Skip to content
AI-Daily-Builder

arXiv 2606.04302·2026-06-07 views

LazyAttention: position-agnostic KV reuse that unsticks the RAG cache bottleneck

Haocheng Xia, Mihir Pamnani, Hanxi Fang, Supawit Chockchowwat, Yongjoo Park · University of Illinois Urbana-Champaign (DAIS group)

A new ICML 2026 paper, LazyAttention (arXiv:2606.04302), tackles a stubborn limitation of KV caching for retrieval-augmented generation: because positional information is baked into the cache, a chunk cached at one position cannot be reused at another. The authors defer

arxiv.org/abs/2606.04302 ↗


What the paper is about

Key-value (KV) caching is the standard trick that makes large-language-model inference fast: once a token is processed, its key and value vectors are stored so they never have to be recomputed. In long-context settings like retrieval-augmented generation (RAG) and in-context learning, caching matters even more, because the same reference documents get fed to the model over and over.

There is a catch. Conventional KV caches bake positional information directly into the stored vectors. That means a document chunk cached while it sat in position 1 cannot simply be dropped into position 3 of a different prompt — the positions no longer line up. The paper “LazyAttention: Efficient Retrieval-Augmented Generation with Deferred Positional Encoding” (arXiv:2606.04302, submitted 3 June 2026, accepted to ICML 2026, by Haocheng Xia, Mihir Pamnani, Hanxi Fang, Supawit Chockchowwat, and Yongjoo Park of the University of Illinois Urbana-Champaign) attacks exactly this reusability wall.

The core idea

Today’s workarounds fall into two camps. One restricts reuse to shared prefixes only (fine if every request starts with the same boilerplate, useless once retrieved chunks shuffle around). The other re-encodes positions by materializing a fresh copy of the cache in memory, which is expensive in both time and bandwidth.

LazyAttention’s move is to stop writing position into the cache at all, and instead apply positional encoding lazily — “on the fly” inside the attention kernel during computation. The authors describe this as kernelizing deferred positional encoding to get “zero-copy, position-agnostic KV reuse.” Because position is injected at compute time, a single physical KV copy can serve many logical requests sitting at arbitrary positions, with no duplication. They build separate kernels tuned for the two phases of inference: prefilling (digesting the prompt) and decoding (generating tokens one at a time).

Why it matters

RAG and agentic pipelines are where serving costs quietly pile up: the same handful of popular documents get retrieved across many users and many queries, but conventional caches force re-processing whenever those chunks land in a new spot. The paper’s reported gains are measured against Block-Attention, a recent state-of-the-art reuse method, under skewed document distributions — the realistic case where a few documents are hot and most are cold.

MetricReported gain vs Block-Attention
Time-to-first-token (TTFT)1.37x reduction
Inference throughput1.40x increase
Output quality”comparable” (per the abstract)

Two caveats worth flagging for anyone reading past the headline numbers. First, the improvements are claimed under skewed document distributions; a uniform workload with little chunk reuse should narrow the gap, since there is less cache to share. Second, the abstract reports quality as “comparable” rather than identical — deferring positional encoding is an architectural intervention, so the right move before adopting it is to re-run your own task-specific evals rather than trusting a single quality summary.

Practitioner note

If you run a RAG service and your retrieval distribution is heavy-tailed (a small set of evergreen documents dominate), this is the class of optimization that pays off without touching model weights or retraining. The practical question to ask your serving stack: does it only reuse shared prefixes, or can it reuse a retrieved chunk regardless of where it appears in the prompt? LazyAttention is squarely aimed at the latter. Treat the 1.37x and 1.40x figures as an upper-ish bound tied to skew, validate quality on your own benchmark, and check whether the kernel approach is compatible with your positional-encoding scheme before planning a migration.

An under-considered angle

Most of the public conversation about LLM serving cost fixates on prefix caching and longer context windows, but the sharper lever in retrieval systems is positional reusability — the ability to treat a cached chunk as a movable object rather than something glued to where it was first seen. That reframing has a downstream consequence few teams budget for: it shifts effort away from squeezing the model and toward designing the retrieval layer so that hot chunks actually recur. A position-agnostic cache is only worth its complexity if your retriever produces enough repetition to fill it; the optimization and the retrieval-distribution shape are coupled, and evaluating one without the other will mislead you on real-world savings.

Tip