Skip to content
AI-Daily-Builder

2026-06-07 views

vLLM 0.22 Adds Multi-Tier KV Cache Offloading: GPU to CPU to Disk for Long-Context Local Serving

vLLM 0.22.0 (May 29, 2026) shipped a multi-tier KV cache offloading framework that cascades cached blocks past CPU DRAM down to disk, with a Python filesystem tier and Mooncake disk backend; the June 5 patch added a few

What shipped

vLLM 0.22.0, tagged May 29, 2026, landed a native multi-tier KV cache offloading framework (PR #40020). Until now vLLM could spill the key/value cache from GPU memory to CPU DRAM and back, but that was the end of the road: once host RAM filled, blocks were evicted and prefill had to be recomputed. The new framework treats CPU DRAM as a single primary tier and lets one or more secondary tiers sit behind it — a local filesystem, object storage, a key-value store, or a remote node. The release added a Python filesystem secondary tier (PR #41735) and Mooncake disk offloading (PR #42689) as the first concrete backends, alongside DeepSeek V4 support (PR #43142).

The June 5, 2026 patch, 0.22.1, is a smaller follow-up: it adds JetBrains’ Mellum v2 code MoE, routes int8 and GPTQ linear ops through zentorch kernels on AMD Zen CPUs, and fixes a DeepSeek-V4 CUTLASS init break. The headline feature for anyone running long context on a single box is in 0.22.0.

How the tiering works

The design (RFC #38260) introduces a TieringManager that runs inside the scheduler process and a SecondaryTierManager interface that each backend implements. The orchestration rule is “always cascade”: when a block is confirmed in CPU DRAM, the manager pushes it down to every registered secondary tier asynchronously, using zero-copy views into the CPU tensors so the scheduler never blocks on I/O. When a request needs a block that has aged out of DRAM, the manager promotes it back up from the secondary tier rather than triggering a fresh prefill. Blocks are stored in a canonical CPU layout (TP rank 1 form), which is what lets a cache written under one tensor-parallel degree be reused under another.

The economics are simple. GPU HBM moves data near 3.35 TB/s; a PCIe 4.0 NVMe drive sits around 7 GB/s with sub-millisecond latency. That gap looks brutal until you compare against the alternative, which is not “read from HBM” but “recompute the entire prefill.” Loading precomputed blocks off disk beats recomputation handily once prompts get long.

Why this matters for a local rig

The 2026 reality is that 128K- and even 1M-token windows are ordinary workloads, and KV cache is what eats the card. A rough single-box picture, drawn from independent benchmarks on an H100-class card serving a 70B model at 128K context:

SetupApprox. concurrent users at 128K
GPU HBM only~1
FP8 KV + CPU swap + disk offload~8-10

The same source cites time-to-first-token on a 128K system prompt dropping from roughly 11 seconds to about 1.5 seconds when the prefill is served as a cache hit from disk instead of recomputed. (Those figures are from a hosting vendor’s April 2026 writeup, not vLLM’s own numbers, so treat the multiplier as directional.) For a self-hosted operator, the practical unlock is that a fixed system prompt or a long retrieved document no longer has to be paid for on every request — it lives on the SSD and gets promoted back when needed.

The same release also tightened the quantization path that makes the on-GPU footprint smaller in the first place: batch-invariant Cutlass FP8 claims +28.9% end-to-end (PR #40408) and a padded NVFP4 quant kernel adds +2.4 to 5.7% (PR #42774). Smaller KV blocks plus somewhere to put the overflow is the combination that stretches a single card.

Practitioner note

If you are on a single-GPU box and your bottleneck is context length rather than raw token throughput, the lever here is a fast local NVMe plus the filesystem or Mooncake tier, not a second GPU. Benchmark with a realistic shared prefix (a system prompt or a fixed RAG document) so the cache-hit path actually fires; a cold, all-unique workload will show you only the I/O cost and none of the recompute savings. Pin a sensible CPU DRAM budget before enabling the disk tier — the primary tier is still the staging area every block passes through.

Under-considered angle: multi-tier offloading quietly changes the threat and reliability surface of a “local” deployment. KV blocks now persist to disk, which means your prompt and retrieved-context contents live on the SSD past the lifetime of a request — a privacy and retention consideration that did not exist when the cache evaporated with GPU memory. And because blocks are stored in a canonical TP-rank-1 layout, a cache built today can be reused after you change tensor-parallel degree or restart the server, turning the KV store into durable state you have to reason about (eviction policy, disk pressure, stale entries) rather than ephemeral scratch. The interesting near-term question is whether a persistent, cross-restart KV cache becomes a managed asset on solo rigs the way a vector index already is.


Sources

Tip