arXiv 2606.09659·2026-06-09 — views

Latent Context Language Models: encoder-decoder compression trained on 350B tokens beats KV-cache pruning — 81% on GSM8K at 16x compression where baselines drop to 0%

Ang Li, Sean McLeish, Haozhe Chen, Nimit Kalra, Zaiqian Chen, Artem Gazizov, Venkata Anoop Suhas Kumar Morisetty, Bhavya Kailkhura, Harshitha Menon, Zhuang Liu, Brian R. Bartoldson, Tom Goldstein, Sanae Lotfi, Micah Goldblum, Pavel Izmailov

A 15-author team spanning the Goldstein, Goldblum and Izmailov groups revives encoder-decoder context compression at scale: a 0.6B encoder plus 4B decoder trained on roughly 350B tokens sets a new speed/memory/accuracy Pareto frontier, with models and code released.

arxiv.org/abs/2606.09659 ↗

What shipped

A paper titled “End-to-End Context Compression at Scale” (arXiv:2606.09659, cs.CL, submitted June 8, 2026) takes an idea the field had mostly filed under “tried it, too lossy” — encoder-decoder context compression — and asks what happens if you stop treating it as a bolt-on trick and instead train it properly, at pre-training scale. The 15-author team spans the Tom Goldstein, Micah Goldblum and Pavel Izmailov orbits, with co-authors including Zhuang Liu, Sanae Lotfi, Brian Bartoldson, Bhavya Kailkhura and Sean McLeish.

The motivating problem is the one every long-context deployment hits: the KV cache grows linearly with context length, and at hundreds of thousands of tokens it — not the weights — becomes the memory bottleneck. Existing KV-cache compression methods either degrade quality substantially or burn considerable time and compute just to compress a single long prompt. The authors’ answer is Latent Context Language Models (LCLMs): a small encoder maps a long token sequence into a much shorter sequence of latent embeddings, and the decoder consumes the latents instead of the raw tokens.

How it works

The architecture is deliberately simple. A 0.6B-parameter encoder (initialized from Qwen3-Embedding) reads the context in fixed 1024-token windows and mean-pools groups of N tokens into single latent vectors; an MLP adapter projects those latents into the embedding space of a 4B-parameter decoder (initialized from Qwen3-4B-Instruct). The team trained variants at 1:4, 1:8 and 1:16 compression ratios.

The load-bearing part is the training budget. Each model goes through a four-stage pipeline — adapter warmup, encoder training, continual pre-training, then supervised fine-tuning — on roughly 350 billion tokens, with interleaved compressed and uncompressed blocks plus auxiliary reconstruction objectives. Prior compressor papers typically fine-tuned on a few billion tokens at most; this is the first time the recipe has been pushed to genuine continual-pre-training scale, which is exactly why the authors frame the contribution as “at scale” rather than as a new mechanism.

The numbers

Against a strong roster of KV-cache compression baselines — SnapKV, KVzip, FastKVzip, Expected Attention and Attention Matching — the paper reports a new Pareto frontier across general-task performance, compression speed and peak memory:

Time-to-first-token: LCLMs avoid the full prefill cost that cache-pruning methods still pay, with reported TTFT speedups up to 8.8x at higher compression ratios on RULER-style settings.
Information-dense tasks: on GSM8K at 16x compression — 94% of tokens gone — LCLMs hold 81% accuracy while the competing methods collapse to 0%. Cache pruning throws away entries it deems unimportant; a trained compressor learns to keep the arithmetic.
Memory: on an H200, peak memory stays nearly flat from 128K to 512K tokens at 16x compression, and the approach scales to 1M-token contexts where the baselines run out of memory.

There is also a forward-looking agent experiment: the decoder skims compressed context and calls an EXPAND tool to retrieve the original text of any chunk it needs verbatim, which substantially improves exact string-match accuracy on needle-in-a-haystack tasks. Everything is released — models on Hugging Face (latent-context) and code on GitHub (LeonLixyz/LCLM).

Why a builder should care

The practical claim here is that context compression works when it is a trained capability rather than post-hoc cache surgery. That distinction matters for three audiences. If you serve long-context workloads, the TTFT and memory numbers attack your two worst cost curves at once, because the encoder is small and cheap to batch. If you build agents, the skim-then-EXPAND pattern is a genuinely new memory tier: cheaper than re-reading raw history, more faithful than a text summary, with lossless recovery on demand. And if you train models, the result reads as evidence that a 350B-token investment can buy a 16x context discount that pruning methods cannot match at any price — a trade that gets more attractive every time your average context length doubles.

The honest caveats: the released decoder is 4B parameters, so nobody has shown this recipe on a frontier-scale model; the training cost is real and sits with whoever makes the compressor; and the GSM8K-at-16x number, while striking, is one task family. The 0%-baseline comparison also flatters the setup, since cache-pruning methods were never designed for 94% eviction.

Practitioner note

If I ran a long-context serving stack today, I would benchmark the released 1:4 model against my current SnapKV-style pipeline on my own traces before believing any of this transfers — but the experiment is cheap, because the weights and code are public and the decoder is only 4B. The metric I would watch is not average accuracy but the failure mode: pruning fails by silently dropping facts, while a trained compressor fails by blurring them, and the EXPAND-tool pattern gives you a recovery path for the second failure that does not exist for the first. For agent builders, I would prototype compressed memory now even if I keep raw logs as ground truth: store latents for old turns, expand on demand, and measure how often the agent actually needs the expansion. That ratio tells you your real compression budget.

Under-considered angle

The quiet implication is economic, not architectural. KV-cache pruning kept compression inside the serving layer, where every provider pays the cost per request, forever. LCLMs move the cost into training, once, and amortize it across all future requests — the same shift that made instruction-tuning beat prompt engineering. If the recipe scales with decoder size, “context compression” stops being an inference optimization and becomes a model capability you ship, version and fine-tune. The open question worth tracking is whether frontier labs adopt the encoder-decoder split or fold the compressor into the model itself; either way, the 350B-token price tag this paper publishes is the first credible quote for what that capability costs to build.