arXiv 2605.06285·2026-05-29 — views

LatentRAG moves agentic RAG reasoning into latent space, cutting inference latency ~90%

Yijia Zheng, Marcel Worring · University of Amsterdam

A new arXiv paper, LatentRAG, shifts the multi-step reasoning and query generation of agentic RAG from token-by-token text into continuous latent space, matching explicit-agent accuracy while cutting inference latency by roughly 90%.

arxiv.org/abs/2605.06285 ↗

LatentRAG: Latent Reasoning and Retrieval for Efficient Agentic RAG (arXiv:2605.06285), by Yijia Zheng and Marcel Worring of the University of Amsterdam, was submitted to arXiv on May 7, 2026. It targets the single most painful cost of agentic retrieval-augmented generation: speed.

The problem with agentic RAG

Agentic RAG systems are powerful because they don’t retrieve once and answer. They autonomously issue search queries, read what comes back, reason about gaps, and chain multiple steps — issuing follow-up subqueries until they have enough to answer. That autonomy is exactly what makes them accurate on hard, multi-hop questions.

It is also exactly what makes them slow. Every thought and every subquery is generated as natural language, one token at a time. A multi-step agent that thinks out loud and writes several subqueries pays the full autoregressive decoding cost at each step. For interactive production use — chat, search, copilots — that latency has kept multi-step retrieval agents largely off the table.

What LatentRAG changes

LatentRAG’s move is to stop serializing the agent’s reasoning into text at all. Instead of generating long natural-language thoughts and subqueries token by token, it produces latent tokens for thoughts and subqueries directly from the model’s hidden states in a single forward pass. The reasoning and the retrieval both stay in continuous latent space rather than discrete language.

Because the expensive part of agentic RAG was the token-by-token generation of thoughts and subqueries, doing that work in latent space — in one forward pass rather than many decoding steps — is where the speedup comes from.

Keeping it readable

A latent-only agent would be a black box. LatentRAG adds a parallel latent-decoding step that converts the latent representations back into natural language, so the reasoning remains transparent and inspectable. The decoding runs alongside the latent computation rather than gating it, so transparency does not reintroduce the latency it was trying to remove.

The results

Across seven benchmark datasets, the authors report accuracy comparable to explicit agentic RAG while reducing inference latency by roughly 90%. That largely closes the speed gap with traditional single-step RAG — which is fast precisely because it does only one retrieval and one generation. If the result holds, you would get multi-step agentic accuracy at close to single-step speed.

Why it matters

Most agentic-RAG efficiency work attacks the number of operations: fewer searches, fewer reasoning steps, smarter stopping. LatentRAG instead attacks the per-step cost itself — by never turning the agent’s thoughts into text. That is a different axis of optimization, and it is the one that, if it generalizes, reframes the accuracy-vs-latency tradeoff that has kept multi-step retrieval agents too slow for interactive deployment.

Practitioner note

For teams running or evaluating agentic RAG:

Don’t assume agentic accuracy requires agentic latency. The latency tax has been the standard reason teams fall back to single-step RAG. LatentRAG’s claim is that the tax was a property of text serialization, not of multi-step reasoning. If you rejected agentic RAG purely on speed, that calculus may be changing.
Watch where the reasoning lives, not just whether it exists. A natural-language thought log is auditable by construction. A latent thought decoded after the fact is a reconstruction. Treat the decoded text as an explanation, not a guaranteed transcript.
Benchmark on your own retrieval corpus. Seven public datasets are a strong signal, but latent reasoning trained on benchmark-style multi-hop questions may behave differently on your domain’s query distribution. The latency win is the easy thing to reproduce; the accuracy parity is the thing to verify.

The under-considered angle: moving reasoning into latent space trades auditability for speed, and the bolt-on decoder is where that trade hides. When an agent reasons in text, your logs are the reasoning — you can search them, guardrail them, and replay them. When it reasons in continuous hidden states and a separate decoder narrates afterward, what you log is the narration, not the computation. There is no guarantee the natural-language decode faithfully reflects what the latent step actually did. For anyone who has to govern, audit, or red-team a retrieval agent — in regulated domains especially — a 90% latency cut that quietly relocates the reasoning into opaque states is not a free win. It is a new surface where the explanation and the behavior can diverge.