arXiv 2605.05117 · 2026-05-09

The Prompt Cache Economy: 73% of LLM Cost Hides in Cacheable Prefixes

Hyeonji Lee, Tara Mukherjee, Daniel Roa-Bell · CMU / Anyscale

Trace analysis of 14M production LLM requests: 73% of input tokens repeat across requests within a 5-minute window. Quantifies cost savings from prompt caching across providers — Anthropic 5-min TTL captures 81% of savings.

arxiv.org/abs/2605.05117 ↗

The largest empirical study of prompt-cache reuse in production. Authors analyzed 14 million LLM requests from three SaaS deployments (chat agents, code review, RAG pipelines) and partitioned the input tokens into “cacheable prefix” vs “request-unique tail.”

Headline numbers

Metric	Value
Median % of input tokens that recur in cacheable prefix	73.4%
95th-percentile %	91.2%
Cost savings if all cacheable prefixes hit cache	64% reduction in input cost
Cost savings actually realized (Anthropic 5-min TTL)	38% reduction (52% of theoretical)

The 5-minute TTL Anthropic ships covers the bulk of intra-session reuse but misses ~half of cross-session reuse (e.g., users who return after a coffee break). The authors model TTL extensions and find a 30-minute TTL would capture 89% of theoretical savings — but at higher infrastructure cost on the provider side.

Application breakdown

Chat agents: 79% cacheable (system prompt + few-shot examples are stable)
Code review: 84% cacheable (review checklist + repo context dominate)
RAG pipelines: 51% cacheable (retrieved chunks vary per query, but system prompt is fixed)

Cache invalidation patterns

The most expensive caching mistake is updating the system prompt mid-session. A single byte change invalidates the entire prefix cache for that conversation. Authors found 3.1% of sessions had at least one mid-session system-prompt change, and these sessions paid 2.4× the cost of single-version sessions.

Practitioner note

Three takeaways for anyone running LLM workloads at scale: (1) Stop iterating on system prompts in production — version them and deploy atomically. (2) If you’re on OpenAI without prompt caching enabled, you’re leaving 30-50% of input cost on the table; switch to the cached endpoint. (3) Anthropic’s 5-min TTL is the pragmatic default but worth measuring whether your sessions exceed that — if they do, structure your prompts to put the largest stable block first so the cache survives the longest.