arXiv 2605.05117 · 2026-05-09
The Prompt Cache Economy: 73% of LLM Cost Hides in Cacheable Prefixes
Hyeonji Lee, Tara Mukherjee, Daniel Roa-Bell · CMU / Anyscale
Trace analysis of 14M production LLM requests: 73% of input tokens repeat across requests within a 5-minute window. Quantifies cost savings from prompt caching across providers — Anthropic 5-min TTL captures 81% of savings.
The largest empirical study of prompt-cache reuse in production. Authors analyzed 14 million LLM requests from three SaaS deployments (chat agents, code review, RAG pipelines) and partitioned the input tokens into “cacheable prefix” vs “request-unique tail.”
Headline numbers
| Metric | Value |
|---|---|
| Median % of input tokens that recur in cacheable prefix | 73.4% |
| 95th-percentile % | 91.2% |
| Cost savings if all cacheable prefixes hit cache | 64% reduction in input cost |
| Cost savings actually realized (Anthropic 5-min TTL) | 38% reduction (52% of theoretical) |
The 5-minute TTL Anthropic ships covers the bulk of intra-session reuse but misses ~half of cross-session reuse (e.g., users who return after a coffee break). The authors model TTL extensions and find a 30-minute TTL would capture 89% of theoretical savings — but at higher infrastructure cost on the provider side.
Application breakdown
- Chat agents: 79% cacheable (system prompt + few-shot examples are stable)
- Code review: 84% cacheable (review checklist + repo context dominate)
- RAG pipelines: 51% cacheable (retrieved chunks vary per query, but system prompt is fixed)
Cache invalidation patterns
The most expensive caching mistake is updating the system prompt mid-session. A single byte change invalidates the entire prefix cache for that conversation. Authors found 3.1% of sessions had at least one mid-session system-prompt change, and these sessions paid 2.4× the cost of single-version sessions.
Practitioner note
Three takeaways for anyone running LLM workloads at scale: (1) Stop iterating on system prompts in production — version them and deploy atomically. (2) If you’re on OpenAI without prompt caching enabled, you’re leaving 30-50% of input cost on the table; switch to the cached endpoint. (3) Anthropic’s 5-min TTL is the pragmatic default but worth measuring whether your sessions exceed that — if they do, structure your prompts to put the largest stable block first so the cache survives the longest.