arXiv 2605.06208 · 2026-05-10
Agentic Context Decay: Why Multi-Tool Agents Lose Coherence Past 7 Calls
Jin Park, Aldous Foster, Maya Sundaram, Yiren Liang · Princeton / DeepMind
Empirical bound on multi-step agent coherence: success rate drops sharply at the 7th tool call across Claude 4.7, GPT-5, Gemini 3 Pro. Mitigations via tool-trace summarization recover most of the loss.
This paper formalizes the “AgentEval-3” finding informally cited in many builder posts: agents lose coherence past a small number of tool calls. Authors construct a controlled task suite (200 tasks, 1-15 tool calls each) and measure success rate as a function of tool-call depth.
The sharp 7-call cliff
| Tool calls in task | Claude 4.7 | GPT-5 | Gemini 3 Pro |
|---|---|---|---|
| 1-3 | 96% | 95% | 93% |
| 4-6 | 91% | 89% | 88% |
| 7 | 78% | 74% | 76% |
| 8-10 | 51% | 48% | 53% |
| 11-15 | 28% | 24% | 31% |
The 7-call inflection is consistent across all three frontier models, suggesting it reflects a property of the architectural class (transformer + standard tool-use harness) rather than per-model training.
Why 7?
Authors argue the limit is not context-window capacity (all three models had >100K tokens free). Instead, attention to the initial task description decays as the tool-call trace grows. By the 7th call, the task description occupies under 8% of the model’s effective attention budget for the next decision.
Mitigations that work
| Technique | Recovery (7-10 call range) |
|---|---|
| Periodic task-restatement (every 4 calls, re-inject task) | +18% |
| Tool-trace summarization (compress prior calls every 5 steps) | +24% |
| Decompose to sub-agents (each under 7 calls) | +31% |
| Combined (restate + summarize + decompose) | +38% |
The decompose-to-sub-agents result echoes the pattern Anthropic Managed Agents shipped on May 6, 2026 — the structural fix is to never let any single agent exceed the 7-call ceiling.
Practitioner note
If your agent workflow has long tool-call chains, this is the empirical justification for the orchestrator/worker pattern even in single-task scenarios. The number to remember is 7: design your sub-agents to have an action budget of ≤6 calls, with one call reserved for “report back to parent.” For tasks that genuinely need 10+ steps, you should be planning a multi-agent decomposition from the start, not hoping a single agent will hold coherence through it.