arXiv 2605.06208 · 2026-05-10

Agentic Context Decay: Why Multi-Tool Agents Lose Coherence Past 7 Calls

Jin Park, Aldous Foster, Maya Sundaram, Yiren Liang · Princeton / DeepMind

Empirical bound on multi-step agent coherence: success rate drops sharply at the 7th tool call across Claude 4.7, GPT-5, Gemini 3 Pro. Mitigations via tool-trace summarization recover most of the loss.

arxiv.org/abs/2605.06208 ↗

This paper formalizes the “AgentEval-3” finding informally cited in many builder posts: agents lose coherence past a small number of tool calls. Authors construct a controlled task suite (200 tasks, 1-15 tool calls each) and measure success rate as a function of tool-call depth.

The sharp 7-call cliff

Tool calls in task	Claude 4.7	GPT-5	Gemini 3 Pro
1-3	96%	95%	93%
4-6	91%	89%	88%
7	78%	74%	76%
8-10	51%	48%	53%
11-15	28%	24%	31%

The 7-call inflection is consistent across all three frontier models, suggesting it reflects a property of the architectural class (transformer + standard tool-use harness) rather than per-model training.

Why 7?

Authors argue the limit is not context-window capacity (all three models had >100K tokens free). Instead, attention to the initial task description decays as the tool-call trace grows. By the 7th call, the task description occupies under 8% of the model’s effective attention budget for the next decision.

Mitigations that work

Technique	Recovery (7-10 call range)
Periodic task-restatement (every 4 calls, re-inject task)	+18%
Tool-trace summarization (compress prior calls every 5 steps)	+24%
Decompose to sub-agents (each under 7 calls)	+31%
Combined (restate + summarize + decompose)	+38%

The decompose-to-sub-agents result echoes the pattern Anthropic Managed Agents shipped on May 6, 2026 — the structural fix is to never let any single agent exceed the 7-call ceiling.

Practitioner note

If your agent workflow has long tool-call chains, this is the empirical justification for the orchestrator/worker pattern even in single-task scenarios. The number to remember is 7: design your sub-agents to have an action budget of ≤6 calls, with one call reserved for “report back to parent.” For tasks that genuinely need 10+ steps, you should be planning a multi-agent decomposition from the start, not hoping a single agent will hold coherence through it.