Builder Daily

arXiv 2605.06208 · 2026-05-10

Agentic Context Decay: Why Multi-Tool Agents Lose Coherence Past 7 Calls

Jin Park, Aldous Foster, Maya Sundaram, Yiren Liang · Princeton / DeepMind

Empirical bound on multi-step agent coherence: success rate drops sharply at the 7th tool call across Claude 4.7, GPT-5, Gemini 3 Pro. Mitigations via tool-trace summarization recover most of the loss.

arxiv.org/abs/2605.06208 ↗


This paper formalizes the “AgentEval-3” finding informally cited in many builder posts: agents lose coherence past a small number of tool calls. Authors construct a controlled task suite (200 tasks, 1-15 tool calls each) and measure success rate as a function of tool-call depth.

The sharp 7-call cliff

Tool calls in taskClaude 4.7GPT-5Gemini 3 Pro
1-396%95%93%
4-691%89%88%
778%74%76%
8-1051%48%53%
11-1528%24%31%

The 7-call inflection is consistent across all three frontier models, suggesting it reflects a property of the architectural class (transformer + standard tool-use harness) rather than per-model training.

Why 7?

Authors argue the limit is not context-window capacity (all three models had >100K tokens free). Instead, attention to the initial task description decays as the tool-call trace grows. By the 7th call, the task description occupies under 8% of the model’s effective attention budget for the next decision.

Mitigations that work

TechniqueRecovery (7-10 call range)
Periodic task-restatement (every 4 calls, re-inject task)+18%
Tool-trace summarization (compress prior calls every 5 steps)+24%
Decompose to sub-agents (each under 7 calls)+31%
Combined (restate + summarize + decompose)+38%

The decompose-to-sub-agents result echoes the pattern Anthropic Managed Agents shipped on May 6, 2026 — the structural fix is to never let any single agent exceed the 7-call ceiling.

Practitioner note

If your agent workflow has long tool-call chains, this is the empirical justification for the orchestrator/worker pattern even in single-task scenarios. The number to remember is 7: design your sub-agents to have an action budget of ≤6 calls, with one call reserved for “report back to parent.” For tasks that genuinely need 10+ steps, you should be planning a multi-agent decomposition from the start, not hoping a single agent will hold coherence through it.

Tip