arXiv 2604.10261 · 2026-04-11
The Amazing Agent Race: Strong Tool Users, Weak Navigators
Zae Myung Kim, Dongseok Lee, Jaehyung Kim
Builds a DAG-puzzle Wikipedia benchmark. Across 1,400 instances and three frameworks, the best system hits 37.2% — navigation errors dominate, not tool-use errors.
The benchmark builds DAG puzzles where agents must navigate Wikipedia, chain multiple tools, and aggregate results. Across 1,400 instances and three agent frameworks tested, the best system reaches only 37.2% accuracy. The error attribution is the interesting part: navigation errors account for 27-52% of trials, while tool-use errors stay under 17%.
Practitioner note (mine)
This empirically separates “your model can call tools” from “your model can stay coherent over a long browse.” Most builders worry about the former; the data says the latter is the binding constraint.
Concrete implication: invest in state tracking and replanning in your agent harness more than in tool-format polish. A clean tool schema doesn’t help if the agent has lost track of which page it’s on.