arXiv 2604.10261 · 2026-04-11

The Amazing Agent Race: Strong Tool Users, Weak Navigators

Zae Myung Kim, Dongseok Lee, Jaehyung Kim

Builds a DAG-puzzle Wikipedia benchmark. Across 1,400 instances and three frameworks, the best system hits 37.2% — navigation errors dominate, not tool-use errors.

arxiv.org/abs/2604.10261 ↗

The benchmark builds DAG puzzles where agents must navigate Wikipedia, chain multiple tools, and aggregate results. Across 1,400 instances and three agent frameworks tested, the best system reaches only 37.2% accuracy. The error attribution is the interesting part: navigation errors account for 27-52% of trials, while tool-use errors stay under 17%.

Practitioner note (mine)

This empirically separates “your model can call tools” from “your model can stay coherent over a long browse.” Most builders worry about the former; the data says the latter is the binding constraint.

Concrete implication: invest in state tracking and replanning in your agent harness more than in tool-format polish. A clean tool schema doesn’t help if the agent has lost track of which page it’s on.