arXiv 2604.16529 · 2026-04-16

Scaling Test-Time Compute for Agentic Coding

Joongwon Kim, Wannan Yang, Kelvin Niu

Argues test-time scaling for long-horizon coding agents is a representation problem. Reports Claude Opus 4.5 going from 70.9% to 77.6% on SWE-Bench Verified.

arxiv.org/abs/2604.16529 ↗

The paper reframes test-time compute (TTC) for agentic coding as a representation problem rather than a sampling problem. Each rollout is converted into a structured summary capturing hypotheses, progress, and failure modes; multiple rollouts are then combined via Recursive Tournament Voting (parallel) and Parallel-Distill-Refine (sequential).

Reported numbers (per the abstract): Claude Opus 4.5 improves from 70.9% to 77.6% on SWE-Bench Verified using mini-SWE-agent, and from 46.9% to 59.1% on Terminal-Bench v2.0.

Practitioner note (mine)

The “structured summary as the unit of compute” idea is the part I’d actually adopt. Most TTC recipes are “draw N samples and vote”; this paper formalizes a way to compress prior trajectories so subsequent rollouts can stand on what was already explored. If you run an agent harness over multiple rollouts, replacing raw transcripts with structured per-rollout summaries before voting is a low-effort upgrade with measurable gains.

The Recursive Tournament Voting step is more involved — needs careful engineering — but the structured-summary intermediate form is a single change you can ship in a day.