arXiv 2604.16529 · 2026-04-16

为 agentic coding 扩展 test-time compute

Joongwon Kim, Wannan Yang, Kelvin Niu

主张长 horizon coding agent 的 test-time scaling 是「表示问题」而非「采样问题」。Claude Opus 4.5 在 SWE-Bench Verified 从 70.9% 升到 77.6%。

论文把 agentic coding 的 test-time compute（TTC）重新框架为表示问题而非采样问题。每次 rollout 转成结构化摘要（hypotheses、progress、failure modes）；多次 rollout 用 Recursive Tournament Voting（并行）与 Parallel-Distill-Refine（串行）合成。

数字（摘要所载）：mini-SWE-agent 上 Claude Opus 4.5 在 SWE-Bench Verified 从 70.9% → 77.6%，Terminal-Bench v2.0 从 46.9% → 59.1%。

实战笔记（我的）

「结构化摘要作为 compute 单位」是我会实际采用的部分。多数 TTC 配方是「抽 N 样本后投票」；这篇论文形式化了压缩过往轨迹让后续 rollout 站在已探索基础上的方式。如果你跑多 rollout 的 agent harness，把原始 transcript 换成结构化的 per-rollout 摘要再投票，是工作量小但有可测收益的升级。

Recursive Tournament Voting 比较工程化、需要小心实作；但结构化摘要这个中间形式可以一天上线。