arXiv 2604.13120 · 2026-04-13
AgentForge: Execution-Grounded Multi-Agent LLM Framework for Autonomous SWE
Rajesh Kumar, Waqar Ali, Junaid Ahmed
Proposes execution-grounded verification as a first-class principle: every code change must survive Docker-sandboxed execution. Reaches 40.0% on SWE-Bench Lite.
AgentForge proposes that every code change must survive Docker-sandboxed execution before propagating to the next agent. Five role-decomposed agents (Planner, Coder, Tester, Debugger, Critic) coordinate through shared memory.
Reported number: 40.0% on SWE-Bench Lite, 26-28 points above single-agent baselines. Ablations show both execution feedback and role decomposition contribute independently.
Practitioner note (mine)
The headline takeaway: next-token likelihood is a weaker supervision signal than “did the test actually pass.” This matches what builders running coding agents are converging on (Claude Code’s verification loop, Cursor’s test-aware Agent mode, GitHub’s new Debugger agent).
For your own builds, the actionable pattern is: gate every agent step on “did the change cause a sandboxed run to do what we expected” — not on the model’s confidence.