arXiv 2604.13120 · 2026-04-13

AgentForge: Execution-Grounded Multi-Agent LLM Framework for Autonomous SWE

Rajesh Kumar, Waqar Ali, Junaid Ahmed

Proposes execution-grounded verification as a first-class principle: every code change must survive Docker-sandboxed execution. Reaches 40.0% on SWE-Bench Lite.

arxiv.org/abs/2604.13120 ↗

AgentForge proposes that every code change must survive Docker-sandboxed execution before propagating to the next agent. Five role-decomposed agents (Planner, Coder, Tester, Debugger, Critic) coordinate through shared memory.

Reported number: 40.0% on SWE-Bench Lite, 26-28 points above single-agent baselines. Ablations show both execution feedback and role decomposition contribute independently.

Practitioner note (mine)

The headline takeaway: next-token likelihood is a weaker supervision signal than “did the test actually pass.” This matches what builders running coding agents are converging on (Claude Code’s verification loop, Cursor’s test-aware Agent mode, GitHub’s new Debugger agent).

For your own builds, the actionable pattern is: gate every agent step on “did the change cause a sandboxed run to do what we expected” — not on the model’s confidence.