Skip to content
AI-Daily-Builder

2026-05-29 views · 7 models

SWE-bench Verified — May 2026 agentic-coding leaderboard (pass@1 %)

Prompt

SWE-bench Verified is 500 human-verified real GitHub issues from popular open-source Python repos. For each, the model-driven agent must read the issue, locate the files to change, write a patch, apply it, and pass the repository's hidden test suite — no hints, scored as the percentage of issues fully resolved (pass@1). This card reports PUBLISHED accuracy from public leaderboards as of 2026-05-28; it is not a latency benchmark.

Notes

Published accuracy leaderboard, NOT a measured-latency run: `latency_ms` is set to 0 (not applicable) and token/cost fields are omitted on every row — the verified datum is the pass@1 % in each `response`. Scores are compiled from public SWE-bench Verified leaderboards (swebench.com, llm-stats.com, benchlm.ai, andrew.ooo, marc0.dev) snapshotted ~2026-05-28; exact numbers vary by harness, scaffold, and snapshot date, so treat ±1-2 points as noise. Verdict tiers (accuracy, not speed): win = 88%+, tie = 84-87.9%, loss = under 84%. Claude Mythos Preview is a restricted-access model; its 93.9% is reported but most teams cannot run it. Takeaways: (1) the frontier has compressed — the top three (Mythos 93.9, GPT-5.5 88.7, Opus 4.8 88.6) are within ~5 points; (2) agentic-coding pass rates above 88% mean the benchmark itself is saturating and SWE-bench Pro / Terminal-Bench Hard are becoming the better discriminators; (3) open-weight DeepSeek V4 Pro Max at 80.6 trails the closed frontier by ~13 points but is closing.

Results — 7 models

Claude Mythos Preview (restricted) WIN · 0ms

93.9% resolved · #1 · public leaderboard (restricted-access model)

GPT-5.5 WIN · 0ms

88.7% resolved · public leaderboard

Claude Opus 4.8 WIN · 0ms

88.6% resolved · public leaderboard

Claude Opus 4.7 (Adaptive) TIE · 0ms

87.6% resolved · public leaderboard

GPT-5.3-Codex TIE · 0ms

85.0% resolved · public leaderboard

Gemini 3.1 Pro LOSS · 0ms

80.6% resolved · public leaderboard

DeepSeek V4 Pro Max (open-weight) LOSS · 0ms

80.6% resolved · public leaderboard · best open-weight
Tip