SWE-Bench Verified

coding

A verified subset of 500 software engineering problems from real GitHub issues, validated by human annotators for evaluating language models' ability to resolve real-world coding issues by generating patches for Python codebases.

Leaderboard

Showing 20 of 101 results

Claude Fable 5

95.0%

i
Claude Mythos Preview

93.9%

i
Claude Opus 4.8

88.6%

i
Claude Opus 4.7

87.6%

i
Claude Haiku 4.5

85.7%

i
Claude Opus 4.5

80.9%

i
Claude Opus 4.6

80.8%

i
DeepSeek-V4-Pro-Max

80.6%

i
Gemini 3.1 Pro

80.6%

i
MiniMax M3

80.5%

i
Qwen3.7 Max

80.4%

i
Kimi K2.6

80.2%

i
MiniMax M2.5

80.2%

i
GPT-5.2

80.0%

i
Claude Sonnet 4.6

79.6%

i
DeepSeek-V4-Flash-Max

79.0%

i
MiMo-V2.5-Pro

78.9%

i
Qwen3.6 Plus

78.8%

i
Gemini 3 Flash

78.0%

i
MiMo-V2-Pro

78.0%

i