SWE-Bench Verified

coding official site →

A verified subset of 500 software engineering problems from real GitHub issues, validated by human annotators for evaluating language models' ability to resolve real-world coding issues by generating patches for Python codebases.

Methodology

Imported from llm-stats public benchmark metadata. Modality: text. Max score: 1. Categories: code, frontend_development, reasoning. Language: en. Verified by llm-stats: no.

Leaderboard

  1. Claude Mythos Preview self-reported llm-stats
    93.9%
  2. Claude Opus 4.8 self-reported llm-stats
    88.6%
  3. Claude Opus 4.7 self-reported llm-stats
    87.6%
  4. Claude Haiku 4.5 self-reported llm-stats
    85.7%
  5. Claude Opus 4.5 self-reported llm-stats
    80.9%
  6. Claude Opus 4.6 self-reported llm-stats
    80.8%
  7. DeepSeek-V4-Pro-Max self-reported llm-stats
    80.6%
  8. Gemini 3.1 Pro self-reported llm-stats
    80.6%
  9. MiniMax M3 self-reported llm-stats
    80.5%
  10. Qwen3.7 Max self-reported llm-stats
    80.4%
  11. Kimi K2.6 self-reported llm-stats
    80.2%
  12. MiniMax M2.5 self-reported llm-stats
    80.2%
  13. GPT-5.2 self-reported llm-stats
    80.0%
  14. Claude Sonnet 4.6 self-reported llm-stats
    79.6%
  15. DeepSeek-V4-Flash-Max self-reported llm-stats
    79.0%
  16. Qwen3.6 Plus self-reported llm-stats
    78.8%
  17. Gemini 3 Flash self-reported llm-stats
    78.0%
  18. MiMo-V2-Pro self-reported llm-stats
    78.0%
  19. GLM-5 self-reported llm-stats
    77.8%
  20. Mistral Medium 3.5 self-reported llm-stats
    77.6%