SWE-Bench Pro

coding

SWE-Bench Pro is an advanced version of SWE-Bench that evaluates language models on complex, real-world software engineering tasks requiring extended reasoning and multi-step problem solving.

Methodology

Imported from llm-stats public benchmark metadata. Modality: text. Max score: 1. Categories: agents, code, reasoning. Language: en. Verified by llm-stats: no.

Leaderboard

  1. Claude Mythos Preview self-reported llm-stats
    77.8%
  2. Claude Opus 4.8 self-reported llm-stats
    69.2%
  3. Claude Opus 4.7 self-reported llm-stats
    64.3%
  4. Qwen3.7 Max self-reported llm-stats
    60.6%
  5. MiniMax M3 self-reported llm-stats
    59.0%
  6. GPT-5.5 self-reported llm-stats
    58.6%
  7. Kimi K2.6 self-reported llm-stats
    58.6%
  8. GLM-5.1 self-reported llm-stats
    58.4%
  9. GPT-5.4 self-reported llm-stats
    57.7%
  10. GPT-5.3 Codex self-reported llm-stats
    56.8%
  11. Qwen3.6 Plus self-reported llm-stats
    56.6%
  12. GPT-5.2 Codex self-reported llm-stats
    56.4%
  13. MiniMax M2.7 self-reported llm-stats
    56.2%
  14. DeepSeek-V4-Pro-Max self-reported llm-stats
    55.4%
  15. MiniMax M2.5 self-reported llm-stats
    55.4%
  16. Gemini 3.5 Flash self-reported llm-stats
    55.1%
  17. GPT-5.4 mini self-reported llm-stats
    54.4%
  18. Gemini 3.1 Pro self-reported llm-stats
    54.2%
  19. Qwen3.6-27B self-reported llm-stats
    53.5%
  20. MAI-Thinking-1 self-reported llm-stats
    52.8%