SWE-bench Multilingual

coding official site →

A multilingual benchmark for issue resolving in software engineering that covers Java, TypeScript, JavaScript, Go, Rust, C, and C++. Contains 1,632 high-quality instances carefully annotated from 2,456 candidates by 68 expert annotators, designed to evaluate Large Language Models across diverse software ecosystems beyond Python.

Methodology

Imported from llm-stats public benchmark metadata. Modality: text. Max score: 1. Categories: code, reasoning. Language: en. Verified by llm-stats: no.

Leaderboard

  1. Claude Mythos Preview self-reported llm-stats
    87.3%
  2. Claude Opus 4.8 self-reported llm-stats
    84.4%
  3. Qwen3.7 Max self-reported llm-stats
    78.3%
  4. Claude Opus 4.6 self-reported llm-stats
    77.8%
  5. Kimi K2.6 self-reported llm-stats
    76.7%
  6. MiniMax M2.7 self-reported llm-stats
    76.5%
  7. DeepSeek-V4-Pro-Max self-reported llm-stats
    76.2%
  8. Qwen3.6 Plus self-reported llm-stats
    73.8%
  9. DeepSeek-V4-Flash-Max self-reported llm-stats
    73.3%
  10. Kimi K2.5 self-reported llm-stats
    73.0%
  11. MiniMax M2.1 self-reported llm-stats
    72.5%
  12. MiMo-V2-Flash self-reported llm-stats
    71.7%
  13. MiMo-V2-Pro self-reported llm-stats
    71.7%
  14. Qwen3.6-27B self-reported llm-stats
    71.3%
  15. DeepSeek-V3.2 (Thinking) self-reported llm-stats
    70.2%
  16. DeepSeek-V3.2 self-reported llm-stats
    70.2%
  17. Qwen3.5-397B-A17B self-reported llm-stats
    69.3%
  18. GLM-4.7 self-reported llm-stats
    66.7%
  19. MAI-Code-1-Flash self-reported llm-stats
    65.5%
  20. Kimi K2-Thinking-0905 self-reported llm-stats
    61.1%