Multi-SWE-Bench

coding official site →

A multilingual benchmark for issue resolving that evaluates Large Language Models' ability to resolve software issues across diverse programming ecosystems. Covers 7 programming languages (Java, TypeScript, JavaScript, Go, Rust, C, and C++) with 1,632 high-quality instances carefully annotated by 68 expert annotators. Addresses limitations of existing benchmarks that focus almost exclusively on Python.

Methodology

Imported from llm-stats public benchmark metadata. Modality: text. Max score: 1. Categories: code, reasoning. Language: en. Verified by llm-stats: no.

Leaderboard

  1. MiniMax M2.7 self-reported llm-stats
    52.7%
  2. MiniMax M2.5 self-reported llm-stats
    51.3%
  3. MiniMax M2.1 self-reported llm-stats
    49.4%
  4. Kimi K2-Thinking-0905 self-reported llm-stats
    41.9%
  5. MiniMax M2 self-reported llm-stats
    36.2%