Aider-Polyglot Edit

coding official site →

A challenging multi-language coding benchmark that evaluates models' code editing abilities across C++, Go, Java, JavaScript, Python, and Rust. Contains 225 of Exercism's most difficult programming problems, selected as problems that were solved by 3 or fewer out of 7 top coding models. The benchmark focuses on code editing tasks and measures both correctness of solutions and proper edit format usage. Designed to re-calibrate evaluation scales so top models score between 5-50%.

Methodology

Imported from llm-stats public benchmark metadata. Modality: text. Max score: 1. Categories: code, general. Language: en. Verified by llm-stats: no.

Leaderboard

  1. DeepSeek-V3 self-reported llm-stats
    79.7%
  2. Gemini 2.5 Pro self-reported llm-stats
    72.7%
  3. o3-mini self-reported llm-stats
    60.4%
  4. o4-mini self-reported llm-stats
    58.2%
  5. Gemini 2.5 Flash self-reported llm-stats
    56.7%
  6. GPT-4.1 self-reported llm-stats
    52.9%
  7. GPT-4.5 self-reported llm-stats
    44.9%
  8. GPT-4.1 mini self-reported llm-stats
    31.6%
  9. GPT-4o self-reported llm-stats
    18.2%
  10. GPT-4.1 nano self-reported llm-stats
    6.2%