Aider-Polyglot

coding official site →

A coding benchmark that evaluates LLMs on 225 challenging Exercism programming exercises across C++, Go, Java, JavaScript, Python, and Rust. Models receive two attempts to solve each problem, with test error feedback provided after the first attempt if it fails. The benchmark measures both initial problem-solving ability and capacity to edit code based on error feedback, providing an end-to-end evaluation of code generation and editing capabilities across multiple programming languages.

Methodology

Imported from llm-stats public benchmark metadata. Modality: text. Max score: 1. Categories: code, general. Language: en. Verified by llm-stats: no.

Leaderboard

  1. GPT-5 self-reported llm-stats
    88.0%
  2. Gemini 2.5 Pro Preview 06-05 self-reported llm-stats
    82.2%
  3. o3 self-reported llm-stats
    81.3%
  4. Gemini 2.5 Pro self-reported llm-stats
    76.5%
  5. DeepSeek-V3.2-Exp self-reported llm-stats
    74.5%
  6. DeepSeek-R1-0528 self-reported llm-stats
    71.6%
  7. o4-mini self-reported llm-stats
    68.9%
  8. DeepSeek-V3.1 self-reported llm-stats
    68.4%
  9. o3-mini self-reported llm-stats
    66.7%
  10. Gemini 2.5 Flash self-reported llm-stats
    61.9%
  11. Kimi K2 Instruct self-reported llm-stats
    60.0%
  12. Kimi K2-Instruct-0905 self-reported llm-stats
    60.0%
  13. Qwen3-235B-A22B-Instruct-2507 self-reported llm-stats
    57.3%
  14. GPT-4.1 self-reported llm-stats
    51.6%
  15. Qwen3-Next-80B-A3B-Instruct self-reported llm-stats
    49.8%
  16. DeepSeek-V3 self-reported llm-stats
    49.6%
  17. Magistral Medium self-reported llm-stats
    47.1%
  18. GPT-4.1 mini self-reported llm-stats
    34.7%
  19. GPT-4o self-reported llm-stats
    30.7%
  20. Gemini 2.5 Flash-Lite self-reported llm-stats
    26.7%