Aider-Polyglot

coding

A coding benchmark that evaluates LLMs on 225 challenging Exercism programming exercises across C++, Go, Java, JavaScript, Python, and Rust. Models receive two attempts to solve each problem, with test error feedback provided after the first attempt if it fails. The benchmark measures both initial problem-solving ability and capacity to edit code based on error feedback, providing an end-to-end evaluation of code generation and editing capabilities across multiple programming languages.

Leaderboard

Showing 20 of 22 results

GPT-5

88.0%

i
Gemini 2.5 Pro Preview 06-05

82.2%

i
o3

81.3%

i
Gemini 2.5 Pro

76.5%

i
DeepSeek-V3.2-Exp

74.5%

i
DeepSeek-R1-0528

71.6%

i
o4-mini

68.9%

i
DeepSeek-V3.1

68.4%

i
o3-mini

66.7%

i
Gemini 2.5 Flash

61.9%

i
Qwen3-Coder 480B A35B Instruct

61.8%

i
Kimi K2 Instruct

60.0%

i
Kimi K2-Instruct-0905

60.0%

i
Qwen3-235B-A22B-Instruct-2507

57.3%

i
GPT-4.1

51.6%

i
Qwen3-Next-80B-A3B-Instruct

49.8%

i
DeepSeek-V3

49.6%

i
Magistral Medium

47.1%

i
GPT-4.1 mini

34.7%

i
GPT-4o

30.7%

i