Aider-Polyglot Edit

coding

A challenging multi-language coding benchmark that evaluates models' code editing abilities across C++, Go, Java, JavaScript, Python, and Rust. Contains 225 of Exercism's most difficult programming problems, selected as problems that were solved by 3 or fewer out of 7 top coding models. The benchmark focuses on code editing tasks and measures both correctness of solutions and proper edit format usage. Designed to re-calibrate evaluation scales so top models score between 5-50%.

Leaderboard

Showing 10 of 10 results

DeepSeek-V3

79.7%

i
Gemini 2.5 Pro

72.7%

i
o3-mini

60.4%

i
o4-mini

58.2%

i
Gemini 2.5 Flash

56.7%

i
GPT-4.1

52.9%

i
GPT-4.5

44.9%

i
GPT-4.1 mini

31.6%

i
GPT-4o

18.2%

i
GPT-4.1 nano

6.2%

i