Aider-Polyglot Edit
coding official site →
A challenging multi-language coding benchmark that evaluates models' code editing abilities across C++, Go, Java, JavaScript, Python, and Rust. Contains 225 of Exercism's most difficult programming problems, selected as problems that were solved by 3 or fewer out of 7 top coding models. The benchmark focuses on code editing tasks and measures both correctness of solutions and proper edit format usage. Designed to re-calibrate evaluation scales so top models score between 5-50%.
Methodology
Imported from llm-stats public benchmark metadata. Modality: text. Max score: 1. Categories: code, general. Language: en. Verified by llm-stats: no.