Aider-Polyglot
coding official site →
A coding benchmark that evaluates LLMs on 225 challenging Exercism programming exercises across C++, Go, Java, JavaScript, Python, and Rust. Models receive two attempts to solve each problem, with test error feedback provided after the first attempt if it fails. The benchmark measures both initial problem-solving ability and capacity to edit code based on error feedback, providing an end-to-end evaluation of code generation and editing capabilities across multiple programming languages.
Methodology
Imported from llm-stats public benchmark metadata. Modality: text. Max score: 1. Categories: code, general. Language: en. Verified by llm-stats: no.