SWE-bench Verified (Agentless)
reasoning official site →
A human-validated subset of SWE-bench that evaluates language models' ability to resolve real-world GitHub issues using an agentless approach. The benchmark tests models on software engineering problems requiring understanding and coordinating changes across multiple functions, classes, and files simultaneously.
Methodology
Imported from llm-stats public benchmark metadata. Modality: text. Max score: 1. Categories: general, reasoning. Language: en. Verified by llm-stats: no.