SWE-Bench Verified
coding official site →
A verified subset of 500 software engineering problems from real GitHub issues, validated by human annotators for evaluating language models' ability to resolve real-world coding issues by generating patches for Python codebases.
Methodology
Imported from llm-stats public benchmark metadata. Modality: text. Max score: 1. Categories: code, frontend_development, reasoning. Language: en. Verified by llm-stats: no.