SWE-bench Verified (Multiple Attempts)
reasoning official site →
SWE-bench Verified is a human-validated subset of 500 test samples from the original SWE-bench dataset that evaluates AI systems' ability to automatically resolve real GitHub issues in Python repositories. Given a codebase and issue description, models must edit the code to successfully resolve the problem, requiring understanding and coordination of changes across multiple functions, classes, and files. The Verified version provides more reliable evaluation through manual validation of test samples.
Methodology
Imported from llm-stats public benchmark metadata. Modality: text. Max score: 1. Categories: reasoning. Language: en. Verified by llm-stats: no.