SWE-bench Verified (Multiple Attempts)

reasoning

SWE-bench Verified is a human-validated subset of 500 test samples from the original SWE-bench dataset that evaluates AI systems' ability to automatically resolve real GitHub issues in Python repositories. Given a codebase and issue description, models must edit the code to successfully resolve the problem, requiring understanding and coordination of changes across multiple functions, classes, and files. The Verified version provides more reliable evaluation through manual validation of test samples.

Leaderboard

Showing 1 of 1 result

Kimi K2 Instruct

71.6%

i