SWE-bench Verified (Agentless)

reasoning

A human-validated subset of SWE-bench that evaluates language models' ability to resolve real-world GitHub issues using an agentless approach. The benchmark tests models on software engineering problems requiring understanding and coordinating changes across multiple functions, classes, and files simultaneously.

Leaderboard

Showing 2 of 2 results

Kimi K2 Instruct

51.8%

i
MiMo-V2.5-Pro

35.7%

i