SWE-bench Verified (Agentic Coding)

coding

SWE-bench Verified is a human-filtered subset of 500 software engineering problems drawn from real GitHub issues across 12 popular Python repositories. Given a codebase and an issue description, language models are tasked with generating patches that resolve the described problems. This benchmark evaluates AI's real-world agentic coding skills by requiring models to navigate complex codebases, understand software engineering problems, and coordinate changes across multiple functions, classes, and files to fix well-defined issues with clear descriptions.

Leaderboard

Showing 2 of 2 results

Claude Sonnet 4.5

77.2%

i
Kimi K2 Instruct

65.8%

i