SWE-bench Verified (Agentic Coding)

coding official site →

SWE-bench Verified is a human-filtered subset of 500 software engineering problems drawn from real GitHub issues across 12 popular Python repositories. Given a codebase and an issue description, language models are tasked with generating patches that resolve the described problems. This benchmark evaluates AI's real-world agentic coding skills by requiring models to navigate complex codebases, understand software engineering problems, and coordinate changes across multiple functions, classes, and files to fix well-defined issues with clear descriptions.

Methodology

Imported from llm-stats public benchmark metadata. Modality: text. Max score: 1. Categories: code, reasoning. Language: en. Verified by llm-stats: no.

Leaderboard

  1. Claude Sonnet 4.5 self-reported llm-stats
    77.2%
  2. Kimi K2 Instruct self-reported llm-stats
    65.8%