SWE-Lancer (IC-Diamond subset)

coding official site →

SWE-Lancer (IC-Diamond subset) is a benchmark of real-world freelance software engineering tasks from Upwork, ranging from $50 bug fixes to $32,000 feature implementations. It evaluates AI models on independent engineering tasks using end-to-end tests triple-verified by experienced software engineers, and includes managerial tasks where models choose between technical implementation proposals.

Methodology

Imported from llm-stats public benchmark metadata. Modality: text. Max score: 1. Categories: code, reasoning. Language: en. Verified by llm-stats: no.

Leaderboard

  1. GPT-5 self-reported llm-stats
    100.0%
  2. GPT-5.3 Codex self-reported llm-stats
    81.4%
  3. GPT-5.2 self-reported llm-stats
    74.6%
  4. GPT-4.5 self-reported llm-stats
    17.4%
  5. GPT-4o self-reported llm-stats
    12.4%
  6. o3-mini self-reported llm-stats
    7.4%