SWE-Lancer

coding official site →

A benchmark for evaluating large language models on real-world freelance software engineering tasks from Upwork. Contains over 1,400 tasks valued at $1 million USD total, ranging from $50 bug fixes to $32,000 feature implementations. Includes both independent engineering tasks graded via end-to-end tests and managerial tasks assessed against original engineering managers' choices.

Methodology

Imported from llm-stats public benchmark metadata. Modality: text. Max score: 1. Categories: code, reasoning. Language: en. Verified by llm-stats: no.

Leaderboard

  1. GPT-5.1 Codex self-reported llm-stats
    66.3%
  2. GPT-4.5 self-reported llm-stats
    37.3%
  3. GPT-4o self-reported llm-stats
    32.6%
  4. o3-mini self-reported llm-stats
    18.0%