SWE-Lancer (IC-Diamond subset)

coding

SWE-Lancer (IC-Diamond subset) is a benchmark of real-world freelance software engineering tasks from Upwork, ranging from $50 bug fixes to $32,000 feature implementations. It evaluates AI models on independent engineering tasks using end-to-end tests triple-verified by experienced software engineers, and includes managerial tasks where models choose between technical implementation proposals.

Leaderboard

Showing 6 of 6 results

GPT-5

100.0%

i
GPT-5.3 Codex

81.4%

i
GPT-5.2

74.6%

i
GPT-4.5

17.4%

i
GPT-4o

12.4%

i
o3-mini

7.4%

i