LiveCodeBench

coding official site →

LiveCodeBench is a holistic and contamination-free evaluation benchmark for large language models for code. It continuously collects new problems from programming contests (LeetCode, AtCoder, CodeForces) and evaluates four different scenarios: code generation, self-repair, code execution, and test output prediction. Problems are annotated with release dates to enable evaluation on unseen problems released after a model's training cutoff.

Methodology

Imported from llm-stats public benchmark metadata. Modality: text. Max score: 1. Categories: code, general, reasoning. Language: en. Verified by llm-stats: no.

Leaderboard

  1. DeepSeek-V4-Pro-Max self-reported llm-stats
    93.5%
  2. DeepSeek-V4-Flash-Max self-reported llm-stats
    91.6%
  3. DeepSeek-V3.2 (Thinking) self-reported llm-stats
    83.3%
  4. DeepSeek-V3.2 self-reported llm-stats
    83.3%
  5. MiniMax M2 self-reported llm-stats
    83.0%
  6. LongCat-Flash-Thinking-2601 self-reported llm-stats
    82.8%
  7. Nemotron 3 Super (120B A12B) self-reported llm-stats
    81.2%
  8. Grok-3 Mini self-reported llm-stats
    80.4%
  9. Grok 4 Fast self-reported llm-stats
    80.0%
  10. Grok-3 self-reported llm-stats
    79.4%
  11. Grok-4 Heavy self-reported llm-stats
    79.4%
  12. LongCat-Flash-Thinking self-reported llm-stats
    79.4%
  13. Grok-4 self-reported llm-stats
    79.0%
  14. MiniMax M2.1 self-reported llm-stats
    78.0%
  15. Nova 2 Pro self-reported llm-stats
    74.6%
  16. DeepSeek-V3.2-Exp self-reported llm-stats
    74.1%
  17. DeepSeek-R1-0528 self-reported llm-stats
    73.3%
  18. GLM-4.5 self-reported llm-stats
    72.9%
  19. Nemotron Nano 9B v2 self-reported llm-stats
    71.1%
  20. Nova 2 Lite self-reported llm-stats
    71.0%