CodeForces

math official site →

A competitive programming benchmark using problems from the CodeForces platform. The benchmark evaluates code generation capabilities of LLMs on algorithmic problems with difficulty ratings ranging from 800 to 2400. Problems cover diverse algorithmic categories including dynamic programming, graph algorithms, data structures, and mathematical problems with standardized evaluation through direct platform submission.

Methodology

Imported from llm-stats public benchmark metadata. Modality: text. Max score: 3000. Categories: math, reasoning. Language: en. Verified by llm-stats: no.

Leaderboard

  1. DeepSeek-V4-Flash-Max self-reported llm-stats
    100.0%
  2. DeepSeek-V4-Pro-Max self-reported llm-stats
    100.0%
  3. DeepSeek-V3.2-Speciale self-reported llm-stats
    90.0%
  4. Qwen3.5-122B-A10B self-reported llm-stats
    85.1%
  5. Qwen3.5-35B-A3B self-reported llm-stats
    82.2%
  6. GPT OSS 120B self-reported llm-stats
    82.1%
  7. GPT OSS 120B self-reported llm-stats
    82.1%
  8. Qwen3.5-27B self-reported llm-stats
    80.7%
  9. DeepSeek-V3.2 (Thinking) self-reported llm-stats
    79.5%
  10. DeepSeek-V3.2 self-reported llm-stats
    79.5%
  11. GPT OSS 20B self-reported llm-stats
    74.3%
  12. GPT OSS 20B self-reported llm-stats
    74.3%
  13. DeepSeek-V3.2-Exp self-reported llm-stats
    70.7%
  14. DeepSeek-V3.1 self-reported llm-stats
    69.7%
  15. Qwen3 32B self-reported llm-stats
    65.9%
  16. DeepSeek-R1-0528 self-reported llm-stats
    64.3%