CodeForces

math

A competitive programming benchmark using problems from the CodeForces platform. The benchmark evaluates code generation capabilities of LLMs on algorithmic problems with difficulty ratings ranging from 800 to 2400. Problems cover diverse algorithmic categories including dynamic programming, graph algorithms, data structures, and mathematical problems with standardized evaluation through direct platform submission.

Leaderboard

Showing 16 of 16 results

DeepSeek-V4-Flash-Max

100.0%

i
DeepSeek-V4-Pro-Max

100.0%

i
DeepSeek-V3.2-Speciale

90.0%

i
Qwen3.5-122B-A10B

85.1%

i
Qwen3.5-35B-A3B

82.2%

i
GPT OSS 120B

82.1%

i
Qwen3.5-27B

80.7%

i
DeepSeek-V3.2 (Thinking)

79.5%

i
DeepSeek-V3.2

79.5%

i
GPT OSS 20B

74.3%

i
DeepSeek-V3.2-Exp

70.7%

i
DeepSeek-V3.1

69.7%

i
Qwen3 32B

65.9%

i
DeepSeek-R1-0528

64.3%

i
Gemma 4 12B

55.3%

i
DiffusionGemma 26B-A4B

47.6%

i