HumanEval+

reasoning

Enhanced version of HumanEval that extends the original test cases by 80x using EvalPlus framework for rigorous evaluation of LLM-synthesized code functional correctness, detecting previously undetected wrong code

Leaderboard

Showing 10 of 10 results

Phi 4 Reasoning

92.9%

i
Phi 4 Reasoning Plus

92.3%

i
Granite 3.3 8B Base

86.1%

i
Granite 3.3 8B Instruct

86.1%

i
Phi 4

82.8%

i
IBM Granite 4.0 Tiny Preview

78.3%

i
MiMo-V2.5-Pro

75.6%

i
Qwen2.5 32B Instruct

52.4%

i
Qwen2.5 14B Instruct

51.2%

i
ERNIE 4.5

25.0%

i