HumanEval+

reasoning official site →

Enhanced version of HumanEval that extends the original test cases by 80x using EvalPlus framework for rigorous evaluation of LLM-synthesized code functional correctness, detecting previously undetected wrong code

Methodology

Imported from llm-stats public benchmark metadata. Modality: text. Max score: 1. Categories: reasoning. Language: en. Verified by llm-stats: no.

Leaderboard

  1. Phi 4 Reasoning self-reported llm-stats
    92.9%
  2. Phi 4 Reasoning Plus self-reported llm-stats
    92.3%
  3. Granite 3.3 8B Base self-reported llm-stats
    86.1%
  4. Granite 3.3 8B Instruct self-reported llm-stats
    86.1%
  5. Phi 4 self-reported llm-stats
    82.8%
  6. IBM Granite 4.0 Tiny Preview self-reported llm-stats
    78.3%
  7. Qwen2.5 32B Instruct self-reported llm-stats
    52.4%
  8. Qwen2.5 14B Instruct self-reported llm-stats
    51.2%
  9. ERNIE 4.5 self-reported llm-stats
    25.0%