HumanEval

coding official site →

A benchmark that measures functional correctness for synthesizing programs from docstrings, consisting of 164 original programming problems assessing language comprehension, algorithms, and simple mathematics

Methodology

Imported from llm-stats public benchmark metadata. Modality: text. Max score: 1. Categories: code, reasoning. Language: en. Verified by llm-stats: no.

Leaderboard

  1. MiniCPM-SALA self-reported llm-stats
    95.1%
  2. 95.0%
  3. Kimi K2 0905 self-reported llm-stats
    94.5%
  4. Claude 3.5 Sonnet self-reported llm-stats
    93.7%
  5. GPT-5 self-reported llm-stats
    93.4%
  6. Kimi K2 Instruct self-reported llm-stats
    93.3%
  7. Qwen2.5-Coder 32B Instruct self-reported llm-stats
    92.7%
  8. 92.4%
  9. o1-mini self-reported llm-stats
    92.4%
  10. Sarvam-30B self-reported llm-stats
    92.1%
  11. 92.0%
  12. 92.0%
  13. Claude 3.5 Sonnet self-reported llm-stats
    92.0%
  14. Mistral Large 2 self-reported llm-stats
    92.0%
  15. Qwen2.5 VL 32B Instruct self-reported llm-stats
    91.5%
  16. 90.2%
  17. GPT-4o self-reported llm-stats
    90.2%
  18. Granite 3.3 8B Base self-reported llm-stats
    89.7%
  19. Granite 3.3 8B Instruct self-reported llm-stats
    89.7%
  20. Gemini Diffusion self-reported llm-stats
    89.6%