EvalPlus

coding official site →

A rigorous code synthesis evaluation framework that augments existing datasets with extensive test cases generated by LLM and mutation-based strategies to better assess functional correctness of generated code, including HumanEval+ with 80x more test cases

Methodology

Imported from llm-stats public benchmark metadata. Modality: text. Max score: 100. Categories: code, reasoning. Language: en. Verified by llm-stats: no.

Leaderboard

  1. Kimi K2 Base self-reported llm-stats
    80.3%
  2. Qwen2 72B Instruct self-reported llm-stats
    79.0%
  3. Qwen3 235B A22B self-reported llm-stats
    77.6%
  4. Qwen2 7B Instruct self-reported llm-stats
    70.3%