HumanEval Plus

coding official site →

Enhanced version of HumanEval that extends the original test cases by 80x using EvalPlus framework for rigorous evaluation of LLM-synthesized code functional correctness, detecting previously undetected wrong code

Methodology

Imported from llm-stats public benchmark metadata. Modality: text. Max score: 1. Categories: code, reasoning. Language: en. Verified by llm-stats: no.

Leaderboard

  1. Mistral Small 3.2 24B Instruct self-reported llm-stats
    92.9%