EvalPlus

coding

A rigorous code synthesis evaluation framework that augments existing datasets with extensive test cases generated by LLM and mutation-based strategies to better assess functional correctness of generated code, including HumanEval+ with 80x more test cases

Leaderboard

Showing 4 of 4 results

Kimi K2 Base

80.3%

i
Qwen2 72B Instruct

79.0%

i
Qwen3 235B A22B

77.6%

i
Qwen2 7B Instruct

70.3%

i