MBPP EvalPlus

reasoning

MBPP (Mostly Basic Python Problems) is a benchmark of 974 crowd-sourced Python programming problems designed to be solvable by entry-level programmers. EvalPlus extends MBPP with significantly more test cases (35x) for more rigorous evaluation of LLM-synthesized code, providing high-quality and precise evaluation.

Leaderboard

Showing 2 of 2 results

Llama 3.1 405B Instruct

88.6%

i
Llama 3.3 70B Instruct

87.6%

i