HumanEval-Mul

reasoning

A multilingual variant of the HumanEval benchmark that measures functional correctness for synthesizing programs from docstrings, consisting of 164 original programming problems assessing language comprehension, algorithms, and simple mathematics

Leaderboard

Showing 2 of 2 results

DeepSeek-V3

82.6%

i
DeepSeek-V2.5

73.8%

i