HumanEval-Mul

reasoning official site →

A multilingual variant of the HumanEval benchmark that measures functional correctness for synthesizing programs from docstrings, consisting of 164 original programming problems assessing language comprehension, algorithms, and simple mathematics

Methodology

Imported from llm-stats public benchmark metadata. Modality: text. Max score: 1. Categories: reasoning. Language: en. Verified by llm-stats: no.

Leaderboard

  1. DeepSeek-V3 self-reported llm-stats
    82.6%
  2. DeepSeek-V2.5 self-reported llm-stats
    73.8%