Multipl-E HumanEval

language official site →

MultiPL-E is a scalable and extensible approach to benchmarking neural code generation that translates unit test-driven code generation benchmarks across multiple programming languages. It extends the HumanEval benchmark to 18 additional programming languages, enabling evaluation of code generation models across diverse programming paradigms and providing insights into how models generalize programming knowledge across language boundaries.

Methodology

Imported from llm-stats public benchmark metadata. Modality: text. Max score: 1. Categories: general, language. Language: en. Verified by llm-stats: no.

Leaderboard

  1. Llama 3.1 405B Instruct self-reported llm-stats
    75.2%
  2. Llama 3.1 70B Instruct self-reported llm-stats
    65.5%
  3. Llama 3.1 8B Instruct self-reported llm-stats
    50.8%