MultiPL-E

language official site →

MultiPL-E is a scalable and extensible system for translating unit test-driven code generation benchmarks to multiple programming languages. It extends HumanEval and MBPP Python benchmarks to 18 additional programming languages, enabling evaluation of neural code generation models across diverse programming paradigms and language features.

Methodology

Imported from llm-stats public benchmark metadata. Modality: text. Max score: 1. Categories: general, language. Language: en. Verified by llm-stats: no.

Leaderboard

  1. Qwen3-235B-A22B-Instruct-2507 self-reported llm-stats
    87.9%
  2. Qwen3-Next-80B-A3B-Instruct self-reported llm-stats
    87.8%
  3. Qwen3 VL 235B A22B Instruct self-reported llm-stats
    86.1%
  4. Kimi K2 Instruct self-reported llm-stats
    85.7%
  5. Kimi K2-Instruct-0905 self-reported llm-stats
    85.7%
  6. Qwen2.5 32B Instruct self-reported llm-stats
    75.4%
  7. Qwen2.5 72B Instruct self-reported llm-stats
    75.1%
  8. Qwen2.5 14B Instruct self-reported llm-stats
    72.8%
  9. Qwen2.5 7B Instruct self-reported llm-stats
    70.4%
  10. Qwen2 72B Instruct self-reported llm-stats
    69.2%
  11. Qwen3 235B A22B self-reported llm-stats
    65.9%
  12. Qwen2.5-Omni-7B self-reported llm-stats
    65.8%
  13. Qwen2 7B Instruct self-reported llm-stats
    59.1%