MBPP

reasoning official site →

MBPP (Mostly Basic Python Problems) is a benchmark of 974 crowd-sourced Python programming problems designed to be solvable by entry-level programmers. Each problem consists of a task description, code solution, and 3 automated test cases covering programming fundamentals and standard library functionality.

Methodology

Imported from llm-stats public benchmark metadata. Modality: text. Max score: 100. Categories: general, reasoning. Language: en. Verified by llm-stats: no.

Leaderboard

  1. Sarvam-30B self-reported llm-stats
    92.7%
  2. Llama-3.3 Nemotron Super 49B v1 self-reported llm-stats
    91.3%
  3. Qwen2.5-Coder 32B Instruct self-reported llm-stats
    90.2%
  4. MiniCPM-SALA self-reported llm-stats
    89.1%
  5. Qwen2.5 72B Instruct self-reported llm-stats
    88.2%
  6. Llama 3.1 Nemotron Nano 8B V1 self-reported llm-stats
    84.6%
  7. Qwen2.5 32B Instruct self-reported llm-stats
    84.0%
  8. Qwen2.5 VL 32B Instruct self-reported llm-stats
    84.0%
  9. Qwen2.5-Coder 7B Instruct self-reported llm-stats
    83.5%
  10. Qwen2.5 14B Instruct self-reported llm-stats
    82.0%
  11. Qwen3 235B A22B self-reported llm-stats
    81.4%
  12. Phi-3.5-MoE-instruct self-reported llm-stats
    80.8%
  13. Qwen2 72B Instruct self-reported llm-stats
    80.2%
  14. Qwen2.5 7B Instruct self-reported llm-stats
    79.2%
  15. Codestral-22B self-reported llm-stats
    78.2%
  16. Llama 4 Maverick self-reported llm-stats
    77.6%
  17. Gemini Diffusion self-reported llm-stats
    76.0%
  18. Mistral Small 3.1 24B Instruct self-reported llm-stats
    74.7%
  19. Gemma 3 27B self-reported llm-stats
    74.4%
  20. Qwen2.5-Omni-7B self-reported llm-stats
    73.2%