GSM8k

math official site →

Grade School Math 8K, a dataset of 8.5K high-quality linguistically diverse grade school math word problems requiring multi-step reasoning and elementary arithmetic operations.

Methodology

Imported from llm-stats public benchmark metadata. Modality: text. Max score: 1. Categories: math, reasoning. Language: en. Verified by llm-stats: no.

Leaderboard

  1. Kimi K2 Instruct self-reported llm-stats
    97.3%
  2. o1 self-reported llm-stats
    97.1%
  3. GPT-4.5 self-reported llm-stats
    97.0%
  4. Llama 3.1 405B Instruct self-reported llm-stats
    96.8%
  5. Claude 3.5 Sonnet self-reported llm-stats
    96.4%
  6. Claude 3.5 Sonnet self-reported llm-stats
    96.4%
  7. Gemma 3 27B self-reported llm-stats
    95.9%
  8. Qwen2.5 32B Instruct self-reported llm-stats
    95.9%
  9. Qwen2.5 72B Instruct self-reported llm-stats
    95.8%
  10. DeepSeek-V2.5 self-reported llm-stats
    95.1%
  11. Claude 3 Opus self-reported llm-stats
    95.0%
  12. Nova Pro self-reported llm-stats
    94.8%
  13. Qwen2.5 14B Instruct self-reported llm-stats
    94.8%
  14. Nova Lite self-reported llm-stats
    94.5%
  15. Gemma 3 12B self-reported llm-stats
    94.4%
  16. Qwen3 235B A22B self-reported llm-stats
    94.4%
  17. Mistral Large 2 self-reported llm-stats
    93.0%
  18. Nova Micro self-reported llm-stats
    92.3%
  19. Claude 3 Sonnet self-reported llm-stats
    92.3%
  20. Kimi K2 Base self-reported llm-stats
    92.1%