HiddenMath

math official site →

Google DeepMind's internal mathematical reasoning benchmark that introduces novel problems not encountered during model training to evaluate true mathematical reasoning capabilities rather than memorization

Methodology

Imported from llm-stats public benchmark metadata. Modality: text. Max score: 1. Categories: math, reasoning. Language: en. Verified by llm-stats: no.

Leaderboard

  1. Gemini 2.0 Flash self-reported llm-stats
    63.0%
  2. Gemma 3 27B self-reported llm-stats
    60.3%
  3. Gemini 2.0 Flash-Lite self-reported llm-stats
    55.3%
  4. Gemma 3 12B self-reported llm-stats
    54.5%
  5. Gemini 1.5 Pro self-reported llm-stats
    52.0%
  6. Gemini 1.5 Flash self-reported llm-stats
    47.2%
  7. Gemma 3 4B self-reported llm-stats
    43.0%
  8. Gemma 3n E4B Instructed self-reported llm-stats
    37.7%
  9. 37.7%
  10. Gemini 1.5 Flash 8B self-reported llm-stats
    32.8%
  11. Gemma 3n E2B Instructed self-reported llm-stats
    27.7%
  12. 27.7%
  13. Gemma 3 1B self-reported llm-stats
    15.8%