BIG-Bench Hard

math official site →

BIG-Bench Hard (BBH) is a subset of 23 challenging BIG-Bench tasks selected because prior language model evaluations did not outperform average human-rater performance. The benchmark contains 6,511 evaluation examples testing various forms of multi-step reasoning including arithmetic, logical reasoning (Boolean expressions, logical deduction), geometric reasoning, temporal reasoning, and language understanding. Tasks require capabilities such as causal judgment, object counting, navigation, pattern recognition, and complex problem solving.

Methodology

Imported from llm-stats public benchmark metadata. Modality: text. Max score: 1. Categories: language, math, reasoning. Language: en. Verified by llm-stats: no.

Leaderboard

  1. Claude 3.5 Sonnet self-reported llm-stats
    93.1%
  2. Claude 3.5 Sonnet self-reported llm-stats
    93.1%
  3. Gemini 1.5 Pro self-reported llm-stats
    89.2%
  4. Gemma 3 27B self-reported llm-stats
    87.6%
  5. Claude 3 Opus self-reported llm-stats
    86.8%
  6. Gemma 3 12B self-reported llm-stats
    85.7%
  7. Gemini 1.5 Flash self-reported llm-stats
    85.5%
  8. Claude 3 Sonnet self-reported llm-stats
    82.9%
  9. Phi-3.5-MoE-instruct self-reported llm-stats
    79.1%
  10. Claude 3 Haiku self-reported llm-stats
    73.7%
  11. Gemma 3 4B self-reported llm-stats
    72.2%
  12. Phi 4 Mini self-reported llm-stats
    70.4%
  13. Granite 3.3 8B Base self-reported llm-stats
    69.1%
  14. Granite 3.3 8B Instruct self-reported llm-stats
    69.1%
  15. Phi-3.5-mini-instruct self-reported llm-stats
    69.0%
  16. IBM Granite 4.0 Tiny Preview self-reported llm-stats
    55.7%
  17. Gemma 3n E4B self-reported llm-stats
    52.9%
  18. 52.9%
  19. Gemma 3n E2B self-reported llm-stats
    44.3%
  20. 44.3%