BBH

math official site →

Big-Bench Hard (BBH) is a suite of 23 challenging tasks selected from BIG-Bench for which prior language model evaluations did not outperform the average human-rater. These tasks require multi-step reasoning across diverse domains including arithmetic, logical reasoning, reading comprehension, and commonsense reasoning. The benchmark was designed to test capabilities believed to be beyond current language models and focuses on evaluating complex reasoning skills including temporal understanding, spatial reasoning, causal understanding, and deductive logical reasoning.

Methodology

Imported from llm-stats public benchmark metadata. Modality: text. Max score: 1. Categories: language, math, reasoning. Language: en. Verified by llm-stats: no.

Leaderboard

  1. Qwen3 235B A22B self-reported llm-stats
    88.9%
  2. Nova Pro self-reported llm-stats
    86.9%
  3. Qwen2.5 32B Instruct self-reported llm-stats
    84.5%
  4. DeepSeek-V2.5 self-reported llm-stats
    84.3%
  5. Nova Lite self-reported llm-stats
    82.4%
  6. Qwen2 72B Instruct self-reported llm-stats
    82.4%
  7. MiniCPM-SALA self-reported llm-stats
    81.5%
  8. Nova Micro self-reported llm-stats
    79.5%
  9. Qwen2.5 14B Instruct self-reported llm-stats
    78.2%
  10. Hermes 3 70B self-reported llm-stats
    67.8%
  11. ERNIE 4.5 self-reported llm-stats
    30.4%