BIG-Bench Extra Hard

reasoning official site →

BIG-Bench Extra Hard (BBEH) is a challenging benchmark that replaces each task in BIG-Bench Hard with a novel task that probes similar reasoning capabilities but exhibits significantly increased difficulty. The benchmark contains 23 tasks testing diverse reasoning skills including many-hop reasoning, causal understanding, spatial reasoning, temporal arithmetic, geometric reasoning, linguistic reasoning, logic puzzles, and humor understanding. Designed to address saturation on existing benchmarks where state-of-the-art models achieve near-perfect scores, BBEH shows substantial room for improvement with best models achieving only 9.8-44.8% average accuracy.

Methodology

Imported from llm-stats public benchmark metadata. Modality: text. Max score: 1. Categories: general, language, reasoning. Language: en. Verified by llm-stats: no.

Leaderboard

  1. Gemma 4 31B self-reported llm-stats
    74.4%
  2. Gemma 4 26B-A4B self-reported llm-stats
    64.8%
  3. Gemma 4 E4B self-reported llm-stats
    33.1%
  4. Gemma 4 E2B self-reported llm-stats
    21.9%
  5. Gemma 3 27B self-reported llm-stats
    19.3%
  6. Gemma 3 12B self-reported llm-stats
    16.3%
  7. Gemini Diffusion self-reported llm-stats
    15.0%
  8. Gemma 3 4B self-reported llm-stats
    11.0%
  9. Gemma 3 1B self-reported llm-stats
    7.2%