BIG-Bench Extra Hard

reasoning

BIG-Bench Extra Hard (BBEH) is a challenging benchmark that replaces each task in BIG-Bench Hard with a novel task that probes similar reasoning capabilities but exhibits significantly increased difficulty. The benchmark contains 23 tasks testing diverse reasoning skills including many-hop reasoning, causal understanding, spatial reasoning, temporal arithmetic, geometric reasoning, linguistic reasoning, logic puzzles, and humor understanding. Designed to address saturation on existing benchmarks where state-of-the-art models achieve near-perfect scores, BBEH shows substantial room for improvement with best models achieving only 9.8-44.8% average accuracy.

Leaderboard

Showing 11 of 11 results

Gemma 4 31B

74.4%

i
Gemma 4 26B-A4B

64.8%

i
Gemma 4 12B

53.0%

i
DiffusionGemma 26B-A4B

47.6%

i
Gemma 4 E4B

33.1%

i
Gemma 4 E2B

21.9%

i
Gemma 3 27B

19.3%

i
Gemma 3 12B

16.3%

i
Gemini Diffusion

15.0%

i
Gemma 3 4B

11.0%

i
Gemma 3 1B

7.2%

i