BIG-Bench Hard

math

BIG-Bench Hard (BBH) is a subset of 23 challenging BIG-Bench tasks selected because prior language model evaluations did not outperform average human-rater performance. The benchmark contains 6,511 evaluation examples testing various forms of multi-step reasoning including arithmetic, logical reasoning (Boolean expressions, logical deduction), geometric reasoning, temporal reasoning, and language understanding. Tasks require capabilities such as causal judgment, object counting, navigation, pattern recognition, and complex problem solving.

Leaderboard

Showing 20 of 21 results

Claude 3.5 Sonnet

93.1%

i
Claude 3.5 Sonnet

93.1%

i
Gemini 1.5 Pro

89.2%

i
Gemma 3 27B

87.6%

i
Claude 3 Opus

86.8%

i
Gemma 3 12B

85.7%

i
Gemini 1.5 Flash

85.5%

i
Claude 3 Sonnet

82.9%

i
Phi-3.5-MoE-instruct

79.1%

i
Claude 3 Haiku

73.7%

i
Gemma 3 4B

72.2%

i
Phi 4 Mini

70.4%

i
Granite 3.3 8B Base

69.1%

i
Granite 3.3 8B Instruct

69.1%

i
Phi-3.5-mini-instruct

69.0%

i
IBM Granite 4.0 Tiny Preview

55.7%

i
Gemma 3n E4B

52.9%

i
Gemma 3n E4B Instructed LiteRT Preview

52.9%

i
Gemma 3n E2B

44.3%

i
Gemma 3n E2B Instructed LiteRT (Preview)

44.3%

i