BBH

math

Big-Bench Hard (BBH) is a suite of 23 challenging tasks selected from BIG-Bench for which prior language model evaluations did not outperform the average human-rater. These tasks require multi-step reasoning across diverse domains including arithmetic, logical reasoning, reading comprehension, and commonsense reasoning. The benchmark was designed to test capabilities believed to be beyond current language models and focuses on evaluating complex reasoning skills including temporal understanding, spatial reasoning, causal understanding, and deductive logical reasoning.

Leaderboard

Showing 12 of 12 results

Qwen3 235B A22B

88.9%

i
MiMo-V2.5-Pro

88.4%

i
Nova Pro

86.9%

i
Qwen2.5 32B Instruct

84.5%

i
DeepSeek-V2.5

84.3%

i
Nova Lite

82.4%

i
Qwen2 72B Instruct

82.4%

i
MiniCPM-SALA

81.5%

i
Nova Micro

79.5%

i
Qwen2.5 14B Instruct

78.2%

i
Hermes 3 70B

67.8%

i
ERNIE 4.5

30.4%

i