BIG-Bench Hard
math official site →
BIG-Bench Hard (BBH) is a subset of 23 challenging BIG-Bench tasks selected because prior language model evaluations did not outperform average human-rater performance. The benchmark contains 6,511 evaluation examples testing various forms of multi-step reasoning including arithmetic, logical reasoning (Boolean expressions, logical deduction), geometric reasoning, temporal reasoning, and language understanding. Tasks require capabilities such as causal judgment, object counting, navigation, pattern recognition, and complex problem solving.
Methodology
Imported from llm-stats public benchmark metadata. Modality: text. Max score: 1. Categories: language, math, reasoning. Language: en. Verified by llm-stats: no.