BIG-Bench

math official site →

Beyond the Imitation Game Benchmark (BIG-bench) is a collaborative benchmark consisting of 204+ tasks designed to probe large language models and extrapolate their future capabilities. It covers diverse domains including linguistics, mathematics, common-sense reasoning, biology, physics, social bias, software development, and more. The benchmark focuses on tasks believed to be beyond current language model capabilities and includes both English and non-English tasks across multiple languages.

Methodology

Imported from llm-stats public benchmark metadata. Modality: text. Max score: 1. Categories: language, math, reasoning. Language: en. Verified by llm-stats: no.

Leaderboard

  1. Gemini 1.0 Pro self-reported llm-stats
    75.0%
  2. Gemma 2 27B self-reported llm-stats
    74.9%
  3. Gemma 2 9B self-reported llm-stats
    68.2%