BoolQ

reasoning official site →

BoolQ is a reading comprehension dataset for yes/no questions containing 15,942 naturally occurring examples. Each example consists of a question, passage, and boolean answer, where questions are generated in unprompted and unconstrained settings. The dataset challenges models with complex, non-factoid information requiring entailment-like inference to solve.

Methodology

Imported from llm-stats public benchmark metadata. Modality: text. Max score: 1. Categories: language, reasoning. Language: en. Verified by llm-stats: no.

Leaderboard

  1. Hermes 3 70B self-reported llm-stats
    88.0%
  2. Gemma 2 27B self-reported llm-stats
    84.8%
  3. Phi-3.5-MoE-instruct self-reported llm-stats
    84.6%
  4. Gemma 2 9B self-reported llm-stats
    84.2%
  5. Gemma 3n E4B self-reported llm-stats
    81.6%
  6. 81.6%
  7. Phi 4 Mini self-reported llm-stats
    81.2%
  8. Phi-3.5-mini-instruct self-reported llm-stats
    78.0%
  9. Gemma 3n E2B self-reported llm-stats
    76.4%
  10. 76.4%