TriviaQA

reasoning official site →

A large-scale reading comprehension dataset containing over 650K question-answer-evidence triples. TriviaQA includes 95K question-answer pairs authored by trivia enthusiasts and independently gathered evidence documents (six per question on average) that provide high quality distant supervision for answering the questions. The dataset features relatively complex, compositional questions with considerable syntactic and lexical variability, requiring cross-sentence reasoning to find answers.

Methodology

Imported from llm-stats public benchmark metadata. Modality: text. Max score: 1. Categories: general, reasoning. Language: en. Verified by llm-stats: no.

Leaderboard

  1. Kimi K2 Base self-reported llm-stats
    85.1%
  2. Gemma 2 27B self-reported llm-stats
    83.7%
  3. Mistral Small 3.1 24B Base self-reported llm-stats
    80.5%
  4. Mistral Small 3.1 24B Instruct self-reported llm-stats
    80.5%
  5. Mistral Small 3 24B Base self-reported llm-stats
    80.3%
  6. Granite 3.3 8B Base self-reported llm-stats
    78.2%
  7. Gemma 2 9B self-reported llm-stats
    76.6%
  8. Ministral 3 (14B Base 2512) self-reported llm-stats
    74.9%
  9. Mistral Large 3 self-reported llm-stats
    74.9%
  10. Mistral NeMo Instruct self-reported llm-stats
    73.8%
  11. Gemma 3n E4B self-reported llm-stats
    70.2%
  12. 70.2%
  13. Ministral 8B Instruct self-reported llm-stats
    65.5%
  14. Gemma 3n E2B self-reported llm-stats
    60.8%
  15. 60.8%