TruthfulQA

reasoning official site →

TruthfulQA is a benchmark to measure whether language models are truthful in generating answers to questions. It comprises 817 questions that span 38 categories, including health, law, finance and politics. The questions are crafted such that some humans would answer falsely due to a false belief or misconception, testing models' ability to avoid generating false answers learned from human texts.

Methodology

Imported from llm-stats public benchmark metadata. Modality: text. Max score: 1. Categories: finance, general, healthcare, legal, reasoning. Language: en. Verified by llm-stats: no.

Leaderboard

  1. MAI-Thinking-1 self-reported llm-stats
    88.0%
  2. Phi-3.5-MoE-instruct self-reported llm-stats
    77.5%
  3. Granite 3.3 8B Instruct self-reported llm-stats
    66.9%
  4. Phi 4 Mini self-reported llm-stats
    66.4%
  5. Phi-3.5-mini-instruct self-reported llm-stats
    64.0%
  6. Hermes 3 70B self-reported llm-stats
    63.3%
  7. Llama 3.1 Nemotron 70B Instruct self-reported llm-stats
    58.6%
  8. Qwen2.5 14B Instruct self-reported llm-stats
    58.4%
  9. Jamba 1.5 Large self-reported llm-stats
    58.3%
  10. IBM Granite 4.0 Tiny Preview self-reported llm-stats
    58.1%
  11. Qwen2.5 32B Instruct self-reported llm-stats
    57.8%
  12. Command R+ self-reported llm-stats
    56.3%
  13. Qwen2 72B Instruct self-reported llm-stats
    54.8%
  14. Qwen2.5-Coder 32B Instruct self-reported llm-stats
    54.2%
  15. Jamba 1.5 Mini self-reported llm-stats
    54.1%
  16. Granite 3.3 8B Base self-reported llm-stats
    52.1%
  17. Qwen2.5-Coder 7B Instruct self-reported llm-stats
    50.6%
  18. Mistral NeMo Instruct self-reported llm-stats
    50.3%