TruthfulQA

reasoning

TruthfulQA is a benchmark to measure whether language models are truthful in generating answers to questions. It comprises 817 questions that span 38 categories, including health, law, finance and politics. The questions are crafted such that some humans would answer falsely due to a false belief or misconception, testing models' ability to avoid generating false answers learned from human texts.

Leaderboard

Showing 18 of 18 results

MAI-Thinking-1

88.0%

i
Phi-3.5-MoE-instruct

77.5%

i
Granite 3.3 8B Instruct

66.9%

i
Phi 4 Mini

66.4%

i
Phi-3.5-mini-instruct

64.0%

i
Hermes 3 70B

63.3%

i
Llama 3.1 Nemotron 70B Instruct

58.6%

i
Qwen2.5 14B Instruct

58.4%

i
Jamba 1.5 Large

58.3%

i
IBM Granite 4.0 Tiny Preview

58.1%

i
Qwen2.5 32B Instruct

57.8%

i
Command R+

56.3%

i
Qwen2 72B Instruct

54.8%

i
Qwen2.5-Coder 32B Instruct

54.2%

i
Jamba 1.5 Mini

54.1%

i
Granite 3.3 8B Base

52.1%

i
Qwen2.5-Coder 7B Instruct

50.6%

i
Mistral NeMo Instruct

50.3%

i