SimpleQA

reasoning

SimpleQA is a factuality benchmark developed by OpenAI that measures the short-form factual accuracy of large language models. The benchmark contains 4,326 short, fact-seeking questions that are adversarially collected and designed to have single, indisputable answers. Questions cover diverse topics from science and technology to entertainment, and the benchmark also measures model calibration by evaluating whether models know what they know.

Leaderboard

Showing 20 of 46 results

DeepSeek-V3.2-Exp

97.1%

i
Grok 4 Fast

95.0%

i
DeepSeek-V3.1

93.4%

i
DeepSeek-R1-0528

92.3%

i
ERNIE 5.0

75.0%

i
Gemini 3 Pro

72.1%

i
Gemini 3 Flash

68.7%

i
GPT-4.5

62.5%

i
DeepSeek-V4-Pro-Max

57.9%

i
Qwen3 VL 32B Thinking

55.4%

i
Qwen3-235B-A22B-Instruct-2507

54.3%

i
Gemini 2.5 Pro Preview 06-05

54.0%

i
Qwen3 VL 235B A22B Instruct

51.9%

i
Gemini 2.5 Pro

50.8%

i
Qwen3 VL 8B Thinking

49.6%

i
Qwen3 VL 4B Instruct

48.0%

i
o1

47.0%

i
Qwen3 VL 235B A22B Thinking

44.4%

i
Gemini 3.1 Flash-Lite

43.3%

i
o1-preview

42.4%

i