SimpleQA

reasoning official site →

SimpleQA is a factuality benchmark developed by OpenAI that measures the short-form factual accuracy of large language models. The benchmark contains 4,326 short, fact-seeking questions that are adversarially collected and designed to have single, indisputable answers. Questions cover diverse topics from science and technology to entertainment, and the benchmark also measures model calibration by evaluating whether models know what they know.

Methodology

Imported from llm-stats public benchmark metadata. Modality: text. Max score: 1. Categories: factuality, general, reasoning. Language: en. Verified by llm-stats: no.

Leaderboard

  1. DeepSeek-V3.2-Exp self-reported llm-stats
    97.1%
  2. Grok 4 Fast self-reported llm-stats
    95.0%
  3. DeepSeek-V3.1 self-reported llm-stats
    93.4%
  4. DeepSeek-R1-0528 self-reported llm-stats
    92.3%
  5. ERNIE 5.0 self-reported llm-stats
    75.0%
  6. Gemini 3 Pro self-reported llm-stats
    72.1%
  7. Gemini 3 Flash self-reported llm-stats
    68.7%
  8. GPT-4.5 self-reported llm-stats
    62.5%
  9. DeepSeek-V4-Pro-Max self-reported llm-stats
    57.9%
  10. Qwen3 VL 32B Thinking self-reported llm-stats
    55.4%
  11. Qwen3-235B-A22B-Instruct-2507 self-reported llm-stats
    54.3%
  12. Gemini 2.5 Pro Preview 06-05 self-reported llm-stats
    54.0%
  13. Qwen3 VL 235B A22B Instruct self-reported llm-stats
    51.9%
  14. Gemini 2.5 Pro self-reported llm-stats
    50.8%
  15. Qwen3 VL 8B Thinking self-reported llm-stats
    49.6%
  16. Qwen3 VL 4B Instruct self-reported llm-stats
    48.0%
  17. o1 self-reported llm-stats
    47.0%
  18. Qwen3 VL 235B A22B Thinking self-reported llm-stats
    44.4%
  19. Gemini 3.1 Flash-Lite self-reported llm-stats
    43.3%
  20. o1-preview self-reported llm-stats
    42.4%