SuperGPQA

math official site →

SuperGPQA is a comprehensive benchmark that evaluates large language models across 285 graduate-level academic disciplines. The benchmark contains 25,957 questions covering 13 broad disciplinary areas including Engineering, Medicine, Science, and Law, with specialized fields in light industry, agriculture, and service-oriented domains. It employs a Human-LLM collaborative filtering mechanism with over 80 expert annotators to create challenging questions that assess graduate-level knowledge and reasoning capabilities.

Methodology

Imported from llm-stats public benchmark metadata. Modality: text. Max score: 1. Categories: chemistry, economics, finance, general, healthcare, legal, math, physics, reasoning. Language: en. Verified by llm-stats: no.

Leaderboard

  1. Qwen3.7 Max self-reported llm-stats
    73.6%
  2. Qwen3.6 Plus self-reported llm-stats
    71.6%
  3. Qwen3.5-397B-A17B self-reported llm-stats
    70.4%
  4. Qwen3.5-122B-A10B self-reported llm-stats
    67.1%
  5. Qwen3.6-27B self-reported llm-stats
    66.0%
  6. Qwen3.5-27B self-reported llm-stats
    65.6%
  7. Qwen3-235B-A22B-Thinking-2507 self-reported llm-stats
    64.9%
  8. Qwen3 VL 235B A22B Thinking self-reported llm-stats
    64.3%
  9. Qwen3.5-35B-A3B self-reported llm-stats
    63.4%
  10. Qwen3-235B-A22B-Instruct-2507 self-reported llm-stats
    62.6%
  11. Qwen3-Next-80B-A3B-Thinking self-reported llm-stats
    60.8%
  12. Qwen3 VL 235B A22B Instruct self-reported llm-stats
    60.4%
  13. Qwen3 VL 32B Thinking self-reported llm-stats
    59.0%
  14. Qwen3-Next-80B-A3B-Instruct self-reported llm-stats
    58.8%
  15. Kimi K2 Instruct self-reported llm-stats
    57.2%
  16. Kimi K2-Instruct-0905 self-reported llm-stats
    57.2%
  17. Qwen3 VL 30B A3B Thinking self-reported llm-stats
    56.4%
  18. Qwen3 VL 32B Instruct self-reported llm-stats
    54.6%
  19. Qwen3 VL 30B A3B Instruct self-reported llm-stats
    53.1%
  20. Qwen3 VL 8B Thinking self-reported llm-stats
    51.2%