Humanity's Last Exam

math official site →

Humanity's Last Exam (HLE) is a multi-modal academic benchmark with 2,500 questions across mathematics, humanities, and natural sciences, designed to test LLM capabilities at the frontier of human knowledge with unambiguous, verifiable solutions

Methodology

Imported from llm-stats public benchmark metadata. Modality: multimodal. Max score: 1. Categories: math, reasoning, vision. Language: en. Verified by llm-stats: no.

Leaderboard

  1. Claude Mythos Preview self-reported llm-stats
    64.7%
  2. Muse Spark self-reported llm-stats
    58.4%
  3. Claude Opus 4.8 self-reported llm-stats
    57.9%
  4. GPT-5.5 Pro self-reported llm-stats
    57.2%
  5. Claude Opus 4.7 self-reported llm-stats
    54.7%
  6. Claude Opus 4.6 self-reported llm-stats
    53.1%
  7. GLM-5.1 self-reported llm-stats
    52.3%
  8. GPT-5.5 self-reported llm-stats
    52.2%
  9. Gemini 3.1 Pro self-reported llm-stats
    51.4%
  10. Kimi K2-Thinking-0905 self-reported llm-stats
    51.0%
  11. Grok-4 Heavy self-reported llm-stats
    50.7%
  12. Kimi K2.5 self-reported llm-stats
    50.2%
  13. Claude Sonnet 4.6 self-reported llm-stats
    49.0%
  14. Qwen3.5-27B self-reported llm-stats
    48.5%
  15. DeepSeek-V4-Pro-Max self-reported llm-stats
    48.2%
  16. Qwen3.5-122B-A10B self-reported llm-stats
    47.5%
  17. Qwen3.5-35B-A3B self-reported llm-stats
    47.4%
  18. Gemini 3 Pro self-reported llm-stats
    45.8%
  19. DeepSeek-V4-Flash-Max self-reported llm-stats
    45.1%
  20. Gemini 3 Flash self-reported llm-stats
    43.5%