PolyMATH

math official site →

Polymath is a challenging multi-modal mathematical reasoning benchmark designed to evaluate the general cognitive reasoning abilities of Multi-modal Large Language Models (MLLMs). The benchmark comprises 5,000 manually collected high-quality images of cognitive textual and visual challenges across 10 distinct categories, including pattern recognition, spatial reasoning, and relative reasoning.

Methodology

Imported from llm-stats public benchmark metadata. Modality: multimodal. Max score: 1. Categories: math, multimodal, reasoning, spatial_reasoning, vision. Language: en. Verified by llm-stats: no.

Leaderboard

  1. Qwen3.7 Max self-reported llm-stats
    86.5%
  2. Qwen3.6 Plus self-reported llm-stats
    77.4%
  3. Qwen3.5-397B-A17B self-reported llm-stats
    73.3%
  4. Qwen3.5-27B self-reported llm-stats
    71.2%
  5. Qwen3.5-122B-A10B self-reported llm-stats
    68.9%
  6. Qwen3.5-35B-A3B self-reported llm-stats
    64.4%
  7. Qwen3-235B-A22B-Thinking-2507 self-reported llm-stats
    60.1%
  8. Qwen3-Next-80B-A3B-Thinking self-reported llm-stats
    56.3%
  9. Qwen3 VL 32B Thinking self-reported llm-stats
    52.0%
  10. Qwen3 VL 30B A3B Thinking self-reported llm-stats
    51.7%
  11. Qwen3-235B-A22B-Instruct-2507 self-reported llm-stats
    50.2%
  12. Qwen3 VL 8B Thinking self-reported llm-stats
    47.5%
  13. Qwen3-Next-80B-A3B-Instruct self-reported llm-stats
    45.9%
  14. Qwen3 VL 4B Thinking self-reported llm-stats
    44.6%
  15. Qwen3 VL 30B A3B Instruct self-reported llm-stats
    44.3%
  16. Qwen3 VL 32B Instruct self-reported llm-stats
    40.5%
  17. Qwen3 VL 8B Instruct self-reported llm-stats
    30.4%
  18. Qwen3 VL 4B Instruct self-reported llm-stats
    28.8%