MathVista

math official site →

MathVista evaluates mathematical reasoning of foundation models in visual contexts. It consists of 6,141 examples derived from 28 existing multimodal datasets and 3 newly created datasets (IQTest, FunctionQA, and PaperQA), combining challenges from diverse mathematical and visual tasks to assess models' ability to understand complex figures and perform rigorous reasoning.

Methodology

Imported from llm-stats public benchmark metadata. Modality: multimodal. Max score: 1. Categories: math, multimodal, vision. Language: en. Verified by llm-stats: no.

Leaderboard

  1. o3 self-reported llm-stats
    86.8%
  2. o4-mini self-reported llm-stats
    84.3%
  3. Kimi-k1.5 self-reported llm-stats
    74.9%
  4. Llama 4 Maverick self-reported llm-stats
    73.7%
  5. GPT-4.1 mini self-reported llm-stats
    73.1%
  6. GPT-4.5 self-reported llm-stats
    72.3%
  7. GPT-4.1 self-reported llm-stats
    72.2%
  8. o1 self-reported llm-stats
    71.8%
  9. QvQ-72B-Preview self-reported llm-stats
    71.4%
  10. Llama 4 Scout self-reported llm-stats
    70.7%
  11. Pixtral Large self-reported llm-stats
    69.4%
  12. Grok-2 self-reported llm-stats
    69.0%
  13. Gemini 1.5 Pro self-reported llm-stats
    68.1%
  14. Grok-2 mini self-reported llm-stats
    68.1%
  15. Qwen2.5-Omni-7B self-reported llm-stats
    67.9%
  16. Claude 3.5 Sonnet self-reported llm-stats
    67.7%
  17. Mistral Small 3.2 24B Instruct self-reported llm-stats
    67.1%
  18. Gemini 1.5 Flash self-reported llm-stats
    65.8%
  19. GPT-4o self-reported llm-stats
    63.8%
  20. DeepSeek VL2 self-reported llm-stats
    62.8%