ChartQA

reasoning official site →

ChartQA is a large-scale benchmark comprising 9.6K human-written questions and 23.1K questions generated from human-written chart summaries, designed to evaluate models' abilities in visual and logical reasoning over charts.

Methodology

Imported from llm-stats public benchmark metadata. Modality: multimodal. Max score: 1. Categories: multimodal, reasoning, vision. Language: en. Verified by llm-stats: no.

Leaderboard

  1. Claude 3.5 Sonnet self-reported llm-stats
    90.8%
  2. Llama 4 Maverick self-reported llm-stats
    90.0%
  3. Qwen2.5 VL 72B Instruct self-reported llm-stats
    89.5%
  4. Nova Pro self-reported llm-stats
    89.2%
  5. Llama 4 Scout self-reported llm-stats
    88.8%
  6. Qwen2-VL-72B-Instruct self-reported llm-stats
    88.3%
  7. Pixtral Large self-reported llm-stats
    88.1%
  8. Mistral Small 3.2 24B Instruct self-reported llm-stats
    87.4%
  9. Qwen2.5 VL 7B Instruct self-reported llm-stats
    87.3%
  10. Nova Lite self-reported llm-stats
    86.8%
  11. DeepSeek VL2 self-reported llm-stats
    86.0%
  12. GPT-4o self-reported llm-stats
    85.7%
  13. Llama 3.2 90B Instruct self-reported llm-stats
    85.5%
  14. Qwen2.5-Omni-7B self-reported llm-stats
    85.3%
  15. DeepSeek VL2 Small self-reported llm-stats
    84.5%
  16. Llama 3.2 11B Instruct self-reported llm-stats
    83.4%
  17. Phi-3.5-vision-instruct self-reported llm-stats
    81.8%
  18. Pixtral-12B self-reported llm-stats
    81.8%
  19. Phi-4-multimodal-instruct self-reported llm-stats
    81.4%
  20. DeepSeek VL2 Tiny self-reported llm-stats
    81.0%