CharXiv-D

reasoning official site →

CharXiv-D is the descriptive questions subset of the CharXiv benchmark, designed to assess multimodal large language models' ability to extract basic information from scientific charts. It contains descriptive questions covering information extraction, enumeration, pattern recognition, and counting across 2,323 diverse charts from arXiv papers, all curated and verified by human experts.

Methodology

Imported from llm-stats public benchmark metadata. Modality: multimodal. Max score: 1. Categories: multimodal, reasoning, structured_output, vision. Language: en. Verified by llm-stats: no.

Leaderboard

  1. Qwen3 VL 32B Instruct self-reported llm-stats
    90.5%
  2. Qwen3 VL 32B Thinking self-reported llm-stats
    90.2%
  3. GPT-4.5 self-reported llm-stats
    90.0%
  4. GPT-4.1 mini self-reported llm-stats
    88.4%
  5. GPT-4.1 self-reported llm-stats
    87.9%
  6. Qwen3 VL 30B A3B Thinking self-reported llm-stats
    86.9%
  7. Qwen3 VL 8B Thinking self-reported llm-stats
    85.9%
  8. Qwen3 VL 30B A3B Instruct self-reported llm-stats
    85.5%
  9. GPT-4o self-reported llm-stats
    85.3%
  10. Qwen3 VL 4B Thinking self-reported llm-stats
    83.9%
  11. Qwen3 VL 8B Instruct self-reported llm-stats
    83.0%
  12. Qwen3 VL 4B Instruct self-reported llm-stats
    76.2%
  13. GPT-4.1 nano self-reported llm-stats
    73.9%