CharXiv-R

reasoning official site →

CharXiv-R is the reasoning component of the CharXiv benchmark, focusing on complex reasoning questions that require synthesizing information across visual chart elements. It evaluates multimodal large language models on their ability to understand and reason about scientific charts from arXiv papers through various reasoning tasks.

Methodology

Imported from llm-stats public benchmark metadata. Modality: multimodal. Max score: 1. Categories: multimodal, reasoning, vision. Language: en. Verified by llm-stats: no.

Leaderboard

  1. Claude Mythos Preview self-reported llm-stats
    93.2%
  2. Claude Opus 4.7 self-reported llm-stats
    91.0%
  3. Claude Opus 4.8 self-reported llm-stats
    89.9%
  4. Kimi K2.6 self-reported llm-stats
    86.7%
  5. Muse Spark self-reported llm-stats
    86.4%
  6. Gemini 3.5 Flash self-reported llm-stats
    84.2%
  7. GPT-5.2 self-reported llm-stats
    82.1%
  8. GPT-5.5 Instant self-reported llm-stats
    81.6%
  9. Qwen3.6 Plus self-reported llm-stats
    81.5%
  10. Gemini 3 Pro self-reported llm-stats
    81.4%
  11. GPT-5 self-reported llm-stats
    81.1%
  12. Gemini 3 Flash self-reported llm-stats
    80.3%
  13. Qwen3.5-27B self-reported llm-stats
    79.5%
  14. o3 self-reported llm-stats
    78.6%
  15. Qwen3.6-27B self-reported llm-stats
    78.4%
  16. Kimi K2.5 self-reported llm-stats
    77.5%
  17. Qwen3.5-35B-A3B self-reported llm-stats
    77.5%
  18. Claude Opus 4.6 self-reported llm-stats
    77.4%
  19. Qwen3.5-122B-A10B self-reported llm-stats
    77.2%
  20. Gemini 3.1 Flash-Lite self-reported llm-stats
    73.2%