CharXiv-D

reasoning

CharXiv-D is the descriptive questions subset of the CharXiv benchmark, designed to assess multimodal large language models' ability to extract basic information from scientific charts. It contains descriptive questions covering information extraction, enumeration, pattern recognition, and counting across 2,323 diverse charts from arXiv papers, all curated and verified by human experts.

Leaderboard

Showing 13 of 13 results

Qwen3 VL 32B Instruct

90.5%

i
Qwen3 VL 32B Thinking

90.2%

i
GPT-4.5

90.0%

i
GPT-4.1 mini

88.4%

i
GPT-4.1

87.9%

i
Qwen3 VL 30B A3B Thinking

86.9%

i
Qwen3 VL 8B Thinking

85.9%

i
Qwen3 VL 30B A3B Instruct

85.5%

i
GPT-4o

85.3%

i
Qwen3 VL 4B Thinking

83.9%

i
Qwen3 VL 8B Instruct

83.0%

i
Qwen3 VL 4B Instruct

76.2%

i
GPT-4.1 nano

73.9%

i