DocVQA

multimodal

A dataset for Visual Question Answering on document images containing 50,000 questions defined on 12,000+ document images. The benchmark tests AI's ability to understand document structure and content, requiring models to comprehend document layout and perform information retrieval to answer questions about document images.

Leaderboard

Showing 20 of 26 results

Qwen2.5 VL 72B Instruct

96.4%

i
Qwen2.5 VL 7B Instruct

95.7%

i
Claude 3.5 Sonnet

95.2%

i
Qwen2.5-Omni-7B

95.2%

i
Mistral Small 3.2 24B Instruct

94.9%

i
Qwen2.5 VL 32B Instruct

94.8%

i
Llama 4 Maverick

94.4%

i
Llama 4 Scout

94.4%

i
Grok-2

93.6%

i
Nova Pro

93.5%

i
DeepSeek VL2

93.3%

i
Pixtral Large

93.3%

i
Phi-4-multimodal-instruct

93.2%

i
Grok-2 mini

93.2%

i
GPT-4o

92.8%

i
Nova Lite

92.4%

i
DeepSeek VL2 Small

92.3%

i
Pixtral-12B

90.7%

i
Llama 3.2 90B Instruct

90.1%

i
DeepSeek VL2 Tiny

88.9%

i