TextVQA

multimodal

TextVQA contains 45,336 questions on 28,408 images that require reasoning about text to answer. Introduced to benchmark VQA models' ability to read and reason about text within images, particularly for assistive technologies for visually impaired users. The dataset addresses the gap where existing VQA datasets had few text-based questions or were too small.

Leaderboard

Showing 15 of 15 results

Qwen2-VL-72B-Instruct

85.5%

i
Qwen2.5 VL 7B Instruct

84.9%

i
Qwen2.5-Omni-7B

84.4%

i
DeepSeek VL2

84.2%

i
DeepSeek VL2 Small

83.4%

i
Nova Pro

81.5%

i
DeepSeek VL2 Tiny

80.7%

i
Nova Lite

80.2%

i
Grok-1.5V

78.1%

i
Phi-4-multimodal-instruct

75.6%

i
Llama 3.2 90B Instruct

73.5%

i
Phi-3.5-vision-instruct

72.0%

i
Gemma 3 12B

67.7%

i
Gemma 3 27B

65.1%

i
Gemma 3 4B

57.8%

i