TextVQA

multimodal official site →

TextVQA contains 45,336 questions on 28,408 images that require reasoning about text to answer. Introduced to benchmark VQA models' ability to read and reason about text within images, particularly for assistive technologies for visually impaired users. The dataset addresses the gap where existing VQA datasets had few text-based questions or were too small.

Methodology

Imported from llm-stats public benchmark metadata. Modality: multimodal. Max score: 1. Categories: image_to_text, multimodal, vision. Language: en. Verified by llm-stats: no.

Leaderboard

  1. Qwen2-VL-72B-Instruct self-reported llm-stats
    85.5%
  2. Qwen2.5 VL 7B Instruct self-reported llm-stats
    84.9%
  3. Qwen2.5-Omni-7B self-reported llm-stats
    84.4%
  4. DeepSeek VL2 self-reported llm-stats
    84.2%
  5. DeepSeek VL2 Small self-reported llm-stats
    83.4%
  6. Nova Pro self-reported llm-stats
    81.5%
  7. DeepSeek VL2 Tiny self-reported llm-stats
    80.7%
  8. Nova Lite self-reported llm-stats
    80.2%
  9. Grok-1.5V self-reported llm-stats
    78.1%
  10. Phi-4-multimodal-instruct self-reported llm-stats
    75.6%
  11. Llama 3.2 90B Instruct self-reported llm-stats
    73.5%
  12. Phi-3.5-vision-instruct self-reported llm-stats
    72.0%
  13. Gemma 3 12B self-reported llm-stats
    67.7%
  14. Gemma 3 27B self-reported llm-stats
    65.1%
  15. Gemma 3 4B self-reported llm-stats
    57.8%