DocVQA

multimodal official site →

A dataset for Visual Question Answering on document images containing 50,000 questions defined on 12,000+ document images. The benchmark tests AI's ability to understand document structure and content, requiring models to comprehend document layout and perform information retrieval to answer questions about document images.

Methodology

Imported from llm-stats public benchmark metadata. Modality: multimodal. Max score: 1. Categories: image_to_text, multimodal, vision. Language: en. Verified by llm-stats: no.

Leaderboard

  1. Qwen2.5 VL 72B Instruct self-reported llm-stats
    96.4%
  2. Qwen2.5 VL 7B Instruct self-reported llm-stats
    95.7%
  3. Claude 3.5 Sonnet self-reported llm-stats
    95.2%
  4. Qwen2.5-Omni-7B self-reported llm-stats
    95.2%
  5. Mistral Small 3.2 24B Instruct self-reported llm-stats
    94.9%
  6. Qwen2.5 VL 32B Instruct self-reported llm-stats
    94.8%
  7. Llama 4 Maverick self-reported llm-stats
    94.4%
  8. Llama 4 Scout self-reported llm-stats
    94.4%
  9. Grok-2 self-reported llm-stats
    93.6%
  10. Nova Pro self-reported llm-stats
    93.5%
  11. DeepSeek VL2 self-reported llm-stats
    93.3%
  12. Pixtral Large self-reported llm-stats
    93.3%
  13. Phi-4-multimodal-instruct self-reported llm-stats
    93.2%
  14. Grok-2 mini self-reported llm-stats
    93.2%
  15. GPT-4o self-reported llm-stats
    92.8%
  16. Nova Lite self-reported llm-stats
    92.4%
  17. DeepSeek VL2 Small self-reported llm-stats
    92.3%
  18. Pixtral-12B self-reported llm-stats
    90.7%
  19. Llama 3.2 90B Instruct self-reported llm-stats
    90.1%
  20. DeepSeek VL2 Tiny self-reported llm-stats
    88.9%