DocVQAtest

multimodal official site →

DocVQA is a Visual Question Answering benchmark on document images containing 50,000 questions defined on 12,000+ document images. The benchmark focuses on understanding document structure and content to answer questions about various document types including letters, memos, notes, and reports from the UCSF Industry Documents Library.

Methodology

Imported from llm-stats public benchmark metadata. Modality: multimodal. Max score: 1. Categories: multimodal, vision. Language: en. Verified by llm-stats: no.

Leaderboard

  1. Qwen3 VL 235B A22B Instruct self-reported llm-stats
    97.1%
  2. Qwen3 VL 32B Instruct self-reported llm-stats
    96.9%
  3. Qwen2-VL-72B-Instruct self-reported llm-stats
    96.5%
  4. Qwen3 VL 235B A22B Thinking self-reported llm-stats
    96.5%
  5. Qwen3 VL 32B Thinking self-reported llm-stats
    96.1%
  6. Qwen3 VL 8B Instruct self-reported llm-stats
    96.1%
  7. Qwen3 VL 4B Instruct self-reported llm-stats
    95.3%
  8. Qwen3 VL 8B Thinking self-reported llm-stats
    95.3%
  9. Qwen3 VL 30B A3B Instruct self-reported llm-stats
    95.0%
  10. Qwen3 VL 30B A3B Thinking self-reported llm-stats
    95.0%
  11. Qwen3 VL 4B Thinking self-reported llm-stats
    94.2%