MTVQA

multimodal official site →

MTVQA (Multilingual Text-Centric Visual Question Answering) is the first benchmark featuring high-quality human expert annotations across 9 diverse languages, consisting of 6,778 question-answer pairs across 2,116 images. It addresses visual-textual misalignment problems in multilingual text-centric VQA.

Methodology

Imported from llm-stats public benchmark metadata. Modality: multimodal. Max score: 1. Categories: multimodal, text-to-image, vision. Language: en. Verified by llm-stats: no.

Leaderboard

  1. Qwen2-VL-72B-Instruct self-reported llm-stats
    30.9%