MTVQA
multimodal official site →
MTVQA (Multilingual Text-Centric Visual Question Answering) is the first benchmark featuring high-quality human expert annotations across 9 diverse languages, consisting of 6,778 question-answer pairs across 2,116 images. It addresses visual-textual misalignment problems in multilingual text-centric VQA.
Methodology
Imported from llm-stats public benchmark metadata. Modality: multimodal. Max score: 1. Categories: multimodal, text-to-image, vision. Language: en. Verified by llm-stats: no.