MTVQA

multimodal

MTVQA (Multilingual Text-Centric Visual Question Answering) is the first benchmark featuring high-quality human expert annotations across 9 diverse languages, consisting of 6,778 question-answer pairs across 2,116 images. It addresses visual-textual misalignment problems in multilingual text-centric VQA.

Leaderboard

Showing 1 of 1 result

Qwen2-VL-72B-Instruct

30.9%

i