VQAv2 (test)
reasoning official site →
VQA v2.0 (Visual Question Answering v2.0) is a balanced dataset designed to counter language priors in visual question answering. It consists of complementary image pairs where the same question yields different answers, forcing models to rely on visual understanding rather than language bias. The dataset contains 1,105,904 questions across 204,721 COCO images, requiring understanding of vision, language, and commonsense knowledge.
Methodology
Imported from llm-stats public benchmark metadata. Modality: multimodal. Max score: 1. Categories: image_to_text, multimodal, reasoning, vision. Language: en. Verified by llm-stats: no.