VQAv2 (val)

reasoning official site →

VQAv2 is a balanced Visual Question Answering dataset containing open-ended questions about images that require understanding of vision, language, and commonsense knowledge to answer. VQAv2 addresses bias issues from the original VQA dataset by collecting complementary images such that every question is associated with similar images that result in different answers, forcing models to actually understand visual content rather than relying on language priors.

Methodology

Imported from llm-stats public benchmark metadata. Modality: multimodal. Max score: 1. Categories: image_to_text, language, multimodal, reasoning, vision. Language: en. Verified by llm-stats: no.

Leaderboard

  1. Gemma 3 12B self-reported llm-stats
    71.6%
  2. Gemma 3 27B self-reported llm-stats
    71.0%
  3. Gemma 3 4B self-reported llm-stats
    62.4%