VQAv2

reasoning official site →

VQAv2 is a balanced Visual Question Answering dataset that addresses language bias by providing complementary images for each question, forcing models to rely on visual understanding rather than language priors. It contains approximately twice the number of image-question pairs compared to the original VQA dataset.

Methodology

Imported from llm-stats public benchmark metadata. Modality: multimodal. Max score: 1. Categories: image_to_text, multimodal, reasoning, vision. Language: en. Verified by llm-stats: no.

Leaderboard

  1. Pixtral Large self-reported llm-stats
    80.9%
  2. Pixtral-12B self-reported llm-stats
    78.6%
  3. Llama 3.2 90B Instruct self-reported llm-stats
    78.1%