Vibe-Eval

multimodal official site →

VIBE-Eval is a hard evaluation suite for measuring progress of multimodal language models, consisting of 269 visual understanding prompts with gold-standard responses authored by experts. The benchmark has dual objectives: vibe checking multimodal chat models for day-to-day tasks and rigorously testing frontier models, with the hard set containing >50% questions that all frontier models answer incorrectly.

Methodology

Imported from llm-stats public benchmark metadata. Modality: multimodal. Max score: 1. Categories: general, multimodal, vision. Language: en. Verified by llm-stats: no.

Leaderboard

  1. Gemini 2.5 Pro Preview 06-05 self-reported llm-stats
    67.2%
  2. Gemini 2.5 Pro self-reported llm-stats
    65.6%
  3. Gemini 2.5 Flash self-reported llm-stats
    65.4%
  4. Gemini 2.0 Flash self-reported llm-stats
    56.3%
  5. Gemini 1.5 Pro self-reported llm-stats
    53.9%
  6. Gemini 2.5 Flash-Lite self-reported llm-stats
    51.3%
  7. Gemini 1.5 Flash self-reported llm-stats
    48.9%
  8. Gemini 1.5 Flash 8B self-reported llm-stats
    40.9%