EgoSchema

reasoning official site →

A diagnostic benchmark for very long-form video language understanding consisting of over 5000 human curated multiple choice questions based on 3-minute video clips from Ego4D, covering a broad range of natural human activities and behaviors

Methodology

Imported from llm-stats public benchmark metadata. Modality: video. Max score: 1. Categories: long_context, reasoning, vision. Language: en. Verified by llm-stats: no.

Leaderboard

  1. Qwen2-VL-72B-Instruct self-reported llm-stats
    77.9%
  2. Qwen2.5 VL 72B Instruct self-reported llm-stats
    76.2%
  3. GPT-4o self-reported llm-stats
    72.2%
  4. Nova Pro self-reported llm-stats
    72.1%
  5. Gemini 2.0 Flash self-reported llm-stats
    71.5%
  6. Nova Lite self-reported llm-stats
    71.4%
  7. Qwen2.5-Omni-7B self-reported llm-stats
    68.6%
  8. Gemini 2.0 Flash-Lite self-reported llm-stats
    67.2%
  9. Gemini 1.0 Pro self-reported llm-stats
    55.7%