TempCompass

reasoning official site →

TempCompass is a comprehensive benchmark for evaluating temporal perception capabilities of Video Large Language Models (Video LLMs). It constructs conflicting videos that share identical static content but differ in specific temporal aspects to prevent models from exploiting single-frame bias. The benchmark evaluates multiple temporal aspects including action, motion, speed, temporal order, and attribute changes across diverse task formats including multi-choice QA, yes/no QA, caption matching, and caption generation.

Methodology

Imported from llm-stats public benchmark metadata. Modality: multimodal. Max score: 1. Categories: multimodal, reasoning, vision. Language: en. Verified by llm-stats: no.

Leaderboard

  1. Qwen2.5 VL 72B Instruct self-reported llm-stats
    74.8%
  2. Qwen2.5 VL 7B Instruct self-reported llm-stats
    71.7%