ERQA

reasoning official site →

Embodied Reasoning Question Answering benchmark consisting of 400 multiple-choice visual questions across spatial reasoning, trajectory reasoning, action reasoning, state estimation, and multi-view reasoning for evaluating AI capabilities in physical world interactions

Methodology

Imported from llm-stats public benchmark metadata. Modality: multimodal. Max score: 1. Categories: reasoning, spatial_reasoning, vision. Language: en. Verified by llm-stats: no.

Leaderboard

  1. GPT-5 self-reported llm-stats
    65.7%
  2. Qwen3.6 Plus self-reported llm-stats
    65.7%
  3. Qwen3.5-35B-A3B self-reported llm-stats
    64.8%
  4. Muse Spark self-reported llm-stats
    64.7%
  5. o3 self-reported llm-stats
    64.0%
  6. Qwen3.6-27B self-reported llm-stats
    62.5%
  7. Qwen3.5-122B-A10B self-reported llm-stats
    62.0%
  8. Qwen3.5-27B self-reported llm-stats
    60.5%
  9. Qwen3 VL 235B A22B Thinking self-reported llm-stats
    52.5%
  10. Qwen3 VL 32B Thinking self-reported llm-stats
    52.3%
  11. Qwen3 VL 235B A22B Instruct self-reported llm-stats
    51.3%
  12. Qwen3 VL 32B Instruct self-reported llm-stats
    48.8%
  13. Qwen3 VL 4B Thinking self-reported llm-stats
    47.3%
  14. Qwen3 VL 8B Thinking self-reported llm-stats
    46.8%
  15. Qwen3 VL 8B Instruct self-reported llm-stats
    45.8%
  16. Qwen3 VL 30B A3B Thinking self-reported llm-stats
    45.3%
  17. Qwen3 VL 30B A3B Instruct self-reported llm-stats
    43.0%
  18. Qwen3 VL 4B Instruct self-reported llm-stats
    41.3%
  19. GPT-4o self-reported llm-stats
    35.2%