ScreenSpot Pro

multimodal official site →

ScreenSpot-Pro is a novel GUI grounding benchmark designed to rigorously evaluate the grounding capabilities of multimodal large language models (MLLMs) in professional high-resolution computing environments. The benchmark comprises 1,581 instructions across 23 applications spanning 5 industries and 3 operating systems, featuring authentic high-resolution images from professional domains with expert annotations. Unlike previous benchmarks that focus on cropped screenshots in consumer applications, ScreenSpot-Pro addresses the complexity and diversity of real-world professional software scenarios, revealing significant performance gaps in current MLLM GUI perception capabilities.

Methodology

Imported from llm-stats public benchmark metadata. Modality: multimodal. Max score: 1. Categories: grounding, multimodal, spatial_reasoning, vision. Language: en. Verified by llm-stats: no.

Leaderboard

  1. Claude Opus 4.8 self-reported llm-stats
    87.9%
  2. GPT-5.2 self-reported llm-stats
    86.3%
  3. Muse Spark self-reported llm-stats
    84.1%
  4. Gemini 3 Pro self-reported llm-stats
    72.7%
  5. Qwen3.5-122B-A10B self-reported llm-stats
    70.4%
  6. Qwen3.5-27B self-reported llm-stats
    70.3%
  7. Gemini 3 Flash self-reported llm-stats
    69.1%
  8. Qwen3.5-35B-A3B self-reported llm-stats
    68.6%
  9. Qwen3.6 Plus self-reported llm-stats
    68.2%
  10. Qwen3 VL 235B A22B Instruct self-reported llm-stats
    62.0%
  11. Qwen3 VL 235B A22B Thinking self-reported llm-stats
    61.8%
  12. Qwen3 VL 30B A3B Instruct self-reported llm-stats
    60.5%
  13. Qwen3 VL 4B Instruct self-reported llm-stats
    59.5%
  14. Qwen3 VL 32B Instruct self-reported llm-stats
    57.9%
  15. Qwen3 VL 30B A3B Thinking self-reported llm-stats
    57.3%
  16. Qwen3 VL 32B Thinking self-reported llm-stats
    57.1%
  17. Qwen3 VL 8B Instruct self-reported llm-stats
    54.6%
  18. Qwen3 VL 4B Thinking self-reported llm-stats
    49.2%
  19. Qwen3 VL 8B Thinking self-reported llm-stats
    46.6%
  20. Qwen2.5 VL 72B Instruct self-reported llm-stats
    43.6%