ScreenSpot

multimodal official site →

ScreenSpot is the first realistic GUI grounding benchmark that encompasses mobile, desktop, and web environments. The dataset comprises over 1,200 instructions from iOS, Android, macOS, Windows and Web environments, along with annotated element types (text and icon/widget), designed to evaluate visual GUI agents' ability to accurately locate screen elements based on natural language instructions.

Methodology

Imported from llm-stats public benchmark metadata. Modality: multimodal. Max score: 1. Categories: grounding, multimodal, spatial_reasoning, vision. Language: en. Verified by llm-stats: no.

Leaderboard

  1. Qwen3 VL 32B Instruct self-reported llm-stats
    95.8%
  2. Qwen3 VL 32B Thinking self-reported llm-stats
    95.7%
  3. Qwen3 VL 235B A22B Instruct self-reported llm-stats
    95.4%
  4. Qwen3 VL 235B A22B Thinking self-reported llm-stats
    95.4%
  5. Qwen3 VL 30B A3B Instruct self-reported llm-stats
    94.7%
  6. Qwen3 VL 30B A3B Thinking self-reported llm-stats
    94.7%
  7. Qwen3 VL 8B Instruct self-reported llm-stats
    94.4%
  8. Qwen3 VL 4B Instruct self-reported llm-stats
    94.0%
  9. Qwen3 VL 8B Thinking self-reported llm-stats
    93.6%
  10. Qwen3 VL 4B Thinking self-reported llm-stats
    92.9%
  11. Qwen2.5 VL 32B Instruct self-reported llm-stats
    88.5%
  12. Nova 2 Pro self-reported llm-stats
    88.1%
  13. Qwen2.5 VL 72B Instruct self-reported llm-stats
    87.1%
  14. Nova 2 Omni self-reported llm-stats
    85.4%
  15. Qwen2.5 VL 7B Instruct self-reported llm-stats
    84.7%
  16. Nova 2 Lite self-reported llm-stats
    83.3%