ScreenSpot Pro

multimodal

ScreenSpot-Pro is a novel GUI grounding benchmark designed to rigorously evaluate the grounding capabilities of multimodal large language models (MLLMs) in professional high-resolution computing environments. The benchmark comprises 1,581 instructions across 23 applications spanning 5 industries and 3 operating systems, featuring authentic high-resolution images from professional domains with expert annotations. Unlike previous benchmarks that focus on cropped screenshots in consumer applications, ScreenSpot-Pro addresses the complexity and diversity of real-world professional software scenarios, revealing significant performance gaps in current MLLM GUI perception capabilities.

Leaderboard

Showing 20 of 22 results

Claude Opus 4.8

87.9%

i
GPT-5.2

86.3%

i
Muse Spark

84.1%

i
Gemini 3 Pro

72.7%

i
Qwen3.5-122B-A10B

70.4%

i
Qwen3.5-27B

70.3%

i
Gemini 3 Flash

69.1%

i
Qwen3.5-35B-A3B

68.6%

i
Qwen3.6 Plus

68.2%

i
Qwen3 VL 235B A22B Instruct

62.0%

i
Qwen3 VL 235B A22B Thinking

61.8%

i
Qwen3 VL 30B A3B Instruct

60.5%

i
Qwen3 VL 4B Instruct

59.5%

i
Qwen3 VL 32B Instruct

57.9%

i
Qwen3 VL 30B A3B Thinking

57.3%

i
Qwen3 VL 32B Thinking

57.1%

i
Qwen3 VL 8B Instruct

54.6%

i
Qwen3 VL 4B Thinking

49.2%

i
Qwen3 VL 8B Thinking

46.6%

i
Qwen2.5 VL 72B Instruct

43.6%

i