ScreenSpot

multimodal

ScreenSpot is the first realistic GUI grounding benchmark that encompasses mobile, desktop, and web environments. The dataset comprises over 1,200 instructions from iOS, Android, macOS, Windows and Web environments, along with annotated element types (text and icon/widget), designed to evaluate visual GUI agents' ability to accurately locate screen elements based on natural language instructions.

Leaderboard

Showing 16 of 16 results

Qwen3 VL 32B Instruct

95.8%

i
Qwen3 VL 32B Thinking

95.7%

i
Qwen3 VL 235B A22B Instruct

95.4%

i
Qwen3 VL 235B A22B Thinking

95.4%

i
Qwen3 VL 30B A3B Instruct

94.7%

i
Qwen3 VL 30B A3B Thinking

94.7%

i
Qwen3 VL 8B Instruct

94.4%

i
Qwen3 VL 4B Instruct

94.0%

i
Qwen3 VL 8B Thinking

93.6%

i
Qwen3 VL 4B Thinking

92.9%

i
Qwen2.5 VL 32B Instruct

88.5%

i
Nova 2 Pro

88.1%

i
Qwen2.5 VL 72B Instruct

87.1%

i
Nova 2 Omni

85.4%

i
Qwen2.5 VL 7B Instruct

84.7%

i
Nova 2 Lite

83.3%

i