VisualWebBench

multimodal

A multimodal benchmark designed to assess the capabilities of multimodal large language models (MLLMs) across web page understanding and grounding tasks. Comprises 7 tasks (captioning, webpage QA, heading OCR, element OCR, element grounding, action prediction, and action grounding) with 1.5K human-curated instances from 139 real websites across 87 sub-domains.

Leaderboard

Showing 2 of 2 results

Nova Pro

79.7%

i
Nova Lite

77.7%

i