VisualWebBench

multimodal official site →

A multimodal benchmark designed to assess the capabilities of multimodal large language models (MLLMs) across web page understanding and grounding tasks. Comprises 7 tasks (captioning, webpage QA, heading OCR, element OCR, element grounding, action prediction, and action grounding) with 1.5K human-curated instances from 139 real websites across 87 sub-domains.

Methodology

Imported from llm-stats public benchmark metadata. Modality: multimodal. Max score: 1. Categories: frontend_development, multimodal, vision. Language: en. Verified by llm-stats: no.

Leaderboard

  1. Nova Pro self-reported llm-stats
    79.7%
  2. Nova Lite self-reported llm-stats
    77.7%