OSWorld Screenshot-only

multimodal official site →

OSWorld Screenshot-only: A variant of the OSWorld benchmark that evaluates multimodal AI agents using only screenshot observations to complete open-ended computer tasks across real operating systems (Ubuntu, Windows, macOS). Tests agents' ability to perform complex workflows involving web apps, desktop applications, file I/O, and multi-application tasks through visual interface understanding and GUI grounding.

Methodology

Imported from llm-stats public benchmark metadata. Modality: multimodal. Max score: 1. Categories: agents, general, grounding, multimodal, vision. Language: en. Verified by llm-stats: no.

Leaderboard

  1. Claude 3.5 Sonnet self-reported llm-stats
    14.9%