OSWorld Extended

reasoning official site →

OSWorld is a scalable, real computer environment benchmark for evaluating multimodal agents on open-ended tasks across Ubuntu, Windows, and macOS. It comprises 369 computer tasks involving real web and desktop applications, OS file I/O, and multi-application workflows. The benchmark evaluates agents' ability to interact with computer interfaces using screenshots and actions in realistic computing environments.

Methodology

Imported from llm-stats public benchmark metadata. Modality: multimodal. Max score: 1. Categories: agents, general, multimodal, reasoning. Language: en. Verified by llm-stats: no.

Leaderboard

  1. Claude 3.5 Sonnet self-reported llm-stats
    22.0%