OSWorld Extended
reasoning official site →
OSWorld is a scalable, real computer environment benchmark for evaluating multimodal agents on open-ended tasks across Ubuntu, Windows, and macOS. It comprises 369 computer tasks involving real web and desktop applications, OS file I/O, and multi-application workflows. The benchmark evaluates agents' ability to interact with computer interfaces using screenshots and actions in realistic computing environments.
Methodology
Imported from llm-stats public benchmark metadata. Modality: multimodal. Max score: 1. Categories: agents, general, multimodal, reasoning. Language: en. Verified by llm-stats: no.