OSWorld

multimodal official site →

OSWorld: The first-of-its-kind scalable, real computer environment for multimodal agents, supporting task setup, execution-based evaluation, and interactive learning across Ubuntu, Windows, and macOS with 369 computer tasks involving real web and desktop applications, OS file I/O, and multi-application workflows

Methodology

Imported from llm-stats public benchmark metadata. Modality: multimodal. Max score: 1. Categories: agents, general, multimodal, vision. Language: en. Verified by llm-stats: no.

Leaderboard

  1. Claude Opus 4.6 self-reported llm-stats
    72.7%
  2. Claude Sonnet 4.6 self-reported llm-stats
    72.5%
  3. Qwen3 VL 235B A22B Instruct self-reported llm-stats
    66.7%
  4. Claude Opus 4.5 self-reported llm-stats
    66.3%
  5. GLM-5V-Turbo self-reported llm-stats
    62.3%
  6. Claude Sonnet 4.5 self-reported llm-stats
    61.4%
  7. Claude Haiku 4.5 self-reported llm-stats
    50.7%
  8. Claude Haiku 4.5 self-reported llm-stats
    44.9%
  9. Qwen3 VL 32B Thinking self-reported llm-stats
    41.0%
  10. Qwen3 VL 235B A22B Thinking self-reported llm-stats
    38.1%
  11. Qwen3 VL 8B Instruct self-reported llm-stats
    33.9%
  12. Qwen3 VL 8B Thinking self-reported llm-stats
    33.9%
  13. Qwen3 VL 32B Instruct self-reported llm-stats
    32.6%
  14. Qwen3 VL 4B Thinking self-reported llm-stats
    31.4%
  15. Qwen3 VL 30B A3B Thinking self-reported llm-stats
    30.6%
  16. Qwen3 VL 30B A3B Instruct self-reported llm-stats
    30.3%
  17. Qwen3 VL 4B Instruct self-reported llm-stats
    26.2%
  18. Qwen2.5 VL 72B Instruct self-reported llm-stats
    8.8%
  19. Qwen2.5 VL 32B Instruct self-reported llm-stats
    5.9%