AndroidWorld_SR

reasoning official site →

AndroidWorld Success Rate (SR) benchmark - A dynamic benchmarking environment for autonomous agents operating on Android devices. Evaluates agents on 116 programmatic tasks across 20 real-world Android apps using multimodal inputs (screen screenshots, accessibility trees, and natural language instructions). Measures success rate of agents completing tasks like sending messages, creating calendar events, and navigating mobile interfaces. Published at ICLR 2025. Best current performance: 30.6% success rate (M3A agent) vs 80.0% human performance.

Methodology

Imported from llm-stats public benchmark metadata. Modality: multimodal. Max score: 1. Categories: agents, general, multimodal, reasoning. Language: en. Verified by llm-stats: no.

Leaderboard

  1. Qwen3.5-35B-A3B self-reported llm-stats
    71.1%
  2. Qwen3.5-122B-A10B self-reported llm-stats
    66.4%
  3. Qwen3.5-27B self-reported llm-stats
    64.2%
  4. Qwen3 VL 235B A22B Instruct self-reported llm-stats
    63.7%
  5. Qwen3 VL 32B Thinking self-reported llm-stats
    63.7%
  6. Qwen2.5 VL 72B Instruct self-reported llm-stats
    35.0%
  7. Qwen2.5 VL 7B Instruct self-reported llm-stats
    25.5%
  8. Qwen2.5 VL 32B Instruct self-reported llm-stats
    22.0%