Android Control Low_EM

reasoning

Android control benchmark evaluating autonomous agents on mobile device interaction tasks with low exact match scoring criteria

Methodology

Imported from llm-stats public benchmark metadata. Modality: multimodal. Max score: 1. Categories: multimodal, reasoning. Language: en. Verified by llm-stats: no.

Leaderboard

  1. Qwen2.5 VL 72B Instruct self-reported llm-stats
    93.7%
  2. Qwen2.5 VL 32B Instruct self-reported llm-stats
    93.3%
  3. Qwen2.5 VL 7B Instruct self-reported llm-stats
    91.4%