Android Control High_EM

reasoning

Android device control benchmark using high exact match evaluation metric for assessing agent performance on mobile interface tasks

Methodology

Imported from llm-stats public benchmark metadata. Modality: multimodal. Max score: 1. Categories: multimodal, reasoning. Language: en. Verified by llm-stats: no.

Leaderboard

  1. Qwen2.5 VL 32B Instruct self-reported llm-stats
    69.6%
  2. Qwen2.5 VL 72B Instruct self-reported llm-stats
    67.4%
  3. Qwen2.5 VL 7B Instruct self-reported llm-stats
    60.1%