AITZ_EM

reasoning official site →

Android-In-The-Zoo (AitZ) benchmark for evaluating autonomous GUI agents on smartphones. Contains 18,643 screen-action pairs with chain-of-action-thought annotations spanning over 70 Android apps. Designed to connect perception (screen layouts and UI elements) with cognition (action decision-making) for natural language-triggered smartphone task completion.

Methodology

Imported from llm-stats public benchmark metadata. Modality: multimodal. Max score: 1. Categories: agents, multimodal, reasoning. Language: en. Verified by llm-stats: no.

Leaderboard

  1. Qwen2.5 VL 72B Instruct self-reported llm-stats
    83.2%
  2. Qwen2.5 VL 32B Instruct self-reported llm-stats
    83.1%
  3. Qwen2.5 VL 7B Instruct self-reported llm-stats
    81.9%