TIR-Bench

reasoning

A tool-calling and multimodal interaction benchmark for testing visual instruction following and execution reliability.

Methodology

Imported from llm-stats public benchmark metadata. Modality: multimodal. Max score: 1. Categories: agents, multimodal, reasoning, tool_calling. Language: en. Verified by llm-stats: no.

Leaderboard

  1. Qwen3.6 Plus self-reported llm-stats
    61.6%
  2. Qwen3.5-27B self-reported llm-stats
    59.8%
  3. Qwen3.5-35B-A3B self-reported llm-stats
    55.5%
  4. Qwen3.5-122B-A10B self-reported llm-stats
    53.2%