Skip to content

Models Benchmarks Providers

Search models and benchmarks /

TIR-Bench

reasoning

Categories: agents, multimodal, reasoning, tool calling
Modality: multimodal
Language: en
Multilingual: No
Max score: 1
Scoring: %, higher is better
Verified by llm-stats: No

A tool-calling and multimodal interaction benchmark for testing visual instruction following and execution reliability.

Leaderboard

Showing 4 of 4 results

Qwen3.6 Plus

61.6%

i
Qwen3.5-27B

59.8%

i
Qwen3.5-35B-A3B

55.5%

i
Qwen3.5-122B-A10B

53.2%

i

Wikibench About Theme Content licensed CC BY-SA 4.0.