TIR-Bench
reasoning
A tool-calling and multimodal interaction benchmark for testing visual instruction following and execution reliability.
Methodology
Imported from llm-stats public benchmark metadata. Modality: multimodal. Max score: 1. Categories: agents, multimodal, reasoning, tool_calling. Language: en. Verified by llm-stats: no.