Tau2 Airline
reasoning official site →
TAU2 airline domain benchmark for evaluating conversational agents in dual-control environments where both AI agents and users interact with tools in airline customer service scenarios. Tests agent coordination, communication, and ability to guide user actions in tasks like flight booking, modifications, cancellations, and refunds.
Methodology
Imported from llm-stats public benchmark metadata. Modality: text. Max score: 1. Categories: communication, reasoning, tool_calling. Language: en. Verified by llm-stats: no.