Tau2 Airline

reasoning official site →

TAU2 airline domain benchmark for evaluating conversational agents in dual-control environments where both AI agents and users interact with tools in airline customer service scenarios. Tests agent coordination, communication, and ability to guide user actions in tasks like flight booking, modifications, cancellations, and refunds.

Methodology

Imported from llm-stats public benchmark metadata. Modality: text. Max score: 1. Categories: communication, reasoning, tool_calling. Language: en. Verified by llm-stats: no.

Leaderboard

  1. LongCat-Flash-Thinking-2601 self-reported llm-stats
    76.5%
  2. Nova 2 Omni self-reported llm-stats
    68.8%
  3. LongCat-Flash-Thinking self-reported llm-stats
    67.5%
  4. GPT-5.1 self-reported llm-stats
    67.0%
  5. GPT-5.1 Instant self-reported llm-stats
    67.0%
  6. GPT-5.1 Thinking self-reported llm-stats
    67.0%
  7. Nova 2 Pro self-reported llm-stats
    65.2%
  8. o3 self-reported llm-stats
    64.8%
  9. Nova 2 Lite self-reported llm-stats
    64.8%
  10. Claude Haiku 4.5 self-reported llm-stats
    63.6%
  11. GPT-5 self-reported llm-stats
    62.6%
  12. Qwen3-Next-80B-A3B-Thinking self-reported llm-stats
    60.5%
  13. Qwen3-235B-A22B-Thinking-2507 self-reported llm-stats
    58.0%
  14. LongCat-Flash-Chat self-reported llm-stats
    58.0%
  15. LongCat-Flash-Lite self-reported llm-stats
    58.0%
  16. Kimi K2 Instruct self-reported llm-stats
    56.5%
  17. Kimi K2-Instruct-0905 self-reported llm-stats
    56.5%
  18. Nemotron 3 Super (120B A12B) self-reported llm-stats
    56.3%
  19. Mercury 2 self-reported llm-stats
    53.0%
  20. Nemotron 3 Nano (30B A3B) self-reported llm-stats
    48.0%