TAU-bench Airline

reasoning official site →

Part of τ-bench (TAU-bench), a benchmark for Tool-Agent-User interaction in real-world domains. The airline domain evaluates language agents' ability to interact with users through dynamic conversations while following domain-specific rules and using API tools. Agents must handle airline-related tasks and policies reliably.

Methodology

Imported from llm-stats public benchmark metadata. Modality: text. Max score: 1. Categories: communication, reasoning, tool_calling. Language: en. Verified by llm-stats: no.

Leaderboard

  1. Claude Sonnet 4.5 self-reported llm-stats
    70.0%
  2. MiniMax M1 80K self-reported llm-stats
    62.0%
  3. GLM-4.5-Air self-reported llm-stats
    60.8%
  4. GLM-4.5 self-reported llm-stats
    60.4%
  5. Claude Sonnet 4 self-reported llm-stats
    60.0%
  6. MiniMax M1 40K self-reported llm-stats
    60.0%
  7. Claude Opus 4 self-reported llm-stats
    59.6%
  8. Claude 3.7 Sonnet self-reported llm-stats
    58.4%
  9. Claude Opus 4.1 self-reported llm-stats
    56.0%
  10. GPT-4.5 self-reported llm-stats
    50.0%
  11. o1 self-reported llm-stats
    50.0%
  12. GPT-4.1 self-reported llm-stats
    49.4%
  13. o4-mini self-reported llm-stats
    49.2%
  14. Qwen3-Next-80B-A3B-Thinking self-reported llm-stats
    49.0%
  15. Claude 3.5 Sonnet self-reported llm-stats
    46.0%
  16. Qwen3-235B-A22B-Thinking-2507 self-reported llm-stats
    46.0%
  17. Qwen3-Next-80B-A3B-Instruct self-reported llm-stats
    44.0%
  18. GPT-4o self-reported llm-stats
    42.8%
  19. GPT-4.1 mini self-reported llm-stats
    36.0%
  20. o3-mini self-reported llm-stats
    32.4%