Tau2 Airline

reasoning

TAU2 airline domain benchmark for evaluating conversational agents in dual-control environments where both AI agents and users interact with tools in airline customer service scenarios. Tests agent coordination, communication, and ability to guide user actions in tasks like flight booking, modifications, cancellations, and refunds.

Leaderboard

Showing 20 of 24 results

LongCat-Flash-Thinking-2601

76.5%

i
Nova 2 Omni

68.8%

i
LongCat-Flash-Thinking

67.5%

i
GPT-5.1

67.0%

i
GPT-5.1 Instant

67.0%

i
GPT-5.1 Thinking

67.0%

i
Nova 2 Pro

65.2%

i
o3

64.8%

i
Nova 2 Lite

64.8%

i
Claude Haiku 4.5

63.6%

i
GPT-5

62.6%

i
Qwen3-Next-80B-A3B-Thinking

60.5%

i
Qwen3-235B-A22B-Thinking-2507

58.0%

i
LongCat-Flash-Chat

58.0%

i
LongCat-Flash-Lite

58.0%

i
Kimi K2 Instruct

56.5%

i
Kimi K2-Instruct-0905

56.5%

i
Nemotron 3 Super (120B A12B)

56.3%

i
Mercury 2

53.0%

i
Nemotron 3 Nano (30B A3B)

48.0%

i