Tau2 Telecom

reasoning

τ²-Bench telecom domain evaluates conversational agents in a dual-control environment modeled as a Dec-POMDP, where both agent and user use tools in shared telecommunications troubleshooting scenarios that test coordination and communication capabilities.

Leaderboard

Showing 20 of 35 results

Claude Opus 4.6

99.3%

i
LongCat-Flash-Thinking-2601

99.3%

i
GPT-5.4

98.9%

i
GPT-5.2

98.7%

i
Claude Opus 4.5

98.2%

i
GPT-5.5

98.0%

i
Claude Sonnet 4.6

97.9%

i
MiMo-V2-Pro

96.8%

i
GPT-5

96.7%

i
GPT-5.1

95.6%

i
GPT-5.1 Instant

95.6%

i
GPT-5.1 Thinking

95.6%

i
GPT-5.4 mini

93.4%

i
Nova 2 Pro

92.7%

i
GPT-5.4 nano

92.5%

i
Muse Spark

91.5%

i
MiniMax M2

87.0%

i
MiniMax M2.1

87.0%

i
LongCat-Flash-Thinking

83.1%

i
Claude Haiku 4.5

83.0%

i