Tau2 Retail

reasoning

τ²-bench retail domain evaluates conversational AI agents in customer service scenarios within a dual-control environment where both agent and user can interact with tools. Tests tool-agent-user interaction, rule adherence, and task consistency in retail customer support contexts.

Leaderboard

Showing 20 of 27 results

Claude Haiku 4.5

93.3%

i
Claude Opus 4.6

91.9%

i
Claude Sonnet 4.6

91.7%

i
Claude Opus 4.5

88.9%

i
LongCat-Flash-Thinking-2601

88.6%

i
Claude Haiku 4.5

83.2%

i
GPT-5.2

82.0%

i
GPT-5

81.1%

i
o3

80.2%

i
Nova 2 Omni

78.3%

i
GPT-5.1

77.9%

i
GPT-5.1 Instant

77.9%

i
GPT-5.1 Thinking

77.9%

i
Nova 2 Pro

77.7%

i
Nova 2 Lite

76.5%

i
LongCat-Flash-Lite

73.1%

i
Qwen3-235B-A22B-Thinking-2507

71.9%

i
LongCat-Flash-Thinking

71.5%

i
Qwen3-235B-A22B-Instruct-2507

71.3%

i
LongCat-Flash-Chat

71.3%

i