Tau-bench

reasoning

τ-bench: A benchmark for tool-agent-user interaction in real-world domains. Tests language agents' ability to interact with users and follow domain-specific rules through dynamic conversations using API tools and policy guidelines across retail and airline domains. Evaluates consistency and reliability of agent behavior over multiple trials.

Leaderboard

Showing 6 of 6 results

Step-3.5-Flash

88.2%

i
GLM-4.7

87.4%

i
MiMo-V2-Flash

80.3%

i
GLM-4.7-Flash

79.5%

i
MiniMax M2

77.2%

i
o3

63.0%

i