Tau-bench

reasoning official site →

τ-bench: A benchmark for tool-agent-user interaction in real-world domains. Tests language agents' ability to interact with users and follow domain-specific rules through dynamic conversations using API tools and policy guidelines across retail and airline domains. Evaluates consistency and reliability of agent behavior over multiple trials.

Methodology

Imported from llm-stats public benchmark metadata. Modality: text. Max score: 1. Categories: agents, general, reasoning, tool_calling. Language: en. Verified by llm-stats: no.

Leaderboard

  1. Step-3.5-Flash self-reported llm-stats
    88.2%
  2. GLM-4.7 self-reported llm-stats
    87.4%
  3. MiMo-V2-Flash self-reported llm-stats
    80.3%
  4. GLM-4.7-Flash self-reported llm-stats
    79.5%
  5. MiniMax M2 self-reported llm-stats
    77.2%
  6. o3 self-reported llm-stats
    63.0%