TAU-bench Retail

reasoning official site →

A benchmark for evaluating tool-agent-user interaction in retail environments. Tests language agents' ability to handle dynamic conversations with users while using domain-specific API tools and following policy guidelines. Evaluates agents on tasks like order cancellations, address changes, and order status checks through multi-turn conversations.

Methodology

Imported from llm-stats public benchmark metadata. Modality: text. Max score: 1. Categories: communication, reasoning, tool_calling. Language: en. Verified by llm-stats: no.

Leaderboard

  1. Claude Sonnet 4.5 self-reported llm-stats
    86.2%
  2. Claude Opus 4.1 self-reported llm-stats
    82.4%
  3. Claude Opus 4 self-reported llm-stats
    81.4%
  4. Claude 3.7 Sonnet self-reported llm-stats
    81.2%
  5. Claude Sonnet 4 self-reported llm-stats
    80.5%
  6. GLM-4.5 self-reported llm-stats
    79.7%
  7. GLM-4.5-Air self-reported llm-stats
    77.9%
  8. o4-mini self-reported llm-stats
    71.8%
  9. o1 self-reported llm-stats
    70.8%
  10. Qwen3-Next-80B-A3B-Thinking self-reported llm-stats
    69.6%
  11. Claude 3.5 Sonnet self-reported llm-stats
    69.2%
  12. GPT-4.5 self-reported llm-stats
    68.4%
  13. GPT-4.1 self-reported llm-stats
    68.0%
  14. GPT OSS 120B self-reported llm-stats
    67.8%
  15. Qwen3-235B-A22B-Thinking-2507 self-reported llm-stats
    67.8%
  16. MiniMax M1 40K self-reported llm-stats
    67.8%
  17. MiniMax M1 80K self-reported llm-stats
    63.5%
  18. Qwen3-Next-80B-A3B-Instruct self-reported llm-stats
    60.9%
  19. GPT-4o self-reported llm-stats
    60.3%
  20. o3-mini self-reported llm-stats
    57.6%