Tau2 Retail

reasoning official site →

τ²-bench retail domain evaluates conversational AI agents in customer service scenarios within a dual-control environment where both agent and user can interact with tools. Tests tool-agent-user interaction, rule adherence, and task consistency in retail customer support contexts.

Methodology

Imported from llm-stats public benchmark metadata. Modality: text. Max score: 1. Categories: communication, reasoning, tool_calling. Language: en. Verified by llm-stats: no.

Leaderboard

  1. Claude Haiku 4.5 self-reported llm-stats
    93.3%
  2. Claude Opus 4.6 self-reported llm-stats
    91.9%
  3. Claude Sonnet 4.6 self-reported llm-stats
    91.7%
  4. Claude Opus 4.5 self-reported llm-stats
    88.9%
  5. LongCat-Flash-Thinking-2601 self-reported llm-stats
    88.6%
  6. Claude Haiku 4.5 self-reported llm-stats
    83.2%
  7. GPT-5.2 self-reported llm-stats
    82.0%
  8. GPT-5 self-reported llm-stats
    81.1%
  9. o3 self-reported llm-stats
    80.2%
  10. Nova 2 Omni self-reported llm-stats
    78.3%
  11. GPT-5.1 self-reported llm-stats
    77.9%
  12. GPT-5.1 Instant self-reported llm-stats
    77.9%
  13. GPT-5.1 Thinking self-reported llm-stats
    77.9%
  14. Nova 2 Pro self-reported llm-stats
    77.7%
  15. Nova 2 Lite self-reported llm-stats
    76.5%
  16. LongCat-Flash-Lite self-reported llm-stats
    73.1%
  17. Qwen3-235B-A22B-Thinking-2507 self-reported llm-stats
    71.9%
  18. LongCat-Flash-Thinking self-reported llm-stats
    71.5%
  19. Qwen3-235B-A22B-Instruct-2507 self-reported llm-stats
    71.3%
  20. LongCat-Flash-Chat self-reported llm-stats
    71.3%