Tau2 Telecom

reasoning official site →

τ²-Bench telecom domain evaluates conversational agents in a dual-control environment modeled as a Dec-POMDP, where both agent and user use tools in shared telecommunications troubleshooting scenarios that test coordination and communication capabilities.

Methodology

Imported from llm-stats public benchmark metadata. Modality: text. Max score: 1. Categories: communication, reasoning, tool_calling. Language: en. Verified by llm-stats: no.

Leaderboard

  1. Claude Opus 4.6 self-reported llm-stats
    99.3%
  2. LongCat-Flash-Thinking-2601 self-reported llm-stats
    99.3%
  3. GPT-5.4 self-reported llm-stats
    98.9%
  4. GPT-5.2 self-reported llm-stats
    98.7%
  5. Claude Opus 4.5 self-reported llm-stats
    98.2%
  6. GPT-5.5 self-reported llm-stats
    98.0%
  7. Claude Sonnet 4.6 self-reported llm-stats
    97.9%
  8. MiMo-V2-Pro self-reported llm-stats
    96.8%
  9. GPT-5 self-reported llm-stats
    96.7%
  10. GPT-5.1 self-reported llm-stats
    95.6%
  11. GPT-5.1 Instant self-reported llm-stats
    95.6%
  12. GPT-5.1 Thinking self-reported llm-stats
    95.6%
  13. GPT-5.4 mini self-reported llm-stats
    93.4%
  14. Nova 2 Pro self-reported llm-stats
    92.7%
  15. GPT-5.4 nano self-reported llm-stats
    92.5%
  16. Muse Spark self-reported llm-stats
    91.5%
  17. MiniMax M2 self-reported llm-stats
    87.0%
  18. MiniMax M2.1 self-reported llm-stats
    87.0%
  19. LongCat-Flash-Thinking self-reported llm-stats
    83.1%
  20. Claude Haiku 4.5 self-reported llm-stats
    83.0%