Toolathlon

reasoning

Tool Decathlon is a comprehensive benchmark for evaluating AI agents' ability to use multiple tools across diverse task categories. It measures proficiency in tool selection, sequencing, and execution across ten different tool-use scenarios.

Methodology

Imported from llm-stats public benchmark metadata. Modality: text. Max score: 1. Categories: agents, reasoning, tool_calling. Language: en. Verified by llm-stats: no.

Leaderboard

  1. Claude Opus 4.8 self-reported llm-stats
    59.9%
  2. Gemini 3.5 Flash self-reported llm-stats
    56.5%
  3. GPT-5.5 self-reported llm-stats
    55.6%
  4. GPT-5.4 self-reported llm-stats
    54.6%
  5. DeepSeek-V4-Pro-Max self-reported llm-stats
    51.8%
  6. Kimi K2.6 self-reported llm-stats
    50.0%
  7. Gemini 3 Flash self-reported llm-stats
    49.4%
  8. DeepSeek-V4-Flash-Max self-reported llm-stats
    47.8%
  9. GPT-5.2 self-reported llm-stats
    46.3%
  10. MiniMax M2.7 self-reported llm-stats
    46.3%
  11. MiniMax M2.1 self-reported llm-stats
    43.5%
  12. GPT-5.4 mini self-reported llm-stats
    42.9%
  13. GLM-5.1 self-reported llm-stats
    40.7%
  14. Qwen3.6 Plus self-reported llm-stats
    39.8%
  15. Qwen3.5-397B-A17B self-reported llm-stats
    38.3%
  16. GPT-5.4 nano self-reported llm-stats
    35.5%
  17. DeepSeek-V3.2 (Thinking) self-reported llm-stats
    35.2%
  18. DeepSeek-V3.2 self-reported llm-stats
    35.2%
  19. DeepSeek-V3.2-Speciale self-reported llm-stats
    35.2%