t2-bench

reasoning

t2-bench is a benchmark for evaluating agentic tool use capabilities, measuring how well models can select, sequence, and utilize tools to solve complex tasks. It tests autonomous planning and execution in multi-step scenarios.

Leaderboard

Showing 20 of 23 results

Gemini 3.1 Pro

99.3%

i
Gemini 3 Flash

90.2%

i
GLM-5

89.7%

i
Qwen3.5-397B-A17B

86.7%

i
Gemma 4 31B

86.4%

i
Gemma 4 26B-A4B

85.5%

i
Gemini 3 Pro

85.4%

i
Qwen3.5-35B-A3B

81.2%

i
DeepSeek-V3.2

80.3%

i
DeepSeek-V3.2-Speciale

80.3%

i
DeepSeek-V3.2 (Thinking)

80.2%

i
Qwen3.5-4B

79.9%

i
Qwen3.5-122B-A10B

79.5%

i
Qwen3.5-9B

79.1%

i
Qwen3.5-27B

79.0%

i
Qwen3 Max

74.8%

i
K-EXAONE-236B-A23B

73.2%

i
GPT OSS 120B High

63.9%

i
Gemma 4 E4B

57.5%

i
DiffusionGemma 26B-A4B

56.2%

i