BFCL-v3

reasoning

Berkeley Function Calling Leaderboard v3 (BFCL-v3) is an advanced benchmark that evaluates large language models' function calling capabilities through multi-turn and multi-step interactions. It introduces extended conversational exchanges where models must retain contextual information across turns and execute multiple internal function calls for complex user requests. The benchmark includes 1000 test cases across domains like vehicle control, trading bots, travel booking, and file system management, using state-based evaluation to verify both system state changes and execution path correctness.

Leaderboard

Showing 19 of 19 results

GLM-4.5

77.8%

i
GLM-4.5-Air

76.4%

i
LongCat-Flash-Thinking

74.4%

i
Qwen3-Next-80B-A3B-Thinking

72.0%

i
MAI-Thinking-1

72.0%

i
Qwen3-235B-A22B-Thinking-2507

71.9%

i
Qwen3 VL 235B A22B Thinking

71.9%

i
Qwen3 VL 32B Thinking

71.7%

i
Qwen3-235B-A22B-Instruct-2507

70.9%

i
Qwen3-Next-80B-A3B-Instruct

70.3%

i
Qwen3 VL 32B Instruct

70.2%

i
Qwen3-Coder 480B A35B Instruct

68.7%

i
Qwen3 VL 30B A3B Thinking

68.6%

i
Qwen3 VL 235B A22B Instruct

67.7%

i
Qwen3 VL 4B Thinking

67.3%

i
Qwen3 VL 30B A3B Instruct

66.3%

i
Qwen3 VL 8B Instruct

66.3%

i
Qwen3 VL 4B Instruct

63.3%

i
Qwen3 VL 8B Thinking

63.0%

i