BFCL-v3

reasoning official site →

Berkeley Function Calling Leaderboard v3 (BFCL-v3) is an advanced benchmark that evaluates large language models' function calling capabilities through multi-turn and multi-step interactions. It introduces extended conversational exchanges where models must retain contextual information across turns and execute multiple internal function calls for complex user requests. The benchmark includes 1000 test cases across domains like vehicle control, trading bots, travel booking, and file system management, using state-based evaluation to verify both system state changes and execution path correctness.

Methodology

Imported from llm-stats public benchmark metadata. Modality: text. Max score: 1. Categories: agents, finance, general, reasoning, structured_output, tool_calling. Language: en. Verified by llm-stats: no.

Leaderboard

  1. GLM-4.5 self-reported llm-stats
    77.8%
  2. GLM-4.5-Air self-reported llm-stats
    76.4%
  3. LongCat-Flash-Thinking self-reported llm-stats
    74.4%
  4. Qwen3-Next-80B-A3B-Thinking self-reported llm-stats
    72.0%
  5. MAI-Thinking-1 self-reported llm-stats
    72.0%
  6. Qwen3-235B-A22B-Thinking-2507 self-reported llm-stats
    71.9%
  7. Qwen3 VL 235B A22B Thinking self-reported llm-stats
    71.9%
  8. Qwen3 VL 32B Thinking self-reported llm-stats
    71.7%
  9. Qwen3-235B-A22B-Instruct-2507 self-reported llm-stats
    70.9%
  10. Qwen3-Next-80B-A3B-Instruct self-reported llm-stats
    70.3%
  11. Qwen3 VL 32B Instruct self-reported llm-stats
    70.2%
  12. Qwen3 VL 30B A3B Thinking self-reported llm-stats
    68.6%
  13. Qwen3 VL 235B A22B Instruct self-reported llm-stats
    67.7%
  14. Qwen3 VL 4B Thinking self-reported llm-stats
    67.3%
  15. Qwen3 VL 30B A3B Instruct self-reported llm-stats
    66.3%
  16. Qwen3 VL 8B Instruct self-reported llm-stats
    66.3%
  17. Qwen3 VL 4B Instruct self-reported llm-stats
    63.3%
  18. Qwen3 VL 8B Thinking self-reported llm-stats
    63.0%