BFCL

reasoning official site →

The Berkeley Function Calling Leaderboard (BFCL) is the first comprehensive and executable function call evaluation dedicated to assessing Large Language Models' ability to invoke functions. It evaluates serial and parallel function calls across multiple programming languages (Python, Java, JavaScript, REST API) using a novel Abstract Syntax Tree (AST) evaluation method. The benchmark consists of over 2,000 question-function-answer pairs covering diverse application domains and complex use cases including multiple function calls, parallel function calls, and multi-turn interactions.

Methodology

Imported from llm-stats public benchmark metadata. Modality: text. Max score: 1. Categories: general, reasoning, tool_calling. Language: en. Verified by llm-stats: no.

Leaderboard

  1. Llama 3.1 405B Instruct self-reported llm-stats
    88.5%
  2. Llama 3.1 70B Instruct self-reported llm-stats
    84.8%
  3. Llama 3.1 8B Instruct self-reported llm-stats
    76.1%
  4. Nova 2 Sonic self-reported llm-stats
    74.5%
  5. Qwen3 235B A22B self-reported llm-stats
    70.8%
  6. Qwen3 32B self-reported llm-stats
    70.3%
  7. Qwen3 30B A3B self-reported llm-stats
    69.1%
  8. Nova Pro self-reported llm-stats
    68.4%
  9. Nova Lite self-reported llm-stats
    66.6%
  10. QwQ-32B self-reported llm-stats
    66.4%
  11. Nova Micro self-reported llm-stats
    56.2%