BFCL v2

reasoning official site →

Berkeley Function Calling Leaderboard (BFCL) v2 is a comprehensive benchmark for evaluating large language models' function calling capabilities. It features 2,251 question-function-answer pairs with enterprise and OSS-contributed functions, addressing data contamination and bias through live, user-contributed scenarios. The benchmark evaluates AST accuracy, executable accuracy, irrelevance detection, and relevance detection across multiple programming languages (Python, Java, JavaScript) and includes complex real-world function calling scenarios with multi-lingual prompts.

Methodology

Imported from llm-stats public benchmark metadata. Modality: text. Max score: 1. Categories: general, reasoning, tool_calling. Language: en. Verified by llm-stats: no.

Leaderboard

  1. Llama 3.3 70B Instruct self-reported llm-stats
    77.3%
  2. Llama 3.1 Nemotron Ultra 253B v1 self-reported llm-stats
    74.1%
  3. Llama-3.3 Nemotron Super 49B v1 self-reported llm-stats
    73.7%
  4. Llama 3.2 3B Instruct self-reported llm-stats
    67.0%
  5. Llama 3.1 Nemotron Nano 8B V1 self-reported llm-stats
    63.6%