BFCL v2

reasoning

Berkeley Function Calling Leaderboard (BFCL) v2 is a comprehensive benchmark for evaluating large language models' function calling capabilities. It features 2,251 question-function-answer pairs with enterprise and OSS-contributed functions, addressing data contamination and bias through live, user-contributed scenarios. The benchmark evaluates AST accuracy, executable accuracy, irrelevance detection, and relevance detection across multiple programming languages (Python, Java, JavaScript) and includes complex real-world function calling scenarios with multi-lingual prompts.

Leaderboard

Showing 5 of 5 results

Llama 3.3 70B Instruct

77.3%

i
Llama 3.1 Nemotron Ultra 253B v1

74.1%

i
Llama-3.3 Nemotron Super 49B v1

73.7%

i
Llama 3.2 3B Instruct

67.0%

i
Llama 3.1 Nemotron Nano 8B V1

63.6%

i