BigCodeBench

reasoning

A benchmark that challenges LLMs to invoke multiple function calls as tools from 139 libraries and 7 domains for 1,140 fine-grained programming tasks. Evaluates code generation with diverse function calls and complex instructions, featuring two variants: Complete (code completion based on comprehensive docstrings) and Instruct (generating code from natural language instructions).

Leaderboard

Showing 2 of 2 results

Gemini Diffusion

45.4%

i
Qwen2.5-Coder 7B Instruct

41.0%

i