BigCodeBench-Hard

reasoning

BigCodeBench-Hard is a subset of 148 challenging programming tasks from BigCodeBench, designed to evaluate large language models' ability to solve complex, real-world programming problems. These tasks require diverse function calls from multiple libraries across 7 domains including computation, networking, data analysis, and visualization. The benchmark tests compositional reasoning and the ability to implement complex instructions that span 139 libraries with an average of 2.8 libraries per task.

Leaderboard

Showing 1 of 1 result

Qwen2.5-Coder 32B Instruct

27.0%

i