ACEBench
reasoning official site →
ACEBench is a comprehensive benchmark for evaluating Large Language Models' tool usage capabilities across three primary evaluation types: Normal (basic tool usage scenarios), Special (tool usage with ambiguous or incomplete instructions), and Agent (multi-agent interactions simulating real-world dialogues). The benchmark covers 4,538 APIs across 8 major domains and 68 sub-domains including technology, finance, entertainment, society, health, culture, and environment, supporting both English and Chinese languages.
Methodology
Imported from llm-stats public benchmark metadata. Modality: text. Max score: 1. Categories: finance, general, healthcare, reasoning, tool_calling. Language: en. Verified by llm-stats: no.