Nexus

general

NexusRaven benchmark for evaluating function calling capabilities of large language models in zero-shot scenarios across cybersecurity tools and API interactions

Leaderboard

Showing 4 of 4 results

Llama 3.1 405B Instruct

58.7%

i
Llama 3.1 70B Instruct

56.7%

i
Llama 3.1 8B Instruct

38.5%

i
Llama 3.2 3B Instruct

34.3%

i