HellaSwag

reasoning

A challenging commonsense natural language inference dataset that uses Adversarial Filtering to create questions trivial for humans (>95% accuracy) but difficult for state-of-the-art models, requiring completion of sentence endings based on physical situations and everyday activities

Leaderboard

Showing 20 of 27 results

Claude 3 Opus

95.4%

i
GPT-4

95.3%

i
Gemini 1.5 Pro

93.3%

i
MiMo-V2.5-Pro

89.8%

i
Claude 3 Sonnet

89.0%

i
Command R+

88.6%

i
Hermes 3 70B

88.2%

i
Qwen2 72B Instruct

87.6%

i
Gemini 1.5 Flash

86.5%

i
Gemma 2 27B

86.4%

i
Claude 3 Haiku

85.9%

i
Llama 3.1 Nemotron 70B Instruct

85.6%

i
Qwen2.5 32B Instruct

85.2%

i
Phi-3.5-MoE-instruct

83.8%

i
Mistral NeMo Instruct

83.5%

i
Qwen2.5-Coder 32B Instruct

83.0%

i
Gemma 2 9B

81.9%

i
Granite 3.3 8B Base

80.1%

i
Gemma 3n E4B

78.6%

i
Gemma 3n E4B Instructed LiteRT Preview

78.6%

i