HellaSwag

reasoning official site →

A challenging commonsense natural language inference dataset that uses Adversarial Filtering to create questions trivial for humans (>95% accuracy) but difficult for state-of-the-art models, requiring completion of sentence endings based on physical situations and everyday activities

Methodology

Imported from llm-stats public benchmark metadata. Modality: text. Max score: 1. Categories: reasoning. Language: en. Verified by llm-stats: no.

Leaderboard

  1. Claude 3 Opus self-reported llm-stats
    95.4%
  2. GPT-4 self-reported llm-stats
    95.3%
  3. Gemini 1.5 Pro self-reported llm-stats
    93.3%
  4. Claude 3 Sonnet self-reported llm-stats
    89.0%
  5. Command R+ self-reported llm-stats
    88.6%
  6. Hermes 3 70B self-reported llm-stats
    88.2%
  7. Qwen2 72B Instruct self-reported llm-stats
    87.6%
  8. Gemini 1.5 Flash self-reported llm-stats
    86.5%
  9. Gemma 2 27B self-reported llm-stats
    86.4%
  10. Claude 3 Haiku self-reported llm-stats
    85.9%
  11. Llama 3.1 Nemotron 70B Instruct self-reported llm-stats
    85.6%
  12. Qwen2.5 32B Instruct self-reported llm-stats
    85.2%
  13. Phi-3.5-MoE-instruct self-reported llm-stats
    83.8%
  14. Mistral NeMo Instruct self-reported llm-stats
    83.5%
  15. Qwen2.5-Coder 32B Instruct self-reported llm-stats
    83.0%
  16. Gemma 2 9B self-reported llm-stats
    81.9%
  17. Granite 3.3 8B Base self-reported llm-stats
    80.1%
  18. Gemma 3n E4B self-reported llm-stats
    78.6%
  19. 78.6%
  20. Qwen2.5-Coder 7B Instruct self-reported llm-stats
    76.8%