Winogrande

reasoning official site →

WinoGrande: An Adversarial Winograd Schema Challenge at Scale. A large-scale dataset of 44,000 pronoun resolution problems designed to test machine commonsense reasoning. Uses adversarial filtering to reduce spurious biases and provides a more robust evaluation of whether AI systems truly understand commonsense or exploit statistical shortcuts. Current best AI methods achieve 59.4-79.1% accuracy, significantly below human performance of 94.0%.

Methodology

Imported from llm-stats public benchmark metadata. Modality: text. Max score: 1. Categories: language, reasoning. Language: en. Verified by llm-stats: no.

Leaderboard

  1. GPT-4 self-reported llm-stats
    87.5%
  2. Command R+ self-reported llm-stats
    85.4%
  3. Qwen2 72B Instruct self-reported llm-stats
    85.1%
  4. Llama 3.1 Nemotron 70B Instruct self-reported llm-stats
    84.5%
  5. Gemma 2 27B self-reported llm-stats
    83.7%
  6. Hermes 3 70B self-reported llm-stats
    83.2%
  7. Qwen2.5 32B Instruct self-reported llm-stats
    82.0%
  8. Phi-3.5-MoE-instruct self-reported llm-stats
    81.3%
  9. Qwen2.5-Coder 32B Instruct self-reported llm-stats
    80.8%
  10. Gemma 2 9B self-reported llm-stats
    80.6%
  11. Mistral NeMo Instruct self-reported llm-stats
    76.8%
  12. Ministral 8B Instruct self-reported llm-stats
    75.3%
  13. Granite 3.3 8B Base self-reported llm-stats
    74.4%
  14. Qwen2.5-Coder 7B Instruct self-reported llm-stats
    72.9%
  15. Gemma 3n E4B self-reported llm-stats
    71.7%
  16. 71.7%
  17. Phi-3.5-mini-instruct self-reported llm-stats
    68.5%
  18. Phi 4 Mini self-reported llm-stats
    67.0%
  19. Gemma 3n E2B self-reported llm-stats
    66.8%
  20. 66.8%