HellaSwag
reasoning official site →
A challenging commonsense natural language inference dataset that uses Adversarial Filtering to create questions trivial for humans (>95% accuracy) but difficult for state-of-the-art models, requiring completion of sentence endings based on physical situations and everyday activities
Methodology
Imported from llm-stats public benchmark metadata. Modality: text. Max score: 1. Categories: reasoning. Language: en. Verified by llm-stats: no.