MMLU-redux-2.0
math official site →
A curated version of the MMLU benchmark featuring manually re-annotated 5,700 questions across 57 subjects to identify and correct errors in the original dataset. Addresses the 6.49% error rate found in MMLU and provides more reliable evaluation metrics for language models.
Methodology
Imported from llm-stats public benchmark metadata. Modality: text. Max score: 1. Categories: general, language, math, reasoning. Language: en. Verified by llm-stats: no.