MMLU-redux-2.0

math official site →

A curated version of the MMLU benchmark featuring manually re-annotated 5,700 questions across 57 subjects to identify and correct errors in the original dataset. Addresses the 6.49% error rate found in MMLU and provides more reliable evaluation metrics for language models.

Methodology

Imported from llm-stats public benchmark metadata. Modality: text. Max score: 1. Categories: general, language, math, reasoning. Language: en. Verified by llm-stats: no.

Leaderboard

  1. Kimi K2 Base self-reported llm-stats
    90.2%