MMLU-Base

math official site →

Base version of the Massive Multitask Language Understanding benchmark, evaluating language models across 57 tasks including elementary mathematics, US history, computer science, law, and other professional and academic subjects. Designed to comprehensively measure the breadth and depth of a model's academic and professional understanding.

Methodology

Imported from llm-stats public benchmark metadata. Modality: text. Max score: 1. Categories: finance, general, healthcare, language, legal, math, reasoning. Language: en. Verified by llm-stats: no.

Leaderboard

  1. Qwen2.5-Coder 7B Instruct self-reported llm-stats
    68.0%