AGIEval

math official site →

A human-centric benchmark for evaluating foundation models on standardized exams including college entrance exams (Gaokao, SAT), law school admission tests (LSAT), math competitions, lawyer qualification tests, and civil service exams. Contains 20 tasks (18 multiple-choice, 2 cloze) designed to assess understanding, knowledge, reasoning, and calculation abilities in real-world academic and professional contexts.

Methodology

Imported from llm-stats public benchmark metadata. Modality: text. Max score: 1. Categories: general, legal, math, reasoning. Language: en. Verified by llm-stats: no.

Leaderboard

  1. Mistral Small 3 24B Base self-reported llm-stats
    65.8%
  2. Ministral 3 (14B Base 2512) self-reported llm-stats
    64.8%
  3. Hermes 3 70B self-reported llm-stats
    56.2%
  4. Gemma 2 27B self-reported llm-stats
    55.1%
  5. Gemma 2 9B self-reported llm-stats
    52.8%
  6. Granite 3.3 8B Base self-reported llm-stats
    49.3%
  7. Ministral 8B Instruct self-reported llm-stats
    48.3%
  8. ERNIE 4.5 self-reported llm-stats
    28.5%