AlignBench

math official site →

AlignBench is a comprehensive multi-dimensional benchmark for evaluating Chinese alignment of Large Language Models. It contains 8 main categories: Fundamental Language Ability, Advanced Chinese Understanding, Open-ended Questions, Writing Ability, Logical Reasoning, Mathematics, Task-oriented Role Play, and Professional Knowledge. The benchmark includes 683 real-scenario rooted queries with human-verified references and uses a rule-calibrated multi-dimensional LLM-as-Judge approach with Chain-of-Thought for evaluation.

Methodology

Imported from llm-stats public benchmark metadata. Modality: text. Max score: 1. Categories: creativity, general, language, math, reasoning, roleplay, writing. Language: en. Verified by llm-stats: no.

Leaderboard

  1. Qwen2.5 72B Instruct self-reported llm-stats
    81.6%
  2. DeepSeek-V2.5 self-reported llm-stats
    80.4%
  3. Qwen2.5 7B Instruct self-reported llm-stats
    73.3%
  4. Qwen2 7B Instruct self-reported llm-stats
    72.1%