AlpacaEval 2.0

reasoning official site →

AlpacaEval 2.0 is a length-controlled automatic evaluator for instruction-following language models that uses GPT-4 Turbo to assess model responses against a baseline. It evaluates models on 805 diverse instruction-following tasks including creative writing, classification, programming, and general knowledge questions. The benchmark achieves 0.98 Spearman correlation with ChatBot Arena while being fast (< 3 minutes) and affordable (< $10 in OpenAI credits). It addresses length bias in automatic evaluation through length-controlled win-rates and uses weighted scoring based on response quality.

Methodology

Imported from llm-stats public benchmark metadata. Modality: text. Max score: 1. Categories: creativity, general, reasoning, writing. Language: en. Verified by llm-stats: no.

Leaderboard

  1. Granite 3.3 8B Base self-reported llm-stats
    62.7%
  2. Granite 3.3 8B Instruct self-reported llm-stats
    62.7%
  3. DeepSeek-V2.5 self-reported llm-stats
    50.5%
  4. IBM Granite 4.0 Tiny Preview self-reported llm-stats
    35.2%