AlpacaEval 2.0
reasoning official site →
AlpacaEval 2.0 is a length-controlled automatic evaluator for instruction-following language models that uses GPT-4 Turbo to assess model responses against a baseline. It evaluates models on 805 diverse instruction-following tasks including creative writing, classification, programming, and general knowledge questions. The benchmark achieves 0.98 Spearman correlation with ChatBot Arena while being fast (< 3 minutes) and affordable (< $10 in OpenAI credits). It addresses length bias in automatic evaluation through length-controlled win-rates and uses weighted scoring based on response quality.
Methodology
Imported from llm-stats public benchmark metadata. Modality: text. Max score: 1. Categories: creativity, general, reasoning, writing. Language: en. Verified by llm-stats: no.