AlpacaEval 2.0

reasoning

AlpacaEval 2.0 is a length-controlled automatic evaluator for instruction-following language models that uses GPT-4 Turbo to assess model responses against a baseline. It evaluates models on 805 diverse instruction-following tasks including creative writing, classification, programming, and general knowledge questions. The benchmark achieves 0.98 Spearman correlation with ChatBot Arena while being fast (< 3 minutes) and affordable (< $10 in OpenAI credits). It addresses length bias in automatic evaluation through length-controlled win-rates and uses weighted scoring based on response quality.

Leaderboard

Showing 4 of 4 results

Granite 3.3 8B Base

62.7%

i
Granite 3.3 8B Instruct

62.7%

i
DeepSeek-V2.5

50.5%

i
IBM Granite 4.0 Tiny Preview

35.2%

i