Arena Hard

reasoning

Arena-Hard-Auto is an automatic evaluation benchmark for instruction-tuned LLMs consisting of 500 challenging real-world prompts curated by BenchBuilder. It includes open-ended software engineering problems, mathematical questions, and creative writing tasks. The benchmark uses LLM-as-a-Judge methodology with GPT-4.1 and Gemini-2.5 as automatic judges to approximate human preference. Arena-Hard achieves 98.6% correlation with human preference rankings and provides 3x higher separation of model performances compared to MT-Bench, making it highly effective for distinguishing between models of similar quality.

Leaderboard

Showing 20 of 26 results

Qwen3 235B A22B

95.6%

i
Qwen3 32B

93.8%

i
Qwen3 30B A3B

91.0%

i
Llama-3.3 Nemotron Super 49B v1

88.3%

i
Mistral Small 3 24B Instruct

87.6%

i
Qwen2.5 72B Instruct

81.2%

i
Phi 4 Reasoning Plus

79.0%

i
DeepSeek-V2.5

76.2%

i
Phi 4

75.4%

i
Phi 4 Reasoning

73.3%

i
Ministral 8B Instruct

70.9%

i
Jamba 1.5 Large

65.4%

i
Mistral Small 4

58.3%

i
Granite 3.3 8B Base

57.6%

i
Granite 3.3 8B Instruct

57.6%

i
Mistral Large 3

55.1%

i
MiniStral 3 (14B Instruct 2512)

55.1%

i
Qwen2.5 7B Instruct

52.0%

i
Ministral 3 (8B Instruct 2512)

50.9%

i
Jamba 1.5 Mini

46.1%

i