Arena-Hard v2

reasoning

Arena-Hard-Auto v2 is a challenging benchmark consisting of 500 carefully curated prompts sourced from Chatbot Arena and WildChat-1M, designed to evaluate large language models on real-world user queries. The benchmark covers diverse domains including open-ended software engineering problems, mathematics, creative writing, and technical problem-solving. It uses LLM-as-a-Judge for automatic evaluation, achieving 98.6% correlation with human preference rankings while providing 3x higher separation of model performances compared to MT-Bench. The benchmark emphasizes prompt specificity, complexity, and domain knowledge to better distinguish between model capabilities.

Leaderboard

Showing 16 of 16 results

MiMo-V2-Flash

86.2%

i
Qwen3-Next-80B-A3B-Instruct

82.7%

i
Qwen3-235B-A22B-Thinking-2507

79.7%

i
Qwen3-235B-A22B-Instruct-2507

79.2%

i
Qwen3 VL 235B A22B Instruct

77.4%

i
Nemotron 3 Super (120B A12B)

73.9%

i
Sarvam-105B

71.0%

i
Nemotron 3 Nano (30B A3B)

67.7%

i
Qwen3 VL 32B Instruct

64.7%

i
Qwen3-Next-80B-A3B-Thinking

62.3%

i
Qwen3 VL 32B Thinking

60.5%

i
Qwen3 VL 30B A3B Instruct

58.5%

i
Qwen3 VL 30B A3B Thinking

56.7%

i
Qwen3 VL 8B Thinking

51.1%

i
Sarvam-30B

49.0%

i
Qwen3 VL 4B Thinking

36.8%

i