Arena Hard

reasoning official site →

Arena-Hard-Auto is an automatic evaluation benchmark for instruction-tuned LLMs consisting of 500 challenging real-world prompts curated by BenchBuilder. It includes open-ended software engineering problems, mathematical questions, and creative writing tasks. The benchmark uses LLM-as-a-Judge methodology with GPT-4.1 and Gemini-2.5 as automatic judges to approximate human preference. Arena-Hard achieves 98.6% correlation with human preference rankings and provides 3x higher separation of model performances compared to MT-Bench, making it highly effective for distinguishing between models of similar quality.

Methodology

Imported from llm-stats public benchmark metadata. Modality: text. Max score: 1. Categories: creativity, general, reasoning, writing. Language: en. Verified by llm-stats: no.

Leaderboard

  1. Qwen3 235B A22B self-reported llm-stats
    95.6%
  2. Qwen3 32B self-reported llm-stats
    93.8%
  3. Qwen3 30B A3B self-reported llm-stats
    91.0%
  4. Llama-3.3 Nemotron Super 49B v1 self-reported llm-stats
    88.3%
  5. Mistral Small 3 24B Instruct self-reported llm-stats
    87.6%
  6. Qwen2.5 72B Instruct self-reported llm-stats
    81.2%
  7. Phi 4 Reasoning Plus self-reported llm-stats
    79.0%
  8. DeepSeek-V2.5 self-reported llm-stats
    76.2%
  9. Phi 4 self-reported llm-stats
    75.4%
  10. Phi 4 Reasoning self-reported llm-stats
    73.3%
  11. Ministral 8B Instruct self-reported llm-stats
    70.9%
  12. Jamba 1.5 Large self-reported llm-stats
    65.4%
  13. Granite 3.3 8B Base self-reported llm-stats
    57.6%
  14. Granite 3.3 8B Instruct self-reported llm-stats
    57.6%
  15. Mistral Large 3 self-reported llm-stats
    55.1%
  16. Qwen2.5 7B Instruct self-reported llm-stats
    52.0%
  17. Jamba 1.5 Mini self-reported llm-stats
    46.1%
  18. Mistral Small 3.2 24B Instruct self-reported llm-stats
    43.1%
  19. Phi-3.5-MoE-instruct self-reported llm-stats
    37.9%
  20. Phi-3.5-mini-instruct self-reported llm-stats
    37.0%