Arena-Hard v2

reasoning official site →

Arena-Hard-Auto v2 is a challenging benchmark consisting of 500 carefully curated prompts sourced from Chatbot Arena and WildChat-1M, designed to evaluate large language models on real-world user queries. The benchmark covers diverse domains including open-ended software engineering problems, mathematics, creative writing, and technical problem-solving. It uses LLM-as-a-Judge for automatic evaluation, achieving 98.6% correlation with human preference rankings while providing 3x higher separation of model performances compared to MT-Bench. The benchmark emphasizes prompt specificity, complexity, and domain knowledge to better distinguish between model capabilities.

Methodology

Imported from llm-stats public benchmark metadata. Modality: text. Max score: 1. Categories: creativity, general, reasoning, writing. Language: en. Verified by llm-stats: no.

Leaderboard

  1. MiMo-V2-Flash self-reported llm-stats
    86.2%
  2. Qwen3-Next-80B-A3B-Instruct self-reported llm-stats
    82.7%
  3. Qwen3-235B-A22B-Thinking-2507 self-reported llm-stats
    79.7%
  4. Qwen3-235B-A22B-Instruct-2507 self-reported llm-stats
    79.2%
  5. Qwen3 VL 235B A22B Instruct self-reported llm-stats
    77.4%
  6. Nemotron 3 Super (120B A12B) self-reported llm-stats
    73.9%
  7. Sarvam-105B self-reported llm-stats
    71.0%
  8. Nemotron 3 Nano (30B A3B) self-reported llm-stats
    67.7%
  9. Qwen3 VL 32B Instruct self-reported llm-stats
    64.7%
  10. Qwen3-Next-80B-A3B-Thinking self-reported llm-stats
    62.3%
  11. Qwen3 VL 32B Thinking self-reported llm-stats
    60.5%
  12. Qwen3 VL 30B A3B Instruct self-reported llm-stats
    58.5%
  13. Qwen3 VL 30B A3B Thinking self-reported llm-stats
    56.7%
  14. Qwen3 VL 8B Thinking self-reported llm-stats
    51.1%
  15. Sarvam-30B self-reported llm-stats
    49.0%
  16. Qwen3 VL 4B Thinking self-reported llm-stats
    36.8%