Wild Bench

reasoning official site →

WildBench is an automated evaluation framework that benchmarks large language models using 1,024 challenging, real-world tasks selected from over one million human-chatbot conversation logs. It introduces two evaluation metrics (WB-Reward and WB-Score) that achieve high correlation with human preferences and uses task-specific checklists for systematic evaluation.

Methodology

Imported from llm-stats public benchmark metadata. Modality: text. Max score: 1. Categories: communication, general, reasoning. Language: en. Verified by llm-stats: no.

Leaderboard

  1. Mistral Large 3 self-reported llm-stats
    68.5%
  2. Mistral Small 3.2 24B Instruct self-reported llm-stats
    65.3%
  3. Mistral Small 3 24B Instruct self-reported llm-stats
    52.2%
  4. Jamba 1.5 Large self-reported llm-stats
    48.5%
  5. Jamba 1.5 Mini self-reported llm-stats
    42.4%