Wild Bench

reasoning

WildBench is an automated evaluation framework that benchmarks large language models using 1,024 challenging, real-world tasks selected from over one million human-chatbot conversation logs. It introduces two evaluation metrics (WB-Reward and WB-Score) that achieve high correlation with human preferences and uses task-specific checklists for systematic evaluation.

Leaderboard

Showing 8 of 8 results

Mistral Large 3

68.5%

i
MiniStral 3 (14B Instruct 2512)

68.5%

i
Ministral 3 (8B Instruct 2512)

66.8%

i
Mistral Small 3.2 24B Instruct

65.3%

i
Ministral 3 (3B Instruct 2512)

56.8%

i
Mistral Small 3 24B Instruct

52.2%

i
Jamba 1.5 Large

48.5%

i
Jamba 1.5 Mini

42.4%

i