Wild Bench
reasoning official site →
WildBench is an automated evaluation framework that benchmarks large language models using 1,024 challenging, real-world tasks selected from over one million human-chatbot conversation logs. It introduces two evaluation metrics (WB-Reward and WB-Score) that achieve high correlation with human preferences and uses task-specific checklists for systematic evaluation.
Methodology
Imported from llm-stats public benchmark metadata. Modality: text. Max score: 1. Categories: communication, general, reasoning. Language: en. Verified by llm-stats: no.