MT-Bench
reasoning official site →
MT-Bench is a challenging multi-turn benchmark that measures the ability of large language models to engage in coherent, informative, and engaging conversations. It uses strong LLMs as judges for scalable and explainable evaluation of multi-turn dialogue capabilities.
Methodology
Imported from llm-stats public benchmark metadata. Modality: text. Max score: 100. Categories: communication, creativity, general, reasoning, roleplay. Language: en. Verified by llm-stats: no.