MT-Bench

reasoning

MT-Bench is a challenging multi-turn benchmark that measures the ability of large language models to engage in coherent, informative, and engaging conversations. It uses strong LLMs as judges for scalable and explainable evaluation of multi-turn dialogue capabilities.

Leaderboard

Showing 12 of 12 results

Qwen2.5 7B Instruct

87.5

i
Mistral Large 2

86.3

i
Qwen2 7B Instruct

84.1

i
Mistral Small 3 24B Instruct

83.5

i
Ministral 8B Instruct

83

i
Pixtral-12B

76.8

i
Hermes 3 70B

8.99

i
Qwen2.5 72B Instruct

0.935

i
Llama-3.3 Nemotron Super 49B v1

0.917

i
DeepSeek-V2.5

0.902

i
Llama 3.1 Nemotron Nano 8B V1

0.81

i
Llama 3.1 Nemotron 70B Instruct

0.09

i