MT-Bench

reasoning official site →

MT-Bench is a challenging multi-turn benchmark that measures the ability of large language models to engage in coherent, informative, and engaging conversations. It uses strong LLMs as judges for scalable and explainable evaluation of multi-turn dialogue capabilities.

Methodology

Imported from llm-stats public benchmark metadata. Modality: text. Max score: 100. Categories: communication, creativity, general, reasoning, roleplay. Language: en. Verified by llm-stats: no.

Leaderboard

  1. Qwen2.5 7B Instruct self-reported llm-stats
    87.5
  2. Mistral Large 2 self-reported llm-stats
    86.3
  3. Qwen2 7B Instruct self-reported llm-stats
    84.1
  4. Mistral Small 3 24B Instruct self-reported llm-stats
    83.5
  5. Ministral 8B Instruct self-reported llm-stats
    83
  6. Pixtral-12B self-reported llm-stats
    76.8
  7. Hermes 3 70B self-reported llm-stats
    8.99
  8. Qwen2.5 72B Instruct self-reported llm-stats
    0.935
  9. Llama-3.3 Nemotron Super 49B v1 self-reported llm-stats
    0.917
  10. DeepSeek-V2.5 self-reported llm-stats
    0.902
  11. Llama 3.1 Nemotron Nano 8B V1 self-reported llm-stats
    0.81
  12. Llama 3.1 Nemotron 70B Instruct self-reported llm-stats
    0.09