MM-MT-Bench

multimodal

A multi-turn LLM-as-a-judge evaluation benchmark for testing multimodal instruction-tuned models' ability to follow user instructions in multi-turn dialogues and answer open-ended questions in a zero-shot manner.

Methodology

Imported from llm-stats public benchmark metadata. Modality: multimodal. Max score: 100. Categories: communication, multimodal. Language: en. Verified by llm-stats: no.

Leaderboard

  1. Mistral Large 3 self-reported llm-stats
    84.9
  2. Pixtral Large self-reported llm-stats
    74
  3. Pixtral-12B self-reported llm-stats
    60.5
  4. Qwen3 VL 235B A22B Instruct self-reported llm-stats
    8.5
  5. Qwen3 VL 235B A22B Thinking self-reported llm-stats
    8.5
  6. Qwen3 VL 32B Instruct self-reported llm-stats
    8.4
  7. Qwen3 VL 32B Thinking self-reported llm-stats
    8.3
  8. Qwen3 VL 30B A3B Instruct self-reported llm-stats
    8.1
  9. Qwen3 VL 8B Thinking self-reported llm-stats
    8
  10. Qwen3 VL 30B A3B Thinking self-reported llm-stats
    7.9
  11. Qwen3 VL 4B Thinking self-reported llm-stats
    7.7
  12. Qwen3 VL 8B Instruct self-reported llm-stats
    7.7
  13. Qwen3 VL 4B Instruct self-reported llm-stats
    7.5
  14. Qwen2.5-Omni-7B self-reported llm-stats
    0.06