Multi-IF

reasoning official site →

Multi-IF benchmarks LLMs on multi-turn and multilingual instruction following. It expands upon IFEval by incorporating multi-turn sequences and translating English prompts into 7 other languages, resulting in 4,501 multilingual conversations with three turns each. The benchmark reveals that current leading LLMs struggle with maintaining accuracy in multi-turn instructions and shows higher error rates for non-Latin script languages.

Methodology

Imported from llm-stats public benchmark metadata. Modality: text. Max score: 1. Categories: communication, instruction_following, language, reasoning, structured_output. Language: en. Verified by llm-stats: no.

Leaderboard

  1. Qwen3-235B-A22B-Thinking-2507 self-reported llm-stats
    80.6%
  2. o3-mini self-reported llm-stats
    79.5%
  3. Qwen3 VL 235B A22B Thinking self-reported llm-stats
    79.1%
  4. Qwen3 VL 32B Thinking self-reported llm-stats
    78.0%
  5. Qwen3-Next-80B-A3B-Thinking self-reported llm-stats
    77.8%
  6. Qwen3-235B-A22B-Instruct-2507 self-reported llm-stats
    77.5%
  7. Qwen3 VL 235B A22B Instruct self-reported llm-stats
    76.3%
  8. Qwen3-Next-80B-A3B-Instruct self-reported llm-stats
    75.8%
  9. Qwen3 VL 8B Instruct self-reported llm-stats
    75.1%
  10. Qwen3 VL 8B Thinking self-reported llm-stats
    75.1%
  11. Qwen3 VL 4B Thinking self-reported llm-stats
    73.6%
  12. Qwen3 VL 30B A3B Thinking self-reported llm-stats
    73.0%
  13. Qwen3 30B A3B self-reported llm-stats
    72.2%
  14. Qwen3 VL 32B Instruct self-reported llm-stats
    72.0%
  15. GPT-4.1 self-reported llm-stats
    70.8%
  16. GPT-4.5 self-reported llm-stats
    70.8%
  17. GPT-4.1 mini self-reported llm-stats
    67.0%
  18. Qwen3 VL 30B A3B Instruct self-reported llm-stats
    66.1%
  19. GPT-4o self-reported llm-stats
    60.9%
  20. GPT-4.1 nano self-reported llm-stats
    57.2%