Multi-IF
reasoning official site →
Multi-IF benchmarks LLMs on multi-turn and multilingual instruction following. It expands upon IFEval by incorporating multi-turn sequences and translating English prompts into 7 other languages, resulting in 4,501 multilingual conversations with three turns each. The benchmark reveals that current leading LLMs struggle with maintaining accuracy in multi-turn instructions and shows higher error rates for non-Latin script languages.
Methodology
Imported from llm-stats public benchmark metadata. Modality: text. Max score: 1. Categories: communication, instruction_following, language, reasoning, structured_output. Language: en. Verified by llm-stats: no.