Multi-Challenge

reasoning

MultiChallenge is a realistic multi-turn conversation evaluation benchmark that challenges frontier LLMs across four key categories: instruction retention (maintaining instructions throughout conversations), inference memory (recalling and connecting details from previous turns), reliable versioned editing (adapting to evolving instructions during collaborative editing), and self-coherence (avoiding contradictions in responses). The benchmark evaluates models on sustained, contextually complex dialogues across diverse topics including travel planning, technical documentation, and professional communication.

Leaderboard

Showing 20 of 28 results

Nova 2 Pro

77.7%

i
Nova 2 Lite

76.6%

i
Nova 2 Omni

75.5%

i
GPT-5

69.6%

i
Qwen3.5-397B-A17B

67.6%

i
Step3-VL-10B

62.6%

i
Qwen3.5-122B-A10B

61.5%

i
Qwen3.5-27B

60.8%

i
o3

60.4%

i
Qwen3.5-35B-A3B

60.0%

i
Nemotron 3 Super (120B A12B)

55.2%

i
Qwen3.5-9B

54.5%

i
Kimi K2 Instruct

54.1%

i
Kimi K2-Instruct-0905

54.1%

i
MAI-Thinking-1

53.0%

i
Qwen3.5-4B

49.0%

i
MiniMax M1 40K

44.7%

i
MiniMax M1 80K

44.7%

i
GPT-4.5

43.8%

i
o4-mini

43.0%

i