Scale MultiChallenge

reasoning

MultiChallenge is a realistic multi-turn conversation evaluation benchmark developed by Scale AI that evaluates large language models on four challenging conversation categories: instruction retention, inference memory of user information, reliable versioned editing, and self-coherence. Each challenge requires accurate instruction-following, context allocation, and in-context reasoning. Despite achieving near-perfect scores on existing multi-turn evaluation benchmarks, all frontier models have less than 50% accuracy on MultiChallenge.

Leaderboard

Showing 4 of 4 results

GPT-5

69.6%

i
o3

60.4%

i
o4-mini

43.0%

i
GPT-4o

40.3%

i