MultiChallenge (o3-mini grader)

reasoning

A realistic multi-turn conversation evaluation benchmark that challenges frontier LLMs across four key areas: instruction retention, inference memory, reliable versioned editing, and self-coherence. Despite near-perfect scores on existing benchmarks, frontier models achieve less than 50% accuracy on MultiChallenge.

Leaderboard

Showing 7 of 7 results

GPT-5

69.6%

i
o3-mini

50.2%

i
GPT-4.5

50.1%

i
GPT-4.1

46.2%

i
GPT-4.1 mini

42.2%

i
GPT-4o

39.9%

i
GPT-4.1 nano

31.1%

i