Scale MultiChallenge

reasoning official site →

MultiChallenge is a realistic multi-turn conversation evaluation benchmark developed by Scale AI that evaluates large language models on four challenging conversation categories: instruction retention, inference memory of user information, reliable versioned editing, and self-coherence. Each challenge requires accurate instruction-following, context allocation, and in-context reasoning. Despite achieving near-perfect scores on existing multi-turn evaluation benchmarks, all frontier models have less than 50% accuracy on MultiChallenge.

Methodology

Imported from llm-stats public benchmark metadata. Modality: text. Max score: 1. Categories: reasoning, communication, general. Language: en. Multilingual: no. Verified by llm-stats: no.

Leaderboard

  1. GPT-5 self-reported llm-stats
    69.6%
  2. o3 self-reported llm-stats
    60.4%
  3. o3 self-reported llm-stats
    56.5%
  4. o4-mini self-reported llm-stats
    43.0%
  5. GPT-4o self-reported llm-stats
    40.3%