OpenAI-MRCR: 2 needle 128k

reasoning official site →

Multi-round Co-reference Resolution (MRCR) benchmark for evaluating an LLM's ability to distinguish between multiple needles hidden in long context. Models are given a long, multi-turn synthetic conversation and must retrieve a specific instance of a repeated request, requiring reasoning and disambiguation skills beyond simple retrieval.

Methodology

Imported from llm-stats public benchmark metadata. Modality: text. Max score: 1. Categories: long_context, reasoning. Language: en. Verified by llm-stats: no.

Leaderboard

  1. GPT-5 self-reported llm-stats
    95.2%
  2. MiniMax M1 40K self-reported llm-stats
    76.1%
  3. MiniMax M1 80K self-reported llm-stats
    73.4%
  4. GPT-4.1 self-reported llm-stats
    57.2%
  5. GPT-4.1 mini self-reported llm-stats
    47.2%
  6. GPT-4.5 self-reported llm-stats
    38.5%
  7. GPT-4.1 nano self-reported llm-stats
    36.6%
  8. GPT-4o self-reported llm-stats
    31.9%
  9. o3-mini self-reported llm-stats
    18.7%