OpenAI-MRCR: 2 needle 1M

reasoning official site →

Multi-Round Co-reference Resolution benchmark that tests an LLM's ability to distinguish between multiple similar needles hidden in long conversations. Models must reproduce specific instances of content (e.g., 'Return the 2nd poem about tapirs') from multi-turn synthetic conversations, requiring reasoning about context, ordering, and subtle differences between similar outputs.

Methodology

Imported from llm-stats public benchmark metadata. Modality: text. Max score: 1. Categories: long_context, reasoning. Language: en. Verified by llm-stats: no.

Leaderboard

  1. MiniMax M1 40K self-reported llm-stats
    58.6%
  2. MiniMax M1 80K self-reported llm-stats
    56.2%
  3. GPT-4.1 self-reported llm-stats
    46.3%
  4. GPT-4.1 mini self-reported llm-stats
    33.3%
  5. GPT-4.1 nano self-reported llm-stats
    12.0%