MRCR v2 (8-needle)

reasoning official site →

MRCR v2 (8-needle) is a variant of the Multi-Round Coreference Resolution benchmark that includes 8 needle items to retrieve from long contexts. This tests models' ability to simultaneously track and reason about multiple pieces of information across extended conversations.

Methodology

Imported from llm-stats public benchmark metadata. Modality: text. Max score: 1. Categories: general, long_context, reasoning. Language: en. Verified by llm-stats: no.

Leaderboard

  1. Claude Opus 4.6 self-reported llm-stats
    93.0%
  2. GPT-5.5 self-reported llm-stats
    74.0%
  3. Gemini 3.1 Flash-Lite self-reported llm-stats
    60.1%
  4. GPT-5.4 mini self-reported llm-stats
    33.6%
  5. GPT-5.4 nano self-reported llm-stats
    33.1%
  6. Gemini 3.5 Flash self-reported llm-stats
    26.6%
  7. Gemini 3 Pro self-reported llm-stats
    26.3%
  8. Gemini 3.1 Pro self-reported llm-stats
    26.3%
  9. Gemini 3 Flash self-reported llm-stats
    22.1%
  10. Gemini 2.5 Pro Preview 06-05 self-reported llm-stats
    16.4%