LongBench v2

reasoning official site →

LongBench v2 is a benchmark designed to assess the ability of LLMs to handle long-context problems requiring deep understanding and reasoning across real-world multitasks. It consists of 503 challenging multiple-choice questions with contexts ranging from 8k to 2M words across six major task categories: single-document QA, multi-document QA, long in-context learning, long-dialogue history understanding, code repository understanding, and long structured data understanding.

Methodology

Imported from llm-stats public benchmark metadata. Modality: text. Max score: 1. Categories: general, long_context, reasoning, structured_output. Language: en. Verified by llm-stats: no.

Leaderboard

  1. Qwen3.5-397B-A17B self-reported llm-stats
    63.2%
  2. Qwen3.6 Plus self-reported llm-stats
    62.0%
  3. MiniMax M1 80K self-reported llm-stats
    61.5%
  4. Kimi K2.5 self-reported llm-stats
    61.0%
  5. MAI-Thinking-1 self-reported llm-stats
    61.0%
  6. MiniMax M1 40K self-reported llm-stats
    61.0%
  7. MiMo-V2-Flash self-reported llm-stats
    60.6%
  8. Qwen3.5-27B self-reported llm-stats
    60.6%
  9. Qwen3.5-122B-A10B self-reported llm-stats
    60.2%
  10. Qwen3.5-35B-A3B self-reported llm-stats
    59.0%
  11. DeepSeek-V3 self-reported llm-stats
    48.7%