LongBench v2

reasoning

LongBench v2 is a benchmark designed to assess the ability of LLMs to handle long-context problems requiring deep understanding and reasoning across real-world multitasks. It consists of 503 challenging multiple-choice questions with contexts ranging from 8k to 2M words across six major task categories: single-document QA, multi-document QA, long in-context learning, long-dialogue history understanding, code repository understanding, and long structured data understanding.

Leaderboard

Showing 15 of 15 results

Qwen3.5-397B-A17B

63.2%

i
Qwen3.6 Plus

62.0%

i
MiniMax M1 80K

61.5%

i
Kimi K2.5

61.0%

i
MAI-Thinking-1

61.0%

i
MiniMax M1 40K

61.0%

i
MiMo-V2-Flash

60.6%

i
Qwen3.5-27B

60.6%

i
Qwen3.5-122B-A10B

60.2%

i
Qwen3.5-35B-A3B

59.0%

i
Qwen3.5-9B

55.2%

i
Qwen3.5-4B

50.0%

i
DeepSeek-V3

48.7%

i
Qwen3.5-2B

38.7%

i
Qwen3.5-0.8B

26.1%

i