LongBench v2
reasoning official site →
LongBench v2 is a benchmark designed to assess the ability of LLMs to handle long-context problems requiring deep understanding and reasoning across real-world multitasks. It consists of 503 challenging multiple-choice questions with contexts ranging from 8k to 2M words across six major task categories: single-document QA, multi-document QA, long in-context learning, long-dialogue history understanding, code repository understanding, and long structured data understanding.
Methodology
Imported from llm-stats public benchmark metadata. Modality: text. Max score: 1. Categories: general, long_context, reasoning, structured_output. Language: en. Verified by llm-stats: no.