LongCodeBench

coding official site →

LongCodeBench evaluates the code understanding and comprehension abilities of large language models at very long context windows, scaling up to 1M tokens. It tests whether models can reason about extensive codebases provided in a single prompt by answering multiple-choice questions about the code.

Methodology

Imported from llm-stats public benchmark metadata. Modality: text. Max score: 1. Categories: coding, long_context, reasoning. Language: en. Verified by llm-stats: no.

Leaderboard

  1. Nova 2 Lite self-reported llm-stats
    84.0%
  2. Nova 2 Pro self-reported llm-stats
    84.0%