ZebraLogic

reasoning official site →

ZebraLogic is an evaluation framework for assessing large language models' logical reasoning capabilities through logic grid puzzles derived from constraint satisfaction problems (CSPs). The benchmark consists of 1,000 programmatically generated puzzles with controllable and quantifiable complexity, revealing a 'curse of complexity' where model accuracy declines significantly as problem complexity grows.

Methodology

Imported from llm-stats public benchmark metadata. Modality: text. Max score: 1. Categories: reasoning. Language: en. Verified by llm-stats: no.

Leaderboard

  1. Qwen3 VL 235B A22B Thinking self-reported llm-stats
    97.3%
  2. LongCat-Flash-Thinking self-reported llm-stats
    95.5%
  3. Qwen3-235B-A22B-Instruct-2507 self-reported llm-stats
    95.0%
  4. LongCat-Flash-Chat self-reported llm-stats
    89.3%
  5. Kimi K2 Instruct self-reported llm-stats
    89.0%
  6. Kimi K2-Instruct-0905 self-reported llm-stats
    89.0%
  7. MiniMax M1 80K self-reported llm-stats
    86.8%
  8. MiniMax M1 40K self-reported llm-stats
    80.1%