ZebraLogic

reasoning

ZebraLogic is an evaluation framework for assessing large language models' logical reasoning capabilities through logic grid puzzles derived from constraint satisfaction problems (CSPs). The benchmark consists of 1,000 programmatically generated puzzles with controllable and quantifiable complexity, revealing a 'curse of complexity' where model accuracy declines significantly as problem complexity grows.

Leaderboard

Showing 8 of 8 results

Qwen3 VL 235B A22B Thinking

97.3%

i
LongCat-Flash-Thinking

95.5%

i
Qwen3-235B-A22B-Instruct-2507

95.0%

i
LongCat-Flash-Chat

89.3%

i
Kimi K2 Instruct

89.0%

i
Kimi K2-Instruct-0905

89.0%

i
MiniMax M1 80K

86.8%

i
MiniMax M1 40K

80.1%

i