FlenQA

reasoning

Flexible Length Question Answering dataset for evaluating the impact of input length on reasoning performance of language models, featuring True/False questions embedded in contexts of varying lengths (250-3000 tokens) across three reasoning tasks: Monotone Relations, People In Rooms, and simplified Ruletaker

Leaderboard

Showing 2 of 2 results

Phi 4 Reasoning Plus

97.9%

i
Phi 4 Reasoning

97.7%

i