COLLIE

reasoning official site →

COLLIE is a grammar-based framework for systematic construction of constrained text generation tasks. It allows specification of rich, compositional constraints across diverse generation levels and modeling challenges including language understanding, logical reasoning, and semantic planning. The COLLIE-v1 dataset contains 2,080 instances across 13 constraint structures.

Methodology

Imported from llm-stats public benchmark metadata. Modality: text. Max score: 1. Categories: language, reasoning, writing. Language: en. Verified by llm-stats: no.

Leaderboard

  1. GPT-5 self-reported llm-stats
    99.0%
  2. o3-mini self-reported llm-stats
    98.7%
  3. o3 self-reported llm-stats
    98.4%
  4. Mistral Medium 3.5 self-reported llm-stats
    95.8%
  5. GPT-4.5 self-reported llm-stats
    72.3%
  6. GPT-4.1 self-reported llm-stats
    65.8%
  7. GPT-4o self-reported llm-stats
    61.0%
  8. GPT-4.1 mini self-reported llm-stats
    54.6%
  9. GPT-4.1 nano self-reported llm-stats
    42.5%