Internal API instruction following (hard)

general

Internal API instruction following (hard) benchmark - specific documentation not found in official sources

Methodology

Imported from llm-stats public benchmark metadata. Modality: text. Max score: 1. Categories: general, structured_output. Language: en. Verified by llm-stats: no.

Leaderboard

  1. GPT-5 self-reported llm-stats
    64.0%
  2. GPT-4.5 self-reported llm-stats
    54.0%
  3. o3-mini self-reported llm-stats
    50.0%
  4. GPT-4.1 self-reported llm-stats
    49.1%
  5. GPT-4.1 mini self-reported llm-stats
    45.1%
  6. GPT-4.1 nano self-reported llm-stats
    31.6%
  7. GPT-4o self-reported llm-stats
    29.2%