IFEval

general official site →

Instruction-Following Evaluation (IFEval) benchmark for large language models, focusing on verifiable instructions with 25 types of instructions and around 500 prompts containing one or more verifiable constraints

Methodology

Imported from llm-stats public benchmark metadata. Modality: text. Max score: 1. Categories: general, instruction_following, structured_output. Language: en. Verified by llm-stats: no.

Leaderboard

  1. Qwen3.5-27B self-reported llm-stats
    95.0%
  2. Qwen3.6 Plus self-reported llm-stats
    94.3%
  3. Qwen3.7 Max self-reported llm-stats
    94.3%
  4. o3-mini self-reported llm-stats
    93.9%
  5. Qwen3.5-122B-A10B self-reported llm-stats
    93.4%
  6. Claude 3.7 Sonnet self-reported llm-stats
    93.2%
  7. Qwen3.5-397B-A17B self-reported llm-stats
    92.6%
  8. Nova Pro self-reported llm-stats
    92.1%
  9. Llama 3.3 70B Instruct self-reported llm-stats
    92.1%
  10. Qwen3.5-35B-A3B self-reported llm-stats
    91.9%
  11. Gemma 3 27B self-reported llm-stats
    90.4%
  12. Nemotron Nano 9B v2 self-reported llm-stats
    90.3%
  13. Gemma 3 4B self-reported llm-stats
    90.2%
  14. Kimi K2 Instruct self-reported llm-stats
    89.8%
  15. Kimi K2-Instruct-0905 self-reported llm-stats
    89.8%
  16. Nova Lite self-reported llm-stats
    89.7%
  17. LongCat-Flash-Chat self-reported llm-stats
    89.6%
  18. Llama 3.1 Nemotron Ultra 253B v1 self-reported llm-stats
    89.5%
  19. Gemma 3 12B self-reported llm-stats
    88.9%
  20. Qwen3-Next-80B-A3B-Thinking self-reported llm-stats
    88.9%