IFEval

general

Instruction-Following Evaluation (IFEval) benchmark for large language models, focusing on verifiable instructions with 25 types of instructions and around 500 prompts containing one or more verifiable constraints

Leaderboard

Showing 20 of 64 results

Qwen3.5-27B

95.0%

i
Qwen3.6 Plus

94.3%

i
Qwen3.7 Max

94.3%

i
o3-mini

93.9%

i
Qwen3.5-122B-A10B

93.4%

i
Claude 3.7 Sonnet

93.2%

i
Qwen3.5-397B-A17B

92.6%

i
Nova Pro

92.1%

i
Llama 3.3 70B Instruct

92.1%

i
Qwen3.5-35B-A3B

91.9%

i
Qwen3.5-9B

91.5%

i
Gemma 3 27B

90.4%

i
Nemotron Nano 9B v2

90.3%

i
Gemma 3 4B

90.2%

i
Kimi K2 Instruct

89.8%

i
Kimi K2-Instruct-0905

89.8%

i
Qwen3.5-4B

89.8%

i
Nova Lite

89.7%

i
LongCat-Flash-Chat

89.6%

i
Llama 3.1 Nemotron Ultra 253B v1

89.5%

i