GPT-4.1

GPT-4.1 is OpenAI's latest and most advanced flagship model, significantly improving upon GPT-4 Turbo in performance across benchmarks, speed, and cost-effectiveness.

Benchmark results

Benchmark Score Tags Source
Aider-Polyglot 51.6% self-reported llm-stats link →
Aider-Polyglot Edit 52.9% self-reported llm-stats link →
AIME 2024 48.1% self-reported llm-stats link →
AIME 2025 46.4% self-reported llm-stats link →
CharXiv-D 87.9% self-reported llm-stats link →
CharXiv-R 56.7% self-reported llm-stats link →
COLLIE 65.8% self-reported llm-stats link →
ComplexFuncBench 65.5% self-reported llm-stats link →
GPQA 66.3% self-reported llm-stats link →
Graphwalks BFS <128k 61.7% self-reported llm-stats link →
Graphwalks BFS >128k 19.0% self-reported llm-stats link →
Graphwalks parents <128k 58.0% self-reported llm-stats link →
Graphwalks parents >128k 25.0% self-reported llm-stats link →
HMMT 2025 28.9% self-reported llm-stats link →
Humanity's Last Exam 5.4% self-reported llm-stats link →
IFEval 87.4% self-reported llm-stats link →
Internal API instruction following (hard) 49.1% self-reported llm-stats link →
MathVista 72.2% self-reported llm-stats link →
MMLU 90.2% self-reported llm-stats link →
MMMLU 87.3% self-reported llm-stats link →
MMMU 74.8% self-reported llm-stats link →
Multi-Challenge 38.3% self-reported llm-stats link →
Multi-IF 70.8% self-reported llm-stats link →
MultiChallenge (o3-mini grader) 46.2% self-reported llm-stats link →
OpenAI-MRCR: 2 needle 128k 57.2% self-reported llm-stats link →
OpenAI-MRCR: 2 needle 1M 46.3% self-reported llm-stats link →
SWE-Bench Verified 54.6% self-reported llm-stats link →
TAU-bench Airline 49.4% self-reported llm-stats link →
TAU-bench Retail 68.0% self-reported llm-stats link →
Video-MME (long, no subtitles) 72.0% self-reported llm-stats link →