GPT-4.5

GPT-4.5 is OpenAI's most advanced model, offering improved reasoning, coding, and creative capabilities with faster performance and longer context handling than GPT-4. It features enhanced instruction following, reduced hallucinations, and better factual accuracy.

Benchmark results

Benchmark Score Tags Source
Aider-Polyglot Edit 44.9% self-reported llm-stats link →
AIME 2024 36.7% self-reported llm-stats link →
CharXiv-D 90.0% self-reported llm-stats link →
CharXiv-R 55.4% self-reported llm-stats link →
COLLIE 72.3% self-reported llm-stats link →
ComplexFuncBench 63.0% self-reported llm-stats link →
GPQA 69.5% self-reported llm-stats link →
Graphwalks BFS <128k 72.3% self-reported llm-stats link →
Graphwalks parents <128k 72.6% self-reported llm-stats link →
GSM8k 97.0% self-reported llm-stats link →
HumanEval 88.0% self-reported llm-stats link →
IFEval 88.2% self-reported llm-stats link →
Internal API instruction following (hard) 54.0% self-reported llm-stats link →
MathVista 72.3% self-reported llm-stats link →
MMLU 90.8% self-reported llm-stats link →
MMMLU 85.1% self-reported llm-stats link →
MMMU 75.2% self-reported llm-stats link →
Multi-Challenge 43.8% self-reported llm-stats link →
Multi-IF 70.8% self-reported llm-stats link →
MultiChallenge (o3-mini grader) 50.1% self-reported llm-stats link →
OpenAI-MRCR: 2 needle 128k 38.5% self-reported llm-stats link →
SimpleQA 62.5% self-reported llm-stats link →
SWE-Bench Verified 38.0% self-reported llm-stats link →
SWE-Lancer 37.3% self-reported llm-stats link →
SWE-Lancer (IC-Diamond subset) 17.4% self-reported llm-stats link →
TAU-bench Airline 50.0% self-reported llm-stats link →
TAU-bench Retail 68.4% self-reported llm-stats link →