o3

OpenAI's most powerful reasoning model. o3 is a well-rounded and powerful model across domains. It sets a new standard for math, science, coding, and visual reasoning tasks. It also excels at technical writing and instruction-following. Use it to think through multi-step problems that involve analysis across text, code, and images.

Benchmark results

Benchmark Score Tags Source
Aider-Polyglot 81.3% self-reported llm-stats link →
AIME 2024 91.6% self-reported llm-stats link →
AIME 2025 86.4% self-reported llm-stats link →
ARC-AGI 88.0% self-reported llm-stats link →
ARC-AGI v2 6.5% self-reported llm-stats link →
BrowseComp 49.7% self-reported llm-stats link →
CharXiv-R 78.6% self-reported llm-stats link →
COLLIE 98.4% self-reported llm-stats link →
ERQA 64.0% self-reported llm-stats link →
FrontierMath 15.8% self-reported llm-stats link →
GPQA 83.3% self-reported llm-stats link →
Humanity's Last Exam 14.7% self-reported llm-stats link →
Humanity's Last Exam 24.3% self-reported llm-stats link →
Humanity's Last Exam 14.7% self-reported llm-stats link →
MathVista 86.8% self-reported llm-stats link →
MMMU 82.9% self-reported llm-stats link →
MMMU-Pro 76.4% self-reported llm-stats link →
Multi-Challenge 60.4% self-reported llm-stats link →
Scale MultiChallenge 56.5% self-reported llm-stats link →
Scale MultiChallenge 60.4% self-reported llm-stats link →
SWE-Bench Verified 69.1% self-reported llm-stats link →
Tau-bench 63.0% self-reported llm-stats link →
Tau2 Airline 64.8% self-reported llm-stats link →
Tau2 Retail 80.2% self-reported llm-stats link →
Tau2 Telecom 58.2% self-reported llm-stats link →
VideoMMMU 83.3% self-reported llm-stats link →