o3
OpenAI's most powerful reasoning model. o3 is a well-rounded and powerful model across domains. It sets a new standard for math, science, coding, and visual reasoning tasks. It also excels at technical writing and instruction-following. Use it to think through multi-step problems that involve analysis across text, code, and images.
Benchmark results
| Benchmark | Score | Tags | Source |
|---|---|---|---|
| Aider-Polyglot | 81.3% | self-reported llm-stats | link → |
| AIME 2024 | 91.6% | self-reported llm-stats | link → |
| AIME 2025 | 86.4% | self-reported llm-stats | link → |
| ARC-AGI | 88.0% | self-reported llm-stats | link → |
| ARC-AGI v2 | 6.5% | self-reported llm-stats | link → |
| BrowseComp | 49.7% | self-reported llm-stats | link → |
| CharXiv-R | 78.6% | self-reported llm-stats | link → |
| COLLIE | 98.4% | self-reported llm-stats | link → |
| ERQA | 64.0% | self-reported llm-stats | link → |
| FrontierMath | 15.8% | self-reported llm-stats | link → |
| GPQA | 83.3% | self-reported llm-stats | link → |
| Humanity's Last Exam | 14.7% | self-reported llm-stats | link → |
| Humanity's Last Exam | 24.3% | self-reported llm-stats | link → |
| Humanity's Last Exam | 14.7% | self-reported llm-stats | link → |
| MathVista | 86.8% | self-reported llm-stats | link → |
| MMMU | 82.9% | self-reported llm-stats | link → |
| MMMU-Pro | 76.4% | self-reported llm-stats | link → |
| Multi-Challenge | 60.4% | self-reported llm-stats | link → |
| Scale MultiChallenge | 56.5% | self-reported llm-stats | link → |
| Scale MultiChallenge | 60.4% | self-reported llm-stats | link → |
| SWE-Bench Verified | 69.1% | self-reported llm-stats | link → |
| Tau-bench | 63.0% | self-reported llm-stats | link → |
| Tau2 Airline | 64.8% | self-reported llm-stats | link → |
| Tau2 Retail | 80.2% | self-reported llm-stats | link → |
| Tau2 Telecom | 58.2% | self-reported llm-stats | link → |
| VideoMMMU | 83.3% | self-reported llm-stats | link → |