o1
A research preview model focused on mathematical and logical reasoning capabilities, demonstrating improved performance on tasks requiring step-by-step reasoning, mathematical problem-solving, and code generation. The model shows enhanced capabilities in formal reasoning while maintaining strong general capabilities.
Benchmark results
| Benchmark | Score | Tags | Source |
|---|---|---|---|
| AIME 2024 | 74.3% | self-reported llm-stats | link → |
| FrontierMath | 5.5% | self-reported llm-stats | link → |
| GPQA | 78.0% | self-reported llm-stats | link → |
| GPQA Biology | 69.2% | self-reported llm-stats | link → |
| GPQA Chemistry | 64.7% | self-reported llm-stats | link → |
| GPQA Physics | 92.8% | self-reported llm-stats | link → |
| GSM8k | 97.1% | self-reported llm-stats | link → |
| HumanEval | 88.1% | self-reported llm-stats | link → |
| LiveBench | 67.0% | self-reported llm-stats | link → |
| MATH | 96.4% | self-reported llm-stats | link → |
| MathVista | 71.8% | self-reported llm-stats | link → |
| MGSM | 89.3% | self-reported llm-stats | link → |
| MMLU | 91.8% | self-reported llm-stats | link → |
| MMMLU | 87.7% | self-reported llm-stats | link → |
| MMMU | 77.6% | self-reported llm-stats | link → |
| SimpleQA | 47.0% | self-reported llm-stats | link → |
| SWE-Bench Verified | 41.0% | self-reported llm-stats | link → |
| TAU-bench Airline | 50.0% | self-reported llm-stats | link → |
| TAU-bench Retail | 70.8% | self-reported llm-stats | link → |