o1

A research preview model focused on mathematical and logical reasoning capabilities, demonstrating improved performance on tasks requiring step-by-step reasoning, mathematical problem-solving, and code generation. The model shows enhanced capabilities in formal reasoning while maintaining strong general capabilities.

Benchmark results

Benchmark Score Tags Source
AIME 2024 74.3% self-reported llm-stats link →
FrontierMath 5.5% self-reported llm-stats link →
GPQA 78.0% self-reported llm-stats link →
GPQA Biology 69.2% self-reported llm-stats link →
GPQA Chemistry 64.7% self-reported llm-stats link →
GPQA Physics 92.8% self-reported llm-stats link →
GSM8k 97.1% self-reported llm-stats link →
HumanEval 88.1% self-reported llm-stats link →
LiveBench 67.0% self-reported llm-stats link →
MATH 96.4% self-reported llm-stats link →
MathVista 71.8% self-reported llm-stats link →
MGSM 89.3% self-reported llm-stats link →
MMLU 91.8% self-reported llm-stats link →
MMMLU 87.7% self-reported llm-stats link →
MMMU 77.6% self-reported llm-stats link →
SimpleQA 47.0% self-reported llm-stats link →
SWE-Bench Verified 41.0% self-reported llm-stats link →
TAU-bench Airline 50.0% self-reported llm-stats link →
TAU-bench Retail 70.8% self-reported llm-stats link →