o3-mini

A smaller variant of O3, expected to offer enhanced multimodal capabilities, improved reasoning, and more efficient resource utilization compared to previous models while maintaining strong performance on core tasks.

Benchmark results

Benchmark Score Tags Source
Aider-Polyglot 66.7% self-reported llm-stats link →
Aider-Polyglot Edit 60.4% self-reported llm-stats link →
AIME 2024 87.3% self-reported llm-stats link →
COLLIE 98.7% self-reported llm-stats link →
ComplexFuncBench 17.6% self-reported llm-stats link →
FrontierMath 9.2% self-reported llm-stats link →
GPQA 77.2% self-reported llm-stats link →
Graphwalks BFS <128k 51.0% self-reported llm-stats link →
Graphwalks parents <128k 58.3% self-reported llm-stats link →
IFEval 93.9% self-reported llm-stats link →
Internal API instruction following (hard) 50.0% self-reported llm-stats link →
LiveBench 84.6% self-reported llm-stats link →
MATH 97.9% self-reported llm-stats link →
MGSM 92.0% self-reported llm-stats link →
MMLU 86.9% self-reported llm-stats link →
Multi-Challenge 39.9% self-reported llm-stats link →
Multi-IF 79.5% self-reported llm-stats link →
MultiChallenge (o3-mini grader) 50.2% self-reported llm-stats link →
Multilingual MMLU 80.7% self-reported llm-stats link →
OpenAI-MRCR: 2 needle 128k 18.7% self-reported llm-stats link →
SimpleQA 15.0% self-reported llm-stats link →
SWE-Bench Verified 49.3% self-reported llm-stats link →
SWE-Lancer 18.0% self-reported llm-stats link →
SWE-Lancer (IC-Diamond subset) 7.4% self-reported llm-stats link →
TAU-bench Airline 32.4% self-reported llm-stats link →
TAU-bench Retail 57.6% self-reported llm-stats link →