o1 Reasoning-tuned model. Benchmark results Benchmark Score Tags Source AIME 2024 83.3% GPQA Diamond 78.0% HumanEval 92.4% MATH 94.8%