GPT-5
GPT-5 is a flagship model from OpenAI designed for coding, reasoning, and agentic tasks across domains. It is optimized for coding and agentic tasks with higher reasoning capabilities and medium speed.
Benchmark results
| Benchmark | Score | Tags | Source |
|---|---|---|---|
| Aider-Polyglot | 88.0% | self-reported llm-stats | link → |
| AIME 2025 | 94.6% | self-reported llm-stats | link → |
| BrowseComp | 54.9% | self-reported llm-stats | link → |
| BrowseComp Long Context 128k | 90.0% | self-reported llm-stats | link → |
| BrowseComp Long Context 256k | 88.8% | self-reported llm-stats | link → |
| CharXiv-R | 81.1% | self-reported llm-stats | link → |
| COLLIE | 99.0% | self-reported llm-stats | link → |
| ERQA | 65.7% | self-reported llm-stats | link → |
| FActScore | 1.0% | self-reported llm-stats | link → |
| FrontierMath | 26.3% | self-reported llm-stats | link → |
| GPQA | 85.7% | self-reported llm-stats | link → |
| Graphwalks BFS <128k | 78.3% | self-reported llm-stats | link → |
| Graphwalks parents <128k | 73.3% | self-reported llm-stats | link → |
| HealthBench Hard | 1.6% | self-reported llm-stats | link → |
| HMMT 2025 | 93.3% | self-reported llm-stats | link → |
| HumanEval | 93.4% | self-reported llm-stats | link → |
| Humanity's Last Exam | 24.8% | self-reported llm-stats | link → |
| Internal API instruction following (hard) | 64.0% | self-reported llm-stats | link → |
| LongFact Concepts | 0.7% | self-reported llm-stats | link → |
| LongFact Objects | 0.8% | self-reported llm-stats | link → |
| MATH | 84.7% | self-reported llm-stats | link → |
| MMLU | 92.5% | self-reported llm-stats | link → |
| MMMU | 84.2% | self-reported llm-stats | link → |
| MMMU-Pro | 78.4% | self-reported llm-stats | link → |
| Multi-Challenge | 69.6% | self-reported llm-stats | link → |
| MultiChallenge (o3-mini grader) | 69.6% | self-reported llm-stats | link → |
| OpenAI-MRCR: 2 needle 128k | 95.2% | self-reported llm-stats | link → |
| OpenAI-MRCR: 2 needle 256k | 86.8% | self-reported llm-stats | link → |
| Scale MultiChallenge | 69.6% | self-reported llm-stats | link → |
| SWE-Bench Verified | 74.9% | self-reported llm-stats | link → |
| SWE-Lancer (IC-Diamond subset) | 100.0% | self-reported llm-stats | link → |
| Tau2 Airline | 62.6% | self-reported llm-stats | link → |
| Tau2 Retail | 81.1% | self-reported llm-stats | link → |
| Tau2 Telecom | 96.7% | self-reported llm-stats | link → |
| VideoMME w sub. | 86.7% | self-reported llm-stats | link → |
| VideoMMMU | 84.6% | self-reported llm-stats | link → |