GPT-5.5
GPT-5.5 is OpenAI's smartest model yet, designed for real work across agentic coding, computer use, knowledge work, and early scientific research. It matches GPT-5.4 per-token latency in real-world serving while reaching a much higher level of intelligence and using significantly fewer tokens to complete the same tasks. GPT-5.5 supports a 1M-token context window in the API and a 400K-token context window in Codex, with state-of-the-art results on Terminal-Bench 2.0, OSWorld-Verified, GDPval, FrontierMath, and CyberGym.
Benchmark results
| Benchmark | Score | Tags | Source |
|---|---|---|---|
| ARC-AGI | 95.0% | self-reported llm-stats | link → |
| ARC-AGI v2 | 85.0% | self-reported llm-stats | link → |
| BixBench | 80.5% | self-reported llm-stats | link → |
| BrowseComp | 84.4% | self-reported llm-stats | link → |
| CyberGym | 81.8% | self-reported llm-stats | link → |
| Finance Agent | 60.0% | self-reported llm-stats | link → |
| FrontierMath | 35.4% | self-reported llm-stats | link → |
| GDPval-MM | 84.9% | self-reported llm-stats | link → |
| GeneBench | 25.0% | self-reported llm-stats | link → |
| GPQA | 93.6% | self-reported llm-stats | link → |
| Graphwalks BFS >128k | 45.4% | self-reported llm-stats | link → |
| Graphwalks parents >128k | 58.5% | self-reported llm-stats | link → |
| Humanity's Last Exam | 52.2% | self-reported llm-stats | link → |
| MCP Atlas | 75.3% | self-reported llm-stats | link → |
| MMMU-Pro | 83.2% | self-reported llm-stats | link → |
| MRCR v2 (8-needle) | 74.0% | self-reported llm-stats | link → |
| OfficeQA Pro | 54.1% | self-reported llm-stats | link → |
| OSWorld-Verified | 78.7% | self-reported llm-stats | link → |
| SWE-Bench Pro | 58.6% | self-reported llm-stats | link → |
| Tau2 Telecom | 98.0% | self-reported llm-stats | link → |
| Terminal-Bench 2.0 | 82.7% | self-reported llm-stats | link → |
| Toolathlon | 55.6% | self-reported llm-stats | link → |