GPT-5.5

GPT-5.5 is OpenAI's smartest model yet, designed for real work across agentic coding, computer use, knowledge work, and early scientific research. It matches GPT-5.4 per-token latency in real-world serving while reaching a much higher level of intelligence and using significantly fewer tokens to complete the same tasks. GPT-5.5 supports a 1M-token context window in the API and a 400K-token context window in Codex, with state-of-the-art results on Terminal-Bench 2.0, OSWorld-Verified, GDPval, FrontierMath, and CyberGym.

Benchmark results

Benchmark Score Tags Source
ARC-AGI 95.0% self-reported llm-stats link →
ARC-AGI v2 85.0% self-reported llm-stats link →
BixBench 80.5% self-reported llm-stats link →
BrowseComp 84.4% self-reported llm-stats link →
CyberGym 81.8% self-reported llm-stats link →
Finance Agent 60.0% self-reported llm-stats link →
FrontierMath 35.4% self-reported llm-stats link →
GDPval-MM 84.9% self-reported llm-stats link →
GeneBench 25.0% self-reported llm-stats link →
GPQA 93.6% self-reported llm-stats link →
Graphwalks BFS >128k 45.4% self-reported llm-stats link →
Graphwalks parents >128k 58.5% self-reported llm-stats link →
Humanity's Last Exam 52.2% self-reported llm-stats link →
MCP Atlas 75.3% self-reported llm-stats link →
MMMU-Pro 83.2% self-reported llm-stats link →
MRCR v2 (8-needle) 74.0% self-reported llm-stats link →
OfficeQA Pro 54.1% self-reported llm-stats link →
OSWorld-Verified 78.7% self-reported llm-stats link →
SWE-Bench Pro 58.6% self-reported llm-stats link →
Tau2 Telecom 98.0% self-reported llm-stats link →
Terminal-Bench 2.0 82.7% self-reported llm-stats link →
Toolathlon 55.6% self-reported llm-stats link →