GPT-5.2

GPT‑5.2 introduces substantial gains in professional knowledge work, outperforming experts on GDPval with 70.9% wins or ties, and setting new highs in coding (SWE‑Bench Pro 55.6%), science (GPQA Diamond ~92–93%), math (AIME 2025: 100%), long‑context accuracy up to 256k tokens, and reliable tool‑calling (Tau2 Telecom 98.7%). It rolls out as Instant, Thinking, and Pro—faster, more structured, and less error‑prone—priced at $1.75/1M input and $14/1M output tokens, with Pro variants supporting xhigh reasoning for top‑quality, end‑to‑end execution.

Benchmark results

Benchmark Score Tags Source
AIME 2025 100.0% self-reported llm-stats link →
ARC-AGI 86.2% self-reported llm-stats link →
ARC-AGI v2 52.9% self-reported llm-stats link →
BrowseComp 65.8% self-reported llm-stats link →
BrowseComp Long Context 128k 92.0% self-reported llm-stats link →
BrowseComp Long Context 256k 89.8% self-reported llm-stats link →
CharXiv-R 82.1% self-reported llm-stats link →
FrontierMath 40.3% self-reported llm-stats link →
GPQA 92.4% self-reported llm-stats link →
Graphwalks BFS <128k 94.0% self-reported llm-stats link →
Graphwalks parents <128k 89.0% self-reported llm-stats link →
HMMT 2025 99.4% self-reported llm-stats link →
Humanity's Last Exam 34.5% self-reported llm-stats link →
MCP Atlas 60.6% self-reported llm-stats link →
MMMLU 89.6% self-reported llm-stats link →
MMMU-Pro 79.5% self-reported llm-stats link →
ScreenSpot Pro 86.3% self-reported llm-stats link →
SWE-Bench Verified 80.0% self-reported llm-stats link →
SWE-Lancer (IC-Diamond subset) 74.6% self-reported llm-stats link →
Tau2 Retail 82.0% self-reported llm-stats link →
Tau2 Telecom 98.7% self-reported llm-stats link →
Toolathlon 46.3% self-reported llm-stats link →
VideoMMMU 85.9% self-reported llm-stats link →