Claude Opus 4.6

Claude Opus 4.6 is Anthropic's most intelligent model, improving on its predecessor's coding skills with more careful planning, longer agentic task sustenance, more reliable operation in larger codebases, and better code review and debugging skills. First Opus-class model with 1M token context window (beta), 128K output tokens, and adaptive thinking. Features effort controls (low/medium/high/max) and context compaction for long-running tasks. State-of-the-art on Terminal-Bench 2.0, Humanity's Last Exam, GDPval-AA, and BrowseComp. Pricing: $5/$25 per million tokens (input/output).

Benchmark results

Benchmark Score Tags Source
AIME 2025 99.8% self-reported llm-stats link →
ARC-AGI v2 68.8% self-reported llm-stats link →
BrowseComp 84.0% self-reported llm-stats link →
CharXiv-R 77.4% self-reported llm-stats link →
CyberGym 73.8% self-reported llm-stats link →
DeepSearchQA 91.3% self-reported llm-stats link →
FigQA 78.3% self-reported llm-stats link →
Finance Agent 60.7% self-reported llm-stats link →
GDPval-AA 1,606 self-reported llm-stats link →
GPQA 91.3% self-reported llm-stats link →
Graphwalks BFS >128k 61.5% self-reported llm-stats link →
Graphwalks parents >128k 95.4% self-reported llm-stats link →
Humanity's Last Exam 53.1% self-reported llm-stats link →
MCP Atlas 62.7% self-reported llm-stats link →
MMMLU 91.1% self-reported llm-stats link →
MMMU-Pro 77.3% self-reported llm-stats link →
MRCR v2 (8-needle) 93.0% self-reported llm-stats link →
OpenRCA 34.9% self-reported llm-stats link →
OSWorld 72.7% self-reported llm-stats link →
SWE-bench Multilingual 77.8% self-reported llm-stats link →
SWE-Bench Verified 80.8% self-reported llm-stats link →
Tau2 Retail 91.9% self-reported llm-stats link →
Tau2 Telecom 99.3% self-reported llm-stats link →
Terminal-Bench 2.0 65.4% self-reported llm-stats link →
Vending-Bench 2 8,017.59 self-reported llm-stats link →