Claude Mythos Preview

Claude Mythos Preview is an unreleased general-purpose frontier model from Anthropic, a new tier above Opus (internal codename 'Capybara'). It identified thousands of zero-day vulnerabilities across every major operating system and web browser as part of Project Glasswing, a cross-industry cybersecurity initiative with 12 partners including AWS, Apple, Microsoft, and Google. State-of-the-art on SWE-bench Verified (93.9%), GPQA Diamond (94.6%), USAMO (97.6%), Terminal-Bench 2.0 (82.0%), CyberGym (83.1%), and Cybench (100% pass@1, saturated). Represents a 4.3x increase over the previous trendline for model performance. Deployed under ASL-3 Standard. Best-aligned Claude model to date per Anthropic's risk report, with the first-ever 24-hour internal alignment review before deployment. Not planned for general availability. Pricing for participants: $25/$125 per million tokens (input/output). 244-page system card.

Benchmark results

Benchmark Score Tags Source
BrowseComp 86.9% self-reported llm-stats link →
CharXiv-R 93.2% self-reported llm-stats link →
CyBench 100.0% self-reported llm-stats link →
CyberGym 83.1% self-reported llm-stats link →
FigQA 89.0% self-reported llm-stats link →
GPQA 94.6% self-reported llm-stats link →
Graphwalks BFS >128k 80.0% self-reported llm-stats link →
Humanity's Last Exam 64.7% self-reported llm-stats link →
MMMLU 92.7% self-reported llm-stats link →
OSWorld-Verified 79.6% self-reported llm-stats link →
SWE-bench Multilingual 87.3% self-reported llm-stats link →
SWE-Bench Multimodal 59.0% self-reported llm-stats link →
SWE-Bench Pro 77.8% self-reported llm-stats link →
SWE-Bench Verified 93.9% self-reported llm-stats link →
Terminal-Bench 2.0 82.0% self-reported llm-stats link →
USAMO25 97.6% self-reported llm-stats link →