Claude 3.5 Sonnet
Claude 3.5 Sonnet is a powerful AI model with industry-leading software engineering skills. It excels in coding, planning, and problem-solving, with significant improvements in agentic coding and tool use tasks. The model includes computer use capabilities in public beta, allowing it to interact with computer interfaces like a human user.
Benchmark results
| Benchmark | Score | Tags | Source |
|---|---|---|---|
| AI2D | 94.7% | self-reported llm-stats | link → |
| BIG-Bench Hard | 93.1% | self-reported llm-stats | link → |
| ChartQA | 90.8% | self-reported llm-stats | link → |
| DocVQA | 95.2% | self-reported llm-stats | link → |
| DROP | 87.1% | self-reported llm-stats | link → |
| GPQA | 67.2% | self-reported llm-stats | link → |
| GSM8k | 96.4% | self-reported llm-stats | link → |
| HumanEval | 93.7% | self-reported llm-stats | link → |
| MATH | 78.3% | self-reported llm-stats | link → |
| MathVista | 67.7% | self-reported llm-stats | link → |
| MGSM | 91.6% | self-reported llm-stats | link → |
| MMLU | 90.4% | self-reported llm-stats | link → |
| MMLU-Pro | 77.6% | self-reported llm-stats | link → |
| MMMU | 68.3% | self-reported llm-stats | link → |
| OSWorld Extended | 22.0% | self-reported llm-stats | link → |
| OSWorld Screenshot-only | 14.9% | self-reported llm-stats | link → |
| SWE-Bench Verified | 49.0% | self-reported llm-stats | link → |
| TAU-bench Airline | 46.0% | self-reported llm-stats | link → |
| TAU-bench Retail | 69.2% | self-reported llm-stats | link → |