DeepSeek-V3.1
DeepSeek-V3.1 is a hybrid model supporting both thinking and non-thinking modes through different chat templates. Built on DeepSeek-V3.1-Base with a two-phase long context extension (32K phase: 630B tokens, 128K phase: 209B tokens), it features 671B total parameters with 37B activated. Key improvements include smarter tool calling through post-training optimization, higher thinking efficiency achieving comparable quality to DeepSeek-R1-0528 while responding more quickly, and UE8M0 FP8 scale data format for model weights and activations. The model excels in both reasoning tasks (thinking mode) and practical applications (non-thinking mode), with particularly strong performance in code agent tasks, math competitions, and search-based problem solving.
Benchmark results
| Benchmark | Score | Tags | Source |
|---|---|---|---|
| Aider-Polyglot | 68.4% | self-reported llm-stats | link → |
| AIME 2024 | 66.3% | self-reported llm-stats | link → |
| AIME 2025 | 49.8% | self-reported llm-stats | link → |
| BrowseComp | 30.0% | self-reported llm-stats | link → |
| BrowseComp-zh | 49.2% | self-reported llm-stats | link → |
| CodeForces | 69.7% | self-reported llm-stats | link → |
| GPQA | 74.9% | self-reported llm-stats | link → |
| HMMT 2025 | 33.5% | self-reported llm-stats | link → |
| Humanity's Last Exam | 15.9% | self-reported llm-stats | link → |
| LiveCodeBench | 56.4% | self-reported llm-stats | link → |
| MMLU-Pro | 83.7% | self-reported llm-stats | link → |
| MMLU-Redux | 91.8% | self-reported llm-stats | link → |
| SimpleQA | 93.4% | self-reported llm-stats | link → |
| SWE-bench Multilingual | 54.5% | self-reported llm-stats | link → |
| SWE-Bench Verified | 66.0% | self-reported llm-stats | link → |
| Terminal-Bench | 31.3% | self-reported llm-stats | link → |