MAI-Thinking-1
MAI-Thinking-1 is Microsoft AI's first in-house reasoning model, a 35B-active / ~1T-total parameter sparse Mixture of Experts model (base model MAI-Base-1) trained from scratch without distillation from third-party models. Built with Microsoft's Hill-Climbing Machine pipeline, it was pre-trained on 30T tokens of clean, commercially licensed, human-generated data (plus 3.55T mid-training tokens), then post-trained via reinforcement learning across STEM, agentic coding, and helpfulness/safety specialists consolidated into a single model. It delivers strong mathematical reasoning and software-engineering performance for its weight class, going toe-to-toe with Claude Opus 4.6 on SWE-Bench Pro and reaching 97.0% on AIME 2025. It supports a 256k token context window, function calling, and developer instructions, and is preferred over Claude Sonnet 4.6 in blind human side-by-side evaluations.
Benchmark results
| Benchmark | Score | Tags | Source |
|---|---|---|---|
| AdvancedIF | 85.0% | self-reported llm-stats | link → |
| AIME 2025 | 97.0% | self-reported llm-stats | link → |
| AIME 2026 | 94.5% | self-reported llm-stats | link → |
| AIR-Bench | 88.0% | self-reported llm-stats | link → |
| BFCL-v3 | 72.0% | self-reported llm-stats | link → |
| CorpusQA | 82.0% | self-reported llm-stats | link → |
| CyberSecEval 4 | 63.0% | self-reported llm-stats | link → |
| GPQA | 84.2% | self-reported llm-stats | link → |
| GraphWalks | 90.0% | self-reported llm-stats | link → |
| HealthBench Professional | 35.0% | self-reported llm-stats | link → |
| HMMT Feb 26 | 84.9% | self-reported llm-stats | link → |
| IFBench | 69.0% | self-reported llm-stats | link → |
| LiveCodeBench v6 | 87.7% | self-reported llm-stats | link → |
| LongBench v2 | 61.0% | self-reported llm-stats | link → |
| LongFact | 98.0% | self-reported llm-stats | link → |
| MedXpertQA | 43.0% | self-reported llm-stats | link → |
| MMLU-Pro | 85.0% | self-reported llm-stats | link → |
| Multi-Challenge | 53.0% | self-reported llm-stats | link → |
| SimpleQA Verified | 31.0% | self-reported llm-stats | link → |
| SWE-Bench Pro | 52.8% | self-reported llm-stats | link → |
| SWE-Bench Verified | 73.5% | self-reported llm-stats | link → |
| Terminal-Bench 2.0 | 46.0% | self-reported llm-stats | link → |
| TruthfulQA | 88.0% | self-reported llm-stats | link → |