Kimi K2-Instruct-0905
Kimi K2-Instruct-0905 is the latest, most capable version of Kimi K2, achieving state-of-the-art performance in frontier knowledge, math, and coding among non-thinking models. This Mixture-of-Experts model features 32 billion activated parameters and 1 trillion total parameters, meticulously optimized for agentic tasks. Key features include enhanced agentic coding intelligence, extended context length to 256K tokens, and a hybrid architecture trained with MuonClip optimizer on 15.5T tokens. The model achieves 65.8% on SWE-bench Verified (single attempt), 47.3% on SWE-bench Multilingual, and excels at tool use with 70.6% on Tau2-retail. It is a reflex-grade model without long thinking, designed to act and execute complex tasks seamlessly.
Benchmark results
| Benchmark | Score | Tags | Source |
|---|---|---|---|
| ACEBench | 76.5% | self-reported llm-stats | link → |
| Aider-Polyglot | 60.0% | self-reported llm-stats | link → |
| AIME 2024 | 69.6% | self-reported llm-stats | link → |
| AIME 2025 | 49.5% | self-reported llm-stats | link → |
| AutoLogi | 89.5% | self-reported llm-stats | link → |
| CNMO 2024 | 74.3% | self-reported llm-stats | link → |
| GPQA | 75.1% | self-reported llm-stats | link → |
| HLE | 4.7% | self-reported llm-stats | link → |
| HMMT 2025 | 38.8% | self-reported llm-stats | link → |
| Humanity's Last Exam | 4.7% | self-reported llm-stats | link → |
| IFEval | 89.8% | self-reported llm-stats | link → |
| LiveBench | 76.4% | self-reported llm-stats | link → |
| LiveCodeBench | 53.7% | self-reported llm-stats | link → |
| MATH-500 | 97.4% | self-reported llm-stats | link → |
| MMLU | 89.5% | self-reported llm-stats | link → |
| MMLU-Pro | 81.1% | self-reported llm-stats | link → |
| MMLU-Redux | 92.7% | self-reported llm-stats | link → |
| Multi-Challenge | 54.1% | self-reported llm-stats | link → |
| MultiPL-E | 85.7% | self-reported llm-stats | link → |
| OJBench | 27.1% | self-reported llm-stats | link → |
| PolyMath-en | 65.1% | self-reported llm-stats | link → |
| SimpleQA | 31.0% | self-reported llm-stats | link → |
| SuperGPQA | 57.2% | self-reported llm-stats | link → |
| SWE-bench Multilingual | 47.3% | self-reported llm-stats | link → |
| SWE-Bench Verified | 65.8% | self-reported llm-stats | link → |
| Tau2 Airline | 56.5% | self-reported llm-stats | link → |
| Tau2 Retail | 70.6% | self-reported llm-stats | link → |
| Tau2 Telecom | 65.8% | self-reported llm-stats | link → |
| Terminal-Bench | 25.0% | self-reported llm-stats | link → |
| ZebraLogic | 89.0% | self-reported llm-stats | link → |