Kimi K2-Thinking-0905
Kimi K2 Thinking is the latest, most capable version of open-source thinking model. Starting with Kimi K2, it is built as a thinking agent that reasons step-by-step while dynamically invoking tools. It sets a new state-of-the-art on Humanity's Last Exam (HLE), BrowseComp, and other benchmarks by dramatically scaling multi-step reasoning depth and maintaining stable tool-use across 200–300 sequential calls. At the same time, K2 Thinking is a native INT4 quantization model with 256k context window, achieving lossless reductions in inference latency and GPU memory usage. Key features include deep thinking & tool orchestration with end-to-end training to interleave chain-of-thought reasoning with function calls, native INT4 quantization via Quantization-Aware Training (QAT) achieving lossless 2x speed-up, and stable long-horizon agency maintaining coherent goal-directed behavior across up to 200–300 consecutive tool invocations.
Benchmark results
| Benchmark | Score | Tags | Source |
|---|---|---|---|
| AIME 2025 | 100.0% | self-reported llm-stats | link → |
| BrowseComp | 60.2% | self-reported llm-stats | link → |
| BrowseComp-zh | 62.3% | self-reported llm-stats | link → |
| FinSearchComp-T3 | 47.4% | self-reported llm-stats | link → |
| FRAMES | 87.0% | self-reported llm-stats | link → |
| GPQA | 84.5% | self-reported llm-stats | link → |
| HealthBench | 58.0% | self-reported llm-stats | link → |
| HMMT 2025 | 97.5% | self-reported llm-stats | link → |
| Humanity's Last Exam | 51.0% | self-reported llm-stats | link → |
| IMO-AnswerBench | 78.6% | self-reported llm-stats | link → |
| LiveCodeBench v6 | 83.1% | self-reported llm-stats | link → |
| MMLU-Pro | 84.6% | self-reported llm-stats | link → |
| MMLU-Redux | 94.4% | self-reported llm-stats | link → |
| Multi-SWE-Bench | 41.9% | self-reported llm-stats | link → |
| OJBench | 48.7% | self-reported llm-stats | link → |
| SciCode | 44.8% | self-reported llm-stats | link → |
| Seal-0 | 56.3% | self-reported llm-stats | link → |
| SWE-bench Multilingual | 61.1% | self-reported llm-stats | link → |
| SWE-Bench Verified | 71.3% | self-reported llm-stats | link → |
| Terminal-Bench | 47.1% | self-reported llm-stats | link → |
| WritingBench | 73.8% | self-reported llm-stats | link → |