GLM-5.1
GLM-5.1 is Z.AI's next-generation flagship foundation model designed for long-horizon agentic engineering tasks. Built on a 754B MoE architecture (40B active parameters), it can work continuously and autonomously on a single task for up to 8 hours, completing the full loop from planning and execution to iterative optimization and delivery. GLM-5.1 achieves state-of-the-art on SWE-Bench Pro (58.4) and demonstrates strong performance across coding, reasoning, and agentic benchmarks. It supports 200K context length, 128K max output tokens, thinking mode, function calling, structured output, context caching, and MCP integration. Overall performance is aligned with Claude Opus 4.6 with particular strengths in sustained execution and complex engineering optimization.
Benchmark results
| Benchmark | Score | Tags | Source |
|---|---|---|---|
| AIME 2026 | 95.3% | self-reported llm-stats | link → |
| BrowseComp | 79.3% | self-reported llm-stats | link → |
| CyberGym | 68.7% | self-reported llm-stats | link → |
| GPQA | 86.2% | self-reported llm-stats | link → |
| HMMT 2025 | 94.0% | self-reported llm-stats | link → |
| HMMT Feb 26 | 82.6% | self-reported llm-stats | link → |
| Humanity's Last Exam | 52.3% | self-reported llm-stats | link → |
| IMO-AnswerBench | 83.8% | self-reported llm-stats | link → |
| MCP Atlas | 71.8% | self-reported llm-stats | link → |
| NL2Repo | 42.7% | self-reported llm-stats | link → |
| SWE-Bench Pro | 58.4% | self-reported llm-stats | link → |
| TAU3-Bench | 70.6% | self-reported llm-stats | link → |
| Terminal-Bench 2.0 | 69.0% | self-reported llm-stats | link → |
| Toolathlon | 40.7% | self-reported llm-stats | link → |
| Vending-Bench 2 | 5,634.41 | self-reported llm-stats | link → |