Grok-4 Heavy
Grok 4 Heavy is the multi-agent version of Grok 4, released alongside the standard model in summer 2025. This system spawns multiple Grok 4 agents in parallel that work independently on problems and then collaborate by comparing their solutions, similar to a study group. The agents share insights and tricks they discover, with the system intelligently combining their work rather than simply using majority voting. Grok 4 Heavy uses approximately 10x more test-time compute than regular Grok 4, enabling it to solve significantly more complex problems. On the Humanities Last Exam, it achieves over 50% accuracy on text-only problems, and it scored a perfect result on the AIME 2025 mathematics competition. The system represents a major advancement in multi-agent AI collaboration and reasoning capabilities.
Benchmark results
| Benchmark | Score | Tags | Source |
|---|---|---|---|
| AIME 2025 | 100.0% | self-reported llm-stats | link → |
| GPQA | 88.4% | self-reported llm-stats | link → |
| HMMT25 | 96.7% | self-reported llm-stats | link → |
| Humanity's Last Exam | 50.7% | self-reported llm-stats | link → |
| LiveCodeBench | 79.4% | self-reported llm-stats | link → |
| USAMO25 | 61.9% | self-reported llm-stats | link → |