Muse Spark
Muse Spark is the first model in the Muse family developed by Meta Superintelligence Labs. It is a natively multimodal reasoning model with support for tool-use, visual chain of thought, and multi-agent orchestration. It features a Contemplating mode that orchestrates multiple agents reasoning in parallel. It demonstrates competitive performance in multimodal perception, reasoning, health, and agentic tasks, with Contemplating mode achieving 58% on Humanity's Last Exam and 38% on FrontierScience Research.
Benchmark results
| Benchmark | Score | Tags | Source |
|---|---|---|---|
| ARC-AGI v2 | 42.5% | self-reported llm-stats | link → |
| CharXiv-R | 86.4% | self-reported llm-stats | link → |
| DeepSearchQA | 74.8% | self-reported llm-stats | link → |
| ERQA | 64.7% | self-reported llm-stats | link → |
| FrontierScience Research | 38.3% | self-reported llm-stats | link → |
| GDPval-AA | 1,444 | self-reported llm-stats | link → |
| GPQA | 89.5% | self-reported llm-stats | link → |
| HealthBench Hard | 42.8% | self-reported llm-stats | link → |
| Humanity's Last Exam | 58.4% | self-reported llm-stats | link → |
| IPhO 2025 | 82.6% | self-reported llm-stats | link → |
| LiveCodeBench Pro | 0.8 | self-reported llm-stats | link → |
| MedXpertQA | 78.4% | self-reported llm-stats | link → |
| MMMU-Pro | 80.4% | self-reported llm-stats | link → |
| ScreenSpot Pro | 84.1% | self-reported llm-stats | link → |
| SimpleVQA | 71.3% | self-reported llm-stats | link → |
| SWE-Bench Pro | 52.4% | self-reported llm-stats | link → |
| SWE-Bench Verified | 77.4% | self-reported llm-stats | link → |
| Tau2 Telecom | 91.5% | self-reported llm-stats | link → |
| Terminal-Bench 2.0 | 59.0% | self-reported llm-stats | link → |
| ZEROBench | 33.0% | self-reported llm-stats | link → |