MiMo-V2-Omni

MiMo-V2-Omni is Xiaomi's omni foundation model uniting frontier multimodal understanding with strong agentic capability. It fuses dedicated image, video, and audio encoders into a single shared backbone, processing all modalities simultaneously. Natively supports structured tool calling, function execution, and UI grounding. Supports over 10 hours of continuous audio understanding and 256K token context window.

Benchmark results

Benchmark Score Tags Source
Claw-Eval 54.8% self-reported llm-stats link →
GDPval-AA 1,410 self-reported llm-stats link →
MM-BrowserComp 52.0% self-reported llm-stats link →
OmniGAIA 49.8% self-reported llm-stats link →
PinchBench 81.2% self-reported llm-stats link →
SWE-Bench Verified 74.8% self-reported llm-stats link →