MiniCPM-SALA

MiniCPM-SALA (Sparse Attention and Linear Attention) is a 9B hybrid model built from a MiniCPM-4.0 checkpoint via continual training (~2T tokens, 25% of training-from-scratch cost). It interleaves 25% InfLLM-V2 sparse attention and 75% Lightning Attention layers, achieving up to 3.5x inference speed over dense baselines at 256K tokens. With HyPE (Hybrid Positional Encoding) and NoPE in sparse layers, the model extrapolates to 2048K tokens despite a 520K training length, enabling 1M-token inference on consumer GPUs like the RTX 5090.

Benchmark results

Benchmark Score Tags Source
AIME 2024 83.8% self-reported llm-stats link →
AIME 2025 78.3% self-reported llm-stats link →
BBH 81.5% self-reported llm-stats link →
CMMLU 81.5% self-reported llm-stats link →
HumanEval 95.1% self-reported llm-stats link →
IFEval 76.3% self-reported llm-stats link →
LiveCodeBench v5 60.5% self-reported llm-stats link →
LiveCodeBench v6 52.0% self-reported llm-stats link →
MBPP 89.1% self-reported llm-stats link →
MMLU-Pro 67.0% self-reported llm-stats link →
MRCR 128K (2-needle) 28.6% self-reported llm-stats link →
MRCR 128K (4-needle) 19.6% self-reported llm-stats link →
MRCR 128K (8-needle) 10.1% self-reported llm-stats link →
MRCR 64K (2-needle) 29.8% self-reported llm-stats link →
MRCR 64K (4-needle) 20.6% self-reported llm-stats link →
MRCR 64K (8-needle) 16.6% self-reported llm-stats link →
NoLiMa 128K 23.9% self-reported llm-stats link →
NoLiMa 32K 54.5% self-reported llm-stats link →
NoLiMa 64K 43.0% self-reported llm-stats link →
RULER 1000K 86.3% self-reported llm-stats link →
RULER 128k 89.4% self-reported llm-stats link →
RULER 2048K 81.6% self-reported llm-stats link →
RULER 512K 87.1% self-reported llm-stats link →
RULER 64k 92.7% self-reported llm-stats link →