MiniCPM-SALA

MiniCPM-SALA (Sparse Attention and Linear Attention) is a 9B hybrid model built from a MiniCPM-4.0 checkpoint via continual training (~2T tokens, 25% of training-from-scratch cost). It interleaves 25% InfLLM-V2 sparse attention and 75% Lightning Attention layers, achieving up to 3.5x inference speed over dense baselines at 256K tokens.

HumanEval

95.1%

i
RULER 64k

92.7%

i
RULER 128k

89.4%

i
MBPP

89.1%

i
RULER 512K

87.1%

i
RULER 1000K

86.3%

i
AIME 2024

83.8%

i
RULER 2048K

81.6%

i
BBH

81.5%

i
CMMLU

81.5%

i
AIME 2025

78.3%

i
IFEval

76.3%

i
MMLU-Pro

67.0%

i
LiveCodeBench v5

60.5%

i
NoLiMa 32K

54.5%

i
LiveCodeBench v6

52.0%

i
NoLiMa 64K

43.0%

i
MRCR 64K (2-needle)

29.8%

i
MRCR 128K (2-needle)

28.6%

i
NoLiMa 128K

23.9%

i
MRCR 64K (4-needle)

20.6%

i
MRCR 128K (4-needle)

19.6%

i
MRCR 64K (8-needle)

16.6%

i
MRCR 128K (8-needle)

10.1%

i