Pixtral-12B

A 12B parameter multimodal model with a 400M parameter vision encoder, capable of understanding both natural images and documents. Excels at multimodal tasks while maintaining strong text-only performance.

DocVQA

90.7%

i
ChartQA

81.8%

i
VQAv2

78.6%

i
HumanEval

72.0%

i
MMLU

69.2%

i
IFEval

61.3%

i
MathVista

58.0%

i
MM IF-Eval

52.7%

i
MMMU

52.5%

i
MATH

48.1%

i
MT-Bench

76.8

i
MM-MT-Bench

60.5

i