Pixtral-12B

A 12B parameter multimodal model with a 400M parameter vision encoder, capable of understanding both natural images and documents. Excels at multimodal tasks while maintaining strong text-only performance. Supports variable image sizes and multiple images in context.

Benchmark results

Benchmark Score Tags Source
ChartQA 81.8% self-reported llm-stats link →
DocVQA 90.7% self-reported llm-stats link →
HumanEval 72.0% self-reported llm-stats link →
IFEval 61.3% self-reported llm-stats link →
MATH 48.1% self-reported llm-stats link →
MathVista 58.0% self-reported llm-stats link →
MM IF-Eval 52.7% self-reported llm-stats link →
MM-MT-Bench 60.5 self-reported llm-stats link →
MMLU 69.2% self-reported llm-stats link →
MMMU 52.5% self-reported llm-stats link →
MT-Bench 76.8 self-reported llm-stats link →
VQAv2 78.6% self-reported llm-stats link →