Pixtral-12B
A 12B parameter multimodal model with a 400M parameter vision encoder, capable of understanding both natural images and documents. Excels at multimodal tasks while maintaining strong text-only performance. Supports variable image sizes and multiple images in context.
Benchmark results
| Benchmark | Score | Tags | Source |
|---|---|---|---|
| ChartQA | 81.8% | self-reported llm-stats | link → |
| DocVQA | 90.7% | self-reported llm-stats | link → |
| HumanEval | 72.0% | self-reported llm-stats | link → |
| IFEval | 61.3% | self-reported llm-stats | link → |
| MATH | 48.1% | self-reported llm-stats | link → |
| MathVista | 58.0% | self-reported llm-stats | link → |
| MM IF-Eval | 52.7% | self-reported llm-stats | link → |
| MM-MT-Bench | 60.5 | self-reported llm-stats | link → |
| MMLU | 69.2% | self-reported llm-stats | link → |
| MMMU | 52.5% | self-reported llm-stats | link → |
| MT-Bench | 76.8 | self-reported llm-stats | link → |
| VQAv2 | 78.6% | self-reported llm-stats | link → |