Phi-4-multimodal-instruct
Phi-4-multimodal-instruct is a lightweight (5.57B parameters) open multimodal foundation model that leverages research and datasets from Phi-3.5 and 4.0. It processes text, image, and audio inputs to generate text outputs, supporting a 128K token context length. Enhanced via SFT, DPO, and RLHF for instruction following and safety.
Benchmark results
| Benchmark | Score | Tags | Source |
|---|---|---|---|
| AI2D | 82.3% | self-reported llm-stats | link → |
| BLINK | 61.3% | self-reported llm-stats | link → |
| ChartQA | 81.4% | self-reported llm-stats | link → |
| DocVQA | 93.2% | self-reported llm-stats | link → |
| InfoVQA | 72.7% | self-reported llm-stats | link → |
| InterGPS | 48.6% | self-reported llm-stats | link → |
| MathVista | 62.4% | self-reported llm-stats | link → |
| MMBench | 86.7% | self-reported llm-stats | link → |
| MMMU | 55.1% | self-reported llm-stats | link → |
| MMMU-Pro | 38.5% | self-reported llm-stats | link → |
| OCRBench | 84.4% | self-reported llm-stats | link → |
| POPE | 85.6% | self-reported llm-stats | link → |
| ScienceQA Visual | 97.5% | self-reported llm-stats | link → |
| TextVQA | 75.6% | self-reported llm-stats | link → |
| Video-MME | 55.0% | self-reported llm-stats | link → |