Phi-4-multimodal-instruct

Phi-4-multimodal-instruct is a lightweight (5.57B parameters) open multimodal foundation model that leverages research and datasets from Phi-3.5 and 4.0. It processes text, image, and audio inputs to generate text outputs, supporting a 128K token context length. Enhanced via SFT, DPO, and RLHF for instruction following and safety.

Benchmark results

Benchmark Score Tags Source
AI2D 82.3% self-reported llm-stats link →
BLINK 61.3% self-reported llm-stats link →
ChartQA 81.4% self-reported llm-stats link →
DocVQA 93.2% self-reported llm-stats link →
InfoVQA 72.7% self-reported llm-stats link →
InterGPS 48.6% self-reported llm-stats link →
MathVista 62.4% self-reported llm-stats link →
MMBench 86.7% self-reported llm-stats link →
MMMU 55.1% self-reported llm-stats link →
MMMU-Pro 38.5% self-reported llm-stats link →
OCRBench 84.4% self-reported llm-stats link →
POPE 85.6% self-reported llm-stats link →
ScienceQA Visual 97.5% self-reported llm-stats link →
TextVQA 75.6% self-reported llm-stats link →
Video-MME 55.0% self-reported llm-stats link →