Phi-3.5-vision-instruct

Phi-3.5-vision-instruct is a 4.2B-parameter open multimodal model with up to 128K context tokens. It emphasizes multi-frame image understanding and reasoning, boosting performance on single-image benchmarks while enabling multi-image comparison, summarization, and even video analysis. The model underwent safety post-training for improved instruction-following, alignment, and robust handling of visual and text inputs, and is released under the MIT license.

Benchmark results

Benchmark Score Tags Source
AI2D 78.1% self-reported llm-stats link →
ChartQA 81.8% self-reported llm-stats link →
InterGPS 36.3% self-reported llm-stats link →
MathVista 43.9% self-reported llm-stats link →
MMBench 81.9% self-reported llm-stats link →
MMMU 43.0% self-reported llm-stats link →
POPE 86.1% self-reported llm-stats link →
ScienceQA 91.3% self-reported llm-stats link →
TextVQA 72.0% self-reported llm-stats link →