Phi-3.5-vision-instruct
Phi-3.5-vision-instruct is a 4.2B-parameter open multimodal model with up to 128K context tokens. It emphasizes multi-frame image understanding and reasoning, boosting performance on single-image benchmarks while enabling multi-image comparison, summarization, and even video analysis. The model underwent safety post-training for improved instruction-following, alignment, and robust handling of visual and text inputs, and is released under the MIT license.
Benchmark results
| Benchmark | Score | Tags | Source |
|---|---|---|---|
| AI2D | 78.1% | self-reported llm-stats | link → |
| ChartQA | 81.8% | self-reported llm-stats | link → |
| InterGPS | 36.3% | self-reported llm-stats | link → |
| MathVista | 43.9% | self-reported llm-stats | link → |
| MMBench | 81.9% | self-reported llm-stats | link → |
| MMMU | 43.0% | self-reported llm-stats | link → |
| POPE | 86.1% | self-reported llm-stats | link → |
| ScienceQA | 91.3% | self-reported llm-stats | link → |
| TextVQA | 72.0% | self-reported llm-stats | link → |