Phi-3.5-vision-instruct

Phi-3.5-vision-instruct is a 4.2B-parameter open multimodal model with up to 128K context tokens. It emphasizes multi-frame image understanding and reasoning, boosting performance on single-image benchmarks while enabling multi-image comparison, summarization, and even video analysis.

ScienceQA

91.3%

i
POPE

86.1%

i
MMBench

81.9%

i
ChartQA

81.8%

i
AI2D

78.1%

i
TextVQA

72.0%

i
MathVista

43.9%

i
MMMU

43.0%

i
InterGPS

36.3%

i