Qwen2.5 VL 72B Instruct

Qwen2.5-VL is the new flagship vision-language model of Qwen, significantly improved from Qwen2-VL. It excels at recognizing objects, analyzing text/charts/layouts in images, acting as a visual agent, understanding long videos (over 1 hour) with event pinpointing, performing visual localization (bounding boxes/points), and generating structured outputs from documents.

Benchmark results

Benchmark Score Tags Source
AI2D 88.4% self-reported llm-stats link →
AITZ_EM 83.2% self-reported llm-stats link →
Android Control High_EM 67.4% self-reported llm-stats link →
Android Control Low_EM 93.7% self-reported llm-stats link →
AndroidWorld_SR 35.0% self-reported llm-stats link →
CC-OCR 79.8% self-reported llm-stats link →
ChartQA 89.5% self-reported llm-stats link →
DocVQA 96.4% self-reported llm-stats link →
EgoSchema 76.2% self-reported llm-stats link →
Hallusion Bench 55.2% self-reported llm-stats link →
LVBench 47.3% self-reported llm-stats link →
MathVision 38.1% self-reported llm-stats link →
MathVista-Mini 74.8% self-reported llm-stats link →
MLVU-M 74.6% self-reported llm-stats link →
MMBench 88.0% self-reported llm-stats link →
MMBench-Video 2.0% self-reported llm-stats link →
MMMU 70.2% self-reported llm-stats link →
MMMU-Pro 51.1% self-reported llm-stats link →
MMStar 70.8% self-reported llm-stats link →
MMVet 76.2% self-reported llm-stats link →
MobileMiniWob++_SR 68.0% self-reported llm-stats link →
MVBench 70.4% self-reported llm-stats link →
OCRBench 88.5% self-reported llm-stats link →
OCRBench-V2 (en) 61.5% self-reported llm-stats link →
OSWorld 8.8% self-reported llm-stats link →
PerceptionTest 73.2% self-reported llm-stats link →
ScreenSpot 87.1% self-reported llm-stats link →
ScreenSpot Pro 43.6% self-reported llm-stats link →
TempCompass 74.8% self-reported llm-stats link →
VideoMME w/o sub. 73.3% self-reported llm-stats link →