Qwen2.5 VL 32B Instruct

Qwen2.5-VL is a vision-language model from the Qwen family. Key enhancements include visual understanding (objects, text, charts, layouts), visual agent capabilities (tool use, computer/phone control), long video comprehension with event pinpointing, visual localization (bounding boxes/points), and structured output generation.

Benchmark results

Benchmark Score Tags Source
AITZ_EM 83.1% self-reported llm-stats link →
Android Control High_EM 69.6% self-reported llm-stats link →
Android Control Low_EM 93.3% self-reported llm-stats link →
AndroidWorld_SR 22.0% self-reported llm-stats link →
CC-OCR 77.1% self-reported llm-stats link →
CharadesSTA 54.2% self-reported llm-stats link →
DocVQA 94.8% self-reported llm-stats link →
GPQA 46.0% self-reported llm-stats link →
HumanEval 91.5% self-reported llm-stats link →
InfoVQA 83.4% self-reported llm-stats link →
LVBench 49.0% self-reported llm-stats link →
MATH 82.2% self-reported llm-stats link →
MathVision 38.4% self-reported llm-stats link →
MathVista-Mini 74.7% self-reported llm-stats link →
MBPP 84.0% self-reported llm-stats link →
MMBench-Video 1.9% self-reported llm-stats link →
MMLU 78.4% self-reported llm-stats link →
MMLU-Pro 68.8% self-reported llm-stats link →
MMMU 70.0% self-reported llm-stats link →
MMMU-Pro 49.5% self-reported llm-stats link →
MMStar 69.5% self-reported llm-stats link →
OCRBench-V2 (en) 57.2% self-reported llm-stats link →
OCRBench-V2 (zh) 59.1% self-reported llm-stats link →
OSWorld 5.9% self-reported llm-stats link →
ScreenSpot 88.5% self-reported llm-stats link →
ScreenSpot Pro 39.4% self-reported llm-stats link →
VideoMME w sub. 77.9% self-reported llm-stats link →
VideoMME w/o sub. 70.5% self-reported llm-stats link →