Qwen2.5 VL 7B Instruct

Qwen2.5-VL is a vision-language model from the Qwen family. Key enhancements include visual understanding (objects, text, charts, layouts), visual agent capabilities (tool use, computer/phone control), long video comprehension with event pinpointing, visual localization (bounding boxes/points), and structured output generation.

Benchmark results

Benchmark Score Tags Source
AITZ_EM 81.9% self-reported llm-stats link →
Android Control High_EM 60.1% self-reported llm-stats link →
Android Control Low_EM 91.4% self-reported llm-stats link →
AndroidWorld_SR 25.5% self-reported llm-stats link →
CC-OCR 77.8% self-reported llm-stats link →
CharadesSTA 43.6% self-reported llm-stats link →
ChartQA 87.3% self-reported llm-stats link →
DocVQA 95.7% self-reported llm-stats link →
Hallusion Bench 52.9% self-reported llm-stats link →
InfoVQA 82.6% self-reported llm-stats link →
LongVideoBench 54.7% self-reported llm-stats link →
LVBench 45.3% self-reported llm-stats link →
MathVision 25.1% self-reported llm-stats link →
MathVista-Mini 68.2% self-reported llm-stats link →
MLVU 70.2% self-reported llm-stats link →
MMBench 84.3% self-reported llm-stats link →
MMBench-Video 1.8% self-reported llm-stats link →
MMMU 58.6% self-reported llm-stats link →
MMMU-Pro 38.3% self-reported llm-stats link →
MMStar 63.9% self-reported llm-stats link →
MMT-Bench 63.6% self-reported llm-stats link →
MMVet 67.1% self-reported llm-stats link →
MobileMiniWob++_SR 91.4% self-reported llm-stats link →
MVBench 69.6% self-reported llm-stats link →
OCRBench 86.4% self-reported llm-stats link →
PerceptionTest 70.5% self-reported llm-stats link →
ScreenSpot 84.7% self-reported llm-stats link →
ScreenSpot Pro 29.0% self-reported llm-stats link →
TempCompass 71.7% self-reported llm-stats link →
TextVQA 84.9% self-reported llm-stats link →
VideoMME w sub. 71.6% self-reported llm-stats link →
VideoMME w/o sub. 65.1% self-reported llm-stats link →