Qwen2.5 VL 7B Instruct
Qwen2.5-VL is a vision-language model from the Qwen family. Key enhancements include visual understanding (objects, text, charts, layouts), visual agent capabilities (tool use, computer/phone control), long video comprehension with event pinpointing, visual localization (bounding boxes/points), and structured output generation.
Benchmark results
| Benchmark | Score | Tags | Source |
|---|---|---|---|
| AITZ_EM | 81.9% | self-reported llm-stats | link → |
| Android Control High_EM | 60.1% | self-reported llm-stats | link → |
| Android Control Low_EM | 91.4% | self-reported llm-stats | link → |
| AndroidWorld_SR | 25.5% | self-reported llm-stats | link → |
| CC-OCR | 77.8% | self-reported llm-stats | link → |
| CharadesSTA | 43.6% | self-reported llm-stats | link → |
| ChartQA | 87.3% | self-reported llm-stats | link → |
| DocVQA | 95.7% | self-reported llm-stats | link → |
| Hallusion Bench | 52.9% | self-reported llm-stats | link → |
| InfoVQA | 82.6% | self-reported llm-stats | link → |
| LongVideoBench | 54.7% | self-reported llm-stats | link → |
| LVBench | 45.3% | self-reported llm-stats | link → |
| MathVision | 25.1% | self-reported llm-stats | link → |
| MathVista-Mini | 68.2% | self-reported llm-stats | link → |
| MLVU | 70.2% | self-reported llm-stats | link → |
| MMBench | 84.3% | self-reported llm-stats | link → |
| MMBench-Video | 1.8% | self-reported llm-stats | link → |
| MMMU | 58.6% | self-reported llm-stats | link → |
| MMMU-Pro | 38.3% | self-reported llm-stats | link → |
| MMStar | 63.9% | self-reported llm-stats | link → |
| MMT-Bench | 63.6% | self-reported llm-stats | link → |
| MMVet | 67.1% | self-reported llm-stats | link → |
| MobileMiniWob++_SR | 91.4% | self-reported llm-stats | link → |
| MVBench | 69.6% | self-reported llm-stats | link → |
| OCRBench | 86.4% | self-reported llm-stats | link → |
| PerceptionTest | 70.5% | self-reported llm-stats | link → |
| ScreenSpot | 84.7% | self-reported llm-stats | link → |
| ScreenSpot Pro | 29.0% | self-reported llm-stats | link → |
| TempCompass | 71.7% | self-reported llm-stats | link → |
| TextVQA | 84.9% | self-reported llm-stats | link → |
| VideoMME w sub. | 71.6% | self-reported llm-stats | link → |
| VideoMME w/o sub. | 65.1% | self-reported llm-stats | link → |