Qwen2.5 VL 32B Instruct

Qwen2.5-VL is a vision-language model from the Qwen family. Key enhancements include visual understanding (objects, text, charts, layouts), visual agent capabilities (tool use, computer/phone control), long video comprehension with event pinpointing, visual localization (bounding boxes/points), and structured output generation.

DocVQA

94.8%

i
Android Control Low_EM

93.3%

i
HumanEval

91.5%

i
ScreenSpot

88.5%

i
MBPP

84.0%

i
InfoVQA

83.4%

i
AITZ_EM

83.1%

i
MATH

82.2%

i
MMLU

78.4%

i
VideoMME w sub.

77.9%

i
CC-OCR

77.1%

i
MathVista-Mini

74.7%

i
VideoMME w/o sub.

70.5%

i
MMMU

70.0%

i
Android Control High_EM

69.6%

i
MMStar

69.5%

i
MMLU-Pro

68.8%

i
OCRBench-V2 (zh)

59.1%

i
OCRBench-V2 (en)

57.2%

i
CharadesSTA

54.2%

i
MMMU-Pro

49.5%

i
LVBench

49.0%

i
GPQA

46.0%

i
ScreenSpot Pro

39.4%

i
MathVision

38.4%

i
AndroidWorld_SR

22.0%

i
OSWorld

5.9%

i
MMBench-Video

1.9%

i