Qwen2.5 VL 7B Instruct

Qwen2.5-VL is a vision-language model from the Qwen family. Key enhancements include visual understanding (objects, text, charts, layouts), visual agent capabilities (tool use, computer/phone control), long video comprehension with event pinpointing, visual localization (bounding boxes/points), and structured output generation.

DocVQA

95.7%

i
Android Control Low_EM

91.4%

i
MobileMiniWob++_SR

91.4%

i
ChartQA

87.3%

i
OCRBench

86.4%

i
TextVQA

84.9%

i
ScreenSpot

84.7%

i
MMBench

84.3%

i
InfoVQA

82.6%

i
AITZ_EM

81.9%

i
CC-OCR

77.8%

i
TempCompass

71.7%

i
VideoMME w sub.

71.6%

i
PerceptionTest

70.5%

i
MLVU

70.2%

i
MVBench

69.6%

i
MathVista-Mini

68.2%

i
MMVet

67.1%

i
VideoMME w/o sub.

65.1%

i
MMStar

63.9%

i
MMT-Bench

63.6%

i
Android Control High_EM

60.1%

i
MMMU

58.6%

i
LongVideoBench

54.7%

i
Hallusion Bench

52.9%

i
LVBench

45.3%

i
CharadesSTA

43.6%

i
MMMU-Pro

38.3%

i
ScreenSpot Pro

29.0%

i
AndroidWorld_SR

25.5%

i
MathVision

25.1%

i
MMBench-Video

1.8%

i