Qwen3 VL 4B Instruct

Qwen3-VL is a large multimodal model that unifies vision, language, and reasoning to achieve human-level perception and cognition across text, images, and video. Built on a 235B-parameter architecture, it integrates early joint training of visual and textual modalities for strong language grounding.

DocVQAtest

95.3%

i
ScreenSpot

94.0%

i
OCRBench

88.1%

i
MMBench-V1.1

85.1%

i
AI2D

84.1%

i
WritingBench

82.5%

i
IFEval

82.3%

i
MMLU-Redux

81.5%

i
InfoVQAtest

80.3%

i
MMLU

77.2%

i
CC-OCR

76.2%

i
CharXiv-D

76.2%

i
MLVU-M

75.3%

i
MathVista-Mini

73.7%

i
RealWorldQA

70.9%

i
MMStar

69.8%

i
MVBench

68.9%

i
MMMU (val)

67.4%

i
MMLU-Pro

67.1%

i
BLINK

65.8%

i
MuirBench

63.8%

i
OCRBench-V2 (en)

63.7%

i
BFCL-v3

63.3%

i
Include

61.4%

i
LiveBench 20241125

60.9%

i
ScreenSpot Pro

59.5%

i
MMLU-ProX

59.4%

i
Hallusion Bench

57.6%

i
OCRBench-V2 (zh)

57.6%

i
LVBench

56.2%

i
VideoMMMU

56.2%

i
CharadesSTA

55.5%

i
MMMU-Pro

53.2%

i
MathVision

51.6%

i
ODinW

48.2%

i
SimpleQA

48.0%

i
AIME 2025

46.6%

i
ERQA

41.3%

i
SuperGPQA

40.3%

i
CharXiv-R

39.7%

i
LiveCodeBench v6

37.9%

i
HMMT25

30.7%

i
PolyMATH

28.8%

i
OSWorld

26.2%

i
MM-MT-Bench

7.5

i