Qwen3 VL 4B Thinking

Qwen3-VL is a large multimodal model that unifies vision, language, and reasoning to achieve human-level perception and cognition across text, images, and video. Built on a 235B-parameter architecture, it integrates early joint training of visual and textual modalities for strong language grounding.

DocVQAtest

94.2%

i
ScreenSpot

92.9%

i
MMBench-V1.1

86.7%

i
MMLU-Redux

86.0%

i
AI2D

84.9%

i
WritingBench

84.0%

i
CharXiv-D

83.9%

i
InfoVQAtest

83.0%

i
IFEval

82.6%

i
MMLU

81.5%

i
OCRBench

80.8%

i
MathVista-Mini

79.5%

i
MLVU-M

75.7%

i
MuirBench

75.0%

i
AIME 2025

74.5%

i
CC-OCR

73.8%

i
MMLU-Pro

73.6%

i
Multi-IF

73.6%

i
MMStar

73.2%

i
RealWorldQA

73.2%

i
MMMU (val)

70.8%

i
VideoMMMU

69.4%

i
MVBench

69.3%

i
LiveBench 20241125

68.4%

i
BFCL-v3

67.3%

i
MMLU-ProX

65.0%

i
Include

64.6%

i
GPQA

64.1%

i
Hallusion Bench

64.1%

i
BLINK

63.4%

i
OCRBench-V2 (en)

61.8%

i
MathVision

60.0%

i
CharadesSTA

59.0%

i
MMMU-Pro

57.0%

i
OCRBench-V2 (zh)

55.8%

i
LVBench

53.5%

i
HMMT25

53.1%

i
LiveCodeBench v6

51.3%

i
CharXiv-R

50.3%

i
ScreenSpot Pro

49.2%

i
ERQA

47.3%

i
SuperGPQA

46.8%

i
PolyMATH

44.6%

i
ODinW

39.4%

i
Arena-Hard v2

36.8%

i
OSWorld

31.4%

i
MM-MT-Bench

7.7

i
Creative Writing v3

0.761

i