Qwen3 VL 32B Thinking

Qwen3-VL is a large multimodal model that unifies vision, language, and reasoning to achieve human-level perception and cognition across text, images, and video. Built on a 235B-parameter architecture, it integrates early joint training of visual and textual modalities for strong language grounding.

DocVQAtest

96.1%

i
ScreenSpot

95.7%

i
MMLU-Redux

91.9%

i
MMBench-V1.1

90.8%

i
CharXiv-D

90.2%

i
InfoVQAtest

89.2%

i
AI2D

88.9%

i
MMLU

88.7%

i
IFEval

87.8%

i
WritingBench

86.2%

i
MathVista-Mini

85.9%

i
OCRBench

85.5%

i
AIME 2025

83.7%

i
MMLU-Pro

82.1%

i
MuirBench

80.3%

i
MMStar

79.4%

i
VideoMMMU

79.0%

i
RealWorldQA

78.4%

i
MMMU (val)

78.1%

i
Multi-IF

78.0%

i
VideoMME w/o sub.

77.3%

i
MMLU-ProX

77.2%

i
Include

76.3%

i
LiveBench 20241125

74.7%

i
MVBench

73.2%

i
GPQA

73.1%

i
BFCL-v3

71.7%

i
MathVision

70.2%

i
BLINK

68.5%

i
OCRBench-V2 (en)

68.4%

i
MMMU-Pro

68.1%

i
Hallusion Bench

67.4%

i
LiveCodeBench v6

65.6%

i
CharXiv-R

65.2%

i
AndroidWorld_SR

63.7%

i
CharadesSTA

62.8%

i
LVBench

62.6%

i
OCRBench-V2 (zh)

62.1%

i
Arena-Hard v2

60.5%

i
SuperGPQA

59.0%

i
ScreenSpot Pro

57.1%

i
SimpleQA

55.4%

i
ERQA

52.3%

i
PolyMATH

52.0%

i
OSWorld

41.0%

i
MM-MT-Bench

8.3

i
Creative Writing v3

0.833

i