Qwen3 VL 32B Instruct

Qwen3-VL is a large multimodal model that unifies vision, language, and reasoning to achieve human-level perception and cognition across text, images, and video. Built on a 235B-parameter architecture, it integrates early joint training of visual and textual modalities for strong language grounding.

DocVQAtest

96.9%

i
ScreenSpot

95.8%

i
CharXiv-D

90.5%

i
MMLU-Redux

89.8%

i
AI2D

89.5%

i
OCRBench

89.5%

i
InfoVQAtest

87.0%

i
MMLU

86.4%

i
IFEval

84.7%

i
MathVista-Mini

83.8%

i
WritingBench

82.9%

i
MLVU-M

82.1%

i
CC-OCR

80.3%

i
RealWorldQA

79.0%

i
MMLU-Pro

78.6%

i
MMStar

77.7%

i
MMMU (val)

76.0%

i
Include

74.0%

i
MMLU-ProX

73.4%

i
MuirBench

72.8%

i
MVBench

72.8%

i
LiveBench 20241125

72.2%

i
Multi-IF

72.0%

i
BFCL-v3

70.2%

i
GPQA

68.9%

i
OCRBench-V2 (en)

67.4%

i
BLINK

67.3%

i
AIME 2025

66.2%

i
MMMU-Pro

65.3%

i
Arena-Hard v2

64.7%

i
Hallusion Bench

63.8%

i
LVBench

63.8%

i
MathVision

63.4%

i
CharXiv-R

62.8%

i
CharadesSTA

61.2%

i
OCRBench-V2 (zh)

59.2%

i
ScreenSpot Pro

57.9%

i
SuperGPQA

54.6%

i
ERQA

48.8%

i
ODinW

46.6%

i
LiveCodeBench v6

43.8%

i
PolyMATH

40.5%

i
OSWorld

32.6%

i
MM-MT-Bench

8.4

i
Creative Writing v3

0.856

i

Pricing, uptime, and speed via OpenRouter — updated Jul 17, 2026, 04:19 AM.

Provider	Status	Input	Output	Limits	Uptime	Speed	Notes
Alibaba	available	$0.10/Mtok	$0.42/Mtok	131K tokens context 33K tokens max output	100.0% 5m 100.0%	853 ms p50 TTFT 47 tok/s p50