Qwen3 VL 8B Instruct

Qwen3-VL is a large multimodal model that unifies vision, language, and reasoning to achieve human-level perception and cognition across text, images, and video. Built on a 235B-parameter architecture, it integrates early joint training of visual and textual modalities for strong language grounding.

DocVQAtest

96.1%

i
ScreenSpot

94.4%

i
OCRBench

89.6%

i
AI2D

85.7%

i
MMBench-V1.1

85.0%

i
MMLU-Redux

84.9%

i
IFEval

83.7%

i
InfoVQAtest

83.1%

i
WritingBench

83.1%

i
CharXiv-D

83.0%

i
MMLU

80.7%

i
CC-OCR

79.9%

i
MLVU-M

78.1%

i
MathVista-Mini

77.2%

i
Multi-IF

75.1%

i
MMLU-Pro

71.6%

i
RealWorldQA

71.5%

i
Video-MME

71.4%

i
MMStar

70.9%

i
MMMU (val)

69.6%

i
BLINK

69.1%

i
MVBench

68.7%

i
Include

67.0%

i
BFCL-v3

66.3%

i
MMLU-ProX

65.4%

i
OCRBench-V2 (en)

65.4%

i
VideoMMMU

65.3%

i
MuirBench

64.4%

i
LiveBench 20241125

62.0%

i
OCRBench-V2 (zh)

61.2%

i
Hallusion Bench

61.1%

i
LVBench

58.0%

i
CharadesSTA

56.0%

i
MMMU-Pro

55.9%

i
ScreenSpot Pro

54.6%

i
MathVision

53.9%

i
CharXiv-R

46.4%

i
AIME 2025

45.9%

i
ERQA

45.8%

i
ODinW

44.7%

i
SuperGPQA

44.5%

i
LiveCodeBench v6

39.3%

i
OSWorld

33.9%

i
HMMT25

32.5%

i
PolyMATH

30.4%

i
MM-MT-Bench

7.7

i

Pricing, uptime, and speed via OpenRouter — updated Jul 17, 2026, 04:19 AM.

Provider	Status	Input	Output	Limits	Uptime	Speed	Notes
Alibaba	available	$0.12/Mtok	$0.45/Mtok	131K tokens context 33K tokens max output	100.0% 5m 100.0%	444 ms p50 TTFT 60 tok/s p50
Parasail	available	$0.25/Mtok cache $0.12/Mtok	$0.75/Mtok	262K tokens context 262K tokens max output	100.0% 5m 100.0%	551 ms p50 TTFT 37 tok/s p50	bf16