Qwen2-VL-72B-Instruct

An instruction-tuned, large multimodal model that excels at visual understanding and step-by-step reasoning. It supports image and video input, with dynamic resolution handling and improved positional embeddings (M-ROPE), enabling advanced capabilities such as complex problem solving, multilingual text recognition in images, and agent-like interactions in video contexts.

DocVQAtest

96.5%

i
VCR_en_easy

91.9%

i
ChartQA

88.3%

i
OCRBench

87.7%

i
MMBench

86.5%

i
MMBench_test

86.5%

i
TextVQA

85.5%

i
InfoVQAtest

84.5%

i
EgoSchema

77.9%

i
RealWorldQA

77.8%

i
MMVetGPT4Turbo

74.0%

i
MVBench

73.6%

i
MathVista-Mini

70.5%

i
MMMUval

64.5%

i
MMMU-Pro

46.2%

i
MTVQA

30.9%

i