Qwen2-VL-72B-Instruct

An instruction-tuned, large multimodal model that excels at visual understanding and step-by-step reasoning. It supports image and video input, with dynamic resolution handling and improved positional embeddings (M-ROPE), enabling advanced capabilities such as complex problem solving, multilingual text recognition in images, and agent-like interactions in video contexts.

Benchmark results

Benchmark Score Tags Source
ChartQA 88.3% self-reported llm-stats link →
DocVQAtest 96.5% self-reported llm-stats link →
EgoSchema 77.9% self-reported llm-stats link →
InfoVQAtest 84.5% self-reported llm-stats link →
MathVista-Mini 70.5% self-reported llm-stats link →
MMBench 86.5% self-reported llm-stats link →
MMBench_test 86.5% self-reported llm-stats link →
MMMU-Pro 46.2% self-reported llm-stats link →
MMMUval 64.5% self-reported llm-stats link →
MMVetGPT4Turbo 74.0% self-reported llm-stats link →
MTVQA 30.9% self-reported llm-stats link →
MVBench 73.6% self-reported llm-stats link →
OCRBench 87.7% self-reported llm-stats link →
RealWorldQA 77.8% self-reported llm-stats link →
TextVQA 85.5% self-reported llm-stats link →
VCR_en_easy 91.9% self-reported llm-stats link →