Qwen2-VL-72B-Instruct
An instruction-tuned, large multimodal model that excels at visual understanding and step-by-step reasoning. It supports image and video input, with dynamic resolution handling and improved positional embeddings (M-ROPE), enabling advanced capabilities such as complex problem solving, multilingual text recognition in images, and agent-like interactions in video contexts.
Benchmark results
| Benchmark | Score | Tags | Source |
|---|---|---|---|
| ChartQA | 88.3% | self-reported llm-stats | link → |
| DocVQAtest | 96.5% | self-reported llm-stats | link → |
| EgoSchema | 77.9% | self-reported llm-stats | link → |
| InfoVQAtest | 84.5% | self-reported llm-stats | link → |
| MathVista-Mini | 70.5% | self-reported llm-stats | link → |
| MMBench | 86.5% | self-reported llm-stats | link → |
| MMBench_test | 86.5% | self-reported llm-stats | link → |
| MMMU-Pro | 46.2% | self-reported llm-stats | link → |
| MMMUval | 64.5% | self-reported llm-stats | link → |
| MMVetGPT4Turbo | 74.0% | self-reported llm-stats | link → |
| MTVQA | 30.9% | self-reported llm-stats | link → |
| MVBench | 73.6% | self-reported llm-stats | link → |
| OCRBench | 87.7% | self-reported llm-stats | link → |
| RealWorldQA | 77.8% | self-reported llm-stats | link → |
| TextVQA | 85.5% | self-reported llm-stats | link → |
| VCR_en_easy | 91.9% | self-reported llm-stats | link → |