Llama 3.2 11B Instruct

Llama 3.2 11B Vision Instruct is an instruction-tuned multimodal large language model optimized for visual recognition, image reasoning, captioning, and answering general questions about an image. It accepts text and images as input and generates text as output.

Benchmark results

Benchmark Score Tags Source
AI2D 91.1% self-reported llm-stats link →
ChartQA 83.4% self-reported llm-stats link →
DocVQA 88.4% self-reported llm-stats link →
GPQA 32.8% self-reported llm-stats link →
MATH 51.9% self-reported llm-stats link →
MathVista 51.5% self-reported llm-stats link →
MGSM 68.9% self-reported llm-stats link →
MMLU 73.0% self-reported llm-stats link →
MMMU 50.7% self-reported llm-stats link →
MMMU-Pro 33.0% self-reported llm-stats link →
VQAv2 (test) 75.2% self-reported llm-stats link →