GPT-4o
GPT-4o ('o' for 'omni') is a multimodal AI model that accepts text, audio, image, and video inputs, and generates text, audio, and image outputs. It matches GPT-4 Turbo performance on text and code, with improvements in non-English languages, vision, and audio understanding.
Benchmark results
| Benchmark | Score | Tags | Source |
|---|---|---|---|
| ActivityNet | 61.9% | self-reported llm-stats | link → |
| AI2D | 94.2% | self-reported llm-stats | link → |
| Aider-Polyglot | 30.7% | self-reported llm-stats | link → |
| Aider-Polyglot Edit | 18.2% | self-reported llm-stats | link → |
| AIME 2024 | 13.1% | self-reported llm-stats | link → |
| ChartQA | 85.7% | self-reported llm-stats | link → |
| CharXiv-D | 85.3% | self-reported llm-stats | link → |
| CharXiv-R | 58.8% | self-reported llm-stats | link → |
| COLLIE | 61.0% | self-reported llm-stats | link → |
| ComplexFuncBench | 66.5% | self-reported llm-stats | link → |
| DocVQA | 92.8% | self-reported llm-stats | link → |
| EgoSchema | 72.2% | self-reported llm-stats | link → |
| ERQA | 35.2% | self-reported llm-stats | link → |
| GPQA | 70.1% | self-reported llm-stats | link → |
| Graphwalks BFS <128k | 41.7% | self-reported llm-stats | link → |
| Graphwalks parents <128k | 35.4% | self-reported llm-stats | link → |
| Humanity's Last Exam | 5.3% | self-reported llm-stats | link → |
| IFEval | 81.0% | self-reported llm-stats | link → |
| Internal API instruction following (hard) | 29.2% | self-reported llm-stats | link → |
| MathVista | 61.4% | self-reported llm-stats | link → |
| MMLU | 85.7% | self-reported llm-stats | link → |
| MMLU-Pro | 74.7% | self-reported llm-stats | link → |
| MMMLU | 81.4% | self-reported llm-stats | link → |
| MMMU | 72.2% | self-reported llm-stats | link → |
| MMMU-Pro | 59.9% | self-reported llm-stats | link → |
| Multi-Challenge | 40.3% | self-reported llm-stats | link → |
| Multi-IF | 60.9% | self-reported llm-stats | link → |
| MultiChallenge (o3-mini grader) | 39.9% | self-reported llm-stats | link → |
| OpenAI-MRCR: 2 needle 128k | 31.9% | self-reported llm-stats | link → |
| Scale MultiChallenge | 40.3% | self-reported llm-stats | link → |
| SimpleQA | 38.2% | self-reported llm-stats | link → |
| SWE-Bench Verified | 33.2% | self-reported llm-stats | link → |
| SWE-Lancer | 32.6% | self-reported llm-stats | link → |
| SWE-Lancer (IC-Diamond subset) | 12.4% | self-reported llm-stats | link → |
| TAU-bench Airline | 42.8% | self-reported llm-stats | link → |
| TAU-bench Retail | 60.3% | self-reported llm-stats | link → |
| Tau2 Airline | 45.5% | self-reported llm-stats | link → |
| Tau2 Retail | 63.4% | self-reported llm-stats | link → |
| Tau2 Telecom | 23.5% | self-reported llm-stats | link → |
| VideoMMMU | 61.2% | self-reported llm-stats | link → |