GPT-4
GPT-4 is a large multimodal model capable of processing both image and text inputs and generating human-like text outputs. It demonstrates human-level performance on various professional and academic benchmarks.
Benchmark results
| Benchmark | Score | Tags | Source |
|---|---|---|---|
| AI2 Reasoning Challenge (ARC) | 96.3% | self-reported llm-stats | link → |
| DROP | 80.9% | self-reported llm-stats | link → |
| GPQA | 35.7% | self-reported llm-stats | link → |
| HellaSwag | 95.3% | self-reported llm-stats | link → |
| HumanEval | 67.0% | self-reported llm-stats | link → |
| LSAT | 88.0% | self-reported llm-stats | link → |
| MATH | 42.0% | self-reported llm-stats | link → |
| MGSM | 74.5% | self-reported llm-stats | link → |
| MMLU | 86.4% | self-reported llm-stats | link → |
| SAT Math | 89.0% | self-reported llm-stats | link → |
| Uniform Bar Exam | 90.0% | self-reported llm-stats | link → |
| Winogrande | 87.5% | self-reported llm-stats | link → |