GPQA
reasoning official site →
A challenging dataset of 448 multiple-choice questions written by domain experts in biology, physics, and chemistry. Questions are Google-proof and extremely difficult, with PhD experts reaching 65% accuracy.
Methodology
Imported from llm-stats public benchmark metadata. Modality: text. Max score: 1. Categories: biology, chemistry, general, physics, reasoning. Language: en. Verified by llm-stats: no.