DROP

math official site →

DROP (Discrete Reasoning Over Paragraphs) is a reading comprehension benchmark requiring discrete reasoning over paragraph content. It contains crowdsourced, adversarially-created questions that require resolving references and performing discrete operations like addition, counting, or sorting, demanding comprehensive paragraph understanding beyond paraphrase-and-entity-typing shortcuts.

Methodology

Imported from llm-stats public benchmark metadata. Modality: text. Max score: 1. Categories: math, reasoning. Language: en. Verified by llm-stats: no.

Leaderboard

  1. DeepSeek-V3 self-reported llm-stats
    91.6%
  2. Claude 3.5 Sonnet self-reported llm-stats
    87.1%
  3. Claude 3.5 Sonnet self-reported llm-stats
    87.1%
  4. GPT-4 Turbo self-reported llm-stats
    86.0%
  5. Nova Pro self-reported llm-stats
    85.4%
  6. Llama 3.1 405B Instruct self-reported llm-stats
    84.8%
  7. GPT-4o self-reported llm-stats
    83.4%
  8. Claude 3.5 Haiku self-reported llm-stats
    83.1%
  9. Claude 3 Opus self-reported llm-stats
    83.1%
  10. GPT-4 self-reported llm-stats
    80.9%
  11. Nova Lite self-reported llm-stats
    80.2%
  12. GPT-4o mini self-reported llm-stats
    79.7%
  13. Llama 3.1 70B Instruct self-reported llm-stats
    79.6%
  14. Nova Micro self-reported llm-stats
    79.3%
  15. LongCat-Flash-Chat self-reported llm-stats
    79.1%
  16. Claude 3 Sonnet self-reported llm-stats
    78.9%
  17. Claude 3 Haiku self-reported llm-stats
    78.4%
  18. Phi 4 self-reported llm-stats
    75.5%
  19. Gemini 1.5 Pro self-reported llm-stats
    74.9%
  20. GPT-3.5 Turbo self-reported llm-stats
    70.2%