PIQA

reasoning official site →

PIQA (Physical Interaction: Question Answering) is a benchmark dataset for physical commonsense reasoning in natural language. It tests AI systems' ability to answer questions requiring physical world knowledge through multiple choice questions with everyday situations, focusing on atypical solutions inspired by instructables.com. The dataset contains 21,000 multiple choice questions where models must choose the most appropriate solution for physical interactions.

Methodology

Imported from llm-stats public benchmark metadata. Modality: text. Max score: 1. Categories: general, physics, reasoning. Language: en. Verified by llm-stats: no.

Leaderboard

  1. Phi-3.5-MoE-instruct self-reported llm-stats
    88.6%
  2. Hermes 3 70B self-reported llm-stats
    84.4%
  3. Gemma 2 27B self-reported llm-stats
    83.2%
  4. Gemma 2 9B self-reported llm-stats
    81.7%
  5. Gemma 3n E4B self-reported llm-stats
    81.0%
  6. 81.0%
  7. Phi-3.5-mini-instruct self-reported llm-stats
    81.0%
  8. Gemma 3n E2B self-reported llm-stats
    78.9%
  9. 78.9%
  10. Phi 4 Mini self-reported llm-stats
    77.6%
  11. ERNIE 4.5 self-reported llm-stats
    55.2%