HealthBench Hard

healthcare official site →

A challenging variation of HealthBench that evaluates large language models' performance and safety in healthcare through 5,000 multi-turn conversations with particularly rigorous evaluation criteria validated by 262 physicians from 60 countries

Methodology

Imported from llm-stats public benchmark metadata. Modality: text. Max score: 1. Categories: healthcare. Language: en. Verified by llm-stats: no.

Leaderboard

  1. Muse Spark self-reported llm-stats
    42.8%
  2. GPT OSS 120B self-reported llm-stats
    30.0%
  3. GPT-5.3 Chat self-reported llm-stats
    25.9%
  4. GPT-5.5 Instant self-reported llm-stats
    22.9%
  5. GPT OSS 20B self-reported llm-stats
    10.8%
  6. GPT-5 self-reported llm-stats
    1.6%