HealthBench

healthcare official site →

An open-source benchmark for measuring performance and safety of large language models in healthcare, consisting of 5,000 multi-turn conversations evaluated by 262 physicians using 48,562 unique rubric criteria across health contexts and behavioral dimensions

Methodology

Imported from llm-stats public benchmark metadata. Modality: text. Max score: 1. Categories: healthcare. Language: en. Verified by llm-stats: no.

Leaderboard

  1. Kimi K2-Thinking-0905 self-reported llm-stats
    58.0%
  2. GPT OSS 120B self-reported llm-stats
    57.6%
  3. GPT-5.3 Chat self-reported llm-stats
    54.1%
  4. GPT-5.5 Instant self-reported llm-stats
    51.4%
  5. GPT OSS 20B self-reported llm-stats
    42.5%