HealthBench

healthcare

An open-source benchmark for measuring performance and safety of large language models in healthcare, consisting of 5,000 multi-turn conversations evaluated by 262 physicians using 48,562 unique rubric criteria across health contexts and behavioral dimensions

Leaderboard

Showing 5 of 5 results

Kimi K2-Thinking-0905

58.0%

i
GPT OSS 120B

57.6%

i
GPT-5.3 Chat

54.1%

i
GPT-5.5 Instant

51.4%

i
GPT OSS 20B

42.5%

i