MedXpertQA

reasoning official site →

A comprehensive benchmark to evaluate expert-level medical knowledge and advanced reasoning, featuring 4,460 questions spanning 17 specialties and 11 body systems. Includes both text-only and multimodal subsets with expert-level exam questions incorporating diverse medical images and rich clinical information.

Methodology

Imported from llm-stats public benchmark metadata. Modality: multimodal. Max score: 1. Categories: healthcare, multimodal, reasoning, vision. Language: en. Verified by llm-stats: no.

Leaderboard

  1. Muse Spark self-reported llm-stats
    78.4%
  2. Qwen3.5-122B-A10B self-reported llm-stats
    67.3%
  3. Qwen3.5-27B self-reported llm-stats
    62.4%
  4. Qwen3.5-35B-A3B self-reported llm-stats
    61.4%
  5. Gemma 4 31B self-reported llm-stats
    61.3%
  6. Gemma 4 26B-A4B self-reported llm-stats
    58.1%
  7. MAI-Thinking-1 self-reported llm-stats
    43.0%
  8. Gemma 4 E4B self-reported llm-stats
    28.7%
  9. Gemma 4 E2B self-reported llm-stats
    23.5%
  10. MedGemma 4B IT self-reported llm-stats
    18.8%