PhiBench

math official site →

PhiBench is an internal benchmark designed to evaluate diverse skills and reasoning abilities of language models, covering a wide range of tasks including coding (debugging, extending incomplete code, explaining code snippets) and mathematics (identifying proof errors, generating related problems). Created by Microsoft's research team to address limitations of standard academic benchmarks and guide the development of the Phi-4 model.

Methodology

Imported from llm-stats public benchmark metadata. Modality: text. Max score: 1. Categories: general, math, reasoning. Language: en. Verified by llm-stats: no.

Leaderboard

  1. Phi 4 Reasoning Plus self-reported llm-stats
    74.2%
  2. Phi 4 Reasoning self-reported llm-stats
    70.6%
  3. Phi 4 self-reported llm-stats
    56.2%