PhiBench

math

PhiBench is an internal benchmark designed to evaluate diverse skills and reasoning abilities of language models, covering a wide range of tasks including coding (debugging, extending incomplete code, explaining code snippets) and mathematics (identifying proof errors, generating related problems). Created by Microsoft's research team to address limitations of standard academic benchmarks and guide the development of the Phi-4 model.

Leaderboard

Showing 3 of 3 results

Phi 4 Reasoning Plus

74.2%

i
Phi 4 Reasoning

70.6%

i
Phi 4

56.2%

i