PhiBench
math official site →
PhiBench is an internal benchmark designed to evaluate diverse skills and reasoning abilities of language models, covering a wide range of tasks including coding (debugging, extending incomplete code, explaining code snippets) and mathematics (identifying proof errors, generating related problems). Created by Microsoft's research team to address limitations of standard academic benchmarks and guide the development of the Phi-4 model.
Methodology
Imported from llm-stats public benchmark metadata. Modality: text. Max score: 1. Categories: general, math, reasoning. Language: en. Verified by llm-stats: no.