PaperBench

coding

PaperBench is a benchmark for evaluating AI agents on their ability to replicate research papers. It tests models on complex, multi-step workflows involving code implementation, experimentation, and reproducing scientific results from academic publications.

Leaderboard

Showing 2 of 2 results

Kimi K2.5

63.5%

i
MiniMax M3

52.6%

i