Program Bench

coding

Program Bench evaluates code-generation agents by asking them to recreate a program's behavior from only a compiled binary and documentation. It spans 200 tasks from small CLI tools to large systems such as FFmpeg and SQLite, with submissions judged against more than 248,000 fuzz-generated behavioral tests.

Leaderboard

Showing 1 of 1 result

Kimi K2.7 Code

53.6%

i