Terminal-Bench 2.1

coding

Terminal-Bench 2.1 is an updated release of the Terminal-Bench benchmark that tests AI agents' ability to operate a computer via the terminal. It evaluates how well models handle real-world, end-to-end tasks autonomously, including compiling code, training models, setting up servers, system administration, data science workflows, and security tasks.

Leaderboard

Showing 2 of 2 results

Claude Fable 5

84.3%

i
MiniMax M3

66.0%

i