Terminal-Bench 2.1

coding

Terminal-Bench 2.1 is an updated release of the Terminal-Bench benchmark that tests AI agents' ability to operate a computer via the terminal. It evaluates how well models handle real-world, end-to-end tasks autonomously, including compiling code, training models, setting up servers, system administration, data science workflows, and security tasks.

Methodology

Imported from llm-stats public benchmark metadata. Modality: text. Max score: 1. Categories: agents, code, reasoning, tool_calling. Language: en. Verified by llm-stats: no.

Leaderboard

  1. MiniMax M3 self-reported llm-stats
    66.0%