Terminal-Bench

coding

Terminal-Bench is a benchmark for testing AI agents in real terminal environments. It evaluates how well agents can handle real-world, end-to-end tasks autonomously, including compiling code, training models, setting up servers, system administration, security tasks, data science workflows, and cybersecurity vulnerabilities. The benchmark consists of a dataset of ~100 hand-crafted, human-verified tasks and an execution harness that connects language models to a terminal sandbox.

Leaderboard

Showing 20 of 26 results

Claude Haiku 4.5

92.0%

i
Claude Sonnet 4.5

50.0%

i
MiniMax M2.1

47.9%

i
Kimi K2-Thinking-0905

47.1%

i
MiniMax M2

46.3%

i
Claude Opus 4.1

43.3%

i
Nova 2 Pro

41.3%

i
Claude Haiku 4.5

41.0%

i
GLM-4.6

40.5%

i
LongCat-Flash-Chat

39.5%

i
Claude Opus 4

39.2%

i
DeepSeek-V3.2-Exp

37.7%

i
GLM-4.5

37.5%

i
Claude Sonnet 4

35.5%

i
Claude 3.7 Sonnet

35.2%

i
LongCat-Flash-Lite

33.8%

i
GLM-4.7

33.3%

i
Nova 2 Lite

32.5%

i
DeepSeek-V3.1

31.3%

i
MiMo-V2-Flash

30.5%

i