Terminal-Bench

coding

Terminal-Bench is a benchmark for testing AI agents in real terminal environments. It evaluates how well agents can handle real-world, end-to-end tasks autonomously, including compiling code, training models, setting up servers, system administration, security tasks, data science workflows, and cybersecurity vulnerabilities. The benchmark consists of a dataset of ~100 hand-crafted, human-verified tasks and an execution harness that connects language models to a terminal sandbox.

Methodology

Imported from llm-stats public benchmark metadata. Modality: text. Max score: 1. Categories: agents, code, reasoning. Language: en. Verified by llm-stats: no.

Leaderboard

  1. Claude Haiku 4.5 self-reported llm-stats
    92.0%
  2. Claude Sonnet 4.5 self-reported llm-stats
    50.0%
  3. MiniMax M2.1 self-reported llm-stats
    47.9%
  4. Kimi K2-Thinking-0905 self-reported llm-stats
    47.1%
  5. MiniMax M2 self-reported llm-stats
    46.3%
  6. Claude Opus 4.1 self-reported llm-stats
    43.3%
  7. Nova 2 Pro self-reported llm-stats
    41.3%
  8. Claude Haiku 4.5 self-reported llm-stats
    41.0%
  9. GLM-4.6 self-reported llm-stats
    40.5%
  10. LongCat-Flash-Chat self-reported llm-stats
    39.5%
  11. Claude Opus 4 self-reported llm-stats
    39.2%
  12. DeepSeek-V3.2-Exp self-reported llm-stats
    37.7%
  13. GLM-4.5 self-reported llm-stats
    37.5%
  14. Claude Sonnet 4 self-reported llm-stats
    35.5%
  15. Claude 3.7 Sonnet self-reported llm-stats
    35.2%
  16. LongCat-Flash-Lite self-reported llm-stats
    33.8%
  17. GLM-4.7 self-reported llm-stats
    33.3%
  18. Nova 2 Lite self-reported llm-stats
    32.5%
  19. DeepSeek-V3.1 self-reported llm-stats
    31.3%
  20. MiMo-V2-Flash self-reported llm-stats
    30.5%