Cybersecurity CTFs

safety official site →

Cybersecurity Capture the Flag (CTF) benchmark for evaluating LLMs in offensive security challenges. Contains diverse cybersecurity tasks including cryptography, web exploitation, binary analysis, and forensics to assess AI capabilities in cybersecurity problem-solving.

Methodology

Imported from llm-stats public benchmark metadata. Modality: text. Max score: 1. Categories: safety. Language: en. Verified by llm-stats: no.

Leaderboard

  1. GPT-5.3 Codex self-reported llm-stats
    77.6%
  2. Claude Haiku 4.5 self-reported llm-stats
    46.9%
  3. Claude Haiku 4.5 self-reported llm-stats
    46.9%
  4. o1-mini self-reported llm-stats
    28.7%