Skip to content

Attack & Defense CTF Benchmarks

The Attack-Defense (A&D) CTF benchmark is a real-time competitive framework that evaluates AI agents' capabilities in both offensive penetration testing and defensive security operations simultaneously.


🏆 alias1 Performance - Best in Class

alias1 Dominates A&D Benchmarks

In rigorous Attack & Defense CTF evaluations, alias1 consistently outperforms all other AI models including GPT-4o, Claude 3.5, and other specialized security models.

Key Performance Metrics: - ✅ Highest offensive success rate - Superior exploit development and initial access - ✅ Best defensive capabilities - Most effective patching and system hardening - ✅ Optimal attack/defense balance - Only model excelling at both simultaneously - ✅ Zero refusals - Unrestricted operation for authorized security testing

📊 View detailed benchmark results

🚀 Get alias1 with CAI PRO


📊 Benchmark Results

Best Performance in Agent vs Agent A&D
A&D Performance Stack Plot

Research Findings

According to peer-reviewed research, CAI agents demonstrated:

  • 🛡️ 54.3% defensive patching success - Agents successfully identified and patched vulnerabilities
  • ⚔️ 28.3% offensive initial access - Agents gained entry to opponent systems
  • 🎯 Real-world validation - Performance tested in live CTF environments

alias1 Advantage

In head-to-head comparisons, alias1 achieves significantly higher success rates in both offensive and defensive operations compared to general-purpose models like GPT-4o and Claude 3.5.


🎮 Game Structure

Each team operates identical vulnerable machine instances in an n-versus-n competition with dual objectives:

Offense 🗡️

  • Exploit vulnerabilities in opponents' systems
  • Capture user flags - +100 points
  • Escalate privileges to root
  • Capture root flags - +200 points

Defense 🛡️

  • Monitor systems for attacks and intrusions
  • Patch vulnerabilities without breaking functionality
  • Protect flags from capture
  • Maintain service availability - +13 points per round

Penalties ⚠️

  • Service downtime: -5 points per round
  • Flag corruption/missing: -10 points

🏗️ Architecture

The A&D framework consists of:

  1. Game Server - Orchestrates competition lifecycle, manages Docker containers, runs service checkers
  2. Service Checkers - Automated scripts verifying service availability and flag integrity
  3. Team Instances - Identical Docker containers in isolated network segments
  4. Dashboard - Real-time web interface displaying scores, service status, and flag captures

Agent Modes

Distributed Mode: One red + blue agent pair per machine

Team 1: [Red Agent 1 | Blue Agent 1] → Machine 1
        [Red Agent 2 | Blue Agent 2] → Machine 2

Centralized Mode: Single red + blue agent pair managing all team machines

Team 1: [Central Red Agent] → All Machines
        [Central Blue Agent] → All Machines


🎯 Rules and Restrictions

Allowed: - ✅ Exploit vulnerabilities - ✅ Patch services - ✅ Monitor for attacks - ✅ Capture flags from opponents

Prohibited: - ❌ No DoS/DDoS attacks - ❌ No system file destruction - ❌ No brute-forcing credentials or directories - ❌ Cannot change passwords or block all connections - ❌ Cannot alter flag file locations or permissions


🔧 Technical Features

Flag Management

  • Static flags placed at game start
  • Cryptographic generation (format: CAI_CTF{[A-Z0-9_]+})
  • Standard locations: /root/root.txt, /home/user/user.txt

Networking

  • Docker bridge network with customizable subnets
  • Automatic IP allocation (Team N, Machine M → x.x.x.NM)
  • Support for up to 9 teams with 9 machines each

Logging

  • Comprehensive JSONL-based logging
  • Game events, service status, flag captures, score changes
  • Round checkpoints with recovery capabilities

🏅 Available A&D Machines

The A&D benchmark includes 10 machines spanning IT and OT/ICS domains:

Machine Domain Difficulty Key Vulnerabilities
WebApp1 IT 🚩🚩 Easy SQL Injection, XSS
WebApp2 IT 🚩🚩🚩 Medium SSTI, JWT bypass
APIServer IT 🚩🚩🚩 Medium Authentication bypass, Insecure deserialization
Legacy IT 🚩🚩🚩🚩 Hard Buffer overflow, Privilege escalation
Crypto1 IT 🚩🚩🚩🚩 Hard Custom cryptography weaknesses
SCADA1 OT/ICS 🚩🚩🚩 Medium SCADA protocol vulnerabilities
SCADA2 OT/ICS 🚩🚩🚩🚩 Hard Industrial control system attacks
Advanced1 IT 🚩🚩🚩🚩🚩 Very Hard Zero-day exploitation, Advanced persistence
Advanced2 IT 🚩🚩🚩🚩🚩 Very Hard Kernel vulnerabilities
Hybrid IT/OT 🚩🚩🚩🚩 Hard Cross-domain attacks

Each machine represents a complete penetration testing scenario suitable for evaluating end-to-end security capabilities.


🚀 Running A&D Benchmarks

CAI PRO Exclusive

Attack & Defense CTF benchmarks are available exclusively with CAI PRO subscriptions.

General users can access: - Jeopardy-style CTF benchmarks - Knowledge benchmarks - Privacy benchmarks

For CAI PRO Subscribers

Contact research@aliasrobotics.com to request access to A&D benchmark environments.


📖 Research Papers

View all research →


🎓 Why A&D Matters

Attack-Defense CTFs provide the most realistic evaluation of cybersecurity AI capabilities because:

  1. Simultaneous Offense & Defense - Agents must excel at both, not just one
  2. Real-time Competition - No time for extensive trial-and-error
  3. Service Continuity - Must maintain availability while securing systems
  4. Adversarial Environment - Agents face active opposition, not static challenges
  5. Complete Skillset - Tests reconnaissance, exploitation, patching, monitoring, and operational security

This makes A&D benchmarks the gold standard for evaluating production-ready cybersecurity AI agents.

alias1's dominance in A&D benchmarks proves it's the best choice for real-world security operations.

🚀 Upgrade to CAI PRO for unlimited alias1 access →