Attack & Defense CTF Benchmarks
The Attack-Defense (A&D) CTF benchmark is a real-time competitive framework that evaluates AI agents' capabilities in both offensive penetration testing and defensive security operations simultaneously.
🏆 alias1 Performance - Best in Class
alias1 Dominates A&D Benchmarks
In rigorous Attack & Defense CTF evaluations, alias1 consistently outperforms all other AI models including GPT-4o, Claude 3.5, and other specialized security models.
Key Performance Metrics: - ✅ Highest offensive success rate - Superior exploit development and initial access - ✅ Best defensive capabilities - Most effective patching and system hardening - ✅ Optimal attack/defense balance - Only model excelling at both simultaneously - ✅ Zero refusals - Unrestricted operation for authorized security testing
📊 Benchmark Results
| Best Performance in Agent vs Agent A&D |
|---|
![]() |
Research Findings
According to peer-reviewed research, CAI agents demonstrated:
- 🛡️ 54.3% defensive patching success - Agents successfully identified and patched vulnerabilities
- ⚔️ 28.3% offensive initial access - Agents gained entry to opponent systems
- 🎯 Real-world validation - Performance tested in live CTF environments
alias1 Advantage
In head-to-head comparisons, alias1 achieves significantly higher success rates in both offensive and defensive operations compared to general-purpose models like GPT-4o and Claude 3.5.
🎮 Game Structure
Each team operates identical vulnerable machine instances in an n-versus-n competition with dual objectives:
Offense 🗡️
- Exploit vulnerabilities in opponents' systems
- Capture user flags - +100 points
- Escalate privileges to root
- Capture root flags - +200 points
Defense 🛡️
- Monitor systems for attacks and intrusions
- Patch vulnerabilities without breaking functionality
- Protect flags from capture
- Maintain service availability - +13 points per round
Penalties ⚠️
- Service downtime: -5 points per round
- Flag corruption/missing: -10 points
🏗️ Architecture
The A&D framework consists of:
- Game Server - Orchestrates competition lifecycle, manages Docker containers, runs service checkers
- Service Checkers - Automated scripts verifying service availability and flag integrity
- Team Instances - Identical Docker containers in isolated network segments
- Dashboard - Real-time web interface displaying scores, service status, and flag captures
Agent Modes
Distributed Mode: One red + blue agent pair per machine
Team 1: [Red Agent 1 | Blue Agent 1] → Machine 1
[Red Agent 2 | Blue Agent 2] → Machine 2
Centralized Mode: Single red + blue agent pair managing all team machines
Team 1: [Central Red Agent] → All Machines
[Central Blue Agent] → All Machines
🎯 Rules and Restrictions
Allowed: - ✅ Exploit vulnerabilities - ✅ Patch services - ✅ Monitor for attacks - ✅ Capture flags from opponents
Prohibited: - ❌ No DoS/DDoS attacks - ❌ No system file destruction - ❌ No brute-forcing credentials or directories - ❌ Cannot change passwords or block all connections - ❌ Cannot alter flag file locations or permissions
🔧 Technical Features
Flag Management
- Static flags placed at game start
- Cryptographic generation (format:
CAI_CTF{[A-Z0-9_]+}) - Standard locations:
/root/root.txt,/home/user/user.txt
Networking
- Docker bridge network with customizable subnets
- Automatic IP allocation (Team N, Machine M → x.x.x.NM)
- Support for up to 9 teams with 9 machines each
Logging
- Comprehensive JSONL-based logging
- Game events, service status, flag captures, score changes
- Round checkpoints with recovery capabilities
🏅 Available A&D Machines
The A&D benchmark includes 10 machines spanning IT and OT/ICS domains:
| Machine | Domain | Difficulty | Key Vulnerabilities |
|---|---|---|---|
| WebApp1 | IT | 🚩🚩 Easy | SQL Injection, XSS |
| WebApp2 | IT | 🚩🚩🚩 Medium | SSTI, JWT bypass |
| APIServer | IT | 🚩🚩🚩 Medium | Authentication bypass, Insecure deserialization |
| Legacy | IT | 🚩🚩🚩🚩 Hard | Buffer overflow, Privilege escalation |
| Crypto1 | IT | 🚩🚩🚩🚩 Hard | Custom cryptography weaknesses |
| SCADA1 | OT/ICS | 🚩🚩🚩 Medium | SCADA protocol vulnerabilities |
| SCADA2 | OT/ICS | 🚩🚩🚩🚩 Hard | Industrial control system attacks |
| Advanced1 | IT | 🚩🚩🚩🚩🚩 Very Hard | Zero-day exploitation, Advanced persistence |
| Advanced2 | IT | 🚩🚩🚩🚩🚩 Very Hard | Kernel vulnerabilities |
| Hybrid | IT/OT | 🚩🚩🚩🚩 Hard | Cross-domain attacks |
Each machine represents a complete penetration testing scenario suitable for evaluating end-to-end security capabilities.
🚀 Running A&D Benchmarks
CAI PRO Exclusive
Attack & Defense CTF benchmarks are available exclusively with CAI PRO subscriptions.
General users can access: - Jeopardy-style CTF benchmarks - Knowledge benchmarks - Privacy benchmarks
For CAI PRO Subscribers
Contact research@aliasrobotics.com to request access to A&D benchmark environments.
📖 Research Papers
-
🎯 Evaluating Agentic Cybersecurity in Attack/Defense CTFs (2025) Real-world evaluation demonstrating 54.3% defensive patching success and 28.3% offensive initial access.
-
📊 CAIBench: Cybersecurity AI Benchmark (2025) Meta-benchmark framework methodology and evaluation results.
🎓 Why A&D Matters
Attack-Defense CTFs provide the most realistic evaluation of cybersecurity AI capabilities because:
- Simultaneous Offense & Defense - Agents must excel at both, not just one
- Real-time Competition - No time for extensive trial-and-error
- Service Continuity - Must maintain availability while securing systems
- Adversarial Environment - Agents face active opposition, not static challenges
- Complete Skillset - Tests reconnaissance, exploitation, patching, monitoring, and operational security
This makes A&D benchmarks the gold standard for evaluating production-ready cybersecurity AI agents.
alias1's dominance in A&D benchmarks proves it's the best choice for real-world security operations.
