Running Benchmarks
This guide explains how to set up and run CAIBench evaluations to assess AI model performance across cybersecurity tasks.
🔧 Prerequisites
System Requirements
- Python 3.8 or higher
- Docker (for CTF and Cyber Range benchmarks)
- Git with submodule support
- At least 8GB RAM recommended
- 20GB free disk space for benchmark containers
Required Packages
# Install base dependencies
pip install cai-framework
# Install benchmark-specific requirements
pip install cvss
📦 Setup
1. Clone Repository with Submodules
git clone https://github.com/aliasrobotics/cai.git
cd cai
git submodule update --init --recursive
2. Configure API Keys
Create a .env file in the project root:
# For alias1 (CAI PRO)
ALIAS_API_KEY="sk-your-caipro-key"
# For OpenAI models
OPENAI_API_KEY="sk-..."
# For Anthropic models
ANTHROPIC_API_KEY="sk-ant-..."
# For DeepSeek models
DEEPSEEK_API_KEY="sk-..."
# For OpenRouter (access to 200+ models)
OPENROUTER_API_KEY="sk-or-..."
OPENROUTER_API_BASE="https://openrouter.ai/api/v1"
# For Ollama (local models)
OLLAMA_API_BASE="http://localhost:11434/v1"
3. Verify Setup
# Test basic functionality
python -c "from cai import cli; print('CAI installed successfully!')"
# Check benchmarks directory
ls benchmarks/
🚀 Running Benchmarks
Basic Command Structure
python benchmarks/eval.py \
--model MODEL_NAME \
--dataset_file INPUT_FILE \
--eval EVAL_TYPE \
--backend BACKEND \
[--save_interval N]
Parameters
| Parameter | Description | Required | Example |
|---|---|---|---|
--model / -m |
Model identifier | ✅ Yes | alias1, gpt-4o, ollama/qwen2.5:14b |
--dataset_file / -d |
Path to benchmark dataset | ✅ Yes | benchmarks/cybermetric/CyberMetric-2-v1.json |
--eval / -e |
Benchmark type | ✅ Yes | cybermetric, seceval, cti_bench, cyberpii-bench |
--backend / -B |
API backend | ✅ Yes | alias, openai, anthropic, ollama, openrouter |
--save_interval / -s |
Save results every N questions | ❌ No | 10 |
📊 Benchmark Types
Knowledge Benchmarks
CyberMetric
Measures performance on cybersecurity-specific question answering and contextual understanding.
# Using alias1 (CAI PRO)
python benchmarks/eval.py \
--model alias1 \
--dataset_file benchmarks/cybermetric/CyberMetric-2-v1.json \
--eval cybermetric \
--backend alias
# Using Ollama with Qwen
python benchmarks/eval.py \
--model ollama/qwen2.5:14b \
--dataset_file benchmarks/cybermetric/CyberMetric-2-v1.json \
--eval cybermetric \
--backend ollama
# Using OpenAI GPT-4o
python benchmarks/eval.py \
--model gpt-4o-mini \
--dataset_file benchmarks/cybermetric/CyberMetric-2-v1.json \
--eval cybermetric \
--backend openai
SecEval
Evaluates LLMs on security-related tasks like phishing analysis and vulnerability classification.
# Using Anthropic Claude
python benchmarks/eval.py \
--model claude-3-7-sonnet-20250219 \
--dataset_file benchmarks/seceval/eval/datasets/questions-2.json \
--eval seceval \
--backend anthropic
# Using alias1
python benchmarks/eval.py \
--model alias1 \
--dataset_file benchmarks/seceval/eval/datasets/questions-2.json \
--eval seceval \
--backend alias
CTI Bench
Evaluates Cyber Threat Intelligence understanding and processing.
# Using OpenRouter with Qwen
python benchmarks/eval.py \
--model qwen/qwen3-32b:free \
--dataset_file benchmarks/cti_bench/data/cti-mcq1.tsv \
--eval cti_bench \
--backend openrouter
# Multiple CTI Bench variants
python benchmarks/eval.py \
--model alias1 \
--dataset_file benchmarks/cti_bench/data/cti-ate2.tsv \
--eval cti_bench \
--backend alias
Privacy Benchmarks
CyberPII-Bench
Evaluates ability to identify and sanitize Personally Identifiable Information.
# Using alias1 (recommended for best privacy protection)
python benchmarks/eval.py \
--model alias1 \
--dataset_file benchmarks/cyberPII-bench/memory01_gold.csv \
--eval cyberpii-bench \
--backend alias
Learn more about privacy benchmarks →
📁 Output Structure
Results are automatically saved to structured directories:
outputs/
└── benchmark_name/
└── model_YYYYMMDD_random-id/
├── answers.json # Complete test with LLM responses
├── information.txt # Performance metrics and metadata
├── entity_performance.txt # (Privacy benchmarks only)
├── metrics.txt # (Privacy benchmarks only)
├── mistakes.txt # (Privacy benchmarks only)
└── overall_report.txt # (Privacy benchmarks only)
Example Output Files
information.txt:
Model: alias1
Benchmark: cybermetric
Accuracy: 87.5%
Total Questions: 100
Correct: 87
Incorrect: 13
Runtime: 245 seconds
Date: 2025-01-15
answers.json:
{
"question_1": {
"prompt": "What is SQL injection?",
"expected": "A code injection technique...",
"response": "SQL injection is...",
"correct": true
}
}
🎯 Best Practices
1. Model Selection
Recommended: Use alias1
For all cybersecurity benchmarks, alias1 consistently achieves the highest scores.
- 🥇 Best performance across all benchmark categories
- ✅ Zero refusals for security-related questions
- 🚀 Optimized for cybersecurity tasks
2. Save Intervals
For long-running benchmarks, use --save_interval to save intermediate results:
python benchmarks/eval.py \
--model alias1 \
--dataset_file benchmarks/cybermetric/CyberMetric-2-v1.json \
--eval cybermetric \
--backend alias \
--save_interval 25 # Save every 25 questions
3. Parallel Execution
Run multiple benchmarks in parallel (different terminals):
# Terminal 1: CyberMetric
python benchmarks/eval.py --model alias1 --dataset_file benchmarks/cybermetric/CyberMetric-2-v1.json --eval cybermetric --backend alias
# Terminal 2: SecEval
python benchmarks/eval.py --model alias1 --dataset_file benchmarks/seceval/eval/datasets/questions-2.json --eval seceval --backend alias
# Terminal 3: CTI Bench
python benchmarks/eval.py --model alias1 --dataset_file benchmarks/cti_bench/data/cti-mcq1.tsv --eval cti_bench --backend alias
4. Docker Benchmarks (CAI PRO)
For Jeopardy CTF, Attack & Defense, and Cyber Range benchmarks:
CAI PRO Exclusive
Docker-based benchmarks (CTFs, A&D, Cyber Ranges) are available exclusively with CAI PRO.
Contact research@aliasrobotics.com for access.
📊 Interpreting Results
Accuracy Metrics
Different benchmarks use different metrics:
- Knowledge Benchmarks: Accuracy (% correct answers)
- Privacy Benchmarks: Precision, Recall, F1, F2 scores
- CTF Benchmarks: Success rate (% challenges solved)
- A&D Benchmarks: Points scored (offensive + defensive)
Comparing Models
When comparing models, consider:
- Overall Accuracy - Higher is better
- Response Quality - Check answers.json for reasoning
- Refusal Rate - How often the model refuses to answer
- Runtime - Time to complete benchmark
- Consistency - Run multiple times for statistical significance
🔍 Troubleshooting
Common Issues
Issue: "Module not found" errors
# Solution: Update submodules
git submodule update --init --recursive
pip install cvss
Issue: "API key not found"
# Solution: Verify .env file exists and has correct format
cat .env
# Should show: BACKEND_API_KEY="sk-..."
Issue: Docker containers fail to start
# Solution: Check Docker daemon
docker ps
sudo systemctl start docker # Linux
Issue: Out of memory errors
# Solution: Use smaller models or increase system RAM
# Alternative: Run benchmarks with save intervals
--save_interval 10
📚 Additional Resources
- 📊 CAIBench Research Paper
- 🎯 A&D CTF Evaluation Paper
- 💻 GitHub Repository
- 📖 Knowledge Benchmarks Guide
- 🔒 Privacy Benchmarks Guide
🚀 Next Steps
- View A&D Benchmark Results - See alias1's superior performance
- Explore Jeopardy CTFs - Learn about CTF benchmarks
- Upgrade to CAI PRO - Get unlimited alias1 access and exclusive benchmarks