Running Benchmarks

This guide explains how to set up and run CAIBench evaluations to assess AI model performance across cybersecurity tasks.

🔧 Prerequisites

System Requirements

Python 3.8 or higher
Docker (for CTF and Cyber Range benchmarks)
Git with submodule support
At least 8GB RAM recommended
20GB free disk space for benchmark containers

Required Packages

# Install base dependencies
pip install cai-framework

# Install benchmark-specific requirements
pip install cvss

📦 Setup

1. Clone Repository with Submodules

git clone https://github.com/aliasrobotics/cai.git
cd cai
git submodule update --init --recursive

2. Configure API Keys

Create a .env file in the project root:

# For alias1 (CAI PRO)
ALIAS_API_KEY="sk-your-caipro-key"

# For OpenAI models
OPENAI_API_KEY="sk-..."

# For Anthropic models
ANTHROPIC_API_KEY="sk-ant-..."

# For DeepSeek models
DEEPSEEK_API_KEY="sk-..."

# For OpenRouter (access to 200+ models)
OPENROUTER_API_KEY="sk-or-..."
OPENROUTER_API_BASE="https://openrouter.ai/api/v1"

# For Ollama (local models)
OLLAMA_API_BASE="http://localhost:11434/v1"

3. Verify Setup

# Test basic functionality
python -c "from cai import cli; print('CAI installed successfully!')"

# Check benchmarks directory
ls benchmarks/

🚀 Running Benchmarks

Basic Command Structure

python benchmarks/eval.py \
    --model MODEL_NAME \
    --dataset_file INPUT_FILE \
    --eval EVAL_TYPE \
    --backend BACKEND \
    [--save_interval N]

Parameters

Parameter	Description	Required	Example
`--model` / `-m`	Model identifier	✅ Yes	`alias1`, `gpt-4o`, `ollama/qwen2.5:14b`
`--dataset_file` / `-d`	Path to benchmark dataset	✅ Yes	`benchmarks/cybermetric/CyberMetric-2-v1.json`
`--eval` / `-e`	Benchmark type	✅ Yes	`cybermetric`, `seceval`, `cti_bench`, `cyberpii-bench`
`--backend` / `-B`	API backend	✅ Yes	`alias`, `openai`, `anthropic`, `ollama`, `openrouter`
`--save_interval` / `-s`	Save results every N questions	❌ No	`10`

📊 Benchmark Types

Knowledge Benchmarks

CyberMetric

Measures performance on cybersecurity-specific question answering and contextual understanding.

# Using alias1 (CAI PRO)
python benchmarks/eval.py \
    --model alias1 \
    --dataset_file benchmarks/cybermetric/CyberMetric-2-v1.json \
    --eval cybermetric \
    --backend alias

# Using Ollama with Qwen
python benchmarks/eval.py \
    --model ollama/qwen2.5:14b \
    --dataset_file benchmarks/cybermetric/CyberMetric-2-v1.json \
    --eval cybermetric \
    --backend ollama

# Using OpenAI GPT-4o
python benchmarks/eval.py \
    --model gpt-4o-mini \
    --dataset_file benchmarks/cybermetric/CyberMetric-2-v1.json \
    --eval cybermetric \
    --backend openai

SecEval

Evaluates LLMs on security-related tasks like phishing analysis and vulnerability classification.

# Using Anthropic Claude
python benchmarks/eval.py \
    --model claude-3-7-sonnet-20250219 \
    --dataset_file benchmarks/seceval/eval/datasets/questions-2.json \
    --eval seceval \
    --backend anthropic

# Using alias1
python benchmarks/eval.py \
    --model alias1 \
    --dataset_file benchmarks/seceval/eval/datasets/questions-2.json \
    --eval seceval \
    --backend alias

CTI Bench

Evaluates Cyber Threat Intelligence understanding and processing.

# Using OpenRouter with Qwen
python benchmarks/eval.py \
    --model qwen/qwen3-32b:free \
    --dataset_file benchmarks/cti_bench/data/cti-mcq1.tsv \
    --eval cti_bench \
    --backend openrouter

# Multiple CTI Bench variants
python benchmarks/eval.py \
    --model alias1 \
    --dataset_file benchmarks/cti_bench/data/cti-ate2.tsv \
    --eval cti_bench \
    --backend alias

Privacy Benchmarks

CyberPII-Bench

Evaluates ability to identify and sanitize Personally Identifiable Information.

# Using alias1 (recommended for best privacy protection)
python benchmarks/eval.py \
    --model alias1 \
    --dataset_file benchmarks/cyberPII-bench/memory01_gold.csv \
    --eval cyberpii-bench \
    --backend alias

Learn more about privacy benchmarks →

📁 Output Structure

Results are automatically saved to structured directories:

outputs/
└── benchmark_name/
    └── model_YYYYMMDD_random-id/
        ├── answers.json       # Complete test with LLM responses
        ├── information.txt    # Performance metrics and metadata
        ├── entity_performance.txt  # (Privacy benchmarks only)
        ├── metrics.txt        # (Privacy benchmarks only)
        ├── mistakes.txt       # (Privacy benchmarks only)
        └── overall_report.txt # (Privacy benchmarks only)

Example Output Files

information.txt:

Model: alias1
Benchmark: cybermetric
Accuracy: 87.5%
Total Questions: 100
Correct: 87
Incorrect: 13
Runtime: 245 seconds
Date: 2025-01-15

answers.json:

{
  "question_1": {
    "prompt": "What is SQL injection?",
    "expected": "A code injection technique...",
    "response": "SQL injection is...",
    "correct": true
  }
}

🎯 Best Practices

1. Model Selection

Recommended: Use alias1

For all cybersecurity benchmarks, alias1 consistently achieves the highest scores.

🥇 Best performance across all benchmark categories
✅ Zero refusals for security-related questions
🚀 Optimized for cybersecurity tasks

Get alias1 with CAI PRO →

2. Save Intervals

For long-running benchmarks, use --save_interval to save intermediate results:

python benchmarks/eval.py \
    --model alias1 \
    --dataset_file benchmarks/cybermetric/CyberMetric-2-v1.json \
    --eval cybermetric \
    --backend alias \
    --save_interval 25  # Save every 25 questions

3. Parallel Execution

Run multiple benchmarks in parallel (different terminals):

# Terminal 1: CyberMetric
python benchmarks/eval.py --model alias1 --dataset_file benchmarks/cybermetric/CyberMetric-2-v1.json --eval cybermetric --backend alias

# Terminal 2: SecEval
python benchmarks/eval.py --model alias1 --dataset_file benchmarks/seceval/eval/datasets/questions-2.json --eval seceval --backend alias

# Terminal 3: CTI Bench
python benchmarks/eval.py --model alias1 --dataset_file benchmarks/cti_bench/data/cti-mcq1.tsv --eval cti_bench --backend alias

4. Docker Benchmarks (CAI PRO)

For Jeopardy CTF, Attack & Defense, and Cyber Range benchmarks:

CAI PRO Exclusive

Docker-based benchmarks (CTFs, A&D, Cyber Ranges) are available exclusively with CAI PRO.

Contact research@aliasrobotics.com for access.

📊 Interpreting Results

Accuracy Metrics

Different benchmarks use different metrics:

Knowledge Benchmarks: Accuracy (% correct answers)
Privacy Benchmarks: Precision, Recall, F1, F2 scores
CTF Benchmarks: Success rate (% challenges solved)
A&D Benchmarks: Points scored (offensive + defensive)

Comparing Models

When comparing models, consider:

Overall Accuracy - Higher is better
Response Quality - Check answers.json for reasoning
Refusal Rate - How often the model refuses to answer
Runtime - Time to complete benchmark
Consistency - Run multiple times for statistical significance

🔍 Troubleshooting

Common Issues

Issue: "Module not found" errors

# Solution: Update submodules
git submodule update --init --recursive
pip install cvss

Issue: "API key not found"

# Solution: Verify .env file exists and has correct format
cat .env
# Should show: BACKEND_API_KEY="sk-..."

Issue: Docker containers fail to start

# Solution: Check Docker daemon
docker ps
sudo systemctl start docker  # Linux

Issue: Out of memory errors

# Solution: Use smaller models or increase system RAM
# Alternative: Run benchmarks with save intervals
--save_interval 10

📚 Additional Resources

🚀 Next Steps

View A&D Benchmark Results - See alias1's superior performance
Explore Jeopardy CTFs - Learn about CTF benchmarks
Upgrade to CAI PRO - Get unlimited alias1 access and exclusive benchmarks