AI Red Teaming: Testing Your AI Agent Defenses
You cannot defend what you have not attacked. AI red teaming systematically probes every layer of your agent stack — from prompt boundaries to container escape vectors — so you find the vulnerabilities before attackers do. This playbook provides the complete test methodology with 25 specific test cases across 5 categories.
Test Categories & Cases
- ▸Direct system prompt override
- ▸Indirect injection via document
- ▸Nested injection in tool output
- ▸Role-playing jailbreak
- ▸Encoded instruction injection (base64, unicode)
- ▸Request for dangerous content (should refuse)
- ▸Privilege escalation attempt
- ▸Out-of-scope task request
- ▸Social engineering the agent
- ▸Persistence/memory manipulation
- ▸Prompt to output full system prompt
- ▸Extract other users' data via RAG
- ▸Leak environment variables or secrets
- ▸Output training data verbatim
- ▸API key extraction via crafted query
- ▸Infinite recursion prompt
- ▸Memory exhaustion via long context
- ▸Token flooding to exceed rate limit
- ▸Slow tool call bomb
- ▸Embedding space flooding in RAG
- ▸Model checksum verification
- ▸Dependency vulnerability scan
- ▸Backdoor trigger phrase test
- ▸Model behavior consistency across versions
- ▸Serialization attack on model artifacts
CI/CD Integration: Automated Security Gate
# GitHub Actions — AI security gate
name: AI Agent Security Tests
on: [push, pull_request]
jobs:
ai-red-team:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- name: Verify model checksums
run: sha256sum -c models/checksums.txt
- name: Run behavioral test suite
run: python tests/behavioral_suite.py --agent moltbot
env:
AGENT_ENDPOINT: http://localhost:8080
- name: Prompt injection scan
run: python tests/injection_tests.py --category RT01 RT02 RT03
- name: Assert zero critical findings
run: python tests/assert_results.py --max-critical 0
# Block deployment if any critical finding
- name: Gate deployment
if: failure()
run: echo "SECURITY GATE FAILED — deployment blocked" && exit 1Finding Severity Classification
CRITICAL — Block Deployment
- • System prompt fully overrideable
- • Agent can exfiltrate secrets/credentials
- • Unrestricted command execution
- • Cross-tenant data access
HIGH — Fix Within 7 Days
- • Partial injection (limited override)
- • Rate limit bypassable
- • Excessive agency without confirmation
- • Audit log gaps
MEDIUM — Fix Within 30 Days
- • Inconsistent refusal behavior
- • Verbose error messages
- • Suboptimal sandboxing
LOW — Track & Improve
- • Hallucination without guardrail
- • Missing structured output validation
- • Log verbosity issues
Frequently Asked Questions
What is AI red teaming?
AI red teaming is the practice of adversarially testing AI systems to discover security vulnerabilities before attackers do. For LLM-based agents, it includes: prompt injection testing, jailbreak attempts, data exfiltration probes, behavioral boundary testing, and infrastructure security testing. The goal is to find weaknesses in both the model's behavior and the surrounding system.
How often should I red team my AI agents?
Minimum: before every major model update or agent capability change. Best practice: run automated adversarial test suites in CI/CD on every build. Quarterly: comprehensive manual red team exercise including novel attack vectors. After any security incident: immediate re-test of affected attack surface.
What is a behavioral test suite for AI agents?
A behavioral test suite is a set of deterministic tests that verify an AI agent behaves correctly and securely. It includes: refusal tests (agent must decline dangerous requests), boundary tests (agent stays within declared scope), consistency tests (same input produces safe output across model versions), and canary tests (known injection patterns must be blocked). Run in CI/CD before every deployment.
Can I automate AI red teaming?
Yes, partially. Automated tests cover: known injection patterns, refusal boundary testing, output length/format validation, rate limit enforcement, model checksum verification. Human red teamers are still required for: novel attack vectors, social engineering scenarios, and creative jailbreak development. Use Moltbot to orchestrate automated tests and track results over time.