"Not a Pentest" Notice: AI red teaming as described here is for testing your own AI systems. Never use these techniques against systems you do not own or have explicit permission to test.

Moltbot AI Security · Batch 5

AI Red Teaming: Testing Your AI Agent Defenses

You cannot defend what you have not attacked. AI red teaming systematically probes every layer of your agent stack — from prompt boundaries to container escape vectors — so you find the vulnerabilities before attackers do. This playbook provides the complete test methodology with 25 specific test cases across 5 categories.

Test categories

Specific test cases

LLM01-05

OWASP coverage

CI/CD

Automation target

Test Categories & Cases

RT01Prompt Injection TestsOWASP LLM01

▸Direct system prompt override
▸Indirect injection via document
▸Nested injection in tool output
▸Role-playing jailbreak
▸Encoded instruction injection (base64, unicode)

RT02Boundary & Refusal TestsOWASP LLM01/LLM08

▸Request for dangerous content (should refuse)
▸Privilege escalation attempt
▸Out-of-scope task request
▸Social engineering the agent
▸Persistence/memory manipulation

RT03Data Exfiltration TestsOWASP LLM06

▸Prompt to output full system prompt
▸Extract other users' data via RAG
▸Leak environment variables or secrets
▸Output training data verbatim
▸API key extraction via crafted query

RT04Denial of Service TestsOWASP LLM04

▸Infinite recursion prompt
▸Memory exhaustion via long context
▸Token flooding to exceed rate limit
▸Slow tool call bomb
▸Embedding space flooding in RAG

RT05Supply Chain TestsOWASP LLM03/LLM05

▸Model checksum verification
▸Dependency vulnerability scan
▸Backdoor trigger phrase test
▸Model behavior consistency across versions
▸Serialization attack on model artifacts

CI/CD Integration: Automated Security Gate

# GitHub Actions — AI security gate
name: AI Agent Security Tests
on: [push, pull_request]

jobs:
  ai-red-team:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4

      - name: Verify model checksums
        run: sha256sum -c models/checksums.txt

      - name: Run behavioral test suite
        run: python tests/behavioral_suite.py --agent moltbot
        env:
          AGENT_ENDPOINT: http://localhost:8080

      - name: Prompt injection scan
        run: python tests/injection_tests.py --category RT01 RT02 RT03

      - name: Assert zero critical findings
        run: python tests/assert_results.py --max-critical 0

      # Block deployment if any critical finding
      - name: Gate deployment
        if: failure()
        run: echo "SECURITY GATE FAILED — deployment blocked" && exit 1

Finding Severity Classification

CRITICAL — Block Deployment

• System prompt fully overrideable
• Agent can exfiltrate secrets/credentials
• Unrestricted command execution
• Cross-tenant data access

HIGH — Fix Within 7 Days

• Partial injection (limited override)
• Rate limit bypassable
• Excessive agency without confirmation
• Audit log gaps

MEDIUM — Fix Within 30 Days

• Inconsistent refusal behavior
• Verbose error messages
• Suboptimal sandboxing

LOW — Track & Improve

• Hallucination without guardrail
• Missing structured output validation
• Log verbosity issues

Frequently Asked Questions

What is AI red teaming?

AI red teaming is the practice of adversarially testing AI systems to discover security vulnerabilities before attackers do. For LLM-based agents, it includes: prompt injection testing, jailbreak attempts, data exfiltration probes, behavioral boundary testing, and infrastructure security testing. The goal is to find weaknesses in both the model's behavior and the surrounding system.

How often should I red team my AI agents?

Minimum: before every major model update or agent capability change. Best practice: run automated adversarial test suites in CI/CD on every build. Quarterly: comprehensive manual red team exercise including novel attack vectors. After any security incident: immediate re-test of affected attack surface.

What is a behavioral test suite for AI agents?

A behavioral test suite is a set of deterministic tests that verify an AI agent behaves correctly and securely. It includes: refusal tests (agent must decline dangerous requests), boundary tests (agent stays within declared scope), consistency tests (same input produces safe output across model versions), and canary tests (known injection patterns must be blocked). Run in CI/CD before every deployment.

Can I automate AI red teaming?

Yes, partially. Automated tests cover: known injection patterns, refusal boundary testing, output length/format validation, rate limit enforcement, model checksum verification. Human red teamers are still required for: novel attack vectors, social engineering scenarios, and creative jailbreak development. Use Moltbot to orchestrate automated tests and track results over time.

Further Resources

AI Agent Security Hub

Full OWASP LLM defense map

Prompt Injection Defense

Fix RT01 findings

Roast My Moltbot

Free quick red-team of your setup

Model Poisoning Protection

Fix RT05 supply chain findings