Zum Hauptinhalt springen
LIVE Intel Feed
"Not a Pentest" Trust-Anker: Output-Filtering-Guide für eigene KI-Systeme.
Moltbot AI Security · Output Filtering

LLM Output Filtering

LLM-Ausgaben können schädliche Inhalte enthalten — PII, Policy-Verstöße oder harmful content. Vier Filter-Typen: Safety Classifier, PII-Filterung, Policy-Violation-Detection und Post-Processing.

Was ist Output Filtering für LLMs? Einfach erklärt

Output Filtering für LLMs ist wie ein Content-Filter für KI-Antworten: es prüft jede Ausgabe, bevor sie dem Nutzer angezeigt wird. Safety Classifier erkennt schädliche Inhalte (Hate Speech, Gewalt, Selbstmord). PII-Filterung entfernt persönliche Daten (E-Mail, Telefonnummer, Kreditkarte). Policy-Violation-Detection fügt Disclaimer bei Rechts-, Finanz- oder Medizinberatung. Post-Processing formatiert Code, bereinigt Markdown und validiert Links. Ohne Output Filtering könnte ein KI-Modell versehentlich PII preisgeben, harmful content generieren oder rechtliche Probleme verursachen.

Springe zu 4 Filter-Typen und FAQ

4 Output-Filter-Typen

OF-1Content Safety Classifier

Run a safety classifier on every LLM output before returning it to the user. Detect harmful content and block or replace it.

# Moltbot output safety classifier:
output_filtering:
  enabled: true
  classifier: "moltbot-safety-v3"

  # Safety categories:
  categories:
    violence:
      enabled: true
      threshold: 0.70  # Block if confidence > 70%
      action: block_and_replace
      replacement: "I cannot provide information about violence."

    hate_speech:
      enabled: true
      threshold: 0.75
      action: block_and_replace
      replacement: "I cannot generate hate speech."

    self_harm:
      enabled: true
      threshold: 0.80
      action: block_and_alert  # High severity — alert security team

    sexual_content:
      enabled: true
      threshold: 0.70
      action: block_and_replace

    illegal_activities:
      enabled: true
      threshold: 0.75
      action: block_and_replace

    phishing:
      enabled: true
      threshold: 0.85
      action: block_and_alert

  # Fallback output:
  fallback:
    enabled: true
    message: "I cannot provide that information. For assistance, contact support."

  # Logging:
  logging:
    log_blocked_outputs: true
    log_classifier_scores: true
    alert_on_high_severity: true
OF-2PII Filtering

Detect and redact personally identifiable information (PII) in LLM outputs. Protect user privacy and comply with GDPR.

# Moltbot PII output filtering:
pii_filtering:
  enabled: true
  engine: "presidio"

  # PII entities to detect:
  entities:
    - PERSON
    - EMAIL_ADDRESS
    - PHONE_NUMBER
    - IBAN_CODE
    - CREDIT_CARD
    - IP_ADDRESS
    - URL
    - LOCATION
    - DATE_OF_BIRTH
    - NATIONAL_ID
    - PASSPORT

  # Action on detection:
  action: redact  # Options: redact, block, alert

  # Redaction format:
  redaction_format: "<PII_REDACTED>"

  # Context-aware redaction:
  context_aware: true
  # If the user explicitly asks for their own PII (e.g., "show me my email"),
  # allow it with user consent. If PII appears unexpectedly, redact it.

  # Audit logging:
  logging:
    log_pii_detections: true
    log_redaction_count: true
    # Required for GDPR accountability (Art. 30 RoPA)
OF-3Policy Violation Detection

Detect policy violations specific to your organisation (e.g., competitor disparagement, legal advice without disclaimer, financial advice without disclaimer).

# Moltbot policy violation detection:
policy_filtering:
  enabled: true
  policies:
    competitor_disparagement:
      enabled: true
      keywords: ["competitor is bad", "competitor sucks", "avoid competitor"]
      threshold: 0.60
      action: add_disclaimer
      disclaimer: "This is an AI-generated response. For objective comparisons, consult independent sources."

    legal_advice:
      enabled: true
      keywords: ["you should sue", "legal advice", "file a lawsuit"]
      threshold: 0.70
      action: add_disclaimer
      disclaimer: "This is not legal advice. Consult a qualified attorney for legal matters."

    medical_advice:
      enabled: true
      keywords: ["take this medication", "medical diagnosis", "prescribe"]
      threshold: 0.70
      action: add_disclaimer
      disclaimer: "This is not medical advice. Consult a healthcare professional."

    financial_advice:
      enabled: true
      keywords: ["invest in", "buy stock", "sell stock"]
      threshold: 0.70
      action: add_disclaimer
      disclaimer: "This is not financial advice. Consult a financial advisor."

  # Disclaimer placement:
  disclaimer_placement: append  # append to end of output

  # Policy-specific logging:
  logging:
    log_policy_violations: true
    log_disclaimer_added: true
OF-4Output Post-Processing

Post-process LLM outputs for consistency, formatting, and safety. Apply transformations before returning to the user.

# Moltbot output post-processing:
post_processing:
  enabled: true

  # Transformations:
  transformations:
    # 1. Code formatting
    code_formatting:
      enabled: true
      language_detection: true
      syntax_highlighting: true

    # 2. Markdown sanitisation
    markdown_sanitisation:
      enabled: true
      # Remove potentially harmful markdown:
      # - HTML tags (unless whitelisted)
      # - Javascript execution
      # - External images from untrusted sources
      allowed_html_tags: ["b", "i", "u", "strong", "em", "code", "pre"]

    # 3. Link validation
    link_validation:
      enabled: true
      # Validate all links in output:
      # - Block links to malicious domains
      # - Add rel="nofollow" to external links
      block_domains: ["malicious-site.com", "phishing-site.com"]
      add_nofollow: true

    # 4. Length limit
    length_limit:
      enabled: true
      max_characters: 10000
      on_exceed: truncate_with_ellipsis

  # Safety check after post-processing:
  safety_check:
    enabled: true
    # Re-run safety classifier after transformations
    # This catches cases where transformations introduce safety issues

  # Output caching:
  caching:
    enabled: true
    cache_duration_seconds: 300
    # Cache safe outputs to reduce LLM calls
    # Do not cache outputs that required disclaimers or redactions

Häufige Fragen

What is the difference between input filtering and output filtering for LLMs?

Input filtering happens BEFORE the LLM processes the prompt. It scans user input for malicious patterns, prompt injection attempts, and policy violations. If input is blocked, the LLM never sees it. Output filtering happens AFTER the LLM generates a response. It scans the LLM's output for harmful content, PII, and policy violations. If output is blocked, the LLM has already done the work, but the user never sees the harmful response. Both are necessary: input filtering reduces the chance the LLM produces harmful content in the first place. Output filtering catches harmful content that slips through input filtering or is generated by the LLM despite safe input. Output filtering is particularly important for: jailbreak attempts that bypass input filters, PII that the LLM retrieves from RAG corpus, policy violations the LLM generates inadvertently.

How accurate are LLM safety classifiers?

LLM safety classifier accuracy varies by model and training data. State-of-the-art classifiers (2025-2026): 90-95% accuracy for clear-cut harmful content (hate speech, explicit violence). 75-85% accuracy for nuanced content (satire, fictional violence, medical information). False positives: 5-15% — safe content incorrectly flagged as harmful. False negatives: 5-10% — harmful content incorrectly allowed. Tradeoffs: Higher threshold = fewer false positives, more false negatives. Lower threshold = more false positives, fewer false negatives. Recommendation: set thresholds based on your use case. For customer-facing applications, prioritize safety over false positives (higher threshold). For internal tools, balance safety with usability (lower threshold). Always log classifier scores to tune thresholds over time.

How do I handle PII in LLM outputs for GDPR compliance?

GDPR requires that you minimise PII disclosure and have a lawful basis for processing PII. For LLM outputs: 1) PII filtering — detect and redact PII in all outputs before returning to the user. 2) Context-aware filtering — if the user explicitly asks for their own PII (e.g., 'show me my email'), allow it only with user consent and clear disclosure. 3) Logging — log all PII detections and redactions for GDPR accountability (Art. 30 RoPA). 4) Data minimisation — configure your RAG system to avoid retrieving PII in the first place. 5) User rights — implement a mechanism for users to request deletion of their PII from the RAG corpus. 6) Legal basis — ensure you have a legal basis (Art. 6 GDPR) for any PII processing. For most AI assistants, legitimate interest or contract performance applies.

Can output filtering introduce latency?

Yes — output filtering adds latency because the output must be processed through the classifier before being returned to the user. Typical latency impact: 50-200ms for content safety classification. 100-300ms for PII detection with multiple entities. 50-150ms for policy violation detection. Total: 200-650ms additional latency. Mitigation: 1) Use fast classifier models (quantised models, distilled models). 2) Run classification in parallel with LLM generation for streaming outputs (classify chunks as they are generated). 3) Cache classifier results for repeated outputs. 4) Use GPU acceleration for classification. 5) For low-latency use cases, consider a two-tier approach: fast classifier for obvious violations, slow classifier for edge cases.

🔗 Weiterführende Ressourcen

CG

ClawGuru Security Team

✓ Verified
Security Research & Engineering · Output Filtering Specialists
📅 Veröffentlicht: 28.04.2026🔄 Zuletzt geprüft: 28.04.2026
Dieser Guide basiert auf praktischer Erfahrung mit Output-Filtering-Implementierungen für LLM-Systeme in Produktionsumgebungen. Die beschriebenen Best Practices sind in echten Deployments erprobt und kontinuierlich verbessert worden.
🔒 Verifiziert von ClawGuru Security Team·Alle Informationen fact-checked und peer-reviewed
🔒 Quantum-Resistant Mycelium Architecture
🛡️ 3M+ Runbooks – täglich von SecOps-Experten geprüft
🌐 Zero Known Breaches – Powered by Living Intelligence
🏛️ SOC2 & ISO 27001 Aligned • GDPR 100 % compliant
⚡ Real-Time Global Mycelium Network – 347 Bedrohungen in 60 Minuten
🧬 Trusted by SecOps Leaders worldwide