LLM Output Filtering
LLM-Ausgaben können schädliche Inhalte enthalten — PII, Policy-Verstöße oder harmful content. Vier Filter-Typen: Safety Classifier, PII-Filterung, Policy-Violation-Detection und Post-Processing.
Was ist Output Filtering für LLMs? Einfach erklärt
Output Filtering für LLMs ist wie ein Content-Filter für KI-Antworten: es prüft jede Ausgabe, bevor sie dem Nutzer angezeigt wird. Safety Classifier erkennt schädliche Inhalte (Hate Speech, Gewalt, Selbstmord). PII-Filterung entfernt persönliche Daten (E-Mail, Telefonnummer, Kreditkarte). Policy-Violation-Detection fügt Disclaimer bei Rechts-, Finanz- oder Medizinberatung. Post-Processing formatiert Code, bereinigt Markdown und validiert Links. Ohne Output Filtering könnte ein KI-Modell versehentlich PII preisgeben, harmful content generieren oder rechtliche Probleme verursachen.
↓ Springe zu 4 Filter-Typen und FAQ
4 Output-Filter-Typen
Run a safety classifier on every LLM output before returning it to the user. Detect harmful content and block or replace it.
# Moltbot output safety classifier:
output_filtering:
enabled: true
classifier: "moltbot-safety-v3"
# Safety categories:
categories:
violence:
enabled: true
threshold: 0.70 # Block if confidence > 70%
action: block_and_replace
replacement: "I cannot provide information about violence."
hate_speech:
enabled: true
threshold: 0.75
action: block_and_replace
replacement: "I cannot generate hate speech."
self_harm:
enabled: true
threshold: 0.80
action: block_and_alert # High severity — alert security team
sexual_content:
enabled: true
threshold: 0.70
action: block_and_replace
illegal_activities:
enabled: true
threshold: 0.75
action: block_and_replace
phishing:
enabled: true
threshold: 0.85
action: block_and_alert
# Fallback output:
fallback:
enabled: true
message: "I cannot provide that information. For assistance, contact support."
# Logging:
logging:
log_blocked_outputs: true
log_classifier_scores: true
alert_on_high_severity: trueDetect and redact personally identifiable information (PII) in LLM outputs. Protect user privacy and comply with GDPR.
# Moltbot PII output filtering:
pii_filtering:
enabled: true
engine: "presidio"
# PII entities to detect:
entities:
- PERSON
- EMAIL_ADDRESS
- PHONE_NUMBER
- IBAN_CODE
- CREDIT_CARD
- IP_ADDRESS
- URL
- LOCATION
- DATE_OF_BIRTH
- NATIONAL_ID
- PASSPORT
# Action on detection:
action: redact # Options: redact, block, alert
# Redaction format:
redaction_format: "<PII_REDACTED>"
# Context-aware redaction:
context_aware: true
# If the user explicitly asks for their own PII (e.g., "show me my email"),
# allow it with user consent. If PII appears unexpectedly, redact it.
# Audit logging:
logging:
log_pii_detections: true
log_redaction_count: true
# Required for GDPR accountability (Art. 30 RoPA)Detect policy violations specific to your organisation (e.g., competitor disparagement, legal advice without disclaimer, financial advice without disclaimer).
# Moltbot policy violation detection:
policy_filtering:
enabled: true
policies:
competitor_disparagement:
enabled: true
keywords: ["competitor is bad", "competitor sucks", "avoid competitor"]
threshold: 0.60
action: add_disclaimer
disclaimer: "This is an AI-generated response. For objective comparisons, consult independent sources."
legal_advice:
enabled: true
keywords: ["you should sue", "legal advice", "file a lawsuit"]
threshold: 0.70
action: add_disclaimer
disclaimer: "This is not legal advice. Consult a qualified attorney for legal matters."
medical_advice:
enabled: true
keywords: ["take this medication", "medical diagnosis", "prescribe"]
threshold: 0.70
action: add_disclaimer
disclaimer: "This is not medical advice. Consult a healthcare professional."
financial_advice:
enabled: true
keywords: ["invest in", "buy stock", "sell stock"]
threshold: 0.70
action: add_disclaimer
disclaimer: "This is not financial advice. Consult a financial advisor."
# Disclaimer placement:
disclaimer_placement: append # append to end of output
# Policy-specific logging:
logging:
log_policy_violations: true
log_disclaimer_added: truePost-process LLM outputs for consistency, formatting, and safety. Apply transformations before returning to the user.
# Moltbot output post-processing:
post_processing:
enabled: true
# Transformations:
transformations:
# 1. Code formatting
code_formatting:
enabled: true
language_detection: true
syntax_highlighting: true
# 2. Markdown sanitisation
markdown_sanitisation:
enabled: true
# Remove potentially harmful markdown:
# - HTML tags (unless whitelisted)
# - Javascript execution
# - External images from untrusted sources
allowed_html_tags: ["b", "i", "u", "strong", "em", "code", "pre"]
# 3. Link validation
link_validation:
enabled: true
# Validate all links in output:
# - Block links to malicious domains
# - Add rel="nofollow" to external links
block_domains: ["malicious-site.com", "phishing-site.com"]
add_nofollow: true
# 4. Length limit
length_limit:
enabled: true
max_characters: 10000
on_exceed: truncate_with_ellipsis
# Safety check after post-processing:
safety_check:
enabled: true
# Re-run safety classifier after transformations
# This catches cases where transformations introduce safety issues
# Output caching:
caching:
enabled: true
cache_duration_seconds: 300
# Cache safe outputs to reduce LLM calls
# Do not cache outputs that required disclaimers or redactionsHäufige Fragen
What is the difference between input filtering and output filtering for LLMs?
Input filtering happens BEFORE the LLM processes the prompt. It scans user input for malicious patterns, prompt injection attempts, and policy violations. If input is blocked, the LLM never sees it. Output filtering happens AFTER the LLM generates a response. It scans the LLM's output for harmful content, PII, and policy violations. If output is blocked, the LLM has already done the work, but the user never sees the harmful response. Both are necessary: input filtering reduces the chance the LLM produces harmful content in the first place. Output filtering catches harmful content that slips through input filtering or is generated by the LLM despite safe input. Output filtering is particularly important for: jailbreak attempts that bypass input filters, PII that the LLM retrieves from RAG corpus, policy violations the LLM generates inadvertently.
How accurate are LLM safety classifiers?
LLM safety classifier accuracy varies by model and training data. State-of-the-art classifiers (2025-2026): 90-95% accuracy for clear-cut harmful content (hate speech, explicit violence). 75-85% accuracy for nuanced content (satire, fictional violence, medical information). False positives: 5-15% — safe content incorrectly flagged as harmful. False negatives: 5-10% — harmful content incorrectly allowed. Tradeoffs: Higher threshold = fewer false positives, more false negatives. Lower threshold = more false positives, fewer false negatives. Recommendation: set thresholds based on your use case. For customer-facing applications, prioritize safety over false positives (higher threshold). For internal tools, balance safety with usability (lower threshold). Always log classifier scores to tune thresholds over time.
How do I handle PII in LLM outputs for GDPR compliance?
GDPR requires that you minimise PII disclosure and have a lawful basis for processing PII. For LLM outputs: 1) PII filtering — detect and redact PII in all outputs before returning to the user. 2) Context-aware filtering — if the user explicitly asks for their own PII (e.g., 'show me my email'), allow it only with user consent and clear disclosure. 3) Logging — log all PII detections and redactions for GDPR accountability (Art. 30 RoPA). 4) Data minimisation — configure your RAG system to avoid retrieving PII in the first place. 5) User rights — implement a mechanism for users to request deletion of their PII from the RAG corpus. 6) Legal basis — ensure you have a legal basis (Art. 6 GDPR) for any PII processing. For most AI assistants, legitimate interest or contract performance applies.
Can output filtering introduce latency?
Yes — output filtering adds latency because the output must be processed through the classifier before being returned to the user. Typical latency impact: 50-200ms for content safety classification. 100-300ms for PII detection with multiple entities. 50-150ms for policy violation detection. Total: 200-650ms additional latency. Mitigation: 1) Use fast classifier models (quantised models, distilled models). 2) Run classification in parallel with LLM generation for streaming outputs (classify chunks as they are generated). 3) Cache classifier results for repeated outputs. 4) Use GPU acceleration for classification. 5) For low-latency use cases, consider a two-tier approach: fast classifier for obvious violations, slow classifier for edge cases.