Skip to main content
Menu
Defenses Wiki Entry

Output Filtering

Defense mechanism that inspects and sanitizes AI model outputs before delivery, preventing data leakage, harmful content, and policy violations.

Last updated: January 24, 2025

Definition

Output filtering is a defense mechanism that inspects AI model responses before delivering them to users or downstream systems. It serves as the last line of defense when input validation and model-level guardrails fail to prevent harmful or sensitive content generation.

Unlike input filtering (which blocks bad requests), output filtering catches bad responses—whether from successful attacks, model errors, or edge cases in the model's training.


What Output Filtering Catches

  • Data leakage — PII, API keys, credentials, internal system information
  • System prompt disclosure — Confidential instructions extracted by attackers
  • Policy violations — Content that violates usage policies (harmful, illegal, etc.)
  • Hallucinated sensitive data — Fabricated but realistic-looking credentials or PII
  • Injection payloads — SQL, XSS, or command injection in generated code
  • Formatting attacks — Markdown/HTML injection that could affect downstream rendering

Implementation Approaches

PII Detection and Redaction

import re
from presidio_analyzer import AnalyzerEngine
from presidio_anonymizer import AnonymizerEngine

analyzer = AnalyzerEngine()
anonymizer = AnonymizerEngine()

def filter_pii(text: str) -> str:
    results = analyzer.analyze(text=text, language='en')
    anonymized = anonymizer.anonymize(
        text=text,
        analyzer_results=results
    )
    return anonymized.text

# Output: "Contact [EMAIL] or call [PHONE]"

Content Classification

Using classifiers to detect policy-violating content:

from openai import OpenAI

client = OpenAI()

def check_content_policy(text: str) -> dict:
    response = client.moderations.create(input=text)
    results = response.results[0]

    return {
        "flagged": results.flagged,
        "categories": {
            k: v for k, v in results.categories
            if v is True
        }
    }

Pattern-Based Detection

Regex patterns for sensitive data types:

SENSITIVE_PATTERNS = {
    "api_key": r"(sk-|pk_|api[_-]?key)[a-zA-Z0-9]{20,}",
    "aws_key": r"AKIA[0-9A-Z]{16}",
    "jwt": r"eyJ[a-zA-Z0-9_-]+\.eyJ[a-zA-Z0-9_-]+\.",
    "credit_card": r"\b\d{4}[\s-]?\d{4}[\s-]?\d{4}[\s-]?\d{4}\b",
    "ssn": r"\b\d{3}-\d{2}-\d{4}\b",
}

def detect_secrets(text: str) -> list:
    findings = []
    for name, pattern in SENSITIVE_PATTERNS.items():
        if re.search(pattern, text):
            findings.append(name)
    return findings

System Prompt Leakage Detection

Detecting when the model has disclosed system instructions:

SYSTEM_PROMPT_INDICATORS = [
    "my instructions",
    "i was told to",
    "my system prompt",
    "i am configured to",
    "my guidelines state",
]

def check_prompt_leakage(output: str, system_prompt: str) -> bool:
    # Check for indicator phrases
    lower_output = output.lower()
    for indicator in SYSTEM_PROMPT_INDICATORS:
        if indicator in lower_output:
            return True

    # Check for direct inclusion of system prompt text
    if len(system_prompt) > 50:
        # Fuzzy matching for substantial overlap
        from difflib import SequenceMatcher
        ratio = SequenceMatcher(None, output, system_prompt).ratio()
        if ratio > 0.3:
            return True

    return False

Output Filtering Pipeline

A complete output filtering pipeline chains multiple checks:

class OutputFilter:
    def __init__(self, system_prompt: str):
        self.system_prompt = system_prompt
        self.pii_analyzer = AnalyzerEngine()

    def filter(self, output: str) -> tuple[str, list]:
        issues = []
        filtered = output

        # 1. Check for policy violations
        moderation = check_content_policy(output)
        if moderation["flagged"]:
            issues.append(("policy_violation", moderation["categories"]))
            return "[Content blocked by policy]", issues

        # 2. Check for system prompt leakage
        if check_prompt_leakage(output, self.system_prompt):
            issues.append(("prompt_leakage", None))
            return "[Response blocked: potential disclosure]", issues

        # 3. Redact PII
        filtered = filter_pii(filtered)
        if filtered != output:
            issues.append(("pii_redacted", None))

        # 4. Check for secrets
        secrets = detect_secrets(filtered)
        if secrets:
            issues.append(("secrets_detected", secrets))
            # Redact detected patterns
            for secret_type in secrets:
                pattern = SENSITIVE_PATTERNS[secret_type]
                filtered = re.sub(pattern, "[REDACTED]", filtered)

        return filtered, issues

Downstream Output Handling

Output filtering also prevents injection attacks when LLM outputs are passed to other systems:

  • SQL injection — Sanitize before database queries
  • Command injection — Never pass raw output to shells
  • XSS — Escape HTML before rendering in browsers
  • Path traversal — Validate file paths in generated output
# DANGEROUS: Direct output to shell
os.system(f"echo {llm_output}")

# SAFER: Escape and validate
import shlex
safe_output = shlex.quote(llm_output)
os.system(f"echo {safe_output}")

Limitations

  • False positives — Overly aggressive filtering blocks legitimate content
  • Encoding evasion — Base64, Unicode, and obfuscation can bypass pattern matching
  • Semantic attacks — Paraphrased sensitive information may not match patterns
  • Performance overhead — Multiple classification passes add latency
  • Context blindness — Filters can't always distinguish legitimate from malicious context

Best Practices

  • Defense in depth — Output filtering complements, not replaces, input validation
  • Log blocked content — Capture filtered outputs for security analysis
  • Graceful degradation — Provide useful error messages when content is blocked
  • Regular testing — Red team your filters with bypass attempts
  • Tune thresholds — Balance security against usability based on risk profile

Real-World Examples

Microsoft Presidio — Open-source PII detection and anonymization library widely used in AI applications.

AWS Comprehend — Managed PII detection service that can filter AI outputs before storage or transmission.

LlamaGuard — Meta's content safety classifier specifically designed for LLM output moderation.


References

  • OWASP (2023). "LLM02: Insecure Output Handling." OWASP Top 10 for LLM Applications.
  • Microsoft (2023). "Presidio: Data Protection and Anonymization SDK."
  • Meta AI (2023). "Llama Guard: LLM-based Input-Output Safeguard for Human-AI Conversations."

Framework Mappings

Framework Reference
OWASP LLM Top 10 LLM02: Insecure Output Handling
NIST AI RMF MEASURE 2.7, MANAGE 2.2
MITRE ATLAS AML.M0004: Restrict Number of ML Model Queries

Citation

Aizen, K. (2025). "Output Filtering." AI Security Wiki, snailsploit.com. Retrieved from https://snailsploit.com/ai-security/wiki/defenses/output-filtering/