Defenses Wiki Entry

Output Filtering

Defense mechanism that inspects and sanitizes AI model outputs before delivery, preventing data leakage, harmful content, and policy violations.

Last updated: January 24, 2025

Definition

Output filtering is a defense mechanism that inspects AI model responses before delivering them to users or downstream systems. It serves as the last line of defense when input validation and model-level guardrails fail to prevent harmful or sensitive content generation.

Unlike input filtering (which blocks bad requests), output filtering catches bad responses—whether from successful attacks, model errors, or edge cases in the model's training.

What Output Filtering Catches

Data leakage — PII, API keys, credentials, internal system information
System prompt disclosure — Confidential instructions extracted by attackers
Policy violations — Content that violates usage policies (harmful, illegal, etc.)
Hallucinated sensitive data — Fabricated but realistic-looking credentials or PII
Injection payloads — SQL, XSS, or command injection in generated code
Formatting attacks — Markdown/HTML injection that could affect downstream rendering

Implementation Approaches

PII Detection and Redaction

import re
from presidio_analyzer import AnalyzerEngine
from presidio_anonymizer import AnonymizerEngine

analyzer = AnalyzerEngine()
anonymizer = AnonymizerEngine()

def filter_pii(text: str) -> str:
    results = analyzer.analyze(text=text, language='en')
    anonymized = anonymizer.anonymize(
        text=text,
        analyzer_results=results
    )
    return anonymized.text

# Output: "Contact [EMAIL] or call [PHONE]"

Content Classification

Using classifiers to detect policy-violating content:

from openai import OpenAI

client = OpenAI()

def check_content_policy(text: str) -> dict:
    response = client.moderations.create(input=text)
    results = response.results[0]

    return {
        "flagged": results.flagged,
        "categories": {
            k: v for k, v in results.categories
            if v is True
        }
    }

Pattern-Based Detection

Regex patterns for sensitive data types:

SENSITIVE_PATTERNS = {
    "api_key": r"(sk-|pk_|api[_-]?key)[a-zA-Z0-9]{20,}",
    "aws_key": r"AKIA[0-9A-Z]{16}",
    "jwt": r"eyJ[a-zA-Z0-9_-]+\.eyJ[a-zA-Z0-9_-]+\.",
    "credit_card": r"\b\d{4}[\s-]?\d{4}[\s-]?\d{4}[\s-]?\d{4}\b",
    "ssn": r"\b\d{3}-\d{2}-\d{4}\b",
}

def detect_secrets(text: str) -> list:
    findings = []
    for name, pattern in SENSITIVE_PATTERNS.items():
        if re.search(pattern, text):
            findings.append(name)
    return findings

System Prompt Leakage Detection

Detecting when the model has disclosed system instructions:

SYSTEM_PROMPT_INDICATORS = [
    "my instructions",
    "i was told to",
    "my system prompt",
    "i am configured to",
    "my guidelines state",
]

def check_prompt_leakage(output: str, system_prompt: str) -> bool:
    # Check for indicator phrases
    lower_output = output.lower()
    for indicator in SYSTEM_PROMPT_INDICATORS:
        if indicator in lower_output:
            return True

    # Check for direct inclusion of system prompt text
    if len(system_prompt) > 50:
        # Fuzzy matching for substantial overlap
        from difflib import SequenceMatcher
        ratio = SequenceMatcher(None, output, system_prompt).ratio()
        if ratio > 0.3:
            return True

    return False