Defenses Wiki Entry

Input Validation

First line of defense against prompt injection and malicious inputs, using pattern matching, classification, and structural analysis to filter threats.

Last updated: January 24, 2025

Definition

Input validation for AI systems refers to the systematic inspection and filtering of user inputs before they reach the language model. Unlike traditional input validation (which focuses on data types and format), AI input validation must address semantic attacks where syntactically valid text contains malicious instructions.

This defense operates at the perimeter—the first opportunity to block attacks before they interact with the model. While no input validation can catch all prompt injection attempts, it raises the barrier significantly against automated and unsophisticated attacks.

Why Traditional Validation Fails

Traditional input validation techniques don't translate directly to AI security:

No type system — LLMs process natural language; there's no schema to validate against
Semantic attacks — Malicious content is syntactically indistinguishable from legitimate queries
Context dependence — The same text can be benign or malicious depending on application context
Encoding tricks — Attackers use Unicode, base64, and obfuscation to evade pattern matching

Implementation Approaches

Pattern-Based Detection

Regex and keyword filters for known attack patterns:

# Example detection patterns
INJECTION_PATTERNS = [
    r"ignore (previous|prior|above) instructions",
    r"you are now (DAN|in developer mode)",
    r"system prompt:",
    r"<system>|</system>",
    r"IMPORTANT: .* override",
]

def check_patterns(text: str) -> bool:
    for pattern in INJECTION_PATTERNS:
        if re.search(pattern, text, re.IGNORECASE):
            return True
    return False

Limitation: Trivially bypassed by rephrasing. Catches only the most naive attacks.

Classifier-Based Detection

ML models trained to identify injection attempts:

# Using a fine-tuned classifier
from transformers import pipeline

classifier = pipeline("text-classification",
                      model="injection-detector-v1")

def classify_input(text: str) -> dict:
    result = classifier(text)[0]
    return {
        "is_injection": result["label"] == "INJECTION",
        "confidence": result["score"]
    }

Limitation: Subject to adversarial examples. Requires ongoing training data updates.

Structural Analysis

Detecting instruction-like structures in user input:

Imperative verb detection ("ignore", "override", "pretend")
Role assignment patterns ("you are now", "act as")
Delimiter injection (<system>, [INST], ###)
Unusual Unicode or encoding patterns

Length and Complexity Limits

Simple but effective constraints:

Maximum input length — Limits attack surface
Character set restrictions — Block unusual Unicode ranges
Nesting depth limits — Prevent deeply nested instruction patterns

Defense Effectiveness

Technique	Blocks	Bypassed By
Pattern matching	Known attack strings	Rephrasing, encoding
ML classifiers	Statistically similar attacks	Novel phrasings, adversarial examples
Length limits	Complex multi-stage attacks	Concise payloads
Unicode filtering	Encoding tricks	ASCII-only attacks

Implementation Best Practices

Layer defenses — Combine multiple techniques; don't rely on any single approach
Log everything — Capture blocked inputs for analysis and classifier training
Fail closed — When uncertain, reject the input rather than allow it through
Normalize first — Decode Unicode, expand encoding before pattern matching
Context-aware rules — Validation rules should match application risk profile
Regular updates — Attack patterns evolve; validation rules must too

Limitations

Input validation is a necessary but insufficient defense. It cannot:

Catch semantically valid attacks (requests that look legitimate but have malicious intent)
Prevent indirect prompt injection (malicious content in retrieved documents)
Stop novel attack variations not present in training data
Distinguish between legitimate edge cases and attacks

Input validation should be one layer in a defense-in-depth architecture, not a complete solution.

Real-World Examples

OpenAI Moderation API — Provides content classification that can be used as input filtering before sending content to GPT models.

LangChain Input Guardrails — Framework-level validation hooks that can intercept and filter inputs before LLM calls.

Rebuff — Open-source prompt injection detection system combining heuristics and ML classification.

References

OWASP (2023). "OWASP Top 10 for Large Language Model Applications."
Greshake, K. et al. (2023). "Not What You've Signed Up For: Compromising Real-World LLM-Integrated Applications with Indirect Prompt Injection."
Liu, Y. et al. (2023). "Prompt Injection Attacks and Defenses in LLM-Integrated Applications."

Framework Mappings

Framework	Reference
OWASP LLM Top 10	LLM01: Prompt Injection (Mitigation)
NIST AI RMF	GOVERN 1.1, MAP 1.5
MITRE ATLAS	AML.M0015: Adversarial Input Detection

Citation

Aizen, K. (2025). "Input Validation." AI Security Wiki, snailsploit.com. Retrieved from https://snailsploit.com/ai-security/wiki/defenses/input-validation/

← Back to Defenses Wiki Index