Skip to main content
Menu
Defenses Wiki Entry

Input Validation

First line of defense against prompt injection and malicious inputs, using pattern matching, classification, and structural analysis to filter threats.

Last updated: January 24, 2025

Definition

Input validation for AI systems refers to the systematic inspection and filtering of user inputs before they reach the language model. Unlike traditional input validation (which focuses on data types and format), AI input validation must address semantic attacks where syntactically valid text contains malicious instructions.

This defense operates at the perimeter—the first opportunity to block attacks before they interact with the model. While no input validation can catch all prompt injection attempts, it raises the barrier significantly against automated and unsophisticated attacks.


Why Traditional Validation Fails

Traditional input validation techniques don't translate directly to AI security:

  • No type system — LLMs process natural language; there's no schema to validate against
  • Semantic attacks — Malicious content is syntactically indistinguishable from legitimate queries
  • Context dependence — The same text can be benign or malicious depending on application context
  • Encoding tricks — Attackers use Unicode, base64, and obfuscation to evade pattern matching

Implementation Approaches

Pattern-Based Detection

Regex and keyword filters for known attack patterns:

# Example detection patterns
INJECTION_PATTERNS = [
    r"ignore (previous|prior|above) instructions",
    r"you are now (DAN|in developer mode)",
    r"system prompt:",
    r"<system>|</system>",
    r"IMPORTANT: .* override",
]

def check_patterns(text: str) -> bool:
    for pattern in INJECTION_PATTERNS:
        if re.search(pattern, text, re.IGNORECASE):
            return True
    return False

Limitation: Trivially bypassed by rephrasing. Catches only the most naive attacks.

Classifier-Based Detection

ML models trained to identify injection attempts:

# Using a fine-tuned classifier
from transformers import pipeline

classifier = pipeline("text-classification",
                      model="injection-detector-v1")

def classify_input(text: str) -> dict:
    result = classifier(text)[0]
    return {
        "is_injection": result["label"] == "INJECTION",
        "confidence": result["score"]
    }

Limitation: Subject to adversarial examples. Requires ongoing training data updates.

Structural Analysis

Detecting instruction-like structures in user input:

  • Imperative verb detection ("ignore", "override", "pretend")
  • Role assignment patterns ("you are now", "act as")
  • Delimiter injection (<system>, [INST], ###)
  • Unusual Unicode or encoding patterns

Length and Complexity Limits

Simple but effective constraints:

  • Maximum input length — Limits attack surface
  • Character set restrictions — Block unusual Unicode ranges
  • Nesting depth limits — Prevent deeply nested instruction patterns

Defense Effectiveness

Technique Blocks Bypassed By
Pattern matching Known attack strings Rephrasing, encoding
ML classifiers Statistically similar attacks Novel phrasings, adversarial examples
Length limits Complex multi-stage attacks Concise payloads
Unicode filtering Encoding tricks ASCII-only attacks

Implementation Best Practices

  • Layer defenses — Combine multiple techniques; don't rely on any single approach
  • Log everything — Capture blocked inputs for analysis and classifier training
  • Fail closed — When uncertain, reject the input rather than allow it through
  • Normalize first — Decode Unicode, expand encoding before pattern matching
  • Context-aware rules — Validation rules should match application risk profile
  • Regular updates — Attack patterns evolve; validation rules must too

Limitations

Input validation is a necessary but insufficient defense. It cannot:

  • Catch semantically valid attacks (requests that look legitimate but have malicious intent)
  • Prevent indirect prompt injection (malicious content in retrieved documents)
  • Stop novel attack variations not present in training data
  • Distinguish between legitimate edge cases and attacks

Input validation should be one layer in a defense-in-depth architecture, not a complete solution.


Real-World Examples

OpenAI Moderation API — Provides content classification that can be used as input filtering before sending content to GPT models.

LangChain Input Guardrails — Framework-level validation hooks that can intercept and filter inputs before LLM calls.

Rebuff — Open-source prompt injection detection system combining heuristics and ML classification.


References

  • OWASP (2023). "OWASP Top 10 for Large Language Model Applications."
  • Greshake, K. et al. (2023). "Not What You've Signed Up For: Compromising Real-World LLM-Integrated Applications with Indirect Prompt Injection."
  • Liu, Y. et al. (2023). "Prompt Injection Attacks and Defenses in LLM-Integrated Applications."

Framework Mappings

Framework Reference
OWASP LLM Top 10 LLM01: Prompt Injection (Mitigation)
NIST AI RMF GOVERN 1.1, MAP 1.5
MITRE ATLAS AML.M0015: Adversarial Input Detection

Citation

Aizen, K. (2025). "Input Validation." AI Security Wiki, snailsploit.com. Retrieved from https://snailsploit.com/ai-security/wiki/defenses/input-validation/