Input Validation
First line of defense against prompt injection and malicious inputs, using pattern matching, classification, and structural analysis to filter threats.
Definition
Input validation for AI systems refers to the systematic inspection and filtering of user inputs before they reach the language model. Unlike traditional input validation (which focuses on data types and format), AI input validation must address semantic attacks where syntactically valid text contains malicious instructions.
This defense operates at the perimeter—the first opportunity to block attacks before they interact with the model. While no input validation can catch all prompt injection attempts, it raises the barrier significantly against automated and unsophisticated attacks.
Why Traditional Validation Fails
Traditional input validation techniques don't translate directly to AI security:
- No type system — LLMs process natural language; there's no schema to validate against
- Semantic attacks — Malicious content is syntactically indistinguishable from legitimate queries
- Context dependence — The same text can be benign or malicious depending on application context
- Encoding tricks — Attackers use Unicode, base64, and obfuscation to evade pattern matching
Implementation Approaches
Pattern-Based Detection
Regex and keyword filters for known attack patterns:
# Example detection patterns
INJECTION_PATTERNS = [
r"ignore (previous|prior|above) instructions",
r"you are now (DAN|in developer mode)",
r"system prompt:",
r"<system>|</system>",
r"IMPORTANT: .* override",
]
def check_patterns(text: str) -> bool:
for pattern in INJECTION_PATTERNS:
if re.search(pattern, text, re.IGNORECASE):
return True
return False Limitation: Trivially bypassed by rephrasing. Catches only the most naive attacks.
Classifier-Based Detection
ML models trained to identify injection attempts:
# Using a fine-tuned classifier
from transformers import pipeline
classifier = pipeline("text-classification",
model="injection-detector-v1")
def classify_input(text: str) -> dict:
result = classifier(text)[0]
return {
"is_injection": result["label"] == "INJECTION",
"confidence": result["score"]
} Limitation: Subject to adversarial examples. Requires ongoing training data updates.
Structural Analysis
Detecting instruction-like structures in user input:
- Imperative verb detection ("ignore", "override", "pretend")
- Role assignment patterns ("you are now", "act as")
- Delimiter injection (<system>, [INST], ###)
- Unusual Unicode or encoding patterns
Length and Complexity Limits
Simple but effective constraints:
- Maximum input length — Limits attack surface
- Character set restrictions — Block unusual Unicode ranges
- Nesting depth limits — Prevent deeply nested instruction patterns
Defense Effectiveness
| Technique | Blocks | Bypassed By |
|---|---|---|
| Pattern matching | Known attack strings | Rephrasing, encoding |
| ML classifiers | Statistically similar attacks | Novel phrasings, adversarial examples |
| Length limits | Complex multi-stage attacks | Concise payloads |
| Unicode filtering | Encoding tricks | ASCII-only attacks |
Implementation Best Practices
- Layer defenses — Combine multiple techniques; don't rely on any single approach
- Log everything — Capture blocked inputs for analysis and classifier training
- Fail closed — When uncertain, reject the input rather than allow it through
- Normalize first — Decode Unicode, expand encoding before pattern matching
- Context-aware rules — Validation rules should match application risk profile
- Regular updates — Attack patterns evolve; validation rules must too
Limitations
Input validation is a necessary but insufficient defense. It cannot:
- Catch semantically valid attacks (requests that look legitimate but have malicious intent)
- Prevent indirect prompt injection (malicious content in retrieved documents)
- Stop novel attack variations not present in training data
- Distinguish between legitimate edge cases and attacks
Input validation should be one layer in a defense-in-depth architecture, not a complete solution.
Real-World Examples
OpenAI Moderation API — Provides content classification that can be used as input filtering before sending content to GPT models.
LangChain Input Guardrails — Framework-level validation hooks that can intercept and filter inputs before LLM calls.
Rebuff — Open-source prompt injection detection system combining heuristics and ML classification.
References
- OWASP (2023). "OWASP Top 10 for Large Language Model Applications."
- Greshake, K. et al. (2023). "Not What You've Signed Up For: Compromising Real-World LLM-Integrated Applications with Indirect Prompt Injection."
- Liu, Y. et al. (2023). "Prompt Injection Attacks and Defenses in LLM-Integrated Applications."
Framework Mappings
| Framework | Reference |
|---|---|
| OWASP LLM Top 10 | LLM01: Prompt Injection (Mitigation) |
| NIST AI RMF | GOVERN 1.1, MAP 1.5 |
| MITRE ATLAS | AML.M0015: Adversarial Input Detection |
Related Entries
Citation
Aizen, K. (2025). "Input Validation." AI Security Wiki, snailsploit.com. Retrieved from https://snailsploit.com/ai-security/wiki/defenses/input-validation/