Skip to content

Input Sanitization

Input sanitization is a defense layer that inspects, classifies, and transforms user input before it reaches the LLM. It detects and blocks prompt injection attacks, jailbreak attempts, and PII leakage at the input boundary — the same way web applications sanitize form inputs before passing them to a database.

LLMs process user input as part of their prompt. Unlike traditional software where input and instructions are structurally separated, LLMs treat everything as text in the same context window. This creates a fundamental vulnerability:

  • Prompt injection: Malicious instructions embedded in user input that override the system prompt (“Ignore previous instructions and…”).
  • Jailbreaks: Attempts to bypass safety constraints through social engineering, role-playing scenarios, or encoding tricks.
  • PII in prompts: Users inadvertently (or intentionally) including personal data that gets sent to external model providers, logged, or used in training.

Without input sanitization, every user query is a potential attack surface.

User Input
┌──────────────────┐
│ Length & Format │ Reject oversized or malformed input
│ Validation │
└────────┬─────────┘
┌──────────────────┐
│ PII Detection │ Detect and redact emails, SSNs,
│ & Redaction │ phone numbers, credit cards
└────────┬─────────┘
┌──────────────────┐
│ Injection │ Classify input for injection
│ Detection │ patterns and jailbreaks
└────────┬─────────┘
┌────┴─────┐
│ │
Clean Flagged
│ │
▼ ▼
Pass to Block, log,
model and return
safe response
  1. Format validation: Check input length, encoding, and structure. Reject inputs that exceed token budgets or contain suspicious encoding.
  2. PII detection: Scan for patterns matching personal data (emails, phone numbers, SSNs, credit card numbers). Redact or tokenize detected PII before the input moves forward.
  3. Injection detection: Classify the input for known injection patterns using one or more methods:
    • Pattern matching: Regular expressions for common injection prefixes (“ignore previous”, “system:”, delimiter injection).
    • Perplexity detection: Inputs with unusual token distributions may contain injected instructions.
    • Classifier model: A fine-tuned model that classifies inputs as benign or adversarial.
  4. Decision: Clean inputs proceed to the model. Flagged inputs are blocked, logged for security review, and the user receives a generic safe response.
  • Your system accepts free-text user input that is incorporated into LLM prompts.
  • Your application handles any type of sensitive or personal data.
  • You expose LLM functionality to external or untrusted users.
  • Compliance requirements (GDPR, HIPAA, PCI-DSS) mandate input data handling controls.
  • The system only processes internal, pre-validated data that never includes user input (batch processing of structured records). Sanitization adds latency to a path that does not need it.
  • You use structured input exclusively (API calls with typed parameters, not free text). Schema validation is more appropriate than text-level sanitization.
  • The LLM has no tools, no access to sensitive data, and no actions beyond text generation in a sandboxed environment. The blast radius of a successful injection is limited to the response text itself.
  1. False positives — Aggressive injection detection blocks legitimate queries that happen to contain patterns resembling attacks. “Please ignore the previous error and retry” is a valid user request that matches injection patterns.
  2. Latency — Each sanitization step adds processing time. PII detection with NER models can add 10-50ms. Classifier-based injection detection adds more.
  3. Arms race — Injection techniques evolve continuously. Pattern-based detection has a short half-life as attackers find new phrasings. Classifier models require ongoing retraining.
  4. Incomplete coverage — No sanitization layer catches everything. Sophisticated indirect injection (through retrieved documents, tool outputs, or multi-turn context manipulation) bypasses input-level checks.
import re
from dataclasses import dataclass
from enum import Enum
class ThreatLevel(Enum):
CLEAN = "clean"
PII_DETECTED = "pii_detected"
INJECTION_SUSPECTED = "injection_suspected"
BLOCKED = "blocked"
@dataclass
class SanitizationResult:
threat_level: ThreatLevel
sanitized_input: str
detections: list[str]
original_length: int
PII_PATTERNS = {
"email": re.compile(r"\b[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Z|a-z]{2,}\b"),
"ssn": re.compile(r"\b\d{3}-?\d{2}-?\d{4}\b"),
"phone": re.compile(r"\b(?:\+?1[-.\s]?)?\(?\d{3}\)?[-.\s]?\d{3}[-.\s]?\d{4}\b"),
"credit_card": re.compile(r"\b(?:\d{4}[-\s]?){3}\d{4}\b"),
}
INJECTION_PATTERNS = [
re.compile(r"ignore\s+(?:all\s+)?(?:previous|prior|above)\s+instructions", re.I),
re.compile(r"disregard\s+(?:all\s+)?(?:previous|prior|your)\s+instructions", re.I),
re.compile(r"you\s+are\s+now\s+(?:a|an)\s+", re.I),
re.compile(r"new\s+instruction[s]?\s*:", re.I),
re.compile(r"system\s*(?:prompt|message)\s*:", re.I),
re.compile(r"\[INST\]|\[/INST\]|<<SYS>>|<\|im_start\|>", re.I),
re.compile(r"```\s*system", re.I),
]
MAX_INPUT_LENGTH = 10000
REDACTION_PLACEHOLDER = "[REDACTED]"
def sanitize_input(text: str) -> SanitizationResult:
detections: list[str] = []
original_length = len(text)
sanitized = text
if len(text) > MAX_INPUT_LENGTH:
return SanitizationResult(
threat_level=ThreatLevel.BLOCKED,
sanitized_input="",
detections=[f"Input exceeds maximum length: {len(text)} > {MAX_INPUT_LENGTH}"],
original_length=original_length,
)
for pii_type, pattern in PII_PATTERNS.items():
matches = pattern.findall(sanitized)
if matches:
detections.append(f"{pii_type}: {len(matches)} instance(s) redacted")
sanitized = pattern.sub(REDACTION_PLACEHOLDER, sanitized)
for pattern in INJECTION_PATTERNS:
if pattern.search(sanitized):
detections.append(f"Injection pattern detected: {pattern.pattern[:50]}")
return SanitizationResult(
threat_level=ThreatLevel.INJECTION_SUSPECTED,
sanitized_input="",
detections=detections,
original_length=original_length,
)
if any("redacted" in d for d in detections):
threat_level = ThreatLevel.PII_DETECTED
else:
threat_level = ThreatLevel.CLEAN
return SanitizationResult(
threat_level=threat_level,
sanitized_input=sanitized,
detections=detections,
original_length=original_length,
)
def enforce_sanitization(text: str) -> str:
result = sanitize_input(text)
if result.threat_level == ThreatLevel.BLOCKED:
raise ValueError("Input blocked by sanitization policy")
if result.threat_level == ThreatLevel.INJECTION_SUSPECTED:
raise ValueError("Input blocked: potential prompt injection detected")
return result.sanitized_input

For production systems, complement pattern-based detection with a trained classifier (e.g., a fine-tuned model on prompt injection datasets) and use a dedicated PII detection library (Presidio, Phileas) for higher accuracy across entity types and languages.

ToolTypeNotes
PresidioOpen-source (Microsoft)PII detection and anonymization with pluggable NER backends
LLM GuardOpen-sourceInput/output sanitization specifically designed for LLM applications
RebuffOpen-sourcePrompt injection detection using multiple detection strategies
Lakera GuardManaged APIReal-time prompt injection and content moderation
NVIDIA NeMo GuardrailsOpen-sourceProgrammable guardrails for LLM applications