Embedding Drift Detector
Monitor embedding distribution shifts over time to detect silent RAG degradation before it surfaces as bad answers.
What It Is
An embedding drift detector continuously monitors the statistical distribution of query embeddings and document embeddings in your RAG system. When the distribution of production queries drifts significantly from the distribution the index was built for — or when a model update silently changes the embedding space — the detector fires an alert before answer quality degrades visibly.
The Problem It Solves
RAG systems are built on an implicit contract: the embedding model maps queries and documents into a shared vector space where proximity equals relevance. This contract breaks silently in several ways:
- Query distribution shift: Users start asking questions about topics that were poorly represented in the original corpus. Retrieval quality drops but no error is thrown — the system returns the "closest" chunks, which are now irrelevant.
- Embedding model updates: The embedding provider ships a new model version. Existing index vectors were computed with the old model. Queries embedded with the new model land in different regions of the space. Cosine similarity scores degrade across the board.
- Index staleness: New documents are indexed incrementally, but the overall distribution of the index no longer reflects the corpus. Clusters become unbalanced.
- Dimensionality collapse: Under certain fine-tuning regimes or model updates, embeddings lose variance in key dimensions, reducing the effective discriminative power of the space.
In all cases, retrieval recall degrades silently. The LLM still produces fluent answers — just wrong ones, grounded in irrelevant context. Without embedding-level monitoring, the first signal is user complaints.
How It Works
flowchart TD
A["Production queries"] --> B["Compute query embeddings"]
B --> C["Store in rolling window"]
C --> D["Compute distribution statistics"]
D --> E{"Compare against baseline"}
E -->|"Drift below threshold"| F["No action"]
E -->|"Drift exceeds threshold"| G["Alert: embedding drift detected"]
G --> H["Trigger re-indexing or model audit"]
I["Periodic index sampling"] --> J["Compute index distribution stats"]
J --> E
- Baseline capture: At indexing time, compute aggregate statistics over the embedding space — centroid, variance per dimension, inter-cluster distances, and a sample of pairwise cosine similarities.
- Rolling query window: Maintain a rolling window of recent query embeddings (last N queries or last T hours).
- Distribution comparison: Periodically compare the query window distribution against the index baseline using statistical distance metrics (Wasserstein distance, KL divergence on projected distributions, or centroid drift magnitude).
- Alignment check: Compute the average cosine similarity between recent query embeddings and their top-k retrieved document embeddings. A decline in this metric indicates the query-document alignment is degrading.
- Alert and action: When drift exceeds the configured threshold, fire an alert. Depending on severity, trigger a re-indexing job, flag the embedding model for review, or activate a fallback retrieval strategy.
When to Use It
- Your RAG system serves production traffic where retrieval quality directly impacts user experience.
- You use a third-party embedding model that may be updated without notice (OpenAI, Cohere, Voyage).
- Your corpus grows incrementally and the query distribution evolves over time.
- You have experienced silent RAG quality degradation that was only caught by user complaints or spot checks.
When not to Use It
- Your embedding model is pinned, self-hosted, and never changes. The index-query alignment is stable by construction. Distribution monitoring adds cost without catching real issues.
- The corpus is small and static (e.g., a fixed FAQ set). Manual quality checks are sufficient and cheaper than automated monitoring infrastructure.
- You already run end-to-end eval suites on every retrieval pipeline change. If evaluation catches quality regressions before they reach production, embedding-level monitoring is redundant.
Trade-offs
- Statistical noise — Embedding distributions are high-dimensional. Meaningful drift detection requires dimensionality reduction (PCA, random projection) that can obscure real shifts or flag noise as drift. Threshold tuning is non-trivial.
- Baseline rot — The baseline itself becomes stale. If query patterns evolve legitimately, the baseline must be updated — but updating too aggressively masks real drift. Requires a policy for baseline refresh.
- Compute overhead — Storing and computing statistics over embedding windows adds storage and CPU cost. For high-throughput systems, sampling is necessary, which introduces its own accuracy trade-off.
- Action gap — The detector tells you something drifted, not what to do about it. Re-indexing is expensive and may not be the right fix if the drift is on the query side. Requires runbooks for each drift scenario.
Failure Modes
False Alarm from Seasonal Query Shifts
Trigger: Query patterns change due to normal seasonal variation (holiday shopping, tax season, back-to-school), shifting the query embedding distribution. Symptom: Drift detector fires alerts on legitimate distribution changes. Team investigates, finds no bug, and starts ignoring alerts. Alert fatigue erodes trust in the detector. Mitigation: Maintain multiple baselines (rolling 30-day and rolling 365-day). Only alert when the current window deviates from BOTH baselines. Annotate known seasonal periods.
Dimensionality Reduction Hides Real Drift
Trigger: PCA or random projection used for efficiency discards the dimensions where actual drift occurs. Symptom: Drift is happening in the full embedding space (retrieval quality is dropping) but the detector reports stable after projection. A false sense of security. Mitigation: Monitor retrieval quality metrics (recall, MRR) alongside embedding statistics. If quality drops but drift detector is silent, revisit the projection strategy. Test the detector against known drift events.
Baseline Rot
Trigger: The baseline is never refreshed, and the corpus evolves legitimately over months. Symptom: The detector permanently reports drift because the current distribution is genuinely different from the stale baseline — but there is no actual problem. Alternatively, the team refreshes aggressively and the baseline absorbs real drift. Mitigation: Implement a scheduled baseline refresh with a confirmation step: compute the new baseline, compare retrieval quality before/after refresh, and only commit the new baseline if quality is stable.
Implementation Example
import time
from dataclasses import dataclass, field
import numpy as np
@dataclass
class EmbeddingBaseline:
centroid: np.ndarray
variance_per_dim: np.ndarray
mean_pairwise_similarity: float
sample_size: int
created_at: float
@dataclass
class DriftReport:
centroid_shift: float
variance_ratio: float
alignment_score: float
is_drifted: bool
details: dict
class EmbeddingDriftDetector:
def __init__(
self,
centroid_threshold: float = 0.15,
alignment_threshold: float = 0.75,
window_size: int = 1000,
):
self._centroid_threshold = centroid_threshold
self._alignment_threshold = alignment_threshold
self._window_size = window_size
self._baseline: EmbeddingBaseline | None = None
self._query_window: list[np.ndarray] = []
self._alignment_scores: list[float] = []
def set_baseline(self, index_embeddings: np.ndarray) -> EmbeddingBaseline:
centroid = np.mean(index_embeddings, axis=0)
variance = np.var(index_embeddings, axis=0)
sample_idx = np.random.choice(
len(index_embeddings),
size=min(500, len(index_embeddings)),
replace=False,
)
sample = index_embeddings[sample_idx]
norms = np.linalg.norm(sample, axis=1, keepdims=True)
norms = np.where(norms == 0, 1, norms)
normalized = sample / norms
sim_matrix = normalized @ normalized.T
upper_tri = sim_matrix[np.triu_indices_from(sim_matrix, k=1)]
mean_sim = float(np.mean(upper_tri))
self._baseline = EmbeddingBaseline(
centroid=centroid,
variance_per_dim=variance,
mean_pairwise_similarity=mean_sim,
sample_size=len(index_embeddings),
created_at=time.monotonic(),
)
return self._baseline
def record_query(
self, query_embedding: np.ndarray, top_k_similarity: float
) -> None:
self._query_window.append(query_embedding)
self._alignment_scores.append(top_k_similarity)
if len(self._query_window) > self._window_size:
self._query_window.pop(0)
self._alignment_scores.pop(0)
def check_drift(self) -> DriftReport:
if self._baseline is None or len(self._query_window) < 50:
return DriftReport(
centroid_shift=0.0,
variance_ratio=1.0,
alignment_score=1.0,
is_drifted=False,
details={"reason": "insufficient data"},
)
query_matrix = np.array(self._query_window)
query_centroid = np.mean(query_matrix, axis=0)
shift_vector = query_centroid - self._baseline.centroid
centroid_shift = float(np.linalg.norm(shift_vector))
query_variance = np.var(query_matrix, axis=0)
safe_baseline_var = np.where(
self._baseline.variance_per_dim == 0, 1e-10, self._baseline.variance_per_dim
)
variance_ratio = float(np.mean(query_variance / safe_baseline_var))
alignment_score = float(np.mean(self._alignment_scores))
is_drifted = (
centroid_shift > self._centroid_threshold
or alignment_score < self._alignment_threshold
)
return DriftReport(
centroid_shift=centroid_shift,
variance_ratio=variance_ratio,
alignment_score=alignment_score,
is_drifted=is_drifted,
details={
"window_size": len(self._query_window),
"centroid_threshold": self._centroid_threshold,
"alignment_threshold": self._alignment_threshold,
"top_drifted_dims": int(
np.sum(np.abs(query_variance - self._baseline.variance_per_dim)
> 2 * self._baseline.variance_per_dim)
),
},
)
Tool Landscape
| Tool | Type | Notes |
|---|---|---|
| Arize Phoenix | Open-source observability | Embedding drift visualization and alerting for LLM applications |
| Evidently AI | Open-source monitoring | Data and embedding drift detection with statistical tests |
| WhyLabs | Managed platform | Continuous profiling of embedding distributions with anomaly detection |
| Galileo | Managed platform | RAG-specific quality and drift monitoring |
| Custom + Prometheus | DIY | Export drift metrics to Prometheus, alert via standard thresholds |
Related Patterns
- Span-Level Tracing — Trace retrieval quality metrics per request alongside embedding drift aggregate metrics.
- Hybrid Search — BM25 acts as a natural fallback when embedding-based retrieval degrades due to drift.
- Data Contract Pattern — Extend data contracts to include embedding model version and dimensionality as contract terms.
- Semantic Caching — Embedding drift invalidates cache similarity assumptions. Drift detection should trigger cache invalidation.