Semantic Caching
What It Is
Section titled “What It Is”Semantic caching stores LLM responses indexed by the meaning of the query rather than the exact string. When a new query arrives, its embedding is compared against cached query embeddings. If a sufficiently similar query exists in the cache, the cached response is returned without making an LLM call.
The Problem It Solves
Section titled “The Problem It Solves”Traditional exact-match caching misses the vast majority of cache opportunities in LLM workloads. Users phrase the same question in dozens of different ways. “What’s your return policy?” and “How do I return an item?” are semantically identical but share no cache key in a string-match system. Without semantic caching, every rephrasing triggers a full inference call at full cost and latency.
How It Works
Section titled “How It Works”flowchart TD
A["Query arrives"] --> B["Generate embedding"]
B --> C{"Search cache\n(cosine similarity)"}
C -->|"Hit (sim ≥ threshold)"| D["Return cached response"]
C -->|"Miss"| E["Call LLM"]
E --> F["Cache response with TTL"]
F --> G["Return response"]
- A new query arrives and is converted to an embedding vector using a lightweight embedding model.
- The embedding is compared against all cached query embeddings using cosine similarity (or approximate nearest neighbor search).
- If any cached query has similarity above the configured threshold (typically 0.92-0.97), the cached response is returned.
- On a cache miss, the query is sent to the LLM. The response, along with the query embedding, is stored in the cache with a TTL.
- Cache entries expire based on TTL or are evicted using LRU when the cache reaches capacity.
When to Use It
Section titled “When to Use It”- Customer support or FAQ workloads where users ask similar questions repeatedly.
- Internal tools where the same types of queries recur across users.
- Any workload with high query repetition and tolerance for slightly stale answers.
- Cost reduction is a priority and sub-second latency is desirable for common queries.
When NOT to Use It
Section titled “When NOT to Use It”- Queries depend on real-time data that changes between requests (stock prices, live inventory). Cached responses will be stale and wrong.
- Every query is genuinely unique (code generation with different specifications, creative writing with unique prompts). Hit rates will be near zero, and you pay the embedding cost on every request for no benefit.
- The system prompt or context changes frequently per request. Two queries may be textually similar but have different system prompts, making cached responses incorrect.
- You need deterministic, auditable responses where every interaction must be independently generated and logged.
Trade-offs
Section titled “Trade-offs”- Embedding cost — Every query requires an embedding call even on cache hits. For very cheap models, the embedding cost can approach the inference cost, reducing the net savings.
- Similarity threshold tuning — Too high and you miss valid cache hits. Too low and you serve wrong answers for different-enough queries. Requires monitoring and adjustment per workload.
- Stale responses — Cached answers may be outdated if the underlying knowledge changes. TTL must be tuned to the data freshness requirements.
- Cache poisoning — A low-quality response that gets cached will be served to many users. Consider caching only responses that pass quality checks.
Implementation Example
Section titled “Implementation Example”import hashlibimport jsonimport timefrom dataclasses import dataclass
import numpy as np
@dataclassclass CacheEntry: query_embedding: np.ndarray response: dict created_at: float ttl: float
class SemanticCache: def __init__(self, similarity_threshold: float = 0.95, default_ttl: float = 3600): self._entries: list[CacheEntry] = [] self._threshold = similarity_threshold self._default_ttl = default_ttl
@staticmethod def _cosine_similarity(a: np.ndarray, b: np.ndarray) -> float: dot = np.dot(a, b) norm = np.linalg.norm(a) * np.linalg.norm(b) if norm == 0: return 0.0 return float(dot / norm)
def _evict_expired(self) -> None: now = time.monotonic() self._entries = [ e for e in self._entries if (now - e.created_at) < e.ttl ]
def lookup(self, query_embedding: np.ndarray) -> dict | None: self._evict_expired() best_similarity = 0.0 best_entry = None
for entry in self._entries: sim = self._cosine_similarity(query_embedding, entry.query_embedding) if sim > best_similarity: best_similarity = sim best_entry = entry
if best_entry and best_similarity >= self._threshold: return best_entry.response return None
def store( self, query_embedding: np.ndarray, response: dict, ttl: float | None = None ) -> None: self._entries.append( CacheEntry( query_embedding=query_embedding, response=response, created_at=time.monotonic(), ttl=ttl or self._default_ttl, ) )
async def cached_completion( cache: SemanticCache, embed_fn, llm_fn, query: str, **kwargs,) -> dict: query_embedding = await embed_fn(query) cached = cache.lookup(query_embedding) if cached is not None: return cached
response = await llm_fn(query, **kwargs) cache.store(query_embedding, response) return responseFor production use, replace the in-memory list with a vector database (Redis with vector search, Qdrant, or Pinecone) to support persistence, distributed access, and approximate nearest neighbor search at scale.
Tool Landscape
Section titled “Tool Landscape”| Tool | Type | Notes |
|---|---|---|
| GPTCache | Open-source library | Purpose-built semantic cache for LLM applications |
| Redis + RediSearch | Infrastructure | Vector similarity search on top of Redis, good for existing Redis users |
| Qdrant | Vector database | Can serve as both a retrieval store and semantic cache |
| Portkey | Managed gateway | Built-in semantic caching at the gateway layer |
| Momento | Managed cache | Serverless caching with vector search support |
Related Patterns
Section titled “Related Patterns”- LLM Gateway Pattern — The gateway is the natural integration point for semantic caching.
- Model Router Pattern — Routing decisions happen before cache lookup in the request pipeline.
- Prompt Compression Pattern — Another cost reduction pattern. Can be combined: compress, then cache.
- Cost Attribution Pattern — Track cache hit rates per feature to measure cost savings.