Skip to content

Semantic Caching

Semantic caching stores LLM responses indexed by the meaning of the query rather than the exact string. When a new query arrives, its embedding is compared against cached query embeddings. If a sufficiently similar query exists in the cache, the cached response is returned without making an LLM call.

Traditional exact-match caching misses the vast majority of cache opportunities in LLM workloads. Users phrase the same question in dozens of different ways. “What’s your return policy?” and “How do I return an item?” are semantically identical but share no cache key in a string-match system. Without semantic caching, every rephrasing triggers a full inference call at full cost and latency.

flowchart TD
A["Query arrives"] --> B["Generate embedding"]
B --> C{"Search cache\n(cosine similarity)"}
C -->|"Hit (sim ≥ threshold)"| D["Return cached response"]
C -->|"Miss"| E["Call LLM"]
E --> F["Cache response with TTL"]
F --> G["Return response"]
  1. A new query arrives and is converted to an embedding vector using a lightweight embedding model.
  2. The embedding is compared against all cached query embeddings using cosine similarity (or approximate nearest neighbor search).
  3. If any cached query has similarity above the configured threshold (typically 0.92-0.97), the cached response is returned.
  4. On a cache miss, the query is sent to the LLM. The response, along with the query embedding, is stored in the cache with a TTL.
  5. Cache entries expire based on TTL or are evicted using LRU when the cache reaches capacity.
  • Customer support or FAQ workloads where users ask similar questions repeatedly.
  • Internal tools where the same types of queries recur across users.
  • Any workload with high query repetition and tolerance for slightly stale answers.
  • Cost reduction is a priority and sub-second latency is desirable for common queries.
  • Queries depend on real-time data that changes between requests (stock prices, live inventory). Cached responses will be stale and wrong.
  • Every query is genuinely unique (code generation with different specifications, creative writing with unique prompts). Hit rates will be near zero, and you pay the embedding cost on every request for no benefit.
  • The system prompt or context changes frequently per request. Two queries may be textually similar but have different system prompts, making cached responses incorrect.
  • You need deterministic, auditable responses where every interaction must be independently generated and logged.
  1. Embedding cost — Every query requires an embedding call even on cache hits. For very cheap models, the embedding cost can approach the inference cost, reducing the net savings.
  2. Similarity threshold tuning — Too high and you miss valid cache hits. Too low and you serve wrong answers for different-enough queries. Requires monitoring and adjustment per workload.
  3. Stale responses — Cached answers may be outdated if the underlying knowledge changes. TTL must be tuned to the data freshness requirements.
  4. Cache poisoning — A low-quality response that gets cached will be served to many users. Consider caching only responses that pass quality checks.
import hashlib
import json
import time
from dataclasses import dataclass
import numpy as np
@dataclass
class CacheEntry:
query_embedding: np.ndarray
response: dict
created_at: float
ttl: float
class SemanticCache:
def __init__(self, similarity_threshold: float = 0.95, default_ttl: float = 3600):
self._entries: list[CacheEntry] = []
self._threshold = similarity_threshold
self._default_ttl = default_ttl
@staticmethod
def _cosine_similarity(a: np.ndarray, b: np.ndarray) -> float:
dot = np.dot(a, b)
norm = np.linalg.norm(a) * np.linalg.norm(b)
if norm == 0:
return 0.0
return float(dot / norm)
def _evict_expired(self) -> None:
now = time.monotonic()
self._entries = [
e for e in self._entries if (now - e.created_at) < e.ttl
]
def lookup(self, query_embedding: np.ndarray) -> dict | None:
self._evict_expired()
best_similarity = 0.0
best_entry = None
for entry in self._entries:
sim = self._cosine_similarity(query_embedding, entry.query_embedding)
if sim > best_similarity:
best_similarity = sim
best_entry = entry
if best_entry and best_similarity >= self._threshold:
return best_entry.response
return None
def store(
self, query_embedding: np.ndarray, response: dict, ttl: float | None = None
) -> None:
self._entries.append(
CacheEntry(
query_embedding=query_embedding,
response=response,
created_at=time.monotonic(),
ttl=ttl or self._default_ttl,
)
)
async def cached_completion(
cache: SemanticCache,
embed_fn,
llm_fn,
query: str,
**kwargs,
) -> dict:
query_embedding = await embed_fn(query)
cached = cache.lookup(query_embedding)
if cached is not None:
return cached
response = await llm_fn(query, **kwargs)
cache.store(query_embedding, response)
return response

For production use, replace the in-memory list with a vector database (Redis with vector search, Qdrant, or Pinecone) to support persistence, distributed access, and approximate nearest neighbor search at scale.

ToolTypeNotes
GPTCacheOpen-source libraryPurpose-built semantic cache for LLM applications
Redis + RediSearchInfrastructureVector similarity search on top of Redis, good for existing Redis users
QdrantVector databaseCan serve as both a retrieval store and semantic cache
PortkeyManaged gatewayBuilt-in semantic caching at the gateway layer
MomentoManaged cacheServerless caching with vector search support