Entity Resolution Graph
Use graph-based approaches to deduplicate, link, and merge entity mentions across heterogeneous data sources into a unified knowledge graph.
What It Is
Entity Resolution Graph is a pattern that uses graph algorithms to identify, link, and merge mentions of the same real-world entity across multiple data sources. Instead of relying purely on string matching or embedding similarity, it models candidate matches as a graph and uses connectivity, transitivity, and community detection to make resolution decisions that individual pairwise comparisons miss.
The Problem It Solves
AI systems ingest data from many sources. The same entity (customer, product, location, concept) appears with different names, formats, and identifiers across sources:
- "Microsoft Corp", "MSFT", "Microsoft Corporation" in financial data.
- "Dr. Jane Smith", "J. Smith", "jane.smith@hospital.org" in medical records.
- "GPT-4", "gpt-4-turbo", "OpenAI GPT4" in technical documentation.
Without resolution, knowledge graphs have duplicate nodes, RAG systems retrieve fragmented context, and analytics produce wrong counts. Pairwise matching alone fails because:
- It misses transitive links: A matches B, B matches C, but A does not directly match C.
- It cannot use graph structure: an entity connected to the same neighbors is likely the same entity even if names differ.
How It Works
flowchart TD
A["Mentions from source A"] --> D["Candidate generation (blocking)"]
B["Mentions from source B"] --> D
C["Mentions from source C"] --> D
D --> E["Pairwise similarity scoring"]
E --> F["Build resolution graph"]
F --> G["Community detection / clustering"]
G --> H["Merge cluster into canonical entities"]
H --> I["Deduplicated entity store"]
- Extract mentions — Pull entity mentions from each source with context (surrounding text, metadata, co-occurring entities).
- Blocking — Create candidate pairs using cheap heuristics (same first letter, shared token, locality-sensitive hashing) to avoid O(n^2) comparisons.
- Pairwise scoring — Score candidate pairs using string similarity (Jaro-Winkler, edit distance), embedding cosine similarity, or an LLM-based judge.
- Build resolution graph — Create a graph where nodes are mentions and edges are scored similarities above a threshold.
- Community detection — Run connected components, Louvain clustering, or label propagation to identify clusters of mentions that refer to the same entity.
- Merge — For each cluster, select or synthesize a canonical entity record that combines the best attributes from all mentions.
When to Use It
- You are building a knowledge graph from multiple data sources and need deduplicated entities.
- Your RAG system retrieves fragmented and contradictory information about the same entity from different documents.
- Analytics or reporting requires accurate entity counts across heterogeneous sources.
- You have long-tail entities (niche products, internal tools, domain-specific terms) that embedding models have not seen.
When not to Use It
- Entities have reliable unique identifiers across all sources (UUIDs, standardized codes). Simple join-on-ID is cheaper and more accurate.
- The entity space is small and manually curated. A lookup table or alias mapping is simpler and more maintainable.
- Real-time resolution is required. Graph-based approaches are batch-oriented. For real-time, use embedding-based nearest neighbor lookup.
Trade-offs
- Blocking quality — The blocking strategy determines recall. Aggressive blocking (more pairs) catches more matches but costs more compute. Conservative blocking misses matches.
- Threshold sensitivity — The similarity threshold for creating graph edges controls precision/recall. Too low creates false links; too high misses valid ones.
- Transitive errors — Graph-based approaches propagate errors: one bad edge can merge two unrelated entity clusters.
- Maintenance cost — New data sources require re-running or incrementally updating the resolution pipeline.
Failure Modes
Transitive Merge Cascade
Trigger: A single incorrect edge links two unrelated entity clusters, and community detection merges them into one super-cluster. Symptom: Hundreds of unrelated entities share a canonical record. Downstream queries get nonsensical answers because the "entity" now represents multiple real-world things. Mitigation: Set a maximum cluster size threshold. Flag clusters that grow beyond expected bounds for manual review. Use edge confidence scores and remove low-confidence edges before community detection.
Blocking Function Misses Cross-Script Entities
Trigger: The blocking function uses first-N characters or phonetic hashing, but entity mentions span multiple scripts or languages ("München" vs. "Munich"). Symptom: Valid matches are never generated as candidate pairs because they fall into different blocks. Recall drops silently for multilingual corpora. Mitigation: Use embedding-based blocking that captures semantic similarity across languages. Add transliteration normalization before blocking.
Stale Canonical After New Source Ingestion
Trigger: A new data source provides better-quality entity data than the original canonical, but the pipeline always preserves the first-seen canonical. Symptom: Canonical records remain low-quality even though superior data exists in the cluster. Retrieval surfaces the weaker representation. Mitigation: Re-evaluate canonical selection when new sources are added, using a quality scoring function (completeness, recency, source authority). Allow canonical promotion.
Implementation Example
from dataclasses import dataclass, field
@dataclass
class EntityMention:
mention_id: str
source: str
name: str
context: str = ""
attributes: dict = field(default_factory=dict)
def jaro_winkler_similarity(s1: str, s2: str) -> float:
if s1 == s2:
return 1.0
len1, len2 = len(s1), len(s2)
if len1 == 0 or len2 == 0:
return 0.0
match_distance = max(len1, len2) // 2 - 1
s1_matches = [False] * len1
s2_matches = [False] * len2
matches = 0
transpositions = 0
for i in range(len1):
start = max(0, i - match_distance)
end = min(i + match_distance + 1, len2)
for j in range(start, end):
if s2_matches[j] or s1[i] != s2[j]:
continue
s1_matches[i] = True
s2_matches[j] = True
matches += 1
break
if matches == 0:
return 0.0
k = 0
for i in range(len1):
if not s1_matches[i]:
continue
while not s2_matches[k]:
k += 1
if s1[i] != s2[k]:
transpositions += 1
k += 1
jaro = (
matches / len1 + matches / len2 + (matches - transpositions / 2) / matches
) / 3
prefix_len = 0
for i in range(min(4, min(len1, len2))):
if s1[i] == s2[i]:
prefix_len += 1
else:
break
return jaro + prefix_len * 0.1 * (1 - jaro)
class ResolutionGraph:
def __init__(self, threshold: float = 0.85):
self._mentions: dict[str, EntityMention] = {}
self._edges: dict[str, list[tuple[str, float]]] = {}
self._threshold = threshold
def add_mention(self, mention: EntityMention) -> None:
self._mentions[mention.mention_id] = mention
self._edges[mention.mention_id] = []
def compute_similarities(self) -> None:
mention_ids = list(self._mentions.keys())
for i in range(len(mention_ids)):
for j in range(i + 1, len(mention_ids)):
m1 = self._mentions[mention_ids[i]]
m2 = self._mentions[mention_ids[j]]
sim = jaro_winkler_similarity(m1.name.lower(), m2.name.lower())
if sim >= self._threshold:
self._edges[mention_ids[i]].append((mention_ids[j], sim))
self._edges[mention_ids[j]].append((mention_ids[i], sim))
def find_clusters(self) -> list[set[str]]:
visited: set[str] = set()
clusters: list[set[str]] = []
for mention_id in self._mentions:
if mention_id in visited:
continue
cluster: set[str] = set()
queue = [mention_id]
while queue:
current = queue.pop(0)
if current in visited:
continue
visited.add(current)
cluster.add(current)
for neighbor_id, _ in self._edges.get(current, []):
if neighbor_id not in visited:
queue.append(neighbor_id)
clusters.append(cluster)
return clusters
def resolve(self) -> list[dict]:
self.compute_similarities()
clusters = self.find_clusters()
resolved = []
for cluster in clusters:
mentions = [self._mentions[mid] for mid in cluster]
canonical = max(mentions, key=lambda m: len(m.name))
resolved.append({
"canonical_name": canonical.name,
"sources": list({m.source for m in mentions}),
"mention_count": len(mentions),
"all_names": list({m.name for m in mentions}),
})
return resolved
Tool Landscape
| Tool | Type | Notes |
|---|---|---|
| Dedupe | Open-source Python | Active learning based entity resolution with blocking |
| Senzing | Commercial | Real-time entity resolution engine with graph-based approach |
| Neo4j + GDS | Graph DB + library | Community detection algorithms (Louvain, WCC) on entity graphs |
| Splink | Open-source (UK Gov) | Probabilistic record linkage at scale with Spark support |
| LLM-based judge | Custom | Use LLM to score whether two mentions refer to the same entity |
Related Patterns
- GraphRAG — A clean entity-resolved knowledge graph is the foundation for effective GraphRAG.
- Data Contract Pattern — Data contracts can enforce entity ID standards that reduce the need for resolution.