Skip to content

Entity Resolution Graph

Entity Resolution Graph is a pattern that uses graph algorithms to identify, link, and merge mentions of the same real-world entity across multiple data sources. Instead of relying purely on string matching or embedding similarity, it models candidate matches as a graph and uses connectivity, transitivity, and community detection to make resolution decisions that individual pairwise comparisons miss.

AI systems ingest data from many sources. The same entity (customer, product, location, concept) appears with different names, formats, and identifiers across sources:

  • “Microsoft Corp”, “MSFT”, “Microsoft Corporation” in financial data.
  • “Dr. Jane Smith”, “J. Smith”, “jane.smith@hospital.org” in medical records.
  • “GPT-4”, “gpt-4-turbo”, “OpenAI GPT4” in technical documentation.

Without resolution, knowledge graphs have duplicate nodes, RAG systems retrieve fragmented context, and analytics produce wrong counts. Pairwise matching alone fails because:

  • It misses transitive links: A matches B, B matches C, but A does not directly match C.
  • It cannot use graph structure: an entity connected to the same neighbors is likely the same entity even if names differ.
flowchart TD
    A["Mentions from source A"] --> D["Candidate generation (blocking)"]
    B["Mentions from source B"] --> D
    C["Mentions from source C"] --> D
    D --> E["Pairwise similarity scoring"]
    E --> F["Build resolution graph"]
    F --> G["Community detection / clustering"]
    G --> H["Merge cluster into canonical entities"]
    H --> I["Deduplicated entity store"]
  1. Extract mentions — Pull entity mentions from each source with context (surrounding text, metadata, co-occurring entities).
  2. Blocking — Create candidate pairs using cheap heuristics (same first letter, shared token, locality-sensitive hashing) to avoid O(n^2) comparisons.
  3. Pairwise scoring — Score candidate pairs using string similarity (Jaro-Winkler, edit distance), embedding cosine similarity, or an LLM-based judge.
  4. Build resolution graph — Create a graph where nodes are mentions and edges are scored similarities above a threshold.
  5. Community detection — Run connected components, Louvain clustering, or label propagation to identify clusters of mentions that refer to the same entity.
  6. Merge — For each cluster, select or synthesize a canonical entity record that combines the best attributes from all mentions.
  • You are building a knowledge graph from multiple data sources and need deduplicated entities.
  • Your RAG system retrieves fragmented and contradictory information about the same entity from different documents.
  • Analytics or reporting requires accurate entity counts across heterogeneous sources.
  • You have long-tail entities (niche products, internal tools, domain-specific terms) that embedding models have not seen.
  • Entities have reliable unique identifiers across all sources (UUIDs, standardized codes). Simple join-on-ID is cheaper and more accurate.
  • The entity space is small and manually curated. A lookup table or alias mapping is simpler and more maintainable.
  • Real-time resolution is required. Graph-based approaches are batch-oriented. For real-time, use embedding-based nearest neighbor lookup.
  1. Blocking quality — The blocking strategy determines recall. Aggressive blocking (more pairs) catches more matches but costs more compute. Conservative blocking misses matches.
  2. Threshold sensitivity — The similarity threshold for creating graph edges controls precision/recall. Too low creates false links; too high misses valid ones.
  3. Transitive errors — Graph-based approaches propagate errors: one bad edge can merge two unrelated entity clusters.
  4. Maintenance cost — New data sources require re-running or incrementally updating the resolution pipeline.
from dataclasses import dataclass, field
@dataclass
class EntityMention:
mention_id: str
source: str
name: str
context: str = ""
attributes: dict = field(default_factory=dict)
def jaro_winkler_similarity(s1: str, s2: str) -> float:
if s1 == s2:
return 1.0
len1, len2 = len(s1), len(s2)
if len1 == 0 or len2 == 0:
return 0.0
match_distance = max(len1, len2) // 2 - 1
s1_matches = [False] * len1
s2_matches = [False] * len2
matches = 0
transpositions = 0
for i in range(len1):
start = max(0, i - match_distance)
end = min(i + match_distance + 1, len2)
for j in range(start, end):
if s2_matches[j] or s1[i] != s2[j]:
continue
s1_matches[i] = True
s2_matches[j] = True
matches += 1
break
if matches == 0:
return 0.0
k = 0
for i in range(len1):
if not s1_matches[i]:
continue
while not s2_matches[k]:
k += 1
if s1[i] != s2[k]:
transpositions += 1
k += 1
jaro = (
matches / len1 + matches / len2 + (matches - transpositions / 2) / matches
) / 3
prefix_len = 0
for i in range(min(4, min(len1, len2))):
if s1[i] == s2[i]:
prefix_len += 1
else:
break
return jaro + prefix_len * 0.1 * (1 - jaro)
class ResolutionGraph:
def __init__(self, threshold: float = 0.85):
self._mentions: dict[str, EntityMention] = {}
self._edges: dict[str, list[tuple[str, float]]] = {}
self._threshold = threshold
def add_mention(self, mention: EntityMention) -> None:
self._mentions[mention.mention_id] = mention
self._edges[mention.mention_id] = []
def compute_similarities(self) -> None:
mention_ids = list(self._mentions.keys())
for i in range(len(mention_ids)):
for j in range(i + 1, len(mention_ids)):
m1 = self._mentions[mention_ids[i]]
m2 = self._mentions[mention_ids[j]]
sim = jaro_winkler_similarity(m1.name.lower(), m2.name.lower())
if sim >= self._threshold:
self._edges[mention_ids[i]].append((mention_ids[j], sim))
self._edges[mention_ids[j]].append((mention_ids[i], sim))
def find_clusters(self) -> list[set[str]]:
visited: set[str] = set()
clusters: list[set[str]] = []
for mention_id in self._mentions:
if mention_id in visited:
continue
cluster: set[str] = set()
queue = [mention_id]
while queue:
current = queue.pop(0)
if current in visited:
continue
visited.add(current)
cluster.add(current)
for neighbor_id, _ in self._edges.get(current, []):
if neighbor_id not in visited:
queue.append(neighbor_id)
clusters.append(cluster)
return clusters
def resolve(self) -> list[dict]:
self.compute_similarities()
clusters = self.find_clusters()
resolved = []
for cluster in clusters:
mentions = [self._mentions[mid] for mid in cluster]
canonical = max(mentions, key=lambda m: len(m.name))
resolved.append({
"canonical_name": canonical.name,
"sources": list({m.source for m in mentions}),
"mention_count": len(mentions),
"all_names": list({m.name for m in mentions}),
})
return resolved
ToolTypeNotes
DedupeOpen-source PythonActive learning based entity resolution with blocking
SenzingCommercialReal-time entity resolution engine with graph-based approach
Neo4j + GDSGraph DB + libraryCommunity detection algorithms (Louvain, WCC) on entity graphs
SplinkOpen-source (UK Gov)Probabilistic record linkage at scale with Spark support
LLM-based judgeCustomUse LLM to score whether two mentions refer to the same entity
  • GraphRAG — A clean entity-resolved knowledge graph is the foundation for effective GraphRAG.
  • Data Contract Pattern — Data contracts can enforce entity ID standards that reduce the need for resolution.