In Part 1 of this two-part series, we introduced the problem of record linkage (also known as entity resolution, data matching, or reconciliation) in the context of personal healthcare data. We briefly explored the limitations of solutions based on matching rules devised by experts, concluding that any acceptable system of rules would be large and unwieldy, and even then wouldn’t cover all the edge cases that are common in real-world data. That leaves us with the question: if not rules, then what?

In this final part of the series, we’ll describe our current ML-based solution to this problem at Included Health, part of a larger online-and-offline system for which we recently earned U.S. Patent 11,321,366. After reading, you’ll take away a high-level understanding of how to use a simple combination of machine learning and graph manipulation to identify real-world people in tabular data.

Machine-learned matching

Spurred by our experience with matching based on rules that generalize the idea of natural keys, we at Included Health surmised that better results might come from letting a machine learn the tricky edge cases for itself. This is the outlook behind our current approach to identifying people in data, which improves recall significantly over our earlier rules-based approach, without compromising the already-high precision of that approach.

The structure of our current data-matching pipeline is straightforward. It comprises typical wrangling and munging steps followed by an ML-based graph algorithm, all orchestrated by Airflow and executed by Spark:

  • Extraction: Derive patient descriptions from source datasets, injecting the corresponding “description IDs” (another term of art) into each such dataset.
  • Cleaning: Raise probability of finding true positives and avoiding false positives by normalizing attributes into standard formats.
  • Blocking: Avoid needless downstream compute by cheaply identifying candidate matches within all possible patient-description pairs.
  • Comparison: Detect co-referential patient-description pairs among the candidate matches using a model trained on string-similarity features.
  • Clustering: Partition the space of all patient descriptions into components internally connected by the co-reference relation.

This pipeline produces a mapping from identifiers for patient descriptions to identifiers for patient entities, such that each entity may have many descriptions. At Included Health, this mapping is a core data asset that enables us to stitch together all our datasets – from third-party database snapshots, to internal event streams – into a unified whole in which longitudinal patient journeys become visible.

Let’s look at the stages of this pipeline in a bit more detail, providing a high-level picture of how to use a simple combination of machine learning and graph manipulation to identify real-world people in tabular data.


Our data-matching pipeline is situated within Included Health’s data platform, in which the schema of any queryable dataset is codified as a protobuf message. We make extensive use of message- and field-level options to express schema metadata through annotations. For example, one field may be annotated as containing a patient’s first name, while another is annotated as containing their SSN.

For the toy dataset used in our Jon Fitch example, the schema might look like this (using strings across the board for simplicity):

message ExampleSchema {

    string FirstName  = 1 [(patient.attribute) = FIRST_NAME];

    string LastName   = 2 [(patient.attribute) = LAST_NAME];

    string Birthdate  = 3 [(patient.attribute) = BIRTHDATE];

    string Ssn        = 4 [(patient.attribute) = SSN];

    string EmployeeId = 5 [(patient.attribute) = EMPLOYEE_ID];

    string ZipCode    = 6 [(patient.attribute) = ZIP_CODE];

    string Address    = 7 [(patient.attribute) = ADDRESS];


By way of these annotations, a patient description can be extracted from each record conforming to the schema, with the allowance that a description field will be null if the source schema does not contain a corresponding field-level annotation, or if the source field is itself null. Corresponding to the extracted object, a “description ID” (essentially a hash of the extracted object) is injected into the source record to facilitate later reidentification.


As usual, data cleaning is tedious but important. To each field, we apply a sequence of steps – such as Unicode normalization, case folding, removal of diacritics, and date formatting – that transforms the raw field contents into a standardized representation. (See the W3C’s note, “Character Model for the World Wide Web: String Matching”, for more on this topic.) These steps may also include nullification of both common filler values and idiosyncratic fillers identified through data profiling, as well as values that do not conform to expected syntactic structures (e.g. invalid SSNs). To cultivate code reuse and maintainability, we invented a simple domain-specific language (DSL) that enables authorship of cleaning tasks as text-format protobufs.


Since the extraction step yields over 100 million patient descriptions, it is not feasible to compare each of the 5 quadrillion possible pairs in detail. Instead, from among those pairs, the blocking step cheaply identifies a much smaller subset of candidate matches for further analysis downstream.

The significance of the word “blocking” in this context has no connection with asynchronous programming. Rather, it pertains to the fact that this step defines a family of blocks (i.e. subsets) of patient descriptions, such that a description may be part of several blocks and, for any block, each pair of descriptions within it is regarded as a candidate match. In this sense, blocking is just a kind of indexing.

At Included Health, we apply a variation of an approach known as standard blocking, wherein each patient description is assigned one or more keys derived from its attributes. As with the cleaning step, we encourage code reuse and maintainability through a protobuf-based DSL for authoring blocking tasks. Concepts leveraged here include tokenization, n-gram sequences, and phonetic encodings.

For example, several blocking keys could be derived from a description by combining a phonetic encoding of its name with each of the strings obtainable by dropping up to one digit from its birthdate (assuming all the relevant fields are nonempty). This would allow two descriptions to share a blocking key (and thus be regarded as a candidate match) despite minor typos and stylization differences.

Here’s how those blocking keys would look for one of our Jon Fitch examples with birthdate 1990-12-12, using Double Metaphone to phonetically encode the names:








The bolded blocking key is shared by the example record in which Jon’s first name is stylized as “John” and his birthdate is mistyped as “1990-11-12”, causing that record to be considered a candidate match.

The success of blocking can be evaluated by three metrics (to be maximized):

  • Reduction ratio: the fraction of possible pairs that aren’t candidate matches
  • Pair quality: the fraction of candidate matches that are genuinely co-referential
  • Pair completeness: the fraction of genuinely co-referential pairs that are candidate matches

By our measurements, we have a reduction ratio of 99.99998%, a pair quality of 34.1%, and a pair completeness of 99.4%. To reduce the cost of our overall data-matching pipeline, we would like to raise our pair quality in the medium term, without materially damaging pair completeness.

The astute reader will have noticed that the standard-blocking approach bears similarity to the disjunctive rules-based matching technique described above, with blocking keys playing the role of composite natural keys. The salient difference is that, in the blocking stage, it is acceptable – expected, even – to generate false positives, with the understanding that these candidates will be denied downstream. Of course, as we optimize the pair-quality metric, we reduce the number of false positives and the effort necessary to remove them.

Alternative approaches often seen in the literature include sorted-neighborhood blocking, canopy clustering, and machine-learning of optimal blocking keys. We selected standard blocking because its implementation and maintenance are straightforward, and because it is embarrassingly parallelizable.

For more details on blocking (as well as other topics we’ll touch on below), a great resource is Peter Christen’s book, Data Matching.


Having generated a suitably small set of candidate matches, the next step is to examine them each in detail, generating feature vectors that represent dimensions of similarity between each pair of patient descriptions:

  • Do their names (or parts thereof) sound similar, under some permutation of tokens?
  • Are their names (or parts thereof) spelled alike, under some permutation of tokens?
  • How many parts (day, month, and year) of their birthdates agree?
  • Are their street and/or email addresses spelled similarly?
  • Do they have similar SSNs, employee IDs, insurance IDs, and such?
  • Etc.

Concepts leveraged here include tokenization, string-similarity measures (Jaro–Winkler, Monge–Elkan, Jaccard), and phonetic encodings.

Once generated, these features are passed to a binary classifier, which we trained (using a variation of the query-by-committee strategy familiar from active learning) to identify genuine matches. The training process used about 8,000 training examples (“actively” selected) and 1,500 testing examples (randomly selected), all hand-labeled by data experts from the engineering team, with uncertain judgements subjected to peer review. We used XGBoost to train our model, due to its strong track record on classification of tabular data and its clever handling of sparse features.

Compared to the disjunctive rules-based matching technique described earlier, this approach suffers from fewer false negatives and false positives; the model does a better job handling the edge cases.

False negatives are reduced because string-similarity features help identify nearly-matching attributes, sometimes using binary functions (e.g. Jaro–Winkler similarity) that are unavailable to the rules-based matching technique, which is limited to unary functions that map a single patient description to its composite natural keys, in the same way as our blocking step applies unary functions to generate a description’s blocking keys.

False positives, in turn, are reduced because the classifier has been trained to “know” that a few exactly matching fields don’t outweigh a countervailing global dissimilarity between two patient descriptions. To bolster this, out of respect for HIPAA, we also incorporate a handful of escape clauses of the kind envisaged above. This helps us paper over some lacunas in our model, without the extravagant trouble of labeling additional training data to teach the model new tricks. However, since it can be difficult to hand-roll excellent rules, we try to keep these escape clauses few and simple.


The final stage of our data-matching pipeline builds an undirected graph in which the patient descriptions are nodes, with an edge connecting two descriptions whenever they were a candidate match and the model classified them as a genuine match based on a detailed field-wise comparison. By partitioning this graph into its connected components, we reveal clusters of co-referential descriptions. Crucially, two descriptions may belong to the same cluster even if the model didn’t match them directly, since two nodes belong to the same connected component whenever they are mutually reachable by a path of any length.

To exercise finer control over this construction, we may prime the graph by drawing some edges by fiat, for instance when we know two descriptions are in fact co-referential, but we expect the model won’t match them. Conversely, we may forbid by fiat that two descriptions should end up in the same connected component, applying a technique like the Edmonds–Karp algorithm to separate the descriptions into distinct clusters with minimal disturbance to the original partitioning.

Finally, to each cluster we assign an identifier – the “entity ID” – that allows us to say, for any two source records from which we extracted patient descriptions, whether those records pertain to the same real-world individual. By a careful selection algorithm, we ensure high coherence between the entity ID assignments produced by subsequent runs of our pipeline.


In framing the problem at hand, we said earlier that, to build an accurate picture of someone’s longitudinal journey through the healthcare system, we need the ability to identify all their personal data, without jumbling it together with anyone else’s. And we agreed to measure our success in terms of precision and recall, as ways of expressing how well we avoided false positives and false negatives.

In an ideal world, we would raise both precision and recall to 100%. But, as noted, these metrics tend to pull in opposite directions. In the healthcare context, where HIPAA demands utmost respect, precision is somewhat more important. Endorsing that view, we have tuned our pipeline to deliver the following results, as measured against our gold-standard test data:

  • Precision: 99.9%
  • Recall: 98.8%

This represents an increase in recall of over 7%, without any sacrifice of precision, as compared with the disjunctive rules-based approach formerly deployed at Included Health.

Putting faith our machine-learned pipeline’s entity IDs on the basis of these metrics, data professionals at Included Health gain the ability to perform entity-wise joins and aggregations involving diverse datasets sourced from across our business and external sources – an ability that underpins not only their capacity to derive insights from longitudinal data, but also our power to offer user-facing features such as a timeline view of your medical history.


I hope you enjoyed this brief tour of Included Health’s approach to identifying patients in healthcare data. You may have learned something new about data wrangling, string comparison, or graph algorithms in the wild. Or perhaps you knew everything already, but you gained a new appreciation for some concrete challenges that arise when integrating data from disparate sources. If you found this exciting, reach out to us or visit our careers page. We’re hiring!


This post was originally written by Cole Leahy former software engineer at Included Health, and co-authored by Nick Gorski, Angelo Sisante, and Vinay Goel. Together, they work on modeling and integration of data from heterogeneous sources, producing trustworthy data assets that support online services and drive insights on data science and analytics teams.