Healthcare in the United States is fractured. Notwithstanding their good intentions, the myriad providers and institutions in the health system forge data silos that interfere with a patient’s own ability to track their past care and take steps toward a healthy future.
At Included Health, we use machine learning to stitch together the seams of this frayed system, providing our members with a unified view of their care history, paired with personalized guidance on optimizing their health.
To that end, we manage an expanding set of more than three billion data points, drawn from insurance claims, hospitalization events, biometric measurements, application logs, and more. Data is the foundation of the insights by which we raise the standard of care.
As individual units, these personal-data points have limited value, answering questions like:
- What portion of last year’s emergency room (ER) visit was covered by my insurance?
- How many milligrams of that medication did my cardiologist prescribe?
- Do my current cholesterol levels mean I’m at risk of a heart attack?
But when composed together into longitudinal patient journeys, the value of personal data multiplies, driving insights like:
- My cholesterol levels are trending in a healthy direction since starting the medication prescribed by my cardiologist after last year’s ER visit.
While the concept is simple, tracking patient journeys is difficult in practice. The very same real-world person has different representations across different third-party systems. In order to build a picture of their journey over time, we need the ability to link all of them together. To be both useful and HIPAA-compliant, there should be no false negatives (missing links) and no false positives (spurious links).
In this first installment of a two-part series, we’ll examine this problem in more detail. We’ll also see why one solution, based on rules devised by experts, leaves much to be desired. In Part 2, we’ll describe our current solution to this problem at Included Health, part of a larger online-and-offline system for which we recently earned U.S. Patent 11,321,366. After reading both parts, you’ll take away a high-level understanding of how to use a simple combination of machine learning and graph manipulation to identify real-world people in tabular data.
Such an endeavor (independent of implementation) is known variously as record linkage, entity resolution, data matching, reconciliation, and several other terms – an amusing microcosm of the very problem it seeks to address, namely the trouble that arises when one thing has many names.
Problem
In general, one entity – that is, a real thing, such as a person or a place – can have many names, descriptions, or representations. This is true not only in the context of healthcare, but also in e-commerce catalogs, bibliographic databases, and so forth.
In structured healthcare data, patients tend to be described by combinations of personally identifiable information (PII) – attributes such as name, address, social security number, and so forth. Since PII comes in many shapes and sizes, it is not always obvious, at least to a machine, when two PII descriptions denote the same entity.
To illustrate that point, consider this simplified example:
First name | Last name | Birthdate | SSN | Employee ID | Zip code | Address |
Jon | Fitch | 1990-12-12 | 111223333 | STKR-42 | 94611 | 123 Oak Dr. Apt. 6 |
Jon | Fitch | 1990-12-12 | 111222333 | STKR-42 | 94611 | 123 Oak Drive Unit 6 |
Jon | Fitch | 1990-12-12 | 888445555 | STKR-42 | 94611 | 123 Oak Drive Unit 6 |
Jon | Fitch | 1990-11-12 | 111223333 | STKR-42 | 94611 | 123 Oak Drive No. 6 |
John | Fitch | 1990-11-12 | 111223333 | STKR-42 | 94610 | |
Jonathan | Fitch-Jones | 1990-12-12 | 3333 | 94610 | ||
Jonathan | Fitch-Jones | 1990-12-12 | STKR-42 | 94610 | ||
Jonathan | Fitch-Jones | 1990-12-12 | 111223333 |
As a human, I may feel confident that these descriptions – which I’ll call “patient descriptions” – all represent one and the same person. But as a machine, how would I arrive at the same conclusion?
The answer can’t be that the descriptions all share a common key, because they don’t. They don’t share a common surrogate key, because they’re derived from several different upstream data stores. And they don’t share a common natural key either, such as the composite attribute comprising birthdate and social security number (SSN), because any given attribute is liable to be absent or incorrect in some of the upstream data.
This highlights an inconvenient truth about would-be natural keys in healthcare data: there is no fixed combination of PII attributes that can be relied on to identify patients across all datasets. Any proposed combination of PII attributes will suffer from both false negatives (incorrect judgments that two records aren’t co-referential) as well as false positives (incorrect judgments that two records are co-referential).
Let’s dig into those two problem types a bit more.
False negatives
In our context, a false negative is a missing link: an incorrect judgment that two records do not refer to the same real-world person. False negatives afflict any proposed natural key, because any given PII attribute in a patient record may inaccurately represent the real-world person to whom the record refers.
Returning to our earlier example, imagine we are considering the natural key comprising birthdate and SSN, and note that these two records cannot be assigned a key using that scheme:
First name | Last name | Birthdate | SSN | Employee ID | Zip code | Address |
Jonathan | Fitch-Jones | 1990-12-12 | 3333 | 94610 | ||
Jonathan | Fitch-Jones | 1990-12-12 | STKR-42 | 94610 |
These records don’t contain valid SSNs, so we can’t identify them as co-referential using a rule that involves SSNs. In effect, the records will be treated as singletons, standing fruitlessly outside every longitudinal patient dataset.
This example shows that, if a proposed natural key uses a field that can be absent or malformed, then that key will be subject to false negatives. The same holds, too, when a proposed natural key uses a field that can be inaccurate, as illustrated here:
First name | Last name | Birthdate | SSN | Employee ID | Zip code | Address |
Jon | Fitch | 1990-12-12 | 111222333 | STKR-42 | 94611 | 123 Oak Drive Unit 6 |
Jon | Fitch | 1990-12-12 | 777445555 | STKR-42 | 94611 | 123 Oak Drive Unit 6 |
Jon | Fitch | 1990-11-12 | 111223333 | STKR-42 | 94611 | 123 Oak Drive No. 6 |
John | Fitch | 1990-11-12 | 111223333 | STKR-42 | 94610 |
As a matter of fact, let’s say, Jon’s true birthdate is 1990-12-12, and his true SSN is 111223333. In the first of these records, his SSN is wrong. In the second, we may suppose his SSN has been mistakenly replaced by that of his fraternal twin, Jean. Finally, in the third and fourth, his birthdate is wrong. (Such mistakes are surprisingly common in real-world data.) Therefore, if we have agreed to use the natural key comprising birthdate and SSN as our method for identifying co-referential records, then we will again confront false negatives when it comes to matching these records with Jon’s others.
Similar thought experiments apply to any proposed natural key. This motivates our general conclusion that any natural key will suffer from false negatives.
False positives
What, then, of false positives? In our context, these are spurious links: incorrect judgments that two records do indeed refer to the same real-world person. False positives afflict any proposed natural key, because, as discussed above, any given PII attribute can be inaccurate.
To see this clearly, let’s juxtapose one of Jon Fitch’s records with a record pertaining to his fraternal twin, Jean:
First name | Last name | Birthdate | SSN | Employee ID | Zip code | Address |
Jon | Fitch | 1990-12-12 | 777445555 | STKR-42 | 94611 | 123 Oak Drive Unit 6 |
Jean | Fitch | 1990-12-12 | 777445555 | STKR-43 | 94611 | 123 Oak Drive Unit 6 |
In the first record, we may suppose Jon’s SSN has been incorrectly replaced by Jean’s. (Again, this is surprisingly common in real-world data.) Realistically, we may also suppose that many other PII attributes are genuinely similar between the twins. Therefore, if we have agreed to use the natural key comprising birthdate and SSN as our method for identifying co-referential records, then we will fall victim to the false-positive judgment that Jon’s and Jean’s records both represent the same person.
As above, similar thought experiments apply to any proposed natural key, motivating our general conclusion that any natural key will suffer from false positives.
Consequences
False negatives and false positives interfere with the goal of composing individual healthcare records into a longitudinal view of a patient’s journey through the health system.
For employees at Included Health, such as data scientists and business-intelligence analysts, this leads to inaccurate results: under-counting or over-counting of patients in specific groups, missed insights about the efficacy of medical interventions, and so forth.
And more importantly, false negatives and false positives lead to a degraded experience in our external application and internal tools. Our users may not see all the personal data they expected, and (I shudder at the thought) may even see someone else’s private data, in violation of HIPAA, the landmark policy that sets the standard for protection of sensitive healthcare records in the United States. Meanwhile, the on-staff providers serving those users may be hamstrung by their lack of access to trustworthy information.
We want to minimize false negatives and false positives. And to minimize anything, we first need to measure it. But rather than measuring the absolute quantity of incorrect judgments, we will leverage two metrics commonly applied to classification problems:
- Recall: the fraction of genuine matches that are predicted, i.e. true positives divided by the sum of true positives and false negatives
- Precision: the fraction of predicted matches that are genuine, i.e. true positives divided by the sum of true positives and false positives
Precision and recall tend to pull in opposite directions, with improvement in one coming at the other’s expense. To obtain a truly useful longitudinal view over healthcare data, we need a solution that scores highly on both axes.
Solutions
As we have seen, neither integer-like surrogate keys nor composite natural keys can be relied on to identify patients across datasets from different sources. We need something more clever than “they have the same key” to enable a machine to recognize co-referential records.
Disjunctive rule-based matching
Let’s start with a simple generalization of the natural-keys idea, which Included Health used in production for some time.
To combat false negatives, it might help to consider two records as co-referential when they match on one or more natural keys. For example, if we allow matches on either last name and employee ID and zip code, or last name and employee ID and SSN, then all of the following records will be regarded as co-referential:
First name | Last name | Birthdate | SSN | Employee ID | Zip code | Address |
Jon | Fitch | 1990-12-12 | 111223333 | STKR-42 | 94611 | 123 Oak Dr. Apt. 6 |
Jon | Fitch | 1990-12-12 | 111222333 | STKR-42 | 94611 | 123 Oak Drive Unit 6 |
Jon | Fitch | 1990-12-12 | 888445555 | STKR-42 | 94611 | 123 Oak Drive Unit 6 |
Jon | Fitch | 1990-11-12 | 111223333 | STKR-42 | 94611 | 123 Oak Drive No. 6 |
John | Fitch | 1990-11-12 | 111223333 | STKR-42 | 94610 |
This approach can be codified using a simple rules engine, where each rule is of the form “if attributes A through Z match exactly, then regard the records as co-referential”. The system will match two records if the disjunction of the antecedents of all the rules is satisfied, or in other words, whenever the “if” condition evaluates to true for one or more rules.
A problem with this approach is that, to capture all the true matches, any adequate system of rules is likely to be large and unwieldy. Furthermore, it can be difficult, even for healthcare-data experts, to invent rules clever enough to cover all the edge cases. In our experience, as our pool of data grew in scope and size, it became increasingly troublesome to hand-roll novel natural keys that promised to reduce false negatives without at the same time increasing false positives.
To actively combat false positives, as opposed to false negatives, disjunctive rule-based matching is no help at all. In fact, it could even exacerbate the problem, by introducing more chances for key collisions. For example, even if two records don’t collide on last name and employee ID and zip code, they may collide on last name and employee ID and SSN.
To forestall this, perhaps escape clauses could be established, stating for instance that two records that match on last name and employee ID and SSN should not, in fact, be regarded as co-referential if they explicitly disagree on several other fields, such as first name, birthdate, address, and so forth. Again, though, it’s tricky to hand-roll the right rules. (Are “Jon” and “John” different first names, or not? How equivalent is “Sam” to “Samuel” or “Samantha”?) Furthermore, as the number of natural-key schemes and escape clauses both grow, the interactions tend to become complex and unmanageable.
Envoi
I hope you enjoyed this introduction to the problem of identifying patients in healthcare data. Be sure to stick around for Part 2, where we’ll dive into Included Health’s current ML-based solution. If you found this topic exciting, reach out to us or visit our careers page. We’re hiring!
Authors
This post was originally written by Cole Leahy former software engineer at Included Health, and co-authored by Nick Gorski, Angelo Sisante, and Vinay Goel. Together, they work on modeling and integration of data from heterogeneous sources, producing trustworthy data assets that support online services and drive insights on data science and analytics teams.