It’s a puzzle, it’s an algorithm, it’s deduplication

By on .

Richard Ngethe

The following is a guest post by Richard Ngethe, MBA,  and Jonathan Friedman, MS

According to a study by UNAIDS of Brazil, of an estimated 140 million records in the national user database—including all health records—only 100 million unique patients were represented. The number of patients, therefore, was overstated by 40%.[1]

This is the kind of puzzle we can get our heads around. Obviously, in our field, which involves applying business intelligence strategies to curb transmission of HIV, this digital health data puzzle has to be solved, because an overstatement of 40% in clinical records is going to mask the true incidence of HIV infection in a population and skew any interventions to address the epidemic.

Jonathan Friedman

Here’s how we tackle the issue of duplicate client records at Data.FI, funded by USAID, to strengthen analysis and use of data to accelerate HIV and COVID-19 epidemic control.

As an example of how we improved digital health data through machine learning to boost HIV care in South Africa, we supported deduplication of more than 20 million records. A full index of every record in the database would have considered 200 trillion record pairs, and any deduplication algorithm would have taken years to run. But the indexing techniques reduced the number of record pairs to the millions, and the algorithm ran in a matter of days.

First, how does it happen? Imagine a person arrives at a clinic for an HIV test. A record is created, the test is positive, and antiretroviral treatment is started.

The next month, the person seeks another supply of antiretroviral treatment at a different clinic several kilometers away from her hometown while visiting family. A new record is created. Several months later, after her marriage, she visits the first clinic for another supply of medication and gives her married name. A third record is created.

One patient, three records. Clearly, we are not going to understand this person’s trajectory of HIV treatment and we likely overstate the HIV infection rate in the area because one patient equals three in the database.

One solution is to establish clear, national, standardized registration practices for all clients of a health system with processes understood and valued by all health providers. With training and good governance, a health system can institute these norms.

But, meanwhile, what about the duplicate records that are skewing the numbers and negatively affecting work to fight HIV? Consider this, if there are 10,000 patient records, you could, possibly, need to compare each record to every other record. That would mean that the possible number of comparison pairs would be close to a half million. Likewise, if there are 10,000 patient records, and you chose to compare them all to each other, the number of comparisons for the algorithm zooms to 50 million.

Obviously, these calculations would take human beings years to complete—a huge resource drain if done manually. That is why, for deduplication, we turn to indexing, blocking, algorithms, machine learning, and some strategic human input.

Deduplication takes place at the highest level where health records are pooled. There are several stages:

  • Preprocessing
  • Indexing
  • Comparing
  • Classifying

Preprocessing is the cleaning and standardization of the data fields in each record so we can compare them. It’s the labor-intensive removal of differences that requires human decision making.

Indexing and blocking arrange the data more efficiently.

  • One way to index is to sort data alphabetically such that related information is arranged more closely together. Sorting alphabetically without incorporating other considerations will struggle, for example, to catch duplicates where names have spelling mistakes or are commonly spelled differently. Phonetic encoding examines client names by their pronunciations, to clean possible spelling differences.
  • Blocking might reduce the search space, for example, by gender—rarely are duplicate client records recording a client as male in one instance and female in another.

The next stage is comparing. The deduplication algorithm calculates the similarity between fields such as names, locations, or birth dates and pulls likely matches among records. These numbers are what we use in the final step of classifying the likely matches identified.

For classifying, we use three main strategies:

  • Deterministic matching. We call it a match if a record duplicates information exactly, such as a unique ID. An exact match has high positive predictive value that it is a duplicate.
  • Probabilistic matching. This technique considers how likely is it that two fields will be highly similar for any random pair of records. It’s preferred to deterministic matching because it’s more sensitive and flexible (it casts a wider net).
  • Supervised machine learning. This is less manual than the other strategies but relies on human input at the beginning. Experts review a set of candidate pairs and identify which are duplicates. The machine learning then seeks to replicate and improve upon human decisions on a new (but similar) dataset.

Of course, after the machine learning takes place, humans need to review the results of the deduplicated database to ensure accuracy. Merging, the last step, can occur at the above-site level where data are aggregated and also at the site level, where care is provided. A deduplicated database of clinic records provides a true accounting of the statistics of the epidemic in a locale and enhances understanding of what services should be provided by the clinic to meet the needs of their client base; and a merged record consolidates a patient history in one single file, fully appraising a caregiver to make the correct clinical decisions and thereby improve the patient service, and outcomes.

Richard Ngethe, MBA, is digital advisor for Data.FI, Palladium.

Jonathan Friedman, MS, is Senior Technical Advisor in Data Science, Data.FI, Palladium.

Data.FI is a global project that helps countries improve their data systems to strengthen prevention, testing, treatment, and lab services to end the HIV epidemic and to combat COVID-19.

[1] Retrieved from

2 thoughts on “It’s a puzzle, it’s an algorithm, it’s deduplication

  1. Jane

    Good article and the way to actually go.The challenge is when a country has different data bases handling same programme interventions.Its possible if a country has a unique patient number for use for all health interventions no matter the place.
    Infrastructure has also been a big barrier to having a unified way of indentification
    Challenge in training healthcare workers and their reception in adoption of the use the EMR systems available especially in public health care facilities.
    Currently patients changing their unique indentification to what might suit them

  2. Dr. Timur Aptekar

    Very good article. I have few comments. 1) reduplication is relevant for large countries such as Brazil, Chine, India and etc . Other countries may check their database and must do it manually on a quarterly basis. 2) I disagree that reduplication needs to be done at the highest level only. Regional level has less amount of patients to be checked, duplication is more often at the regional level, manual checking will more productive, as they have access to more information and will receive additional information much faster.


Leave a Comment

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.