1 + 1 = 1 or Record Deduplication with Python

3:45pm - 4:10pm on Friday, October 5 in Madison

Flávio Juvenal

Audience Level:: Intermediate
Slides:: http://bit.ly/pygotham-dup
Watch:: https://youtu.be/4O87RdBgRJ4

Overview

How to find duplicate records in a dataset when it doesn’t have unique identifiers, like the SSN for US citizens? The answer is to use Data Deduplication techniques: look for matches by cleaning and comparing attributes in a fuzzy way. In this talk, you’ll learn with Python examples how to do this.

Description

Record Deduplication, or more generally, Record Linkage is the task of finding which records refer to the same entity, like a person or a company. It’s used mainly when there isn’t a unique identifier in records like Social Security Number for US citizens. This means one can’t trivially find duplicate records in a single dataset, neither easily link records from different datasets. Without an identifier, record linkage looks for matches by cleaning and comparing record attributes in a fuzzy way. Imagine you have two datasets with information about people, but without any unique identifier in the records. You have to compare attributes like name, date of birth, and address in a smart way to find which records from the two datasets refer to the same person. A similar approach must be used to dedupe records in a single dataset, so Record Deduplication is a kind of Record Linkage.

There are a number of important applications of data deduplication in government and business. For example, by deduping records from Census data, the Australian government was able to find there were 250,000 fewer people in the country than they previously thought. This reduction impacted the estimations of government agencies and even caused the revision economical projections. Similarly, businesses can use record linkage techniques to enrich their customers’ data with publicly available datasets.

In this talk, you’ll learn with Python examples the main concepts of Record Deduplication, what kinds of problems can be solved, what’s the most common workflow for the process, what algorithms are involved, and which tools and libraries you can use. Although some of the discussed concepts are related to data mining, any intermediate-level Python developer will be able to learn the basics of how to dedupe data using Python.