How to Remove Duplicate Data with Pandas
The great thing about python is the packages that are available for processing data. Currently, there are over 200,000 packages on PyPi, with more being added every day. One of those packages is Pandas.
Pandas is the most powerful and flexible open-source data analysis/-manipulation tool available in any language, according to PyPi. Along with numpy, pandas is usually the first package downloaded when beginning to work with datasets for the first time.
Datasets come in all shapes and sizes. Sometimes the same types of data can come from different sources. This can lead to errors in the data. These errors can be from incomplete data, missing data, and duplicated data to name a few.
Pandas has two primary data structures called DataFrame and Series. Today’s article will be concerning data as a DataFrame. Picture this scenario that may be familiar to many of you.
You have a .csv file of email subscribers consisting of a first name and email address. These addresses come from several sources you have in place on the internet. Because of the multiple sources, there is a good chance that the same email address is from more than one source.
Now, what you want is a clean email list with no duplicate entries.
How do you do that? Pandas. In just four lines of code, pandas can remove the duplicates and export the list to a new .csv file.
import pandas as pddf = pd.read_csv(r"path_to\email_list.csv",
pf = df[~df.duplicated(subset='email')]
You are now the proud owner of a file of unique email subscribers.
If you enjoy reading stories like these and want to support me as a writer, consider subscribing to Medium for $5 a month. As a member, you have unlimited access to stories on Medium. If you sign up using my link, I’ll earn a small commission.