Skip to main content

dealing_with_duplicates

Counting Duplicates

In order to count all the duplicated values we do:

ds.duplicated().sum()

In order to count all the duplicated values with respect to a certain subset of fields we do:

ds.duplicated(subset['age','zip_code']).sum()

In order to count duplicates with respect to a certain column we can do:

ds['column_name'].duplicated().sum()

Visualizing Duplicates

In order to view all the duplicates we can do:

ds.loc[users.duplicated(keep = 'last'), :]

where keep = 'last' means that we are showing the last encountered instance of a duplicate row.

Enumerating Dataset Duplicates

Given the subset of duplicated rows extracted by doing:

ds = ds.loc[users.duplicated(keep = 'last'), :]

We can count how many those are repeated with:

dfanalysis = ds.groupby(['ColA','ColB']).size().reset_index(name='count')
# we can also use groupy(list(df)) if we want to use all the columns

we can order results in a descending way by doing:

dfanalysis.sort_values(by="count", ascending=False).to_csv("duplicated_count_stats.csv", index=False)

Removing Duplicates

To remove duplicates and just keep the first encountered instances we do:

ds.drop_duplicates(keep = 'first')

To remove duplicates and just keep the last encountered instances we do:

ds.drop_duplicates(keep = 'last')

To remove duplicates with respect to a subset of fields:

ds.drop_duplicates(subset = ['age', 'zip_code'])