# dealing_with_duplicates

### Counting Duplicates

In order to count all the duplicated values we do:

```ds.duplicated().sum()
```

In order to count all the duplicated values with respect to a certain subset of fields we do:

```ds.duplicated(subset['age','zip_code']).sum()
```

In order to count duplicates with respect to a certain column we can do:

```ds['column_name'].duplicated().sum()
```

### Visualizing Duplicates

In order to view all the duplicates we can do:

```ds.loc[users.duplicated(keep = 'last'), :]
```

where keep = 'last' means that we are showing the last encountered instance of a duplicate row.

### Enumerating Dataset Duplicates

Given the subset of duplicated rows extracted by doing:

```ds = ds.loc[users.duplicated(keep = 'last'), :]
```

We can count how many those are repeated with:

```dfanalysis = ds.groupby(['ColA','ColB']).size().reset_index(name='count')
# we can also use groupy(list(df)) if we want to use all the columns
```

we can order results in a descending way by doing:

```dfanalysis.sort_values(by="count", ascending=False).to_csv("duplicated_count_stats.csv", index=False)
```

### Removing Duplicates

To remove duplicates and just keep the first encountered instances we do:

```ds.drop_duplicates(keep = 'first')
```

To remove duplicates and just keep the last encountered instances we do:

```ds.drop_duplicates(keep = 'last')
```

To remove duplicates with respect to a subset of fields:

```ds.drop_duplicates(subset = ['age', 'zip_code'])
```