DataFrame.
duplicated
Return boolean Series denoting duplicate rows.
Considering certain columns is optional.
Only consider certain columns for identifying duplicates, by default use all of the columns.
Determines which duplicates (if any) to mark.
first : Mark duplicates as True except for the first occurrence.
first
True
last : Mark duplicates as True except for the last occurrence.
last
False : Mark all duplicates as True.
Boolean series for each duplicated rows.
See also
Index.duplicated
Equivalent method on index.
Series.duplicated
Equivalent method on Series.
Series.drop_duplicates
Remove duplicate values from Series.
DataFrame.drop_duplicates
Remove duplicate values from DataFrame.
Examples
Consider dataset containing ramen rating.
>>> df = pd.DataFrame({ ... 'brand': ['Yum Yum', 'Yum Yum', 'Indomie', 'Indomie', 'Indomie'], ... 'style': ['cup', 'cup', 'cup', 'pack', 'pack'], ... 'rating': [4, 4, 3.5, 15, 5] ... }) >>> df brand style rating 0 Yum Yum cup 4.0 1 Yum Yum cup 4.0 2 Indomie cup 3.5 3 Indomie pack 15.0 4 Indomie pack 5.0
By default, for each set of duplicated values, the first occurrence is set on False and all others on True.
>>> df.duplicated() 0 False 1 True 2 False 3 False 4 False dtype: bool
By using ‘last’, the last occurrence of each set of duplicated values is set on False and all others on True.
>>> df.duplicated(keep='last') 0 True 1 False 2 False 3 False 4 False dtype: bool
By setting keep on False, all duplicates are True.
keep
>>> df.duplicated(keep=False) 0 True 1 True 2 False 3 False 4 False dtype: bool
To find duplicates on specific column(s), use subset.
subset
>>> df.duplicated(subset=['brand']) 0 False 1 True 2 False 3 True 4 True dtype: bool