bodo.pandas.BodoDataFrame.drop_duplicates¶
BodoDataFrame.drop_duplicates(
subset=None,
*,
keep="first",
inplace=False,
ignore_index=False,
) -> BodoDataFrame
Return DataFrame with duplicate rows removed.
Currently only supports the subset
argument and the default, first
and last
arguments to keep
.
All other uses will fall back to Pandas.
See pandas.DataFrame.drop_duplicates
for more details.
Note
When subset
is specified, BodoDataFrame.drop_duplicates
is not guaranteed to produce the exact same output of Pandas.
Instead, Bodo DataFrames will return the first or last value for the specified subset
key that is encountered during
processing which may occur in any order. The set of unique keys returned will be identical to Pandas but the other values
may be different.
Parameters
-
subset : None | List[str], default None: Only consider certain columns for identifying duplicates, by default (None) use all of the columns.
-
keep : str, default 'first': Determines which duplicates (if any) to keep. Only 'first' and 'last' are supported. First and last occurrences are relative to the Bodo DataFrame workers processing the dataframe and not to any index ordering of the dataframe.
-
All other parameters will trigger a fallback to
pandas.DataFrame.drop_duplicates
if a non-default value is provided. Returns
-
BodoDataFrame: Bodo DataFrame with duplicates removed.
Example
import bodo.pandas as pd
df = pd.DataFrame({
'brand': ['Yum Yum', 'Yum Yum', 'Indomie', 'Indomie', 'Indomie'],
'style': ['cup', 'cup', 'cup', 'pack', 'pack'],
'rating': [4, 4, 3.5, 15, 5]
})
print(df.drop_duplicates())
Output:
To remove duplicates on specific column(s), use subset
.
Output: