pyspark.pandas.Series.nunique¶

Series.nunique(dropna: bool = True, approx: bool = False, rsd: float = 0.05) → int¶

Return number of unique elements in the object. Excludes NA values by default.

Parameters

dropnabool, default True: Don’t include NaN in the count.
approx: bool, default False: If False, will use the exact algorithm and return the exact number of unique. If True, it uses the HyperLogLog approximate algorithm, which is significantly faster for large amount of data. Note: This parameter is specific to pandas-on-Spark and is not found in pandas.
rsd: float, default 0.05: Maximum estimation error allowed in the HyperLogLog algorithm. Note: Just like approx this parameter is specific to pandas-on-Spark.

Returns

See also

Examples

>>> ps.Series([1, 2, 3, np.nan]).nunique()
3

>>> ps.Series([1, 2, 3, np.nan]).nunique(dropna=False)
4

On big data, we recommend using the approximate algorithm to speed up this function. The result will be very close to the exact unique count.

>>> ps.Series([1, 2, 3, np.nan]).nunique(approx=True)
3

>>> idx = ps.Index([1, 1, 2, None])
>>> idx
Float64Index([1.0, 1.0, 2.0, nan], dtype='float64')

>>> idx.nunique()
2

>>> idx.nunique(dropna=False)
3