pyspark.sql.functions.count_distinct¶

pyspark.sql.functions.count_distinct(col: ColumnOrName, *cols: ColumnOrName) → pyspark.sql.column.Column[source]¶

Returns a new Column for distinct count of col or cols.

New in version 3.2.0.

Changed in version 3.4.0: Supports Spark Connect.

Parameters

colColumn or str: first column to compute on.
colsColumn or str: other columns to compute on.

Returns

Column: distinct values of these two column values.

Examples

>>> from pyspark.sql import types
>>> df1 = spark.createDataFrame([1, 1, 3], types.IntegerType())
>>> df2 = spark.createDataFrame([1, 2], types.IntegerType())
>>> df1.join(df2).show()
+-----+-----+
|value|value|
+-----+-----+
|    1|    1|
|    1|    2|
|    1|    1|
|    1|    2|
|    3|    1|
|    3|    2|
+-----+-----+
>>> df1.join(df2).select(count_distinct(df1.value, df2.value)).show()
+----------------------------+
|count(DISTINCT value, value)|
+----------------------------+
|                           4|
+----------------------------+

pyspark.sql.functions.count

pyspark.sql.functions.countDistinct