pyspark.sql.functions.count_min_sketch¶

pyspark.sql.functions.count_min_sketch(col: ColumnOrName, eps: ColumnOrName, confidence: ColumnOrName, seed: ColumnOrName) → pyspark.sql.column.Column[source]¶

Returns a count-min sketch of a column with the given esp, confidence and seed. The result is an array of bytes, which can be deserialized to a CountMinSketch before usage. Count-min sketch is a probabilistic data structure used for cardinality estimation using sub-linear space.

New in version 3.5.0.

Parameters

colColumn or str: target column to compute on.
epsColumn or str: relative error, must be positive
confidenceColumn or str: confidence, must be positive and less than 1.0
seedColumn or str: random seed

Returns

Column: count-min sketch of the column

Examples

>>> df = spark.createDataFrame([[1], [2], [1]], ['data'])
>>> df = df.agg(count_min_sketch(df.data, lit(0.5), lit(0.5), lit(1)).alias('sketch'))
>>> df.select(hex(df.sketch).alias('r')).collect()
[Row(r='0000000100000000000000030000000100000004000000005D8D6AB90000000000000000000000000000000200000000000000010000000000000000')]

pyspark.sql.functions.countDistinct

pyspark.sql.functions.count_if