pyspark.sql.functions.count_min_sketch¶
-
pyspark.sql.functions.
count_min_sketch
(col: ColumnOrName, eps: ColumnOrName, confidence: ColumnOrName, seed: ColumnOrName) → pyspark.sql.column.Column[source]¶ Returns a count-min sketch of a column with the given esp, confidence and seed. The result is an array of bytes, which can be deserialized to a CountMinSketch before usage. Count-min sketch is a probabilistic data structure used for cardinality estimation using sub-linear space.
New in version 3.5.0.
- Parameters
- Returns
Column
count-min sketch of the column
Examples
>>> df = spark.createDataFrame([[1], [2], [1]], ['data']) >>> df = df.agg(count_min_sketch(df.data, lit(0.5), lit(0.5), lit(1)).alias('sketch')) >>> df.select(hex(df.sketch).alias('r')).collect() [Row(r='0000000100000000000000030000000100000004000000005D8D6AB90000000000000000000000000000000200000000000000010000000000000000')]