pyspark.sql.functions.array_union#
- pyspark.sql.functions.array_union(col1, col2)[source]#
Array function: returns a new array containing the union of elements in col1 and col2, without duplicates.
New in version 2.4.0.
Changed in version 3.4.0: Supports Spark Connect.
- Parameters
- Returns
Column
A new array containing the union of elements in col1 and col2.
Notes
This function does not preserve the order of the elements in the input arrays.
Examples
Example 1: Basic usage
>>> from pyspark.sql import Row, functions as sf >>> df = spark.createDataFrame([Row(c1=["b", "a", "c"], c2=["c", "d", "a", "f"])]) >>> df.select(sf.sort_array(sf.array_union(df.c1, df.c2))).show() +-------------------------------------+ |sort_array(array_union(c1, c2), true)| +-------------------------------------+ | [a, b, c, d, f]| +-------------------------------------+
Example 2: Union with no common elements
>>> from pyspark.sql import Row, functions as sf >>> df = spark.createDataFrame([Row(c1=["b", "a", "c"], c2=["d", "e", "f"])]) >>> df.select(sf.sort_array(sf.array_union(df.c1, df.c2))).show() +-------------------------------------+ |sort_array(array_union(c1, c2), true)| +-------------------------------------+ | [a, b, c, d, e, f]| +-------------------------------------+
Example 3: Union with all common elements
>>> from pyspark.sql import Row, functions as sf >>> df = spark.createDataFrame([Row(c1=["a", "b", "c"], c2=["a", "b", "c"])]) >>> df.select(sf.sort_array(sf.array_union(df.c1, df.c2))).show() +-------------------------------------+ |sort_array(array_union(c1, c2), true)| +-------------------------------------+ | [a, b, c]| +-------------------------------------+
Example 4: Union with null values
>>> from pyspark.sql import Row, functions as sf >>> df = spark.createDataFrame([Row(c1=["a", "b", None], c2=["a", None, "c"])]) >>> df.select(sf.sort_array(sf.array_union(df.c1, df.c2))).show() +-------------------------------------+ |sort_array(array_union(c1, c2), true)| +-------------------------------------+ | [NULL, a, b, c]| +-------------------------------------+
Example 5: Union with empty arrays
>>> from pyspark.sql import Row, functions as sf >>> from pyspark.sql.types import ArrayType, StringType, StructField, StructType >>> data = [Row(c1=[], c2=["a", "b", "c"])] >>> schema = StructType([ ... StructField("c1", ArrayType(StringType()), True), ... StructField("c2", ArrayType(StringType()), True) ... ]) >>> df = spark.createDataFrame(data, schema) >>> df.select(sf.sort_array(sf.array_union(df.c1, df.c2))).show() +-------------------------------------+ |sort_array(array_union(c1, c2), true)| +-------------------------------------+ | [a, b, c]| +-------------------------------------+