pyspark.sql.DataFrame.intersect#

DataFrame.intersect(other)[source]#

Return a new DataFrame containing rows only in both this DataFrame and another DataFrame. Note that any duplicates are removed. To preserve duplicates use intersectAll().

New in version 1.3.0.

Changed in version 3.4.0: Supports Spark Connect.

Parameters

otherDataFrame: Another DataFrame that needs to be combined.

Returns

DataFrame: Combined DataFrame.

Notes

This is equivalent to INTERSECT in SQL.

Examples

Example 1: Intersecting two DataFrames with the same schema

>>> df1 = spark.createDataFrame([("a", 1), ("a", 1), ("b", 3), ("c", 4)], ["C1", "C2"])
>>> df2 = spark.createDataFrame([("a", 1), ("a", 1), ("b", 3)], ["C1", "C2"])
>>> result_df = df1.intersect(df2).sort("C1", "C2")
>>> result_df.show()
+---+---+
| C1| C2|
+---+---+
|  a|  1|
|  b|  3|
+---+---+

Example 2: Intersecting two DataFrames with different schemas

>>> df1 = spark.createDataFrame([(1, "A"), (2, "B")], ["id", "value"])
>>> df2 = spark.createDataFrame([(2, "B"), (3, "C")], ["id", "value"])
>>> result_df = df1.intersect(df2).sort("id", "value")
>>> result_df.show()
+---+-----+
| id|value|
+---+-----+
|  2|    B|
+---+-----+

Example 3: Intersecting all rows from two DataFrames with mismatched columns

>>> df1 = spark.createDataFrame([(1, 2), (1, 2), (3, 4)], ["A", "B"])
>>> df2 = spark.createDataFrame([(1, 2), (1, 2)], ["C", "D"])
>>> result_df = df1.intersect(df2).sort("A", "B")
>>> result_df.show()
+---+---+
|  A|  B|
+---+---+
|  1|  2|
+---+---+