pyspark.RDD.intersection¶

RDD.intersection(other: pyspark.rdd.RDD[T]) → pyspark.rdd.RDD[T][source]¶

Return the intersection of this RDD and another one. The output will not contain any duplicate elements, even if the input RDDs did.

New in version 1.0.0.

Parameters

Returns

See also

Notes

This method performs a shuffle internally.

Examples

>>> rdd1 = sc.parallelize([1, 10, 2, 3, 4, 5])
>>> rdd2 = sc.parallelize([1, 6, 2, 3, 7, 8])
>>> rdd1.intersection(rdd2).collect()
[1, 2, 3]

pyspark.RDD.id pyspark.RDD.isCheckpointed