RDD.
join
Return an RDD containing all pairs of elements with matching keys in self and other.
Each pair of elements will be returned as a (k, (v1, v2)) tuple, where (k, v1) is in self and (k, v2) is in other.
Performs a hash join across the cluster.
New in version 0.7.0.
RDD
another RDD
the number of partitions in new RDD
a RDD containing all pairs of elements with matching keys
See also
RDD.leftOuterJoin()
RDD.rightOuterJoin()
RDD.fullOuterJoin()
RDD.cogroup()
RDD.groupWith()
pyspark.sql.DataFrame.join()
Examples
>>> rdd1 = sc.parallelize([("a", 1), ("b", 4)]) >>> rdd2 = sc.parallelize([("a", 2), ("a", 3)]) >>> sorted(rdd1.join(rdd2).collect()) [('a', (1, 2)), ('a', (1, 3))]