pyspark.RDD.groupByKey¶

RDD.groupByKey(numPartitions: Optional[int] = None, partitionFunc: Callable[[K], int] = <function portable_hash>) → pyspark.rdd.RDD[Tuple[K, Iterable[V]]][source]¶

Group the values for each key in the RDD into a single sequence. Hash-partitions the resulting RDD with numPartitions partitions.

New in version 0.7.0.

Parameters

numPartitionsint, optional: the number of partitions in new RDD
partitionFuncfunction, optional, default portable_hash: function to compute the partition index

Returns

RDD: a RDD containing the keys and the grouped result for each key

See also

RDD.reduceByKey()
RDD.combineByKey()
RDD.aggregateByKey()
RDD.foldByKey()

Notes

If you are grouping in order to perform an aggregation (such as a sum or average) over each key, using reduceByKey or aggregateByKey will provide much better performance.

Examples

>>> rdd = sc.parallelize([("a", 1), ("b", 1), ("a", 1)])
>>> sorted(rdd.groupByKey().mapValues(len).collect())
[('a', 2), ('b', 1)]
>>> sorted(rdd.groupByKey().mapValues(list).collect())
[('a', [1, 1]), ('b', [1])]

pyspark.RDD.groupBy

pyspark.RDD.groupWith