RDD.
groupByKey
Group the values for each key in the RDD into a single sequence. Hash-partitions the resulting RDD with numPartitions partitions.
New in version 0.7.0.
the number of partitions in new RDD
RDD
function to compute the partition index
a RDD containing the keys and the grouped result for each key
See also
RDD.reduceByKey()
RDD.combineByKey()
RDD.aggregateByKey()
RDD.foldByKey()
Notes
If you are grouping in order to perform an aggregation (such as a sum or average) over each key, using reduceByKey or aggregateByKey will provide much better performance.
Examples
>>> rdd = sc.parallelize([("a", 1), ("b", 1), ("a", 1)]) >>> sorted(rdd.groupByKey().mapValues(len).collect()) [('a', 2), ('b', 1)] >>> sorted(rdd.groupByKey().mapValues(list).collect()) [('a', [1, 1]), ('b', [1])]