pyspark.SparkContext.runJob¶
-
SparkContext.
runJob
(rdd: pyspark.rdd.RDD[T], partitionFunc: Callable[[Iterable[T]], Iterable[U]], partitions: Optional[Sequence[int]] = None, allowLocal: bool = False) → List[U][source]¶ Executes the given partitionFunc on the specified set of partitions, returning the result as an array of elements.
If ‘partitions’ is not specified, this will run over all partitions.
New in version 1.1.0.
- Parameters
- rdd
RDD
target RDD to run tasks on
- partitionFuncfunction
a function to run on each partition of the RDD
- partitionslist, optional
set of partitions to run on; some jobs may not want to compute on all partitions of the target RDD, e.g. for operations like first
- allowLocalbool, default False
this parameter takes no effect
- rdd
- Returns
- list
results of specified partitions
See also
Examples
>>> myRDD = sc.parallelize(range(6), 3) >>> sc.runJob(myRDD, lambda part: [x * x for x in part]) [0, 1, 4, 9, 16, 25]
>>> myRDD = sc.parallelize(range(6), 3) >>> sc.runJob(myRDD, lambda part: [x * x for x in part], [0, 2], True) [0, 1, 16, 25]