pyspark.SparkContext.parallelize¶
-
SparkContext.
parallelize
(c: Iterable[T], numSlices: Optional[int] = None) → pyspark.rdd.RDD[T][source]¶ Distribute a local Python collection to form an RDD. Using range is recommended if the input represents a range for performance.
New in version 0.7.0.
- Parameters
- c
collections.abc.Iterable
iterable collection to distribute
- numSlicesint, optional
the number of partitions of the new RDD
- c
- Returns
RDD
RDD representing distributed collection.
Examples
>>> sc.parallelize([0, 2, 3, 4, 6], 5).glom().collect() [[0], [2], [3], [4], [6]] >>> sc.parallelize(range(0, 6, 2), 5).glom().collect() [[], [0], [], [2], [4]]
Deal with a list of strings.
>>> strings = ["a", "b", "c"] >>> sc.parallelize(strings, 2).glom().collect() [['a'], ['b', 'c']]