Home | Trees | Indices | Help |
|
---|
|
object --+ | SparkContext
Main entry point for Spark functionality. A SparkContext represents the connection to a Spark cluster, and can be used to create RDDs and broadcast variables on that cluster.
Instance Methods | |||
|
|||
|
|||
|
|||
|
|||
|
|||
|
|||
|
|||
|
|||
|
|||
|
|||
|
|||
|
|||
|
|||
Inherited from |
Class Methods | |||
|
Properties | |
Inherited from |
Method Details |
Create a new SparkContext. At least the master and app name should be
set, either through the named parameters here or through
|
Set a Java system property, such as spark.executor.memory. This must must be invoked before instantiating SparkContext. |
Default level of parallelism to use when not given by user (e.g. for reduce tasks)
|
Distribute a local Python collection to form an RDD. >>> sc.parallelize(range(5), 5).glom().collect() [[0], [1], [2], [3], [4]] |
Build the union of a list of RDDs. This supports unions() of RDDs with different serialized formats, although this forces them to be reserialized using the default serializer: >>> path = os.path.join(tempdir, "union-text.txt") >>> with open(path, "w") as testFile: ... testFile.write("Hello") >>> textFile = sc.textFile(path) >>> textFile.collect() [u'Hello'] >>> parallelized = sc.parallelize(["World!"]) >>> sorted(sc.union([textFile, parallelized]).collect()) [u'Hello', 'World!'] |
Broadcast a read-only variable to the cluster, returning a Broadcast object for reading it in distributed functions. The variable will be sent to each cluster only once. |
Create an Accumulator with the given initial value, using a given AccumulatorParam helper object to define how to add values of the data type if provided. Default AccumulatorParams are used for integers and floating-point numbers if you do not provide one. For other types, a custom AccumulatorParam can be used. |
Add a file to be downloaded with this Spark job on every node. The
To access the file in Spark jobs, use SparkFiles.get(path) to find its download location. >>> from pyspark import SparkFiles >>> path = os.path.join(tempdir, "test.txt") >>> with open(path, "w") as testFile: ... testFile.write("100") >>> sc.addFile(path) >>> def func(iterator): ... with open(SparkFiles.get("test.txt")) as testFile: ... fileVal = int(testFile.readline()) ... return [x * 100 for x in iterator] >>> sc.parallelize([1, 2, 3, 4]).mapPartitions(func).collect() [100, 200, 300, 400] |
Add a .py or .zip dependency for all tasks to be executed on this
SparkContext in the future. The |
Set the directory under which RDDs are going to be checkpointed. The directory must be a HDFS path if running on a cluster. |
Home | Trees | Indices | Help |
|
---|
Generated by Epydoc 3.0.1 on Sun Mar 2 16:35:00 2014 | http://epydoc.sourceforge.net |