SparkContext.
addArchive
Add an archive to be downloaded with this Spark job on every node. The path passed can be either a local file, a file in HDFS (or other Hadoop-supported filesystems), or an HTTP, HTTPS or FTP URI.
To access the file in Spark jobs, use SparkFiles.get() with the filename to find its download/unpacked location. The given path should be one of .zip, .tar, .tar.gz, .tgz and .jar.
SparkFiles.get()
New in version 3.3.0.
can be either a local file, a file in HDFS (or other Hadoop-supported filesystems), or an HTTP, HTTPS or FTP URI. To access the file in Spark jobs, use SparkFiles.get() to find its download location.
See also
SparkContext.listArchives()
Notes
A path can be added only once. Subsequent additions of the same path are ignored. This API is experimental.
Examples
Creates a zipped file that contains a text file written ‘100’.
>>> import os >>> import tempfile >>> import zipfile >>> from pyspark import SparkFiles
>>> with tempfile.TemporaryDirectory() as d: ... path = os.path.join(d, "test.txt") ... with open(path, "w") as f: ... _ = f.write("100") ... ... zip_path1 = os.path.join(d, "test1.zip") ... with zipfile.ZipFile(zip_path1, "w", zipfile.ZIP_DEFLATED) as z: ... z.write(path, os.path.basename(path)) ... ... zip_path2 = os.path.join(d, "test2.zip") ... with zipfile.ZipFile(zip_path2, "w", zipfile.ZIP_DEFLATED) as z: ... z.write(path, os.path.basename(path)) ... ... sc.addArchive(zip_path1) ... arch_list1 = sorted(sc.listArchives) ... ... sc.addArchive(zip_path2) ... arch_list2 = sorted(sc.listArchives) ... ... # add zip_path2 twice, this addition will be ignored ... sc.addArchive(zip_path2) ... arch_list3 = sorted(sc.listArchives) ... ... def func(iterator): ... with open("%s/test.txt" % SparkFiles.get("test1.zip")) as f: ... mul = int(f.readline()) ... return [x * mul for x in iterator] ... ... collected = sc.parallelize([1, 2, 3, 4]).mapPartitions(func).collect()
>>> arch_list1 ['file:/.../test1.zip'] >>> arch_list2 ['file:/.../test1.zip', 'file:/.../test2.zip'] >>> arch_list3 ['file:/.../test1.zip', 'file:/.../test2.zip'] >>> collected [100, 200, 300, 400]