pyspark.sql.DataFrameWriter.bucketBy¶
-
DataFrameWriter.
bucketBy
(numBuckets: int, col: Union[str, List[str], Tuple[str, …]], *cols: Optional[str]) → pyspark.sql.readwriter.DataFrameWriter[source]¶ Buckets the output by the given columns. If specified, the output is laid out on the file system similar to Hive’s bucketing scheme, but with a different bucket hash function and is not compatible with Hive’s bucketing.
New in version 2.3.0.
Changed in version 3.4.0: Supports Spark Connect.
- Parameters
- numBucketsint
the number of buckets to save
- colstr, list or tuple
a name of a column, or a list of names.
- colsstr
additional names (optional). If col is a list it should be empty.
Notes
Applicable for file-based data sources in combination with
DataFrameWriter.saveAsTable()
.Examples
Write a DataFrame into a Parquet file in a buckted manner, and read it back.
>>> from pyspark.sql.functions import input_file_name >>> # Write a DataFrame into a Parquet file in a bucketed manner. ... _ = spark.sql("DROP TABLE IF EXISTS bucketed_table") >>> spark.createDataFrame([ ... (100, "Hyukjin Kwon"), (120, "Hyukjin Kwon"), (140, "Haejoon Lee")], ... schema=["age", "name"] ... ).write.bucketBy(2, "name").mode("overwrite").saveAsTable("bucketed_table") >>> # Read the Parquet file as a DataFrame. ... spark.read.table("bucketed_table").sort("age").show() +---+------------+ |age| name| +---+------------+ |100|Hyukjin Kwon| |120|Hyukjin Kwon| |140| Haejoon Lee| +---+------------+ >>> _ = spark.sql("DROP TABLE bucketed_table")