DataFrameWriter.
sortBy
Sorts the output in each bucket by the given columns on the file system.
New in version 2.3.0.
Changed in version 3.4.0: Supports Spark Connect.
a name of a column, or a list of names.
additional names (optional). If col is a list it should be empty.
Examples
Write a DataFrame into a Parquet file in a sorted-buckted manner, and read it back.
>>> from pyspark.sql.functions import input_file_name >>> # Write a DataFrame into a Parquet file in a sorted-bucketed manner. ... _ = spark.sql("DROP TABLE IF EXISTS sorted_bucketed_table") >>> spark.createDataFrame([ ... (100, "Hyukjin Kwon"), (120, "Hyukjin Kwon"), (140, "Haejoon Lee")], ... schema=["age", "name"] ... ).write.bucketBy(1, "name").sortBy("age").mode( ... "overwrite").saveAsTable("sorted_bucketed_table") >>> # Read the Parquet file as a DataFrame. ... spark.read.table("sorted_bucketed_table").sort("age").show() +---+------------+ |age| name| +---+------------+ |100|Hyukjin Kwon| |120|Hyukjin Kwon| |140| Haejoon Lee| +---+------------+ >>> _ = spark.sql("DROP TABLE sorted_bucketed_table")