pyspark.sql.DataFrameWriter.sortBy¶
-
DataFrameWriter.
sortBy
(col: Union[str, List[str], Tuple[str, …]], *cols: Optional[str]) → pyspark.sql.readwriter.DataFrameWriter[source]¶ Sorts the output in each bucket by the given columns on the file system.
New in version 2.3.0.
Changed in version 3.4.0: Supports Spark Connect.
- Parameters
- colstr, tuple or list
a name of a column, or a list of names.
- colsstr
additional names (optional). If col is a list it should be empty.
Examples
Write a DataFrame into a Parquet file in a sorted-buckted manner, and read it back.
>>> from pyspark.sql.functions import input_file_name >>> # Write a DataFrame into a Parquet file in a sorted-bucketed manner. ... _ = spark.sql("DROP TABLE IF EXISTS sorted_bucketed_table") >>> spark.createDataFrame([ ... (100, "Hyukjin Kwon"), (120, "Hyukjin Kwon"), (140, "Haejoon Lee")], ... schema=["age", "name"] ... ).write.bucketBy(1, "name").sortBy("age").mode( ... "overwrite").saveAsTable("sorted_bucketed_table") >>> # Read the Parquet file as a DataFrame. ... spark.read.table("sorted_bucketed_table").sort("age").show() +---+------------+ |age| name| +---+------------+ |100|Hyukjin Kwon| |120|Hyukjin Kwon| |140| Haejoon Lee| +---+------------+ >>> _ = spark.sql("DROP TABLE sorted_bucketed_table")