pyspark.sql.DataFrame.groupBy¶

DataFrame.groupBy(*cols: ColumnOrName) → GroupedData[source]¶

Groups the DataFrame using the specified columns, so we can run aggregation on them. See GroupedData for all the available aggregate functions.

groupby() is an alias for groupBy().

New in version 1.3.0.

Changed in version 3.4.0: Supports Spark Connect.

Parameters

colslist, str or Column: columns to group by. Each element should be a column name (string) or an expression (Column) or list of them.

Returns

GroupedData: Grouped data by given columns.

Examples

>>> df = spark.createDataFrame([
...     (2, "Alice"), (2, "Bob"), (2, "Bob"), (5, "Bob")], schema=["age", "name"])

Empty grouping columns triggers a global aggregation.

>>> df.groupBy().avg().show()
+--------+
|avg(age)|
+--------+
|    2.75|
+--------+

Group-by ‘name’, and specify a dictionary to calculate the summation of ‘age’.

>>> df.groupBy("name").agg({"age": "sum"}).sort("name").show()
+-----+--------+
| name|sum(age)|
+-----+--------+
|Alice|       2|
|  Bob|       9|
+-----+--------+

Group-by ‘name’, and calculate maximum values.

>>> df.groupBy(df.name).max().sort("name").show()
+-----+--------+
| name|max(age)|
+-----+--------+
|Alice|       2|
|  Bob|       5|
+-----+--------+

Group-by ‘name’ and ‘age’, and calculate the number of rows in each group.

>>> df.groupBy(["name", df.age]).count().sort("name", "age").show()
+-----+---+-----+
| name|age|count|
+-----+---+-----+
|Alice|  2|    1|
|  Bob|  2|    2|
|  Bob|  5|    1|
+-----+---+-----+

pyspark.sql.DataFrame.freqItems

pyspark.sql.DataFrame.head