pyspark.sql.DataFrame.mapInPandas

DataFrame.mapInPandas(func, schema)

Maps an iterator of batches in the current DataFrame using a Python native function that takes and outputs a pandas DataFrame, and returns the result as a DataFrame.

The function should take an iterator of pandas.DataFrames and return another iterator of pandas.DataFrames. All columns are passed together as an iterator of pandas.DataFrames to the function and the returned iterator of pandas.DataFrames are combined as a DataFrame. Each pandas.DataFrame size can be controlled by spark.sql.execution.arrow.maxRecordsPerBatch.

New in version 3.0.0.

Parameters
funcfunction

a Python native function that takes an iterator of pandas.DataFrames, and outputs an iterator of pandas.DataFrames.

schemapyspark.sql.types.DataType or str

the return type of the func in PySpark. The value can be either a pyspark.sql.types.DataType object or a DDL-formatted type string.

Notes

This API is experimental

Examples

>>> from pyspark.sql.functions import pandas_udf
>>> df = spark.createDataFrame([(1, 21), (2, 30)], ("id", "age"))
>>> def filter_func(iterator):
...     for pdf in iterator:
...         yield pdf[pdf.id == 1]
>>> df.mapInPandas(filter_func, df.schema).show()  
+---+---+
| id|age|
+---+---+
|  1| 21|
+---+---+