pyspark.pandas.DataFrame.spark.cache

spark.cache() → CachedDataFrame

Yields and caches the current DataFrame.

The pandas-on-Spark DataFrame is yielded as a protected resource and its corresponding data is cached which gets uncached after execution goes off the context.

If you want to specify the StorageLevel manually, use DataFrame.spark.persist()

See also

DataFrame.spark.persist

Examples

>>> df = ps.DataFrame([(.2, .3), (.0, .6), (.6, .0), (.2, .1)],
...                   columns=['dogs', 'cats'])
>>> df
   dogs  cats
0   0.2   0.3
1   0.0   0.6
2   0.6   0.0
3   0.2   0.1
>>> with df.spark.cache() as cached_df:
...     print(cached_df.count())
...
dogs    4
cats    4
dtype: int64
>>> df = df.spark.cache()
>>> df.to_pandas().mean(axis=1)
0    0.25
1    0.30
2    0.30
3    0.15
dtype: float64

To uncache the dataframe, use unpersist function

>>> df.spark.unpersist()