pyspark.pandas.Series.corr

Series.corr(other: pyspark.pandas.series.Series, method: str = 'pearson') → float[source]

Compute correlation with other Series, excluding missing values.

Parameters
otherSeries
method{‘pearson’, ‘spearman’}
  • pearson : standard correlation coefficient

  • spearman : Spearman rank correlation

Returns
correlationfloat

Notes

There are behavior differences between pandas-on-Spark and pandas.

  • the method argument only accepts ‘pearson’, ‘spearman’

  • the data should not contain NaNs. pandas-on-Spark will return an error.

  • pandas-on-Spark doesn’t support the following argument(s).

    • min_periods argument is not supported

Examples

>>> df = ps.DataFrame({'s1': [.2, .0, .6, .2],
...                    's2': [.3, .6, .0, .1]})
>>> s1 = df.s1
>>> s2 = df.s2
>>> s1.corr(s2, method='pearson')  
-0.851064...
>>> s1.corr(s2, method='spearman')  
-0.948683...