pyspark.pandas.DataFrame.cov¶

DataFrame.cov(min_periods: Optional[int] = None, ddof: int = 1) → pyspark.pandas.frame.DataFrame[source]¶

Compute pairwise covariance of columns, excluding NA/null values.

Compute the pairwise covariance among the series of a DataFrame. The returned data frame is the covariance matrix of the columns of the DataFrame.

Both NA and null values are automatically excluded from the calculation. (See the note below about bias from missing values.) A threshold can be set for the minimum number of observations for each value created. Comparisons with observations below this threshold will be returned as NaN.

This method is generally used for the analysis of time series data to understand the relationship between different measures across time.

New in version 3.3.0.

Parameters

min_periodsint, optional: Minimum number of observations required per pair of columns to have a valid result.
ddofint, default 1: Delta degrees of freedom. The divisor used in calculations is N - ddof, where N represents the number of elements.

New in version 3.4.0.

Returns

DataFrame: The covariance matrix of the series of the DataFrame.

See also

Series.cov: Compute covariance with another Series.

Examples

>>> df = ps.DataFrame([(1, 2), (0, 3), (2, 0), (1, 1)],
...                   columns=['dogs', 'cats'])
>>> df.cov()
          dogs      cats
dogs  0.666667 -1.000000
cats -1.000000  1.666667

>>> np.random.seed(42)
>>> df = ps.DataFrame(np.random.randn(1000, 5),
...                   columns=['a', 'b', 'c', 'd', 'e'])
>>> df.cov()
          a         b         c         d         e
a  0.998438 -0.020161  0.059277 -0.008943  0.014144
b -0.020161  1.059352 -0.008543 -0.024738  0.009826
c  0.059277 -0.008543  1.010670 -0.001486 -0.000271
d -0.008943 -0.024738 -0.001486  0.921297 -0.013692
e  0.014144  0.009826 -0.000271 -0.013692  0.977795
>>> df.cov(ddof=2)
          a         b         c         d         e
a  0.999439 -0.020181  0.059336 -0.008952  0.014159
b -0.020181  1.060413 -0.008551 -0.024762  0.009836
c  0.059336 -0.008551  1.011683 -0.001487 -0.000271
d -0.008952 -0.024762 -0.001487  0.922220 -0.013705
e  0.014159  0.009836 -0.000271 -0.013705  0.978775
>>> df.cov(ddof=-1)
  a         b         c         d         e
a  0.996444 -0.020121  0.059158 -0.008926  0.014116
b -0.020121  1.057235 -0.008526 -0.024688  0.009807
c  0.059158 -0.008526  1.008650 -0.001483 -0.000270
d -0.008926 -0.024688 -0.001483  0.919456 -0.013664
e  0.014116  0.009807 -0.000270 -0.013664  0.975842

Minimum number of periods

This method also supports an optional min_periods keyword that specifies the required minimum number of non-NA observations for each column pair to have a valid result:

>>> np.random.seed(42)
>>> df = pd.DataFrame(np.random.randn(20, 3),
...                   columns=['a', 'b', 'c'])
>>> df.loc[df.index[:5], 'a'] = np.nan
>>> df.loc[df.index[5:10], 'b'] = np.nan
>>> sdf = ps.from_pandas(df)
>>> sdf.cov(min_periods=12)
          a         b         c
a  0.316741       NaN -0.150812
b       NaN  1.248003  0.191417
c -0.150812  0.191417  0.895202

pyspark.pandas.DataFrame.count

pyspark.pandas.DataFrame.describe