pyspark.pandas.Index.value_counts¶

Index.value_counts(normalize: bool = False, sort: bool = True, ascending: bool = False, bins: None = None, dropna: bool = True) → Series¶

Return a Series containing counts of unique values. The resulting object will be in descending order so that the first element is the most frequently-occurring element. Excludes NA values by default.

Parameters

normalizeboolean, default False: If True then the object returned will contain the relative frequencies of the unique values.
sortboolean, default True: Sort by values.
ascendingboolean, default False: Sort in ascending order.
binsNot Yet Supported
dropnaboolean, default True: Don’t include counts of NaN.

Returns

countsSeries

See also

Series.count: Number of non-NA elements in a Series.

Examples

For Series

>>> df = ps.DataFrame({'x':[0, 0, 1, 1, 1, np.nan]})
>>> df.x.value_counts()  
1.0    3
0.0    2
Name: x, dtype: int64

With normalize set to True, returns the relative frequency by dividing all values by the sum of values.

>>> df.x.value_counts(normalize=True)  
1.0    0.6
0.0    0.4
Name: x, dtype: float64

dropna With dropna set to False we can also see NaN index values.

>>> df.x.value_counts(dropna=False)  
1.0    3
0.0    2
NaN    1
Name: x, dtype: int64

For Index

>>> idx = ps.Index([3, 1, 2, 3, 4, np.nan])
>>> idx  
Float64Index([3.0, 1.0, 2.0, 3.0, 4.0, nan], dtype='float64')

>>> idx.value_counts().sort_index()
1.0    1
2.0    1
3.0    2
4.0    1
dtype: int64

sort

With sort set to False, the result wouldn’t be sorted by number of count.

>>> idx.value_counts(sort=True).sort_index()
1.0    1
2.0    1
3.0    2
4.0    1
dtype: int64

normalize

With normalize set to True, returns the relative frequency by dividing all values by the sum of values.

>>> idx.value_counts(normalize=True).sort_index()
1.0    0.2
2.0    0.2
3.0    0.4
4.0    0.2
dtype: float64

dropna

With dropna set to False we can also see NaN index values.

>>> idx.value_counts(dropna=False).sort_index()  
1.0    1
2.0    1
3.0    2
4.0    1
NaN    1
dtype: int64

For MultiIndex.

>>> midx = pd.MultiIndex([['lama', 'cow', 'falcon'],
...                       ['speed', 'weight', 'length']],
...                      [[0, 0, 0, 1, 1, 1, 2, 2, 2],
...                       [1, 1, 1, 1, 1, 2, 1, 2, 2]])
>>> s = ps.Series([45, 200, 1.2, 30, 250, 1.5, 320, 1, 0.3], index=midx)
>>> s.index  
MultiIndex([(  'lama', 'weight'),
            (  'lama', 'weight'),
            (  'lama', 'weight'),
            (   'cow', 'weight'),
            (   'cow', 'weight'),
            (   'cow', 'length'),
            ('falcon', 'weight'),
            ('falcon', 'length'),
            ('falcon', 'length')],
           )

>>> s.index.value_counts().sort_index()
(cow, length)       1
(cow, weight)       2
(falcon, length)    2
(falcon, weight)    1
(lama, weight)      3
dtype: int64

>>> s.index.value_counts(normalize=True).sort_index()
(cow, length)       0.111111
(cow, weight)       0.222222
(falcon, length)    0.222222
(falcon, weight)    0.111111
(lama, weight)      0.333333
dtype: float64

If Index has name, keep the name up.

>>> idx = ps.Index([0, 0, 0, 1, 1, 2, 3], name='pandas-on-Spark')
>>> idx.value_counts().sort_index()
0    3
1    2
2    1
3    1
Name: pandas-on-Spark, dtype: int64

pyspark.pandas.Index.nunique

pyspark.pandas.Index.set_names