pyspark.pandas.Index.value_counts¶
-
Index.
value_counts
(normalize: bool = False, sort: bool = True, ascending: bool = False, bins: None = None, dropna: bool = True) → Series¶ Return a Series containing counts of unique values. The resulting object will be in descending order so that the first element is the most frequently-occurring element. Excludes NA values by default.
- Parameters
- normalizeboolean, default False
If True then the object returned will contain the relative frequencies of the unique values.
- sortboolean, default True
Sort by values.
- ascendingboolean, default False
Sort in ascending order.
- binsNot Yet Supported
- dropnaboolean, default True
Don’t include counts of NaN.
- Returns
- countsSeries
See also
Series.count
Number of non-NA elements in a Series.
Examples
For Series
>>> df = ps.DataFrame({'x':[0, 0, 1, 1, 1, np.nan]}) >>> df.x.value_counts() 1.0 3 0.0 2 Name: x, dtype: int64
With normalize set to True, returns the relative frequency by dividing all values by the sum of values.
>>> df.x.value_counts(normalize=True) 1.0 0.6 0.0 0.4 Name: x, dtype: float64
dropna With dropna set to False we can also see NaN index values.
>>> df.x.value_counts(dropna=False) 1.0 3 0.0 2 NaN 1 Name: x, dtype: int64
For Index
>>> idx = ps.Index([3, 1, 2, 3, 4, np.nan]) >>> idx Float64Index([3.0, 1.0, 2.0, 3.0, 4.0, nan], dtype='float64')
>>> idx.value_counts().sort_index() 1.0 1 2.0 1 3.0 2 4.0 1 dtype: int64
sort
With sort set to False, the result wouldn’t be sorted by number of count.
>>> idx.value_counts(sort=True).sort_index() 1.0 1 2.0 1 3.0 2 4.0 1 dtype: int64
normalize
With normalize set to True, returns the relative frequency by dividing all values by the sum of values.
>>> idx.value_counts(normalize=True).sort_index() 1.0 0.2 2.0 0.2 3.0 0.4 4.0 0.2 dtype: float64
dropna
With dropna set to False we can also see NaN index values.
>>> idx.value_counts(dropna=False).sort_index() 1.0 1 2.0 1 3.0 2 4.0 1 NaN 1 dtype: int64
For MultiIndex.
>>> midx = pd.MultiIndex([['lama', 'cow', 'falcon'], ... ['speed', 'weight', 'length']], ... [[0, 0, 0, 1, 1, 1, 2, 2, 2], ... [1, 1, 1, 1, 1, 2, 1, 2, 2]]) >>> s = ps.Series([45, 200, 1.2, 30, 250, 1.5, 320, 1, 0.3], index=midx) >>> s.index MultiIndex([( 'lama', 'weight'), ( 'lama', 'weight'), ( 'lama', 'weight'), ( 'cow', 'weight'), ( 'cow', 'weight'), ( 'cow', 'length'), ('falcon', 'weight'), ('falcon', 'length'), ('falcon', 'length')], )
>>> s.index.value_counts().sort_index() (cow, length) 1 (cow, weight) 2 (falcon, length) 2 (falcon, weight) 1 (lama, weight) 3 dtype: int64
>>> s.index.value_counts(normalize=True).sort_index() (cow, length) 0.111111 (cow, weight) 0.222222 (falcon, length) 0.222222 (falcon, weight) 0.111111 (lama, weight) 0.333333 dtype: float64
If Index has name, keep the name up.
>>> idx = ps.Index([0, 0, 0, 1, 1, 2, 3], name='pandas-on-Spark') >>> idx.value_counts().sort_index() 0 3 1 2 2 1 3 1 Name: pandas-on-Spark, dtype: int64