pyspark.pandas.DataFrame.sample¶

DataFrame.sample(n: Optional[int] = None, frac: Optional[float] = None, replace: bool = False, random_state: Optional[int] = None) → pyspark.pandas.frame.DataFrame[source]¶

Return a random sample of items from an axis of object.

Please call this function using named argument by specifying the frac argument.

You can use random_state for reproducibility. However, note that different from pandas, specifying a seed in pandas-on-Spark/Spark does not guarantee the sampled rows will be fixed. The result set depends on not only the seed, but also how the data is distributed across machines and to some extent network randomness when shuffle operations are involved. Even in the simplest case, the result set will depend on the system’s CPU core count.

Parameters

nint, optional: Number of items to return. This is currently NOT supported. Use frac instead.
fracfloat, optional: Fraction of axis items to return.
replacebool, default False: Sample with or without replacement.
random_stateint, optional: Seed for the random number generator (if int).

Returns

Series or DataFrame: A new object of same type as caller containing the sampled items.

Examples

>>> df = ps.DataFrame({'num_legs': [2, 4, 8, 0],
...                    'num_wings': [2, 0, 0, 0],
...                    'num_specimen_seen': [10, 2, 1, 8]},
...                   index=['falcon', 'dog', 'spider', 'fish'],
...                   columns=['num_legs', 'num_wings', 'num_specimen_seen'])
>>> df  
        num_legs  num_wings  num_specimen_seen
falcon         2          2                 10
dog            4          0                  2
spider         8          0                  1
fish           0          0                  8

A random 25% sample of the DataFrame. Note that we use random_state to ensure the reproducibility of the examples.

>>> df.sample(frac=0.25, random_state=1)  
        num_legs  num_wings  num_specimen_seen
falcon         2          2                 10
fish           0          0                  8

Extract 25% random elements from the Series df['num_legs'], with replacement, so the same items could appear more than once.

>>> df['num_legs'].sample(frac=0.4, replace=True, random_state=1)  
falcon    2
spider    8
spider    8
Name: num_legs, dtype: int64

Specifying the exact number of items to return is not supported at the moment.

>>> df.sample(n=5)  
Traceback (most recent call last):
    ...
NotImplementedError: Function sample currently does not support specifying ...

pyspark.pandas.DataFrame.isin pyspark.pandas.DataFrame.truncate