pyspark.pandas.DataFrame.sample¶
-
DataFrame.
sample
(n: Optional[int] = None, frac: Optional[float] = None, replace: bool = False, random_state: Optional[int] = None) → pyspark.pandas.frame.DataFrame[source]¶ Return a random sample of items from an axis of object.
Please call this function using named argument by specifying the
frac
argument.You can use random_state for reproducibility. However, note that different from pandas, specifying a seed in pandas-on-Spark/Spark does not guarantee the sampled rows will be fixed. The result set depends on not only the seed, but also how the data is distributed across machines and to some extent network randomness when shuffle operations are involved. Even in the simplest case, the result set will depend on the system’s CPU core count.
- Parameters
- nint, optional
Number of items to return. This is currently NOT supported. Use frac instead.
- fracfloat, optional
Fraction of axis items to return.
- replacebool, default False
Sample with or without replacement.
- random_stateint, optional
Seed for the random number generator (if int).
- Returns
- Series or DataFrame
A new object of same type as caller containing the sampled items.
Examples
>>> df = ps.DataFrame({'num_legs': [2, 4, 8, 0], ... 'num_wings': [2, 0, 0, 0], ... 'num_specimen_seen': [10, 2, 1, 8]}, ... index=['falcon', 'dog', 'spider', 'fish'], ... columns=['num_legs', 'num_wings', 'num_specimen_seen']) >>> df num_legs num_wings num_specimen_seen falcon 2 2 10 dog 4 0 2 spider 8 0 1 fish 0 0 8
A random 25% sample of the
DataFrame
. Note that we use random_state to ensure the reproducibility of the examples.>>> df.sample(frac=0.25, random_state=1) num_legs num_wings num_specimen_seen falcon 2 2 10 fish 0 0 8
Extract 25% random elements from the
Series
df['num_legs']
, with replacement, so the same items could appear more than once.>>> df['num_legs'].sample(frac=0.4, replace=True, random_state=1) falcon 2 spider 8 spider 8 Name: num_legs, dtype: int64
Specifying the exact number of items to return is not supported at the moment.
>>> df.sample(n=5) Traceback (most recent call last): ... NotImplementedError: Function sample currently does not support specifying ...