RDD.
takeSample
Return a fixed-size sampled subset of this RDD.
New in version 1.3.0.
whether sampling is done with replacement
size of the returned sample
random seed
a fixed-size sampled subset of this RDD in an array
RDD
See also
RDD.sample()
Notes
This method should only be used if the resulting array is expected to be small, as all the data is loaded into the driver’s memory.
Examples
>>> import sys >>> rdd = sc.parallelize(range(0, 10)) >>> len(rdd.takeSample(True, 20, 1)) 20 >>> len(rdd.takeSample(False, 5, 2)) 5 >>> len(rdd.takeSample(False, 15, 3)) 10 >>> sc.range(0, 10).takeSample(False, sys.maxsize) Traceback (most recent call last): ... ValueError: Sample size cannot be greater than ...