pyspark.mllib.random.
RandomRDDs
Generator methods for creating RDDs comprised of i.i.d samples from some distribution.
New in version 1.1.0.
Methods
exponentialRDD(sc, mean, size[, …])
exponentialRDD
Generates an RDD comprised of i.i.d.
exponentialVectorRDD(sc, mean, numRows, numCols)
exponentialVectorRDD
Generates an RDD comprised of vectors containing i.i.d.
gammaRDD(sc, shape, scale, size[, …])
gammaRDD
gammaVectorRDD(sc, shape, scale, numRows, …)
gammaVectorRDD
logNormalRDD(sc, mean, std, size[, …])
logNormalRDD
logNormalVectorRDD(sc, mean, std, numRows, …)
logNormalVectorRDD
normalRDD(sc, size[, numPartitions, seed])
normalRDD
normalVectorRDD(sc, numRows, numCols[, …])
normalVectorRDD
poissonRDD(sc, mean, size[, numPartitions, seed])
poissonRDD
poissonVectorRDD(sc, mean, numRows, numCols)
poissonVectorRDD
uniformRDD(sc, size[, numPartitions, seed])
uniformRDD
uniformVectorRDD(sc, numRows, numCols[, …])
uniformVectorRDD
Methods Documentation
Generates an RDD comprised of i.i.d. samples from the Exponential distribution with the input mean.
New in version 1.3.0.
pyspark.SparkContext
SparkContext used to create the RDD.
Mean, or 1 / lambda, for the Exponential distribution.
Size of the RDD.
Number of partitions in the RDD (default: sc.defaultParallelism).
Random seed (default: a random long integer).
pyspark.RDD
RDD of float comprised of i.i.d. samples ~ Exp(mean).
Examples
>>> mean = 2.0 >>> x = RandomRDDs.exponentialRDD(sc, mean, 1000, seed=2) >>> stats = x.stats() >>> stats.count() 1000 >>> abs(stats.mean() - mean) < 0.5 True >>> from math import sqrt >>> abs(stats.stdev() - sqrt(mean)) < 0.5 True
Generates an RDD comprised of vectors containing i.i.d. samples drawn from the Exponential distribution with the input mean.
Number of Vectors in the RDD.
Number of elements in each Vector.
Number of partitions in the RDD (default: sc.defaultParallelism)
RDD of Vector with vectors containing i.i.d. samples ~ Exp(mean).
>>> import numpy as np >>> mean = 0.5 >>> rdd = RandomRDDs.exponentialVectorRDD(sc, mean, 100, 100, seed=1) >>> mat = np.mat(rdd.collect()) >>> mat.shape (100, 100) >>> abs(mat.mean() - mean) < 0.5 True >>> from math import sqrt >>> abs(mat.std() - sqrt(mean)) < 0.5 True
Generates an RDD comprised of i.i.d. samples from the Gamma distribution with the input shape and scale.
shape (> 0) parameter for the Gamma distribution
scale (> 0) parameter for the Gamma distribution
RDD of float comprised of i.i.d. samples ~ Gamma(shape, scale).
>>> from math import sqrt >>> shape = 1.0 >>> scale = 2.0 >>> expMean = shape * scale >>> expStd = sqrt(shape * scale * scale) >>> x = RandomRDDs.gammaRDD(sc, shape, scale, 1000, seed=2) >>> stats = x.stats() >>> stats.count() 1000 >>> abs(stats.mean() - expMean) < 0.5 True >>> abs(stats.stdev() - expStd) < 0.5 True
Generates an RDD comprised of vectors containing i.i.d. samples drawn from the Gamma distribution.
Shape (> 0) of the Gamma distribution
Scale (> 0) of the Gamma distribution
RDD of Vector with vectors containing i.i.d. samples ~ Gamma(shape, scale).
>>> import numpy as np >>> from math import sqrt >>> shape = 1.0 >>> scale = 2.0 >>> expMean = shape * scale >>> expStd = sqrt(shape * scale * scale) >>> mat = np.matrix(RandomRDDs.gammaVectorRDD(sc, shape, scale, 100, 100, seed=1).collect()) >>> mat.shape (100, 100) >>> abs(mat.mean() - expMean) < 0.1 True >>> abs(mat.std() - expStd) < 0.1 True
Generates an RDD comprised of i.i.d. samples from the log normal distribution with the input mean and standard distribution.
used to create the RDD.
mean for the log Normal distribution
std for the log Normal distribution
>>> from math import sqrt, exp >>> mean = 0.0 >>> std = 1.0 >>> expMean = exp(mean + 0.5 * std * std) >>> expStd = sqrt((exp(std * std) - 1.0) * exp(2.0 * mean + std * std)) >>> x = RandomRDDs.logNormalRDD(sc, mean, std, 1000, seed=2) >>> stats = x.stats() >>> stats.count() 1000 >>> abs(stats.mean() - expMean) < 0.5 True >>> from math import sqrt >>> abs(stats.stdev() - expStd) < 0.5 True
Generates an RDD comprised of vectors containing i.i.d. samples drawn from the log normal distribution.
Mean of the log normal distribution
Standard Deviation of the log normal distribution
RDD of Vector with vectors containing i.i.d. samples ~ log N(mean, std).
>>> import numpy as np >>> from math import sqrt, exp >>> mean = 0.0 >>> std = 1.0 >>> expMean = exp(mean + 0.5 * std * std) >>> expStd = sqrt((exp(std * std) - 1.0) * exp(2.0 * mean + std * std)) >>> m = RandomRDDs.logNormalVectorRDD(sc, mean, std, 100, 100, seed=1).collect() >>> mat = np.matrix(m) >>> mat.shape (100, 100) >>> abs(mat.mean() - expMean) < 0.1 True >>> abs(mat.std() - expStd) < 0.1 True
Generates an RDD comprised of i.i.d. samples from the standard normal distribution.
To transform the distribution in the generated RDD from standard normal to some other normal N(mean, sigma^2), use RandomRDDs.normal(sc, n, p, seed).map(lambda v: mean + sigma * v)
RandomRDDs.normal(sc, n, p, seed).map(lambda v: mean + sigma * v)
RDD of float comprised of i.i.d. samples ~ N(0.0, 1.0).
>>> x = RandomRDDs.normalRDD(sc, 1000, seed=1) >>> stats = x.stats() >>> stats.count() 1000 >>> abs(stats.mean() - 0.0) < 0.1 True >>> abs(stats.stdev() - 1.0) < 0.1 True
Generates an RDD comprised of vectors containing i.i.d. samples drawn from the standard normal distribution.
RDD of Vector with vectors containing i.i.d. samples ~ N(0.0, 1.0).
>>> import numpy as np >>> mat = np.matrix(RandomRDDs.normalVectorRDD(sc, 100, 100, seed=1).collect()) >>> mat.shape (100, 100) >>> abs(mat.mean() - 0.0) < 0.1 True >>> abs(mat.std() - 1.0) < 0.1 True
Generates an RDD comprised of i.i.d. samples from the Poisson distribution with the input mean.
Mean, or lambda, for the Poisson distribution.
RDD of float comprised of i.i.d. samples ~ Pois(mean).
>>> mean = 100.0 >>> x = RandomRDDs.poissonRDD(sc, mean, 1000, seed=2) >>> stats = x.stats() >>> stats.count() 1000 >>> abs(stats.mean() - mean) < 0.5 True >>> from math import sqrt >>> abs(stats.stdev() - sqrt(mean)) < 0.5 True
Generates an RDD comprised of vectors containing i.i.d. samples drawn from the Poisson distribution with the input mean.
RDD of Vector with vectors containing i.i.d. samples ~ Pois(mean).
>>> import numpy as np >>> mean = 100.0 >>> rdd = RandomRDDs.poissonVectorRDD(sc, mean, 100, 100, seed=1) >>> mat = np.mat(rdd.collect()) >>> mat.shape (100, 100) >>> abs(mat.mean() - mean) < 0.5 True >>> from math import sqrt >>> abs(mat.std() - sqrt(mean)) < 0.5 True
Generates an RDD comprised of i.i.d. samples from the uniform distribution U(0.0, 1.0).
To transform the distribution in the generated RDD from U(0.0, 1.0) to U(a, b), use RandomRDDs.uniformRDD(sc, n, p, seed).map(lambda v: a + (b - a) * v)
RandomRDDs.uniformRDD(sc, n, p, seed).map(lambda v: a + (b - a) * v)
RDD of float comprised of i.i.d. samples ~ U(0.0, 1.0).
>>> x = RandomRDDs.uniformRDD(sc, 100).collect() >>> len(x) 100 >>> max(x) <= 1.0 and min(x) >= 0.0 True >>> RandomRDDs.uniformRDD(sc, 100, 4).getNumPartitions() 4 >>> parts = RandomRDDs.uniformRDD(sc, 100, seed=4).getNumPartitions() >>> parts == sc.defaultParallelism True
Generates an RDD comprised of vectors containing i.i.d. samples drawn from the uniform distribution U(0.0, 1.0).
Number of partitions in the RDD.
Seed for the RNG that generates the seed for the generator in each partition.
RDD of Vector with vectors containing i.i.d samples ~ U(0.0, 1.0).
>>> import numpy as np >>> mat = np.matrix(RandomRDDs.uniformVectorRDD(sc, 10, 10).collect()) >>> mat.shape (10, 10) >>> mat.max() <= 1.0 and mat.min() >= 0.0 True >>> RandomRDDs.uniformVectorRDD(sc, 10, 10, 4).getNumPartitions() 4