StandardScaler¶

class pyspark.mllib.feature.StandardScaler(withMean=False, withStd=True)[source]¶

Standardizes features by removing the mean and scaling to unit variance using column summary statistics on the samples in the training set.

New in version 1.2.0.

Parameters

withMeanbool, optional: False by default. Centers the data with mean before scaling. It will build a dense output, so take care when applying to sparse input.
withStdbool, optional: True by default. Scales the data to unit standard deviation.

Examples

>>> vs = [Vectors.dense([-2.0, 2.3, 0]), Vectors.dense([3.8, 0.0, 1.9])]
>>> dataset = sc.parallelize(vs)
>>> standardizer = StandardScaler(True, True)
>>> model = standardizer.fit(dataset)
>>> result = model.transform(dataset)
>>> for r in result.collect(): r
DenseVector([-0.7071, 0.7071, -0.7071])
DenseVector([0.7071, -0.7071, 0.7071])
>>> int(model.std[0])
4
>>> int(model.mean[0]*10)
9
>>> model.withStd
True
>>> model.withMean
True

Methods

fit(dataset)

Computes the mean and variance and stores as a model to be used for later scaling.

Methods Documentation

fit(dataset)[source]¶

Computes the mean and variance and stores as a model to be used for later scaling.

New in version 1.2.0.

Parameters

datasetpyspark.RDD: The data used to compute the mean and variance to build the transformation model.

Returns

StandardScalerModel

StandardScalerModel HashingTF