DataFrame-based machine learning APIs to let users quickly assemble and configure practical machine learning pipelines.
Abstract class for transformers that transform one dataset into another.
New in version 1.3.0.
Creates a copy of this instance with the same uid and some extra params. The default implementation creates a shallow copy using copy.copy(), and then copies the embedded and extra parameters over and returns the copy. Subclasses should override this method if the default approach is not sufficient.
Parameters: | extra – Extra parameters to copy to the new instance |
---|---|
Returns: | Copy of this instance |
Explains a single param and returns its name, doc, and optional default value and user-supplied value in a string.
Returns the documentation of all params with their optionally default values and user-supplied values.
Extracts the embedded default param values and user-supplied values, and then merges them with extra values from input into a flat param map, where the latter value is used if there exist conflicts, i.e., with ordering: default param values < user-supplied values < extra.
Parameters: | extra – extra param values |
---|---|
Returns: | merged param map |
Gets the value of a param in the user-supplied param map or its default value. Raises an error if neither is set.
Gets a param by its name.
Checks whether a param has a default value.
Tests whether this instance contains a param with a given (string) name.
Checks whether a param is explicitly set by user or has a default value.
Checks whether a param is explicitly set by user.
Returns all params ordered by name. The default implementation uses dir() to get all attributes of type Param.
Transforms the input dataset with optional parameters.
Parameters: |
|
---|---|
Returns: | transformed dataset |
New in version 1.3.0.
Abstract class for estimators that fit models to data.
New in version 1.3.0.
Creates a copy of this instance with the same uid and some extra params. The default implementation creates a shallow copy using copy.copy(), and then copies the embedded and extra parameters over and returns the copy. Subclasses should override this method if the default approach is not sufficient.
Parameters: | extra – Extra parameters to copy to the new instance |
---|---|
Returns: | Copy of this instance |
Explains a single param and returns its name, doc, and optional default value and user-supplied value in a string.
Returns the documentation of all params with their optionally default values and user-supplied values.
Extracts the embedded default param values and user-supplied values, and then merges them with extra values from input into a flat param map, where the latter value is used if there exist conflicts, i.e., with ordering: default param values < user-supplied values < extra.
Parameters: | extra – extra param values |
---|---|
Returns: | merged param map |
Fits a model to the input dataset with optional parameters.
Parameters: |
|
---|---|
Returns: | fitted model(s) |
New in version 1.3.0.
Gets the value of a param in the user-supplied param map or its default value. Raises an error if neither is set.
Gets a param by its name.
Checks whether a param has a default value.
Tests whether this instance contains a param with a given (string) name.
Checks whether a param is explicitly set by user or has a default value.
Checks whether a param is explicitly set by user.
Returns all params ordered by name. The default implementation uses dir() to get all attributes of type Param.
Abstract class for models that are fitted by estimators.
New in version 1.4.0.
Creates a copy of this instance with the same uid and some extra params. The default implementation creates a shallow copy using copy.copy(), and then copies the embedded and extra parameters over and returns the copy. Subclasses should override this method if the default approach is not sufficient.
Parameters: | extra – Extra parameters to copy to the new instance |
---|---|
Returns: | Copy of this instance |
Explains a single param and returns its name, doc, and optional default value and user-supplied value in a string.
Returns the documentation of all params with their optionally default values and user-supplied values.
Extracts the embedded default param values and user-supplied values, and then merges them with extra values from input into a flat param map, where the latter value is used if there exist conflicts, i.e., with ordering: default param values < user-supplied values < extra.
Parameters: | extra – extra param values |
---|---|
Returns: | merged param map |
Gets the value of a param in the user-supplied param map or its default value. Raises an error if neither is set.
Gets a param by its name.
Checks whether a param has a default value.
Tests whether this instance contains a param with a given (string) name.
Checks whether a param is explicitly set by user or has a default value.
Checks whether a param is explicitly set by user.
Returns all params ordered by name. The default implementation uses dir() to get all attributes of type Param.
Transforms the input dataset with optional parameters.
Parameters: |
|
---|---|
Returns: | transformed dataset |
New in version 1.3.0.
A simple pipeline, which acts as an estimator. A Pipeline consists of a sequence of stages, each of which is either an Estimator or a Transformer. When Pipeline.fit() is called, the stages are executed in order. If a stage is an Estimator, its Estimator.fit() method will be called on the input dataset to fit a model. Then the model, which is a transformer, will be used to transform the dataset as the input to the next stage. If a stage is a Transformer, its Transformer.transform() method will be called to produce the dataset for the next stage. The fitted model from a Pipeline is a PipelineModel, which consists of fitted models and transformers, corresponding to the pipeline stages. If stages is an empty list, the pipeline acts as an identity transformer.
New in version 1.3.0.
Creates a copy of this instance.
Parameters: | extra – extra parameters |
---|---|
Returns: | new instance |
New in version 1.4.0.
Explains a single param and returns its name, doc, and optional default value and user-supplied value in a string.
Returns the documentation of all params with their optionally default values and user-supplied values.
Extracts the embedded default param values and user-supplied values, and then merges them with extra values from input into a flat param map, where the latter value is used if there exist conflicts, i.e., with ordering: default param values < user-supplied values < extra.
Parameters: | extra – extra param values |
---|---|
Returns: | merged param map |
Fits a model to the input dataset with optional parameters.
Parameters: |
|
---|---|
Returns: | fitted model(s) |
New in version 1.3.0.
Gets the value of a param in the user-supplied param map or its default value. Raises an error if neither is set.
Gets a param by its name.
Get pipeline stages.
New in version 1.3.0.
Checks whether a param has a default value.
Tests whether this instance contains a param with a given (string) name.
Checks whether a param is explicitly set by user or has a default value.
Checks whether a param is explicitly set by user.
Reads an ML instance from the input path, a shortcut of read().load(path).
Returns all params ordered by name. The default implementation uses dir() to get all attributes of type Param.
Returns an MLReader instance for this class.
New in version 2.0.0.
Save this ML instance to the given path, a shortcut of write().save(path).
New in version 2.0.0.
Sets params for Pipeline.
New in version 1.3.0.
Set pipeline stages.
Parameters: | value – a list of transformers or estimators |
---|---|
Returns: | the pipeline instance |
New in version 1.3.0.
Returns an MLWriter instance for this ML instance.
New in version 2.0.0.
Represents a compiled pipeline with transformers and fitted models.
New in version 1.3.0.
Creates a copy of this instance.
Parameters: | extra – extra parameters |
---|---|
Returns: | new instance |
New in version 1.4.0.
Explains a single param and returns its name, doc, and optional default value and user-supplied value in a string.
Returns the documentation of all params with their optionally default values and user-supplied values.
Extracts the embedded default param values and user-supplied values, and then merges them with extra values from input into a flat param map, where the latter value is used if there exist conflicts, i.e., with ordering: default param values < user-supplied values < extra.
Parameters: | extra – extra param values |
---|---|
Returns: | merged param map |
Gets the value of a param in the user-supplied param map or its default value. Raises an error if neither is set.
Gets a param by its name.
Checks whether a param has a default value.
Tests whether this instance contains a param with a given (string) name.
Checks whether a param is explicitly set by user or has a default value.
Checks whether a param is explicitly set by user.
Reads an ML instance from the input path, a shortcut of read().load(path).
Returns all params ordered by name. The default implementation uses dir() to get all attributes of type Param.
Returns an MLReader instance for this class.
New in version 2.0.0.
Save this ML instance to the given path, a shortcut of write().save(path).
New in version 2.0.0.
Transforms the input dataset with optional parameters.
Parameters: |
|
---|---|
Returns: | transformed dataset |
New in version 1.3.0.
Returns an MLWriter instance for this ML instance.
New in version 2.0.0.
A param with self-contained documentation.
New in version 1.3.0.
Components that take parameters. This also provides an internal param map to store parameter values attached to the instance.
New in version 1.3.0.
Creates a copy of this instance with the same uid and some extra params. The default implementation creates a shallow copy using copy.copy(), and then copies the embedded and extra parameters over and returns the copy. Subclasses should override this method if the default approach is not sufficient.
Parameters: | extra – Extra parameters to copy to the new instance |
---|---|
Returns: | Copy of this instance |
Explains a single param and returns its name, doc, and optional default value and user-supplied value in a string.
Returns the documentation of all params with their optionally default values and user-supplied values.
Extracts the embedded default param values and user-supplied values, and then merges them with extra values from input into a flat param map, where the latter value is used if there exist conflicts, i.e., with ordering: default param values < user-supplied values < extra.
Parameters: | extra – extra param values |
---|---|
Returns: | merged param map |
Gets the value of a param in the user-supplied param map or its default value. Raises an error if neither is set.
Binarize a column of continuous features given a threshold.
>>> df = spark.createDataFrame([(0.5,)], ["values"])
>>> binarizer = Binarizer(threshold=1.0, inputCol="values", outputCol="features")
>>> binarizer.transform(df).head().features
0.0
>>> binarizer.setParams(outputCol="freqs").transform(df).head().freqs
0.0
>>> params = {binarizer.threshold: -0.5, binarizer.outputCol: "vector"}
>>> binarizer.transform(df, params).head().vector
1.0
>>> binarizerPath = temp_path + "/binarizer"
>>> binarizer.save(binarizerPath)
>>> loadedBinarizer = Binarizer.load(binarizerPath)
>>> loadedBinarizer.getThreshold() == binarizer.getThreshold()
True
New in version 1.4.0.
Creates a copy of this instance with the same uid and some extra params. This implementation first calls Params.copy and then make a copy of the companion Java pipeline component with extra params. So both the Python wrapper and the Java pipeline component get copied.
Parameters: | extra – Extra parameters to copy to the new instance |
---|---|
Returns: | Copy of this instance |
Explains a single param and returns its name, doc, and optional default value and user-supplied value in a string.
Returns the documentation of all params with their optionally default values and user-supplied values.
Extracts the embedded default param values and user-supplied values, and then merges them with extra values from input into a flat param map, where the latter value is used if there exist conflicts, i.e., with ordering: default param values < user-supplied values < extra.
Parameters: | extra – extra param values |
---|---|
Returns: | merged param map |
Gets the value of inputCol or its default value.
Gets the value of a param in the user-supplied param map or its default value. Raises an error if neither is set.
Gets the value of outputCol or its default value.
Gets a param by its name.
Checks whether a param has a default value.
Tests whether this instance contains a param with a given (string) name.
Checks whether a param is explicitly set by user or has a default value.
Checks whether a param is explicitly set by user.
Reads an ML instance from the input path, a shortcut of read().load(path).
Returns all params ordered by name. The default implementation uses dir() to get all attributes of type Param.
Returns an MLReader instance for this class.
Save this ML instance to the given path, a shortcut of write().save(path).
Sets params for this Binarizer.
New in version 1.4.0.
Transforms the input dataset with optional parameters.
Parameters: |
|
---|---|
Returns: | transformed dataset |
New in version 1.3.0.
Returns an MLWriter instance for this ML instance.
Note
Experimental
LSH class for Euclidean distance metrics. The input is dense or sparse vectors, each of which represents a point in the Euclidean distance space. The output will be vectors of configurable dimension. Hash values in the same dimension are calculated by the same hash function.
See also
>>> from pyspark.ml.linalg import Vectors
>>> from pyspark.sql.functions import col
>>> data = [(0, Vectors.dense([-1.0, -1.0 ]),),
... (1, Vectors.dense([-1.0, 1.0 ]),),
... (2, Vectors.dense([1.0, -1.0 ]),),
... (3, Vectors.dense([1.0, 1.0]),)]
>>> df = spark.createDataFrame(data, ["id", "features"])
>>> brp = BucketedRandomProjectionLSH(inputCol="features", outputCol="hashes",
... seed=12345, bucketLength=1.0)
>>> model = brp.fit(df)
>>> model.transform(df).head()
Row(id=0, features=DenseVector([-1.0, -1.0]), hashes=[DenseVector([-1.0])])
>>> data2 = [(4, Vectors.dense([2.0, 2.0 ]),),
... (5, Vectors.dense([2.0, 3.0 ]),),
... (6, Vectors.dense([3.0, 2.0 ]),),
... (7, Vectors.dense([3.0, 3.0]),)]
>>> df2 = spark.createDataFrame(data2, ["id", "features"])
>>> model.approxNearestNeighbors(df2, Vectors.dense([1.0, 2.0]), 1).collect()
[Row(id=4, features=DenseVector([2.0, 2.0]), hashes=[DenseVector([1.0])], distCol=1.0)]
>>> model.approxSimilarityJoin(df, df2, 3.0, distCol="EuclideanDistance").select(
... col("datasetA.id").alias("idA"),
... col("datasetB.id").alias("idB"),
... col("EuclideanDistance")).show()
+---+---+-----------------+
|idA|idB|EuclideanDistance|
+---+---+-----------------+
| 3| 6| 2.23606797749979|
+---+---+-----------------+
...
>>> brpPath = temp_path + "/brp"
>>> brp.save(brpPath)
>>> brp2 = BucketedRandomProjectionLSH.load(brpPath)
>>> brp2.getBucketLength() == brp.getBucketLength()
True
>>> modelPath = temp_path + "/brp-model"
>>> model.save(modelPath)
>>> model2 = BucketedRandomProjectionLSHModel.load(modelPath)
>>> model.transform(df).head().hashes == model2.transform(df).head().hashes
True
New in version 2.2.0.
Creates a copy of this instance with the same uid and some extra params. This implementation first calls Params.copy and then make a copy of the companion Java pipeline component with extra params. So both the Python wrapper and the Java pipeline component get copied.
Parameters: | extra – Extra parameters to copy to the new instance |
---|---|
Returns: | Copy of this instance |
Explains a single param and returns its name, doc, and optional default value and user-supplied value in a string.
Returns the documentation of all params with their optionally default values and user-supplied values.
Extracts the embedded default param values and user-supplied values, and then merges them with extra values from input into a flat param map, where the latter value is used if there exist conflicts, i.e., with ordering: default param values < user-supplied values < extra.
Parameters: | extra – extra param values |
---|---|
Returns: | merged param map |
Fits a model to the input dataset with optional parameters.
Parameters: |
|
---|---|
Returns: | fitted model(s) |
New in version 1.3.0.
Gets the value of bucketLength or its default value.
New in version 2.2.0.
Gets the value of inputCol or its default value.
Gets the value of numHashTables or its default value.
Gets the value of a param in the user-supplied param map or its default value. Raises an error if neither is set.
Gets the value of outputCol or its default value.
Gets a param by its name.
Gets the value of seed or its default value.
Checks whether a param has a default value.
Tests whether this instance contains a param with a given (string) name.
Checks whether a param is explicitly set by user or has a default value.
Checks whether a param is explicitly set by user.
Reads an ML instance from the input path, a shortcut of read().load(path).
Returns all params ordered by name. The default implementation uses dir() to get all attributes of type Param.
Returns an MLReader instance for this class.
Save this ML instance to the given path, a shortcut of write().save(path).
Sets the value of bucketLength.
New in version 2.2.0.
Sets the value of numHashTables.
Sets params for this BucketedRandomProjectionLSH.
New in version 2.2.0.
Returns an MLWriter instance for this ML instance.
Note
Experimental
Model fitted by BucketedRandomProjectionLSH, where multiple random vectors are stored. The vectors are normalized to be unit vectors and each vector is used in a hash function: \(h_i(x) = floor(r_i \cdot x / bucketLength)\) where \(r_i\) is the i-th random unit vector. The number of buckets will be (max L2 norm of input vectors) / bucketLength.
New in version 2.2.0.
Given a large dataset and an item, approximately find at most k items which have the closest distance to the item. If the outputCol is missing, the method will transform the data; if the outputCol exists, it will use that. This allows caching of the transformed data when necessary.
Note
This method is experimental and will likely change behavior in the next release.
Parameters: |
|
---|---|
Returns: | A dataset containing at most k items closest to the key. A column “distCol” is added to show the distance between each row and the key. |
Join two datasets to approximately find all pairs of rows whose distance are smaller than the threshold. If the outputCol is missing, the method will transform the data; if the outputCol exists, it will use that. This allows caching of the transformed data when necessary.
Parameters: |
|
---|---|
Returns: | A joined dataset containing pairs of rows. The original rows are in columns “datasetA” and “datasetB”, and a column “distCol” is added to show the distance between each pair. |
Creates a copy of this instance with the same uid and some extra params. This implementation first calls Params.copy and then make a copy of the companion Java pipeline component with extra params. So both the Python wrapper and the Java pipeline component get copied.
Parameters: | extra – Extra parameters to copy to the new instance |
---|---|
Returns: | Copy of this instance |
Explains a single param and returns its name, doc, and optional default value and user-supplied value in a string.
Returns the documentation of all params with their optionally default values and user-supplied values.
Extracts the embedded default param values and user-supplied values, and then merges them with extra values from input into a flat param map, where the latter value is used if there exist conflicts, i.e., with ordering: default param values < user-supplied values < extra.
Parameters: | extra – extra param values |
---|---|
Returns: | merged param map |
Gets the value of a param in the user-supplied param map or its default value. Raises an error if neither is set.
Gets a param by its name.
Checks whether a param has a default value.
Tests whether this instance contains a param with a given (string) name.
Checks whether a param is explicitly set by user or has a default value.
Checks whether a param is explicitly set by user.
Reads an ML instance from the input path, a shortcut of read().load(path).
Returns all params ordered by name. The default implementation uses dir() to get all attributes of type Param.
Returns an MLReader instance for this class.
Save this ML instance to the given path, a shortcut of write().save(path).
Transforms the input dataset with optional parameters.
Parameters: |
|
---|---|
Returns: | transformed dataset |
New in version 1.3.0.
Returns an MLWriter instance for this ML instance.
Maps a column of continuous features to a column of feature buckets.
>>> values = [(0.1,), (0.4,), (1.2,), (1.5,), (float("nan"),), (float("nan"),)]
>>> df = spark.createDataFrame(values, ["values"])
>>> bucketizer = Bucketizer(splits=[-float("inf"), 0.5, 1.4, float("inf")],
... inputCol="values", outputCol="buckets")
>>> bucketed = bucketizer.setHandleInvalid("keep").transform(df).collect()
>>> len(bucketed)
6
>>> bucketed[0].buckets
0.0
>>> bucketed[1].buckets
0.0
>>> bucketed[2].buckets
1.0
>>> bucketed[3].buckets
2.0
>>> bucketizer.setParams(outputCol="b").transform(df).head().b
0.0
>>> bucketizerPath = temp_path + "/bucketizer"
>>> bucketizer.save(bucketizerPath)
>>> loadedBucketizer = Bucketizer.load(bucketizerPath)
>>> loadedBucketizer.getSplits() == bucketizer.getSplits()
True
>>> bucketed = bucketizer.setHandleInvalid("skip").transform(df).collect()
>>> len(bucketed)
4
New in version 1.4.0.
Creates a copy of this instance with the same uid and some extra params. This implementation first calls Params.copy and then make a copy of the companion Java pipeline component with extra params. So both the Python wrapper and the Java pipeline component get copied.
Parameters: | extra – Extra parameters to copy to the new instance |
---|---|
Returns: | Copy of this instance |
Explains a single param and returns its name, doc, and optional default value and user-supplied value in a string.
Returns the documentation of all params with their optionally default values and user-supplied values.
Extracts the embedded default param values and user-supplied values, and then merges them with extra values from input into a flat param map, where the latter value is used if there exist conflicts, i.e., with ordering: default param values < user-supplied values < extra.
Parameters: | extra – extra param values |
---|---|
Returns: | merged param map |
Gets the value of handleInvalid or its default value.
New in version 2.1.0.
Gets the value of inputCol or its default value.
Gets the value of a param in the user-supplied param map or its default value. Raises an error if neither is set.
Gets the value of outputCol or its default value.
Gets a param by its name.
Checks whether a param has a default value.
Tests whether this instance contains a param with a given (string) name.
Checks whether a param is explicitly set by user or has a default value.
Checks whether a param is explicitly set by user.
Reads an ML instance from the input path, a shortcut of read().load(path).
Returns all params ordered by name. The default implementation uses dir() to get all attributes of type Param.
Returns an MLReader instance for this class.
Save this ML instance to the given path, a shortcut of write().save(path).
Sets the value of handleInvalid.
New in version 2.1.0.
Sets params for this Bucketizer.
New in version 1.4.0.
Transforms the input dataset with optional parameters.
Parameters: |
|
---|---|
Returns: | transformed dataset |
New in version 1.3.0.
Returns an MLWriter instance for this ML instance.
Note
Experimental
Chi-Squared feature selection, which selects categorical features to use for predicting a categorical label. The selector supports different selection methods: numTopFeatures, percentile, fpr, fdr, fwe.
- numTopFeatures chooses a fixed number of top features according to a chi-squared test.
- percentile is similar but chooses a fraction of all features instead of a fixed number.
- fpr chooses all features whose p-values are below a threshold, thus controlling the false positive rate of selection.
- fdr uses the Benjamini-Hochberg procedure to choose all features whose false discovery rate is below a threshold.
- fwe chooses all features whose p-values are below a threshold. The threshold is scaled by 1/numFeatures, thus controlling the family-wise error rate of selection.
By default, the selection method is numTopFeatures, with the default number of top features set to 50.
>>> from pyspark.ml.linalg import Vectors
>>> df = spark.createDataFrame(
... [(Vectors.dense([0.0, 0.0, 18.0, 1.0]), 1.0),
... (Vectors.dense([0.0, 1.0, 12.0, 0.0]), 0.0),
... (Vectors.dense([1.0, 0.0, 15.0, 0.1]), 0.0)],
... ["features", "label"])
>>> selector = ChiSqSelector(numTopFeatures=1, outputCol="selectedFeatures")
>>> model = selector.fit(df)
>>> model.transform(df).head().selectedFeatures
DenseVector([18.0])
>>> model.selectedFeatures
[2]
>>> chiSqSelectorPath = temp_path + "/chi-sq-selector"
>>> selector.save(chiSqSelectorPath)
>>> loadedSelector = ChiSqSelector.load(chiSqSelectorPath)
>>> loadedSelector.getNumTopFeatures() == selector.getNumTopFeatures()
True
>>> modelPath = temp_path + "/chi-sq-selector-model"
>>> model.save(modelPath)
>>> loadedModel = ChiSqSelectorModel.load(modelPath)
>>> loadedModel.selectedFeatures == model.selectedFeatures
True
New in version 2.0.0.
Creates a copy of this instance with the same uid and some extra params. This implementation first calls Params.copy and then make a copy of the companion Java pipeline component with extra params. So both the Python wrapper and the Java pipeline component get copied.
Parameters: | extra – Extra parameters to copy to the new instance |
---|---|
Returns: | Copy of this instance |
Explains a single param and returns its name, doc, and optional default value and user-supplied value in a string.
Returns the documentation of all params with their optionally default values and user-supplied values.
Extracts the embedded default param values and user-supplied values, and then merges them with extra values from input into a flat param map, where the latter value is used if there exist conflicts, i.e., with ordering: default param values < user-supplied values < extra.
Parameters: | extra – extra param values |
---|---|
Returns: | merged param map |
Fits a model to the input dataset with optional parameters.
Parameters: |
|
---|---|
Returns: | fitted model(s) |
New in version 1.3.0.
Gets the value of featuresCol or its default value.
Gets the value of labelCol or its default value.
Gets the value of numTopFeatures or its default value.
New in version 2.0.0.
Gets the value of a param in the user-supplied param map or its default value. Raises an error if neither is set.
Gets the value of outputCol or its default value.
Gets a param by its name.
Gets the value of selectorType or its default value.
New in version 2.1.0.
Checks whether a param has a default value.
Tests whether this instance contains a param with a given (string) name.
Checks whether a param is explicitly set by user or has a default value.
Checks whether a param is explicitly set by user.
Reads an ML instance from the input path, a shortcut of read().load(path).
Returns all params ordered by name. The default implementation uses dir() to get all attributes of type Param.
Returns an MLReader instance for this class.
Save this ML instance to the given path, a shortcut of write().save(path).
Sets the value of fdr. Only applicable when selectorType = “fdr”.
New in version 2.2.0.
Sets the value of featuresCol.
Sets the value of fpr. Only applicable when selectorType = “fpr”.
New in version 2.1.0.
Sets the value of fwe. Only applicable when selectorType = “fwe”.
New in version 2.2.0.
Sets the value of numTopFeatures. Only applicable when selectorType = “numTopFeatures”.
New in version 2.0.0.
Sets params for this ChiSqSelector.
New in version 2.0.0.
Sets the value of percentile. Only applicable when selectorType = “percentile”.
New in version 2.1.0.
Sets the value of selectorType.
New in version 2.1.0.
Returns an MLWriter instance for this ML instance.
Note
Experimental
Model fitted by ChiSqSelector.
New in version 2.0.0.
Creates a copy of this instance with the same uid and some extra params. This implementation first calls Params.copy and then make a copy of the companion Java pipeline component with extra params. So both the Python wrapper and the Java pipeline component get copied.
Parameters: | extra – Extra parameters to copy to the new instance |
---|---|
Returns: | Copy of this instance |
Explains a single param and returns its name, doc, and optional default value and user-supplied value in a string.
Returns the documentation of all params with their optionally default values and user-supplied values.
Extracts the embedded default param values and user-supplied values, and then merges them with extra values from input into a flat param map, where the latter value is used if there exist conflicts, i.e., with ordering: default param values < user-supplied values < extra.
Parameters: | extra – extra param values |
---|---|
Returns: | merged param map |
Gets the value of a param in the user-supplied param map or its default value. Raises an error if neither is set.
Gets a param by its name.
Checks whether a param has a default value.
Tests whether this instance contains a param with a given (string) name.
Checks whether a param is explicitly set by user or has a default value.
Checks whether a param is explicitly set by user.
Reads an ML instance from the input path, a shortcut of read().load(path).
Returns all params ordered by name. The default implementation uses dir() to get all attributes of type Param.
Returns an MLReader instance for this class.
Save this ML instance to the given path, a shortcut of write().save(path).
Transforms the input dataset with optional parameters.
Parameters: |
|
---|---|
Returns: | transformed dataset |
New in version 1.3.0.
Returns an MLWriter instance for this ML instance.
Extracts a vocabulary from document collections and generates a CountVectorizerModel.
>>> df = spark.createDataFrame(
... [(0, ["a", "b", "c"]), (1, ["a", "b", "b", "c", "a"])],
... ["label", "raw"])
>>> cv = CountVectorizer(inputCol="raw", outputCol="vectors")
>>> model = cv.fit(df)
>>> model.transform(df).show(truncate=False)
+-----+---------------+-------------------------+
|label|raw |vectors |
+-----+---------------+-------------------------+
|0 |[a, b, c] |(3,[0,1,2],[1.0,1.0,1.0])|
|1 |[a, b, b, c, a]|(3,[0,1,2],[2.0,2.0,1.0])|
+-----+---------------+-------------------------+
...
>>> sorted(model.vocabulary) == ['a', 'b', 'c']
True
>>> countVectorizerPath = temp_path + "/count-vectorizer"
>>> cv.save(countVectorizerPath)
>>> loadedCv = CountVectorizer.load(countVectorizerPath)
>>> loadedCv.getMinDF() == cv.getMinDF()
True
>>> loadedCv.getMinTF() == cv.getMinTF()
True
>>> loadedCv.getVocabSize() == cv.getVocabSize()
True
>>> modelPath = temp_path + "/count-vectorizer-model"
>>> model.save(modelPath)
>>> loadedModel = CountVectorizerModel.load(modelPath)
>>> loadedModel.vocabulary == model.vocabulary
True
New in version 1.6.0.
Creates a copy of this instance with the same uid and some extra params. This implementation first calls Params.copy and then make a copy of the companion Java pipeline component with extra params. So both the Python wrapper and the Java pipeline component get copied.
Parameters: | extra – Extra parameters to copy to the new instance |
---|---|
Returns: | Copy of this instance |
Explains a single param and returns its name, doc, and optional default value and user-supplied value in a string.
Returns the documentation of all params with their optionally default values and user-supplied values.
Extracts the embedded default param values and user-supplied values, and then merges them with extra values from input into a flat param map, where the latter value is used if there exist conflicts, i.e., with ordering: default param values < user-supplied values < extra.
Parameters: | extra – extra param values |
---|---|
Returns: | merged param map |
Fits a model to the input dataset with optional parameters.
Parameters: |
|
---|---|
Returns: | fitted model(s) |
New in version 1.3.0.
Gets the value of inputCol or its default value.
Gets the value of a param in the user-supplied param map or its default value. Raises an error if neither is set.
Gets the value of outputCol or its default value.
Gets a param by its name.
Checks whether a param has a default value.
Tests whether this instance contains a param with a given (string) name.
Checks whether a param is explicitly set by user or has a default value.
Checks whether a param is explicitly set by user.
Reads an ML instance from the input path, a shortcut of read().load(path).
Returns all params ordered by name. The default implementation uses dir() to get all attributes of type Param.
Returns an MLReader instance for this class.
Save this ML instance to the given path, a shortcut of write().save(path).
Set the params for the CountVectorizer
New in version 1.6.0.
Returns an MLWriter instance for this ML instance.
Model fitted by CountVectorizer.
New in version 1.6.0.
Creates a copy of this instance with the same uid and some extra params. This implementation first calls Params.copy and then make a copy of the companion Java pipeline component with extra params. So both the Python wrapper and the Java pipeline component get copied.
Parameters: | extra – Extra parameters to copy to the new instance |
---|---|
Returns: | Copy of this instance |
Explains a single param and returns its name, doc, and optional default value and user-supplied value in a string.
Returns the documentation of all params with their optionally default values and user-supplied values.
Extracts the embedded default param values and user-supplied values, and then merges them with extra values from input into a flat param map, where the latter value is used if there exist conflicts, i.e., with ordering: default param values < user-supplied values < extra.
Parameters: | extra – extra param values |
---|---|
Returns: | merged param map |
Gets the value of a param in the user-supplied param map or its default value. Raises an error if neither is set.
Gets a param by its name.
Checks whether a param has a default value.
Tests whether this instance contains a param with a given (string) name.
Checks whether a param is explicitly set by user or has a default value.
Checks whether a param is explicitly set by user.
Reads an ML instance from the input path, a shortcut of read().load(path).
Returns all params ordered by name. The default implementation uses dir() to get all attributes of type Param.
Returns an MLReader instance for this class.
Save this ML instance to the given path, a shortcut of write().save(path).
Transforms the input dataset with optional parameters.
Parameters: |
|
---|---|
Returns: | transformed dataset |
New in version 1.3.0.
Returns an MLWriter instance for this ML instance.
A feature transformer that takes the 1D discrete cosine transform of a real vector. No zero padding is performed on the input vector. It returns a real vector of the same length representing the DCT. The return vector is scaled such that the transform matrix is unitary (aka scaled DCT-II).
See also
>>> from pyspark.ml.linalg import Vectors
>>> df1 = spark.createDataFrame([(Vectors.dense([5.0, 8.0, 6.0]),)], ["vec"])
>>> dct = DCT(inverse=False, inputCol="vec", outputCol="resultVec")
>>> df2 = dct.transform(df1)
>>> df2.head().resultVec
DenseVector([10.969..., -0.707..., -2.041...])
>>> df3 = DCT(inverse=True, inputCol="resultVec", outputCol="origVec").transform(df2)
>>> df3.head().origVec
DenseVector([5.0, 8.0, 6.0])
>>> dctPath = temp_path + "/dct"
>>> dct.save(dctPath)
>>> loadedDtc = DCT.load(dctPath)
>>> loadedDtc.getInverse()
False
New in version 1.6.0.
Creates a copy of this instance with the same uid and some extra params. This implementation first calls Params.copy and then make a copy of the companion Java pipeline component with extra params. So both the Python wrapper and the Java pipeline component get copied.
Parameters: | extra – Extra parameters to copy to the new instance |
---|---|
Returns: | Copy of this instance |
Explains a single param and returns its name, doc, and optional default value and user-supplied value in a string.
Returns the documentation of all params with their optionally default values and user-supplied values.
Extracts the embedded default param values and user-supplied values, and then merges them with extra values from input into a flat param map, where the latter value is used if there exist conflicts, i.e., with ordering: default param values < user-supplied values < extra.
Parameters: | extra – extra param values |
---|---|
Returns: | merged param map |
Gets the value of inputCol or its default value.
Gets the value of a param in the user-supplied param map or its default value. Raises an error if neither is set.
Gets the value of outputCol or its default value.
Gets a param by its name.
Checks whether a param has a default value.
Tests whether this instance contains a param with a given (string) name.
Checks whether a param is explicitly set by user or has a default value.
Checks whether a param is explicitly set by user.
Reads an ML instance from the input path, a shortcut of read().load(path).
Returns all params ordered by name. The default implementation uses dir() to get all attributes of type Param.
Returns an MLReader instance for this class.
Save this ML instance to the given path, a shortcut of write().save(path).
Sets params for this DCT.
New in version 1.6.0.
Transforms the input dataset with optional parameters.
Parameters: |
|
---|---|
Returns: | transformed dataset |
New in version 1.3.0.
Returns an MLWriter instance for this ML instance.
Outputs the Hadamard product (i.e., the element-wise product) of each input vector with a provided “weight” vector. In other words, it scales each column of the dataset by a scalar multiplier.
>>> from pyspark.ml.linalg import Vectors
>>> df = spark.createDataFrame([(Vectors.dense([2.0, 1.0, 3.0]),)], ["values"])
>>> ep = ElementwiseProduct(scalingVec=Vectors.dense([1.0, 2.0, 3.0]),
... inputCol="values", outputCol="eprod")
>>> ep.transform(df).head().eprod
DenseVector([2.0, 2.0, 9.0])
>>> ep.setParams(scalingVec=Vectors.dense([2.0, 3.0, 5.0])).transform(df).head().eprod
DenseVector([4.0, 3.0, 15.0])
>>> elementwiseProductPath = temp_path + "/elementwise-product"
>>> ep.save(elementwiseProductPath)
>>> loadedEp = ElementwiseProduct.load(elementwiseProductPath)
>>> loadedEp.getScalingVec() == ep.getScalingVec()
True
New in version 1.5.0.
Creates a copy of this instance with the same uid and some extra params. This implementation first calls Params.copy and then make a copy of the companion Java pipeline component with extra params. So both the Python wrapper and the Java pipeline component get copied.
Parameters: | extra – Extra parameters to copy to the new instance |
---|---|
Returns: | Copy of this instance |
Explains a single param and returns its name, doc, and optional default value and user-supplied value in a string.
Returns the documentation of all params with their optionally default values and user-supplied values.
Extracts the embedded default param values and user-supplied values, and then merges them with extra values from input into a flat param map, where the latter value is used if there exist conflicts, i.e., with ordering: default param values < user-supplied values < extra.
Parameters: | extra – extra param values |
---|---|
Returns: | merged param map |
Gets the value of inputCol or its default value.
Gets the value of a param in the user-supplied param map or its default value. Raises an error if neither is set.
Gets the value of outputCol or its default value.
Gets a param by its name.
Checks whether a param has a default value.
Tests whether this instance contains a param with a given (string) name.
Checks whether a param is explicitly set by user or has a default value.
Checks whether a param is explicitly set by user.
Reads an ML instance from the input path, a shortcut of read().load(path).
Returns all params ordered by name. The default implementation uses dir() to get all attributes of type Param.
Returns an MLReader instance for this class.
Save this ML instance to the given path, a shortcut of write().save(path).
Sets params for this ElementwiseProduct.
New in version 1.5.0.
Sets the value of scalingVec.
New in version 2.0.0.
Transforms the input dataset with optional parameters.
Parameters: |
|
---|---|
Returns: | transformed dataset |
New in version 1.3.0.
Returns an MLWriter instance for this ML instance.
Maps a sequence of terms to their term frequencies using the hashing trick. Currently we use Austin Appleby’s MurmurHash 3 algorithm (MurmurHash3_x86_32) to calculate the hash code value for the term object. Since a simple modulo is used to transform the hash function to a column index, it is advisable to use a power of two as the numFeatures parameter; otherwise the features will not be mapped evenly to the columns.
>>> df = spark.createDataFrame([(["a", "b", "c"],)], ["words"])
>>> hashingTF = HashingTF(numFeatures=10, inputCol="words", outputCol="features")
>>> hashingTF.transform(df).head().features
SparseVector(10, {0: 1.0, 1: 1.0, 2: 1.0})
>>> hashingTF.setParams(outputCol="freqs").transform(df).head().freqs
SparseVector(10, {0: 1.0, 1: 1.0, 2: 1.0})
>>> params = {hashingTF.numFeatures: 5, hashingTF.outputCol: "vector"}
>>> hashingTF.transform(df, params).head().vector
SparseVector(5, {0: 1.0, 1: 1.0, 2: 1.0})
>>> hashingTFPath = temp_path + "/hashing-tf"
>>> hashingTF.save(hashingTFPath)
>>> loadedHashingTF = HashingTF.load(hashingTFPath)
>>> loadedHashingTF.getNumFeatures() == hashingTF.getNumFeatures()
True
New in version 1.3.0.
Creates a copy of this instance with the same uid and some extra params. This implementation first calls Params.copy and then make a copy of the companion Java pipeline component with extra params. So both the Python wrapper and the Java pipeline component get copied.
Parameters: | extra – Extra parameters to copy to the new instance |
---|---|
Returns: | Copy of this instance |
Explains a single param and returns its name, doc, and optional default value and user-supplied value in a string.
Returns the documentation of all params with their optionally default values and user-supplied values.
Extracts the embedded default param values and user-supplied values, and then merges them with extra values from input into a flat param map, where the latter value is used if there exist conflicts, i.e., with ordering: default param values < user-supplied values < extra.
Parameters: | extra – extra param values |
---|---|
Returns: | merged param map |
Gets the value of inputCol or its default value.
Gets the value of numFeatures or its default value.
Gets the value of a param in the user-supplied param map or its default value. Raises an error if neither is set.
Gets the value of outputCol or its default value.
Gets a param by its name.
Checks whether a param has a default value.
Tests whether this instance contains a param with a given (string) name.
Checks whether a param is explicitly set by user or has a default value.
Checks whether a param is explicitly set by user.
Reads an ML instance from the input path, a shortcut of read().load(path).
Returns all params ordered by name. The default implementation uses dir() to get all attributes of type Param.
Returns an MLReader instance for this class.
Save this ML instance to the given path, a shortcut of write().save(path).
Sets the value of numFeatures.
Sets params for this HashingTF.
New in version 1.3.0.
Transforms the input dataset with optional parameters.
Parameters: |
|
---|---|
Returns: | transformed dataset |
New in version 1.3.0.
Returns an MLWriter instance for this ML instance.
Compute the Inverse Document Frequency (IDF) given a collection of documents.
>>> from pyspark.ml.linalg import DenseVector
>>> df = spark.createDataFrame([(DenseVector([1.0, 2.0]),),
... (DenseVector([0.0, 1.0]),), (DenseVector([3.0, 0.2]),)], ["tf"])
>>> idf = IDF(minDocFreq=3, inputCol="tf", outputCol="idf")
>>> model = idf.fit(df)
>>> model.idf
DenseVector([0.0, 0.0])
>>> model.transform(df).head().idf
DenseVector([0.0, 0.0])
>>> idf.setParams(outputCol="freqs").fit(df).transform(df).collect()[1].freqs
DenseVector([0.0, 0.0])
>>> params = {idf.minDocFreq: 1, idf.outputCol: "vector"}
>>> idf.fit(df, params).transform(df).head().vector
DenseVector([0.2877, 0.0])
>>> idfPath = temp_path + "/idf"
>>> idf.save(idfPath)
>>> loadedIdf = IDF.load(idfPath)
>>> loadedIdf.getMinDocFreq() == idf.getMinDocFreq()
True
>>> modelPath = temp_path + "/idf-model"
>>> model.save(modelPath)
>>> loadedModel = IDFModel.load(modelPath)
>>> loadedModel.transform(df).head().idf == model.transform(df).head().idf
True
New in version 1.4.0.
Creates a copy of this instance with the same uid and some extra params. This implementation first calls Params.copy and then make a copy of the companion Java pipeline component with extra params. So both the Python wrapper and the Java pipeline component get copied.
Parameters: | extra – Extra parameters to copy to the new instance |
---|---|
Returns: | Copy of this instance |
Explains a single param and returns its name, doc, and optional default value and user-supplied value in a string.
Returns the documentation of all params with their optionally default values and user-supplied values.
Extracts the embedded default param values and user-supplied values, and then merges them with extra values from input into a flat param map, where the latter value is used if there exist conflicts, i.e., with ordering: default param values < user-supplied values < extra.
Parameters: | extra – extra param values |
---|---|
Returns: | merged param map |
Fits a model to the input dataset with optional parameters.
Parameters: |
|
---|---|
Returns: | fitted model(s) |
New in version 1.3.0.
Gets the value of inputCol or its default value.
Gets the value of a param in the user-supplied param map or its default value. Raises an error if neither is set.
Gets the value of outputCol or its default value.
Gets a param by its name.
Checks whether a param has a default value.
Tests whether this instance contains a param with a given (string) name.
Checks whether a param is explicitly set by user or has a default value.
Checks whether a param is explicitly set by user.
Reads an ML instance from the input path, a shortcut of read().load(path).
Returns all params ordered by name. The default implementation uses dir() to get all attributes of type Param.
Returns an MLReader instance for this class.
Save this ML instance to the given path, a shortcut of write().save(path).
Sets the value of minDocFreq.
New in version 1.4.0.
Sets params for this IDF.
New in version 1.4.0.
Returns an MLWriter instance for this ML instance.
Model fitted by IDF.
New in version 1.4.0.
Creates a copy of this instance with the same uid and some extra params. This implementation first calls Params.copy and then make a copy of the companion Java pipeline component with extra params. So both the Python wrapper and the Java pipeline component get copied.
Parameters: | extra – Extra parameters to copy to the new instance |
---|---|
Returns: | Copy of this instance |
Explains a single param and returns its name, doc, and optional default value and user-supplied value in a string.
Returns the documentation of all params with their optionally default values and user-supplied values.
Extracts the embedded default param values and user-supplied values, and then merges them with extra values from input into a flat param map, where the latter value is used if there exist conflicts, i.e., with ordering: default param values < user-supplied values < extra.
Parameters: | extra – extra param values |
---|---|
Returns: | merged param map |
Gets the value of a param in the user-supplied param map or its default value. Raises an error if neither is set.
Gets a param by its name.
Checks whether a param has a default value.
Tests whether this instance contains a param with a given (string) name.
Checks whether a param is explicitly set by user or has a default value.
Checks whether a param is explicitly set by user.
Reads an ML instance from the input path, a shortcut of read().load(path).
Returns all params ordered by name. The default implementation uses dir() to get all attributes of type Param.
Returns an MLReader instance for this class.
Save this ML instance to the given path, a shortcut of write().save(path).
Transforms the input dataset with optional parameters.
Parameters: |
|
---|---|
Returns: | transformed dataset |
New in version 1.3.0.
Returns an MLWriter instance for this ML instance.
Note
Experimental
Imputation estimator for completing missing values, either using the mean or the median of the columns in which the missing values are located. The input columns should be of DoubleType or FloatType. Currently Imputer does not support categorical features and possibly creates incorrect values for a categorical feature.
Note that the mean/median value is computed after filtering out missing values. All Null values in the input columns are treated as missing, and so are also imputed. For computing median, pyspark.sql.DataFrame.approxQuantile() is used with a relative error of 0.001.
>>> df = spark.createDataFrame([(1.0, float("nan")), (2.0, float("nan")), (float("nan"), 3.0),
... (4.0, 4.0), (5.0, 5.0)], ["a", "b"])
>>> imputer = Imputer(inputCols=["a", "b"], outputCols=["out_a", "out_b"])
>>> model = imputer.fit(df)
>>> model.surrogateDF.show()
+---+---+
| a| b|
+---+---+
|3.0|4.0|
+---+---+
...
>>> model.transform(df).show()
+---+---+-----+-----+
| a| b|out_a|out_b|
+---+---+-----+-----+
|1.0|NaN| 1.0| 4.0|
|2.0|NaN| 2.0| 4.0|
|NaN|3.0| 3.0| 3.0|
...
>>> imputer.setStrategy("median").setMissingValue(1.0).fit(df).transform(df).show()
+---+---+-----+-----+
| a| b|out_a|out_b|
+---+---+-----+-----+
|1.0|NaN| 4.0| NaN|
...
>>> imputerPath = temp_path + "/imputer"
>>> imputer.save(imputerPath)
>>> loadedImputer = Imputer.load(imputerPath)
>>> loadedImputer.getStrategy() == imputer.getStrategy()
True
>>> loadedImputer.getMissingValue()
1.0
>>> modelPath = temp_path + "/imputer-model"
>>> model.save(modelPath)
>>> loadedModel = ImputerModel.load(modelPath)
>>> loadedModel.transform(df).head().out_a == model.transform(df).head().out_a
True
New in version 2.2.0.
Creates a copy of this instance with the same uid and some extra params. This implementation first calls Params.copy and then make a copy of the companion Java pipeline component with extra params. So both the Python wrapper and the Java pipeline component get copied.
Parameters: | extra – Extra parameters to copy to the new instance |
---|---|
Returns: | Copy of this instance |
Explains a single param and returns its name, doc, and optional default value and user-supplied value in a string.
Returns the documentation of all params with their optionally default values and user-supplied values.
Extracts the embedded default param values and user-supplied values, and then merges them with extra values from input into a flat param map, where the latter value is used if there exist conflicts, i.e., with ordering: default param values < user-supplied values < extra.
Parameters: | extra – extra param values |
---|---|
Returns: | merged param map |
Fits a model to the input dataset with optional parameters.
Parameters: |
|
---|---|
Returns: | fitted model(s) |
New in version 1.3.0.
Gets the value of inputCols or its default value.
Gets the value of missingValue or its default value.
New in version 2.2.0.
Gets the value of a param in the user-supplied param map or its default value. Raises an error if neither is set.
Gets the value of outputCols or its default value.
New in version 2.2.0.
Gets a param by its name.
Checks whether a param has a default value.
Tests whether this instance contains a param with a given (string) name.
Checks whether a param is explicitly set by user or has a default value.
Checks whether a param is explicitly set by user.
Reads an ML instance from the input path, a shortcut of read().load(path).
Returns all params ordered by name. The default implementation uses dir() to get all attributes of type Param.
Returns an MLReader instance for this class.
Save this ML instance to the given path, a shortcut of write().save(path).
Sets the value of missingValue.
New in version 2.2.0.
Sets the value of outputCols.
New in version 2.2.0.
Sets params for this Imputer.
New in version 2.2.0.
Returns an MLWriter instance for this ML instance.
Note
Experimental
Model fitted by Imputer.
New in version 2.2.0.
Creates a copy of this instance with the same uid and some extra params. This implementation first calls Params.copy and then make a copy of the companion Java pipeline component with extra params. So both the Python wrapper and the Java pipeline component get copied.
Parameters: | extra – Extra parameters to copy to the new instance |
---|---|
Returns: | Copy of this instance |
Explains a single param and returns its name, doc, and optional default value and user-supplied value in a string.
Returns the documentation of all params with their optionally default values and user-supplied values.
Extracts the embedded default param values and user-supplied values, and then merges them with extra values from input into a flat param map, where the latter value is used if there exist conflicts, i.e., with ordering: default param values < user-supplied values < extra.
Parameters: | extra – extra param values |
---|---|
Returns: | merged param map |
Gets the value of a param in the user-supplied param map or its default value. Raises an error if neither is set.
Gets a param by its name.
Checks whether a param has a default value.
Tests whether this instance contains a param with a given (string) name.
Checks whether a param is explicitly set by user or has a default value.
Checks whether a param is explicitly set by user.
Reads an ML instance from the input path, a shortcut of read().load(path).
Returns all params ordered by name. The default implementation uses dir() to get all attributes of type Param.
Returns an MLReader instance for this class.
Save this ML instance to the given path, a shortcut of write().save(path).
Returns a DataFrame containing inputCols and their corresponding surrogates, which are used to replace the missing values in the input DataFrame.
New in version 2.2.0.
Transforms the input dataset with optional parameters.
Parameters: |
|
---|---|
Returns: | transformed dataset |
New in version 1.3.0.
Returns an MLWriter instance for this ML instance.
A Transformer that maps a column of indices back to a new column of corresponding string values. The index-string mapping is either from the ML attributes of the input column, or from user-supplied labels (which take precedence over ML attributes). See StringIndexer for converting strings into indices.
New in version 1.6.0.
Creates a copy of this instance with the same uid and some extra params. This implementation first calls Params.copy and then make a copy of the companion Java pipeline component with extra params. So both the Python wrapper and the Java pipeline component get copied.
Parameters: | extra – Extra parameters to copy to the new instance |
---|---|
Returns: | Copy of this instance |
Explains a single param and returns its name, doc, and optional default value and user-supplied value in a string.
Returns the documentation of all params with their optionally default values and user-supplied values.
Extracts the embedded default param values and user-supplied values, and then merges them with extra values from input into a flat param map, where the latter value is used if there exist conflicts, i.e., with ordering: default param values < user-supplied values < extra.
Parameters: | extra – extra param values |
---|---|
Returns: | merged param map |
Gets the value of inputCol or its default value.
Gets the value of a param in the user-supplied param map or its default value. Raises an error if neither is set.
Gets the value of outputCol or its default value.
Gets a param by its name.
Checks whether a param has a default value.
Tests whether this instance contains a param with a given (string) name.
Checks whether a param is explicitly set by user or has a default value.
Checks whether a param is explicitly set by user.
Reads an ML instance from the input path, a shortcut of read().load(path).
Returns all params ordered by name. The default implementation uses dir() to get all attributes of type Param.
Returns an MLReader instance for this class.
Save this ML instance to the given path, a shortcut of write().save(path).
Sets params for this IndexToString.
New in version 1.6.0.
Transforms the input dataset with optional parameters.
Parameters: |
|
---|---|
Returns: | transformed dataset |
New in version 1.3.0.
Returns an MLWriter instance for this ML instance.
Rescale each feature individually to range [-1, 1] by dividing through the largest maximum absolute value in each feature. It does not shift/center the data, and thus does not destroy any sparsity.
>>> from pyspark.ml.linalg import Vectors
>>> df = spark.createDataFrame([(Vectors.dense([1.0]),), (Vectors.dense([2.0]),)], ["a"])
>>> maScaler = MaxAbsScaler(inputCol="a", outputCol="scaled")
>>> model = maScaler.fit(df)
>>> model.transform(df).show()
+-----+------+
| a|scaled|
+-----+------+
|[1.0]| [0.5]|
|[2.0]| [1.0]|
+-----+------+
...
>>> scalerPath = temp_path + "/max-abs-scaler"
>>> maScaler.save(scalerPath)
>>> loadedMAScaler = MaxAbsScaler.load(scalerPath)
>>> loadedMAScaler.getInputCol() == maScaler.getInputCol()
True
>>> loadedMAScaler.getOutputCol() == maScaler.getOutputCol()
True
>>> modelPath = temp_path + "/max-abs-scaler-model"
>>> model.save(modelPath)
>>> loadedModel = MaxAbsScalerModel.load(modelPath)
>>> loadedModel.maxAbs == model.maxAbs
True
New in version 2.0.0.
Creates a copy of this instance with the same uid and some extra params. This implementation first calls Params.copy and then make a copy of the companion Java pipeline component with extra params. So both the Python wrapper and the Java pipeline component get copied.
Parameters: | extra – Extra parameters to copy to the new instance |
---|---|
Returns: | Copy of this instance |
Explains a single param and returns its name, doc, and optional default value and user-supplied value in a string.
Returns the documentation of all params with their optionally default values and user-supplied values.
Extracts the embedded default param values and user-supplied values, and then merges them with extra values from input into a flat param map, where the latter value is used if there exist conflicts, i.e., with ordering: default param values < user-supplied values < extra.
Parameters: | extra – extra param values |
---|---|
Returns: | merged param map |
Fits a model to the input dataset with optional parameters.
Parameters: |
|
---|---|
Returns: | fitted model(s) |
New in version 1.3.0.
Gets the value of inputCol or its default value.
Gets the value of a param in the user-supplied param map or its default value. Raises an error if neither is set.
Gets the value of outputCol or its default value.
Gets a param by its name.
Checks whether a param has a default value.
Tests whether this instance contains a param with a given (string) name.
Checks whether a param is explicitly set by user or has a default value.
Checks whether a param is explicitly set by user.
Reads an ML instance from the input path, a shortcut of read().load(path).
Returns all params ordered by name. The default implementation uses dir() to get all attributes of type Param.
Returns an MLReader instance for this class.
Save this ML instance to the given path, a shortcut of write().save(path).
Sets params for this MaxAbsScaler.
New in version 2.0.0.
Returns an MLWriter instance for this ML instance.
Model fitted by MaxAbsScaler.
New in version 2.0.0.
Creates a copy of this instance with the same uid and some extra params. This implementation first calls Params.copy and then make a copy of the companion Java pipeline component with extra params. So both the Python wrapper and the Java pipeline component get copied.
Parameters: | extra – Extra parameters to copy to the new instance |
---|---|
Returns: | Copy of this instance |
Explains a single param and returns its name, doc, and optional default value and user-supplied value in a string.
Returns the documentation of all params with their optionally default values and user-supplied values.
Extracts the embedded default param values and user-supplied values, and then merges them with extra values from input into a flat param map, where the latter value is used if there exist conflicts, i.e., with ordering: default param values < user-supplied values < extra.
Parameters: | extra – extra param values |
---|---|
Returns: | merged param map |
Gets the value of a param in the user-supplied param map or its default value. Raises an error if neither is set.
Gets a param by its name.
Checks whether a param has a default value.
Tests whether this instance contains a param with a given (string) name.
Checks whether a param is explicitly set by user or has a default value.
Checks whether a param is explicitly set by user.
Reads an ML instance from the input path, a shortcut of read().load(path).
Returns all params ordered by name. The default implementation uses dir() to get all attributes of type Param.
Returns an MLReader instance for this class.
Save this ML instance to the given path, a shortcut of write().save(path).
Transforms the input dataset with optional parameters.
Parameters: |
|
---|---|
Returns: | transformed dataset |
New in version 1.3.0.
Returns an MLWriter instance for this ML instance.
Note
Experimental
LSH class for Jaccard distance. The input can be dense or sparse vectors, but it is more efficient if it is sparse. For example, Vectors.sparse(10, [(2, 1.0), (3, 1.0), (5, 1.0)]) means there are 10 elements in the space. This set contains elements 2, 3, and 5. Also, any input vector must have at least 1 non-zero index, and all non-zero values are treated as binary “1” values.
See also
>>> from pyspark.ml.linalg import Vectors
>>> from pyspark.sql.functions import col
>>> data = [(0, Vectors.sparse(6, [0, 1, 2], [1.0, 1.0, 1.0]),),
... (1, Vectors.sparse(6, [2, 3, 4], [1.0, 1.0, 1.0]),),
... (2, Vectors.sparse(6, [0, 2, 4], [1.0, 1.0, 1.0]),)]
>>> df = spark.createDataFrame(data, ["id", "features"])
>>> mh = MinHashLSH(inputCol="features", outputCol="hashes", seed=12345)
>>> model = mh.fit(df)
>>> model.transform(df).head()
Row(id=0, features=SparseVector(6, {0: 1.0, 1: 1.0, 2: 1.0}), hashes=[DenseVector([-1638925...
>>> data2 = [(3, Vectors.sparse(6, [1, 3, 5], [1.0, 1.0, 1.0]),),
... (4, Vectors.sparse(6, [2, 3, 5], [1.0, 1.0, 1.0]),),
... (5, Vectors.sparse(6, [1, 2, 4], [1.0, 1.0, 1.0]),)]
>>> df2 = spark.createDataFrame(data2, ["id", "features"])
>>> key = Vectors.sparse(6, [1, 2], [1.0, 1.0])
>>> model.approxNearestNeighbors(df2, key, 1).collect()
[Row(id=5, features=SparseVector(6, {1: 1.0, 2: 1.0, 4: 1.0}), hashes=[DenseVector([-163892...
>>> model.approxSimilarityJoin(df, df2, 0.6, distCol="JaccardDistance").select(
... col("datasetA.id").alias("idA"),
... col("datasetB.id").alias("idB"),
... col("JaccardDistance")).show()
+---+---+---------------+
|idA|idB|JaccardDistance|
+---+---+---------------+
| 1| 4| 0.5|
| 0| 5| 0.5|
+---+---+---------------+
...
>>> mhPath = temp_path + "/mh"
>>> mh.save(mhPath)
>>> mh2 = MinHashLSH.load(mhPath)
>>> mh2.getOutputCol() == mh.getOutputCol()
True
>>> modelPath = temp_path + "/mh-model"
>>> model.save(modelPath)
>>> model2 = MinHashLSHModel.load(modelPath)
>>> model.transform(df).head().hashes == model2.transform(df).head().hashes
True
New in version 2.2.0.
Creates a copy of this instance with the same uid and some extra params. This implementation first calls Params.copy and then make a copy of the companion Java pipeline component with extra params. So both the Python wrapper and the Java pipeline component get copied.
Parameters: | extra – Extra parameters to copy to the new instance |
---|---|
Returns: | Copy of this instance |
Explains a single param and returns its name, doc, and optional default value and user-supplied value in a string.
Returns the documentation of all params with their optionally default values and user-supplied values.
Extracts the embedded default param values and user-supplied values, and then merges them with extra values from input into a flat param map, where the latter value is used if there exist conflicts, i.e., with ordering: default param values < user-supplied values < extra.
Parameters: | extra – extra param values |
---|---|
Returns: | merged param map |
Fits a model to the input dataset with optional parameters.
Parameters: |
|
---|---|
Returns: | fitted model(s) |
New in version 1.3.0.
Gets the value of inputCol or its default value.
Gets the value of numHashTables or its default value.
Gets the value of a param in the user-supplied param map or its default value. Raises an error if neither is set.
Gets the value of outputCol or its default value.
Gets a param by its name.
Gets the value of seed or its default value.
Checks whether a param has a default value.
Tests whether this instance contains a param with a given (string) name.
Checks whether a param is explicitly set by user or has a default value.
Checks whether a param is explicitly set by user.
Reads an ML instance from the input path, a shortcut of read().load(path).
Returns all params ordered by name. The default implementation uses dir() to get all attributes of type Param.
Returns an MLReader instance for this class.
Save this ML instance to the given path, a shortcut of write().save(path).
Sets the value of numHashTables.
Sets params for this MinHashLSH.
New in version 2.2.0.
Returns an MLWriter instance for this ML instance.
Note
Experimental
Model produced by MinHashLSH, where where multiple hash functions are stored. Each hash function is picked from the following family of hash functions, where \(a_i\) and \(b_i\) are randomly chosen integers less than prime: \(h_i(x) = ((x \cdot a_i + b_i) \mod prime)\) This hash family is approximately min-wise independent according to the reference.
See also
Tom Bohman, Colin Cooper, and Alan Frieze. “Min-wise independent linear permutations.” Electronic Journal of Combinatorics 7 (2000): R26.
New in version 2.2.0.
Given a large dataset and an item, approximately find at most k items which have the closest distance to the item. If the outputCol is missing, the method will transform the data; if the outputCol exists, it will use that. This allows caching of the transformed data when necessary.
Note
This method is experimental and will likely change behavior in the next release.
Parameters: |
|
---|---|
Returns: | A dataset containing at most k items closest to the key. A column “distCol” is added to show the distance between each row and the key. |
Join two datasets to approximately find all pairs of rows whose distance are smaller than the threshold. If the outputCol is missing, the method will transform the data; if the outputCol exists, it will use that. This allows caching of the transformed data when necessary.
Parameters: |
|
---|---|
Returns: | A joined dataset containing pairs of rows. The original rows are in columns “datasetA” and “datasetB”, and a column “distCol” is added to show the distance between each pair. |
Creates a copy of this instance with the same uid and some extra params. This implementation first calls Params.copy and then make a copy of the companion Java pipeline component with extra params. So both the Python wrapper and the Java pipeline component get copied.
Parameters: | extra – Extra parameters to copy to the new instance |
---|---|
Returns: | Copy of this instance |
Explains a single param and returns its name, doc, and optional default value and user-supplied value in a string.
Returns the documentation of all params with their optionally default values and user-supplied values.
Extracts the embedded default param values and user-supplied values, and then merges them with extra values from input into a flat param map, where the latter value is used if there exist conflicts, i.e., with ordering: default param values < user-supplied values < extra.
Parameters: | extra – extra param values |
---|---|
Returns: | merged param map |
Gets the value of a param in the user-supplied param map or its default value. Raises an error if neither is set.
Gets a param by its name.
Checks whether a param has a default value.
Tests whether this instance contains a param with a given (string) name.
Checks whether a param is explicitly set by user or has a default value.
Checks whether a param is explicitly set by user.
Reads an ML instance from the input path, a shortcut of read().load(path).
Returns all params ordered by name. The default implementation uses dir() to get all attributes of type Param.
Returns an MLReader instance for this class.
Save this ML instance to the given path, a shortcut of write().save(path).
Transforms the input dataset with optional parameters.
Parameters: |
|
---|---|
Returns: | transformed dataset |
New in version 1.3.0.
Returns an MLWriter instance for this ML instance.
Rescale each feature individually to a common range [min, max] linearly using column summary statistics, which is also known as min-max normalization or Rescaling. The rescaled value for feature E is calculated as,
Rescaled(e_i) = (e_i - E_min) / (E_max - E_min) * (max - min) + min
For the case E_max == E_min, Rescaled(e_i) = 0.5 * (max + min)
Note
Since zero values will probably be transformed to non-zero values, output of the transformer will be DenseVector even for sparse input.
>>> from pyspark.ml.linalg import Vectors
>>> df = spark.createDataFrame([(Vectors.dense([0.0]),), (Vectors.dense([2.0]),)], ["a"])
>>> mmScaler = MinMaxScaler(inputCol="a", outputCol="scaled")
>>> model = mmScaler.fit(df)
>>> model.originalMin
DenseVector([0.0])
>>> model.originalMax
DenseVector([2.0])
>>> model.transform(df).show()
+-----+------+
| a|scaled|
+-----+------+
|[0.0]| [0.0]|
|[2.0]| [1.0]|
+-----+------+
...
>>> minMaxScalerPath = temp_path + "/min-max-scaler"
>>> mmScaler.save(minMaxScalerPath)
>>> loadedMMScaler = MinMaxScaler.load(minMaxScalerPath)
>>> loadedMMScaler.getMin() == mmScaler.getMin()
True
>>> loadedMMScaler.getMax() == mmScaler.getMax()
True
>>> modelPath = temp_path + "/min-max-scaler-model"
>>> model.save(modelPath)
>>> loadedModel = MinMaxScalerModel.load(modelPath)
>>> loadedModel.originalMin == model.originalMin
True
>>> loadedModel.originalMax == model.originalMax
True
New in version 1.6.0.
Creates a copy of this instance with the same uid and some extra params. This implementation first calls Params.copy and then make a copy of the companion Java pipeline component with extra params. So both the Python wrapper and the Java pipeline component get copied.
Parameters: | extra – Extra parameters to copy to the new instance |
---|---|
Returns: | Copy of this instance |
Explains a single param and returns its name, doc, and optional default value and user-supplied value in a string.
Returns the documentation of all params with their optionally default values and user-supplied values.
Extracts the embedded default param values and user-supplied values, and then merges them with extra values from input into a flat param map, where the latter value is used if there exist conflicts, i.e., with ordering: default param values < user-supplied values < extra.
Parameters: | extra – extra param values |
---|---|
Returns: | merged param map |
Fits a model to the input dataset with optional parameters.
Parameters: |
|
---|---|
Returns: | fitted model(s) |
New in version 1.3.0.
Gets the value of inputCol or its default value.
Gets the value of a param in the user-supplied param map or its default value. Raises an error if neither is set.
Gets the value of outputCol or its default value.
Gets a param by its name.
Checks whether a param has a default value.
Tests whether this instance contains a param with a given (string) name.
Checks whether a param is explicitly set by user or has a default value.
Checks whether a param is explicitly set by user.
Reads an ML instance from the input path, a shortcut of read().load(path).
Returns all params ordered by name. The default implementation uses dir() to get all attributes of type Param.
Returns an MLReader instance for this class.
Save this ML instance to the given path, a shortcut of write().save(path).
Sets params for this MinMaxScaler.
New in version 1.6.0.
Returns an MLWriter instance for this ML instance.
Model fitted by MinMaxScaler.
New in version 1.6.0.
Creates a copy of this instance with the same uid and some extra params. This implementation first calls Params.copy and then make a copy of the companion Java pipeline component with extra params. So both the Python wrapper and the Java pipeline component get copied.
Parameters: | extra – Extra parameters to copy to the new instance |
---|---|
Returns: | Copy of this instance |
Explains a single param and returns its name, doc, and optional default value and user-supplied value in a string.
Returns the documentation of all params with their optionally default values and user-supplied values.
Extracts the embedded default param values and user-supplied values, and then merges them with extra values from input into a flat param map, where the latter value is used if there exist conflicts, i.e., with ordering: default param values < user-supplied values < extra.
Parameters: | extra – extra param values |
---|---|
Returns: | merged param map |
Gets the value of a param in the user-supplied param map or its default value. Raises an error if neither is set.
Gets a param by its name.
Checks whether a param has a default value.
Tests whether this instance contains a param with a given (string) name.
Checks whether a param is explicitly set by user or has a default value.
Checks whether a param is explicitly set by user.
Reads an ML instance from the input path, a shortcut of read().load(path).
Returns all params ordered by name. The default implementation uses dir() to get all attributes of type Param.
Returns an MLReader instance for this class.
Save this ML instance to the given path, a shortcut of write().save(path).
Transforms the input dataset with optional parameters.
Parameters: |
|
---|---|
Returns: | transformed dataset |
New in version 1.3.0.
Returns an MLWriter instance for this ML instance.
A feature transformer that converts the input array of strings into an array of n-grams. Null values in the input array are ignored. It returns an array of n-grams where each n-gram is represented by a space-separated string of words. When the input is empty, an empty array is returned. When the input array length is less than n (number of elements per n-gram), no n-grams are returned.
>>> df = spark.createDataFrame([Row(inputTokens=["a", "b", "c", "d", "e"])])
>>> ngram = NGram(n=2, inputCol="inputTokens", outputCol="nGrams")
>>> ngram.transform(df).head()
Row(inputTokens=[u'a', u'b', u'c', u'd', u'e'], nGrams=[u'a b', u'b c', u'c d', u'd e'])
>>> # Change n-gram length
>>> ngram.setParams(n=4).transform(df).head()
Row(inputTokens=[u'a', u'b', u'c', u'd', u'e'], nGrams=[u'a b c d', u'b c d e'])
>>> # Temporarily modify output column.
>>> ngram.transform(df, {ngram.outputCol: "output"}).head()
Row(inputTokens=[u'a', u'b', u'c', u'd', u'e'], output=[u'a b c d', u'b c d e'])
>>> ngram.transform(df).head()
Row(inputTokens=[u'a', u'b', u'c', u'd', u'e'], nGrams=[u'a b c d', u'b c d e'])
>>> # Must use keyword arguments to specify params.
>>> ngram.setParams("text")
Traceback (most recent call last):
...
TypeError: Method setParams forces keyword arguments.
>>> ngramPath = temp_path + "/ngram"
>>> ngram.save(ngramPath)
>>> loadedNGram = NGram.load(ngramPath)
>>> loadedNGram.getN() == ngram.getN()
True
New in version 1.5.0.
Creates a copy of this instance with the same uid and some extra params. This implementation first calls Params.copy and then make a copy of the companion Java pipeline component with extra params. So both the Python wrapper and the Java pipeline component get copied.
Parameters: | extra – Extra parameters to copy to the new instance |
---|---|
Returns: | Copy of this instance |
Explains a single param and returns its name, doc, and optional default value and user-supplied value in a string.
Returns the documentation of all params with their optionally default values and user-supplied values.
Extracts the embedded default param values and user-supplied values, and then merges them with extra values from input into a flat param map, where the latter value is used if there exist conflicts, i.e., with ordering: default param values < user-supplied values < extra.
Parameters: | extra – extra param values |
---|---|
Returns: | merged param map |
Gets the value of inputCol or its default value.
Gets the value of a param in the user-supplied param map or its default value. Raises an error if neither is set.
Gets the value of outputCol or its default value.
Gets a param by its name.
Checks whether a param has a default value.
Tests whether this instance contains a param with a given (string) name.
Checks whether a param is explicitly set by user or has a default value.
Checks whether a param is explicitly set by user.
Reads an ML instance from the input path, a shortcut of read().load(path).
Returns all params ordered by name. The default implementation uses dir() to get all attributes of type Param.
Returns an MLReader instance for this class.
Save this ML instance to the given path, a shortcut of write().save(path).
Sets params for this NGram.
New in version 1.5.0.
Transforms the input dataset with optional parameters.
Parameters: |
|
---|---|
Returns: | transformed dataset |
New in version 1.3.0.
Returns an MLWriter instance for this ML instance.
Normalize a vector to have unit norm using the given p-norm.
>>> from pyspark.ml.linalg import Vectors
>>> svec = Vectors.sparse(4, {1: 4.0, 3: 3.0})
>>> df = spark.createDataFrame([(Vectors.dense([3.0, -4.0]), svec)], ["dense", "sparse"])
>>> normalizer = Normalizer(p=2.0, inputCol="dense", outputCol="features")
>>> normalizer.transform(df).head().features
DenseVector([0.6, -0.8])
>>> normalizer.setParams(inputCol="sparse", outputCol="freqs").transform(df).head().freqs
SparseVector(4, {1: 0.8, 3: 0.6})
>>> params = {normalizer.p: 1.0, normalizer.inputCol: "dense", normalizer.outputCol: "vector"}
>>> normalizer.transform(df, params).head().vector
DenseVector([0.4286, -0.5714])
>>> normalizerPath = temp_path + "/normalizer"
>>> normalizer.save(normalizerPath)
>>> loadedNormalizer = Normalizer.load(normalizerPath)
>>> loadedNormalizer.getP() == normalizer.getP()
True
New in version 1.4.0.
Creates a copy of this instance with the same uid and some extra params. This implementation first calls Params.copy and then make a copy of the companion Java pipeline component with extra params. So both the Python wrapper and the Java pipeline component get copied.
Parameters: | extra – Extra parameters to copy to the new instance |
---|---|
Returns: | Copy of this instance |
Explains a single param and returns its name, doc, and optional default value and user-supplied value in a string.
Returns the documentation of all params with their optionally default values and user-supplied values.
Extracts the embedded default param values and user-supplied values, and then merges them with extra values from input into a flat param map, where the latter value is used if there exist conflicts, i.e., with ordering: default param values < user-supplied values < extra.
Parameters: | extra – extra param values |
---|---|
Returns: | merged param map |
Gets the value of inputCol or its default value.
Gets the value of a param in the user-supplied param map or its default value. Raises an error if neither is set.
Gets the value of outputCol or its default value.
Gets a param by its name.
Checks whether a param has a default value.
Tests whether this instance contains a param with a given (string) name.
Checks whether a param is explicitly set by user or has a default value.
Checks whether a param is explicitly set by user.
Reads an ML instance from the input path, a shortcut of read().load(path).
Returns all params ordered by name. The default implementation uses dir() to get all attributes of type Param.
Returns an MLReader instance for this class.
Save this ML instance to the given path, a shortcut of write().save(path).
Sets params for this Normalizer.
New in version 1.4.0.
Transforms the input dataset with optional parameters.
Parameters: |
|
---|---|
Returns: | transformed dataset |
New in version 1.3.0.
Returns an MLWriter instance for this ML instance.
A one-hot encoder that maps a column of category indices to a column of binary vectors, with at most a single one-value per row that indicates the input category index. For example with 5 categories, an input value of 2.0 would map to an output vector of [0.0, 0.0, 1.0, 0.0]. The last category is not included by default (configurable via dropLast) because it makes the vector entries sum up to one, and hence linearly dependent. So an input value of 4.0 maps to [0.0, 0.0, 0.0, 0.0].
Note
This is different from scikit-learn’s OneHotEncoder, which keeps all categories. The output vectors are sparse.
See also
StringIndexer for converting categorical values into category indices
>>> stringIndexer = StringIndexer(inputCol="label", outputCol="indexed")
>>> model = stringIndexer.fit(stringIndDf)
>>> td = model.transform(stringIndDf)
>>> encoder = OneHotEncoder(inputCol="indexed", outputCol="features")
>>> encoder.transform(td).head().features
SparseVector(2, {0: 1.0})
>>> encoder.setParams(outputCol="freqs").transform(td).head().freqs
SparseVector(2, {0: 1.0})
>>> params = {encoder.dropLast: False, encoder.outputCol: "test"}
>>> encoder.transform(td, params).head().test
SparseVector(3, {0: 1.0})
>>> onehotEncoderPath = temp_path + "/onehot-encoder"
>>> encoder.save(onehotEncoderPath)
>>> loadedEncoder = OneHotEncoder.load(onehotEncoderPath)
>>> loadedEncoder.getDropLast() == encoder.getDropLast()
True
New in version 1.4.0.
Creates a copy of this instance with the same uid and some extra params. This implementation first calls Params.copy and then make a copy of the companion Java pipeline component with extra params. So both the Python wrapper and the Java pipeline component get copied.
Parameters: | extra – Extra parameters to copy to the new instance |
---|---|
Returns: | Copy of this instance |
Explains a single param and returns its name, doc, and optional default value and user-supplied value in a string.
Returns the documentation of all params with their optionally default values and user-supplied values.
Extracts the embedded default param values and user-supplied values, and then merges them with extra values from input into a flat param map, where the latter value is used if there exist conflicts, i.e., with ordering: default param values < user-supplied values < extra.
Parameters: | extra – extra param values |
---|---|
Returns: | merged param map |
Gets the value of inputCol or its default value.
Gets the value of a param in the user-supplied param map or its default value. Raises an error if neither is set.
Gets the value of outputCol or its default value.
Gets a param by its name.
Checks whether a param has a default value.
Tests whether this instance contains a param with a given (string) name.
Checks whether a param is explicitly set by user or has a default value.
Checks whether a param is explicitly set by user.
Reads an ML instance from the input path, a shortcut of read().load(path).
Returns all params ordered by name. The default implementation uses dir() to get all attributes of type Param.
Returns an MLReader instance for this class.
Save this ML instance to the given path, a shortcut of write().save(path).
Sets params for this OneHotEncoder.
New in version 1.4.0.
Transforms the input dataset with optional parameters.
Parameters: |
|
---|---|
Returns: | transformed dataset |
New in version 1.3.0.
Returns an MLWriter instance for this ML instance.
PCA trains a model to project vectors to a lower dimensional space of the top k principal components.
>>> from pyspark.ml.linalg import Vectors
>>> data = [(Vectors.sparse(5, [(1, 1.0), (3, 7.0)]),),
... (Vectors.dense([2.0, 0.0, 3.0, 4.0, 5.0]),),
... (Vectors.dense([4.0, 0.0, 0.0, 6.0, 7.0]),)]
>>> df = spark.createDataFrame(data,["features"])
>>> pca = PCA(k=2, inputCol="features", outputCol="pca_features")
>>> model = pca.fit(df)
>>> model.transform(df).collect()[0].pca_features
DenseVector([1.648..., -4.013...])
>>> model.explainedVariance
DenseVector([0.794..., 0.205...])
>>> pcaPath = temp_path + "/pca"
>>> pca.save(pcaPath)
>>> loadedPca = PCA.load(pcaPath)
>>> loadedPca.getK() == pca.getK()
True
>>> modelPath = temp_path + "/pca-model"
>>> model.save(modelPath)
>>> loadedModel = PCAModel.load(modelPath)
>>> loadedModel.pc == model.pc
True
>>> loadedModel.explainedVariance == model.explainedVariance
True
New in version 1.5.0.
Creates a copy of this instance with the same uid and some extra params. This implementation first calls Params.copy and then make a copy of the companion Java pipeline component with extra params. So both the Python wrapper and the Java pipeline component get copied.
Parameters: | extra – Extra parameters to copy to the new instance |
---|---|
Returns: | Copy of this instance |
Explains a single param and returns its name, doc, and optional default value and user-supplied value in a string.
Returns the documentation of all params with their optionally default values and user-supplied values.
Extracts the embedded default param values and user-supplied values, and then merges them with extra values from input into a flat param map, where the latter value is used if there exist conflicts, i.e., with ordering: default param values < user-supplied values < extra.
Parameters: | extra – extra param values |
---|---|
Returns: | merged param map |
Fits a model to the input dataset with optional parameters.
Parameters: |
|
---|---|
Returns: | fitted model(s) |
New in version 1.3.0.
Gets the value of inputCol or its default value.
Gets the value of a param in the user-supplied param map or its default value. Raises an error if neither is set.
Gets the value of outputCol or its default value.
Gets a param by its name.
Checks whether a param has a default value.
Tests whether this instance contains a param with a given (string) name.
Checks whether a param is explicitly set by user or has a default value.
Checks whether a param is explicitly set by user.
Reads an ML instance from the input path, a shortcut of read().load(path).
Returns all params ordered by name. The default implementation uses dir() to get all attributes of type Param.
Returns an MLReader instance for this class.
Save this ML instance to the given path, a shortcut of write().save(path).
Set params for this PCA.
New in version 1.5.0.
Returns an MLWriter instance for this ML instance.
Model fitted by PCA. Transforms vectors to a lower dimensional space.
New in version 1.5.0.
Creates a copy of this instance with the same uid and some extra params. This implementation first calls Params.copy and then make a copy of the companion Java pipeline component with extra params. So both the Python wrapper and the Java pipeline component get copied.
Parameters: | extra – Extra parameters to copy to the new instance |
---|---|
Returns: | Copy of this instance |
Explains a single param and returns its name, doc, and optional default value and user-supplied value in a string.
Returns the documentation of all params with their optionally default values and user-supplied values.
Returns a vector of proportions of variance explained by each principal component.
New in version 2.0.0.
Extracts the embedded default param values and user-supplied values, and then merges them with extra values from input into a flat param map, where the latter value is used if there exist conflicts, i.e., with ordering: default param values < user-supplied values < extra.
Parameters: | extra – extra param values |
---|---|
Returns: | merged param map |
Gets the value of a param in the user-supplied param map or its default value. Raises an error if neither is set.
Gets a param by its name.
Checks whether a param has a default value.
Tests whether this instance contains a param with a given (string) name.
Checks whether a param is explicitly set by user or has a default value.
Checks whether a param is explicitly set by user.
Reads an ML instance from the input path, a shortcut of read().load(path).
Returns all params ordered by name. The default implementation uses dir() to get all attributes of type Param.
Returns a principal components Matrix. Each column is one principal component.
New in version 2.0.0.
Returns an MLReader instance for this class.
Save this ML instance to the given path, a shortcut of write().save(path).
Transforms the input dataset with optional parameters.
Parameters: |
|
---|---|
Returns: | transformed dataset |
New in version 1.3.0.
Returns an MLWriter instance for this ML instance.
Perform feature expansion in a polynomial space. As said in wikipedia of Polynomial Expansion, “In mathematics, an expansion of a product of sums expresses it as a sum of products by using the fact that multiplication distributes over addition”. Take a 2-variable feature vector as an example: (x, y), if we want to expand it with degree 2, then we get (x, x * x, y, x * y, y * y).
>>> from pyspark.ml.linalg import Vectors
>>> df = spark.createDataFrame([(Vectors.dense([0.5, 2.0]),)], ["dense"])
>>> px = PolynomialExpansion(degree=2, inputCol="dense", outputCol="expanded")
>>> px.transform(df).head().expanded
DenseVector([0.5, 0.25, 2.0, 1.0, 4.0])
>>> px.setParams(outputCol="test").transform(df).head().test
DenseVector([0.5, 0.25, 2.0, 1.0, 4.0])
>>> polyExpansionPath = temp_path + "/poly-expansion"
>>> px.save(polyExpansionPath)
>>> loadedPx = PolynomialExpansion.load(polyExpansionPath)
>>> loadedPx.getDegree() == px.getDegree()
True
New in version 1.4.0.
Creates a copy of this instance with the same uid and some extra params. This implementation first calls Params.copy and then make a copy of the companion Java pipeline component with extra params. So both the Python wrapper and the Java pipeline component get copied.
Parameters: | extra – Extra parameters to copy to the new instance |
---|---|
Returns: | Copy of this instance |
Explains a single param and returns its name, doc, and optional default value and user-supplied value in a string.
Returns the documentation of all params with their optionally default values and user-supplied values.
Extracts the embedded default param values and user-supplied values, and then merges them with extra values from input into a flat param map, where the latter value is used if there exist conflicts, i.e., with ordering: default param values < user-supplied values < extra.
Parameters: | extra – extra param values |
---|---|
Returns: | merged param map |
Gets the value of inputCol or its default value.
Gets the value of a param in the user-supplied param map or its default value. Raises an error if neither is set.
Gets the value of outputCol or its default value.
Gets a param by its name.
Checks whether a param has a default value.
Tests whether this instance contains a param with a given (string) name.
Checks whether a param is explicitly set by user or has a default value.
Checks whether a param is explicitly set by user.
Reads an ML instance from the input path, a shortcut of read().load(path).
Returns all params ordered by name. The default implementation uses dir() to get all attributes of type Param.
Returns an MLReader instance for this class.
Save this ML instance to the given path, a shortcut of write().save(path).
Sets params for this PolynomialExpansion.
New in version 1.4.0.
Transforms the input dataset with optional parameters.
Parameters: |
|
---|---|
Returns: | transformed dataset |
New in version 1.3.0.
Returns an MLWriter instance for this ML instance.
Note
Experimental
QuantileDiscretizer takes a column with continuous features and outputs a column with binned categorical features. The number of bins can be set using the numBuckets parameter. It is possible that the number of buckets used will be less than this value, for example, if there are too few distinct values of the input to create enough distinct quantiles.
NaN handling: Note also that QuantileDiscretizer will raise an error when it finds NaN values in the dataset, but the user can also choose to either keep or remove NaN values within the dataset by setting handleInvalid parameter. If the user chooses to keep NaN values, they will be handled specially and placed into their own bucket, for example, if 4 buckets are used, then non-NaN data will be put into buckets[0-3], but NaNs will be counted in a special bucket[4].
Algorithm: The bin ranges are chosen using an approximate algorithm (see the documentation for approxQuantile() for a detailed description). The precision of the approximation can be controlled with the relativeError parameter. The lower and upper bin bounds will be -Infinity and +Infinity, covering all real values.
>>> values = [(0.1,), (0.4,), (1.2,), (1.5,), (float("nan"),), (float("nan"),)]
>>> df = spark.createDataFrame(values, ["values"])
>>> qds = QuantileDiscretizer(numBuckets=2,
... inputCol="values", outputCol="buckets", relativeError=0.01, handleInvalid="error")
>>> qds.getRelativeError()
0.01
>>> bucketizer = qds.fit(df)
>>> qds.setHandleInvalid("keep").fit(df).transform(df).count()
6
>>> qds.setHandleInvalid("skip").fit(df).transform(df).count()
4
>>> splits = bucketizer.getSplits()
>>> splits[0]
-inf
>>> print("%2.1f" % round(splits[1], 1))
0.4
>>> bucketed = bucketizer.transform(df).head()
>>> bucketed.buckets
0.0
>>> quantileDiscretizerPath = temp_path + "/quantile-discretizer"
>>> qds.save(quantileDiscretizerPath)
>>> loadedQds = QuantileDiscretizer.load(quantileDiscretizerPath)
>>> loadedQds.getNumBuckets() == qds.getNumBuckets()
True
New in version 2.0.0.
Creates a copy of this instance with the same uid and some extra params. This implementation first calls Params.copy and then make a copy of the companion Java pipeline component with extra params. So both the Python wrapper and the Java pipeline component get copied.
Parameters: | extra – Extra parameters to copy to the new instance |
---|---|
Returns: | Copy of this instance |
Explains a single param and returns its name, doc, and optional default value and user-supplied value in a string.
Returns the documentation of all params with their optionally default values and user-supplied values.
Extracts the embedded default param values and user-supplied values, and then merges them with extra values from input into a flat param map, where the latter value is used if there exist conflicts, i.e., with ordering: default param values < user-supplied values < extra.
Parameters: | extra – extra param values |
---|---|
Returns: | merged param map |
Fits a model to the input dataset with optional parameters.
Parameters: |
|
---|---|
Returns: | fitted model(s) |
New in version 1.3.0.
Gets the value of handleInvalid or its default value.
New in version 2.1.0.
Gets the value of inputCol or its default value.
Gets the value of a param in the user-supplied param map or its default value. Raises an error if neither is set.
Gets the value of outputCol or its default value.
Gets a param by its name.
Gets the value of relativeError or its default value.
New in version 2.0.0.
Checks whether a param has a default value.
Tests whether this instance contains a param with a given (string) name.
Checks whether a param is explicitly set by user or has a default value.
Checks whether a param is explicitly set by user.
Reads an ML instance from the input path, a shortcut of read().load(path).
Returns all params ordered by name. The default implementation uses dir() to get all attributes of type Param.
Returns an MLReader instance for this class.
Save this ML instance to the given path, a shortcut of write().save(path).
Sets the value of handleInvalid.
New in version 2.1.0.
Sets the value of numBuckets.
New in version 2.0.0.
Set the params for the QuantileDiscretizer
New in version 2.0.0.
Sets the value of relativeError.
New in version 2.0.0.
Returns an MLWriter instance for this ML instance.
A regex based tokenizer that extracts tokens either by using the provided regex pattern (in Java dialect) to split the text (default) or repeatedly matching the regex (if gaps is false). Optional parameters also allow filtering tokens using a minimal length. It returns an array of strings that can be empty.
>>> df = spark.createDataFrame([("A B c",)], ["text"])
>>> reTokenizer = RegexTokenizer(inputCol="text", outputCol="words")
>>> reTokenizer.transform(df).head()
Row(text=u'A B c', words=[u'a', u'b', u'c'])
>>> # Change a parameter.
>>> reTokenizer.setParams(outputCol="tokens").transform(df).head()
Row(text=u'A B c', tokens=[u'a', u'b', u'c'])
>>> # Temporarily modify a parameter.
>>> reTokenizer.transform(df, {reTokenizer.outputCol: "words"}).head()
Row(text=u'A B c', words=[u'a', u'b', u'c'])
>>> reTokenizer.transform(df).head()
Row(text=u'A B c', tokens=[u'a', u'b', u'c'])
>>> # Must use keyword arguments to specify params.
>>> reTokenizer.setParams("text")
Traceback (most recent call last):
...
TypeError: Method setParams forces keyword arguments.
>>> regexTokenizerPath = temp_path + "/regex-tokenizer"
>>> reTokenizer.save(regexTokenizerPath)
>>> loadedReTokenizer = RegexTokenizer.load(regexTokenizerPath)
>>> loadedReTokenizer.getMinTokenLength() == reTokenizer.getMinTokenLength()
True
>>> loadedReTokenizer.getGaps() == reTokenizer.getGaps()
True
New in version 1.4.0.
Creates a copy of this instance with the same uid and some extra params. This implementation first calls Params.copy and then make a copy of the companion Java pipeline component with extra params. So both the Python wrapper and the Java pipeline component get copied.
Parameters: | extra – Extra parameters to copy to the new instance |
---|---|
Returns: | Copy of this instance |
Explains a single param and returns its name, doc, and optional default value and user-supplied value in a string.
Returns the documentation of all params with their optionally default values and user-supplied values.
Extracts the embedded default param values and user-supplied values, and then merges them with extra values from input into a flat param map, where the latter value is used if there exist conflicts, i.e., with ordering: default param values < user-supplied values < extra.
Parameters: | extra – extra param values |
---|---|
Returns: | merged param map |
Gets the value of inputCol or its default value.
Gets the value of minTokenLength or its default value.
New in version 1.4.0.
Gets the value of a param in the user-supplied param map or its default value. Raises an error if neither is set.
Gets the value of outputCol or its default value.
Gets a param by its name.
Checks whether a param has a default value.
Tests whether this instance contains a param with a given (string) name.
Checks whether a param is explicitly set by user or has a default value.
Checks whether a param is explicitly set by user.
Reads an ML instance from the input path, a shortcut of read().load(path).
Returns all params ordered by name. The default implementation uses dir() to get all attributes of type Param.
Returns an MLReader instance for this class.
Save this ML instance to the given path, a shortcut of write().save(path).
Sets the value of minTokenLength.
New in version 1.4.0.
Sets params for this RegexTokenizer.
New in version 1.4.0.
Sets the value of toLowercase.
New in version 2.0.0.
Transforms the input dataset with optional parameters.
Parameters: |
|
---|---|
Returns: | transformed dataset |
New in version 1.3.0.
Returns an MLWriter instance for this ML instance.
Note
Experimental
Implements the transforms required for fitting a dataset against an R model formula. Currently we support a limited subset of the R operators, including ‘~’, ‘.’, ‘:’, ‘+’, and ‘-‘. Also see the R formula docs.
>>> df = spark.createDataFrame([
... (1.0, 1.0, "a"),
... (0.0, 2.0, "b"),
... (0.0, 0.0, "a")
... ], ["y", "x", "s"])
>>> rf = RFormula(formula="y ~ x + s")
>>> model = rf.fit(df)
>>> model.transform(df).show()
+---+---+---+---------+-----+
| y| x| s| features|label|
+---+---+---+---------+-----+
|1.0|1.0| a|[1.0,1.0]| 1.0|
|0.0|2.0| b|[2.0,0.0]| 0.0|
|0.0|0.0| a|[0.0,1.0]| 0.0|
+---+---+---+---------+-----+
...
>>> rf.fit(df, {rf.formula: "y ~ . - s"}).transform(df).show()
+---+---+---+--------+-----+
| y| x| s|features|label|
+---+---+---+--------+-----+
|1.0|1.0| a| [1.0]| 1.0|
|0.0|2.0| b| [2.0]| 0.0|
|0.0|0.0| a| [0.0]| 0.0|
+---+---+---+--------+-----+
...
>>> rFormulaPath = temp_path + "/rFormula"
>>> rf.save(rFormulaPath)
>>> loadedRF = RFormula.load(rFormulaPath)
>>> loadedRF.getFormula() == rf.getFormula()
True
>>> loadedRF.getFeaturesCol() == rf.getFeaturesCol()
True
>>> loadedRF.getLabelCol() == rf.getLabelCol()
True
>>> str(loadedRF)
'RFormula(y ~ x + s) (uid=...)'
>>> modelPath = temp_path + "/rFormulaModel"
>>> model.save(modelPath)
>>> loadedModel = RFormulaModel.load(modelPath)
>>> loadedModel.uid == model.uid
True
>>> loadedModel.transform(df).show()
+---+---+---+---------+-----+
| y| x| s| features|label|
+---+---+---+---------+-----+
|1.0|1.0| a|[1.0,1.0]| 1.0|
|0.0|2.0| b|[2.0,0.0]| 0.0|
|0.0|0.0| a|[0.0,1.0]| 0.0|
+---+---+---+---------+-----+
...
>>> str(loadedModel)
'RFormulaModel(ResolvedRFormula(label=y, terms=[x,s], hasIntercept=true)) (uid=...)'
New in version 1.5.0.
Creates a copy of this instance with the same uid and some extra params. This implementation first calls Params.copy and then make a copy of the companion Java pipeline component with extra params. So both the Python wrapper and the Java pipeline component get copied.
Parameters: | extra – Extra parameters to copy to the new instance |
---|---|
Returns: | Copy of this instance |
Explains a single param and returns its name, doc, and optional default value and user-supplied value in a string.
Returns the documentation of all params with their optionally default values and user-supplied values.
Extracts the embedded default param values and user-supplied values, and then merges them with extra values from input into a flat param map, where the latter value is used if there exist conflicts, i.e., with ordering: default param values < user-supplied values < extra.
Parameters: | extra – extra param values |
---|---|
Returns: | merged param map |
Fits a model to the input dataset with optional parameters.
Parameters: |
|
---|---|
Returns: | fitted model(s) |
New in version 1.3.0.
Gets the value of featuresCol or its default value.
Gets the value of forceIndexLabel.
New in version 2.1.0.
Gets the value of labelCol or its default value.
Gets the value of a param in the user-supplied param map or its default value. Raises an error if neither is set.
Gets a param by its name.
Checks whether a param has a default value.
Tests whether this instance contains a param with a given (string) name.
Checks whether a param is explicitly set by user or has a default value.
Checks whether a param is explicitly set by user.
Reads an ML instance from the input path, a shortcut of read().load(path).
Returns all params ordered by name. The default implementation uses dir() to get all attributes of type Param.
Returns an MLReader instance for this class.
Save this ML instance to the given path, a shortcut of write().save(path).
Sets the value of featuresCol.
Sets the value of forceIndexLabel.
New in version 2.1.0.
Sets params for RFormula.
New in version 1.5.0.
Returns an MLWriter instance for this ML instance.
Note
Experimental
Model fitted by RFormula. Fitting is required to determine the factor levels of formula terms.
New in version 1.5.0.
Creates a copy of this instance with the same uid and some extra params. This implementation first calls Params.copy and then make a copy of the companion Java pipeline component with extra params. So both the Python wrapper and the Java pipeline component get copied.
Parameters: | extra – Extra parameters to copy to the new instance |
---|---|
Returns: | Copy of this instance |
Explains a single param and returns its name, doc, and optional default value and user-supplied value in a string.
Returns the documentation of all params with their optionally default values and user-supplied values.
Extracts the embedded default param values and user-supplied values, and then merges them with extra values from input into a flat param map, where the latter value is used if there exist conflicts, i.e., with ordering: default param values < user-supplied values < extra.
Parameters: | extra – extra param values |
---|---|
Returns: | merged param map |
Gets the value of a param in the user-supplied param map or its default value. Raises an error if neither is set.
Gets a param by its name.
Checks whether a param has a default value.
Tests whether this instance contains a param with a given (string) name.
Checks whether a param is explicitly set by user or has a default value.
Checks whether a param is explicitly set by user.
Reads an ML instance from the input path, a shortcut of read().load(path).
Returns all params ordered by name. The default implementation uses dir() to get all attributes of type Param.
Returns an MLReader instance for this class.
Save this ML instance to the given path, a shortcut of write().save(path).
Transforms the input dataset with optional parameters.
Parameters: |
|
---|---|
Returns: | transformed dataset |
New in version 1.3.0.
Returns an MLWriter instance for this ML instance.
Implements the transforms which are defined by SQL statement. Currently we only support SQL syntax like ‘SELECT ... FROM __THIS__’ where ‘__THIS__’ represents the underlying table of the input dataset.
>>> df = spark.createDataFrame([(0, 1.0, 3.0), (2, 2.0, 5.0)], ["id", "v1", "v2"])
>>> sqlTrans = SQLTransformer(
... statement="SELECT *, (v1 + v2) AS v3, (v1 * v2) AS v4 FROM __THIS__")
>>> sqlTrans.transform(df).head()
Row(id=0, v1=1.0, v2=3.0, v3=4.0, v4=3.0)
>>> sqlTransformerPath = temp_path + "/sql-transformer"
>>> sqlTrans.save(sqlTransformerPath)
>>> loadedSqlTrans = SQLTransformer.load(sqlTransformerPath)
>>> loadedSqlTrans.getStatement() == sqlTrans.getStatement()
True
New in version 1.6.0.
Creates a copy of this instance with the same uid and some extra params. This implementation first calls Params.copy and then make a copy of the companion Java pipeline component with extra params. So both the Python wrapper and the Java pipeline component get copied.
Parameters: | extra – Extra parameters to copy to the new instance |
---|---|
Returns: | Copy of this instance |
Explains a single param and returns its name, doc, and optional default value and user-supplied value in a string.
Returns the documentation of all params with their optionally default values and user-supplied values.
Extracts the embedded default param values and user-supplied values, and then merges them with extra values from input into a flat param map, where the latter value is used if there exist conflicts, i.e., with ordering: default param values < user-supplied values < extra.
Parameters: | extra – extra param values |
---|---|
Returns: | merged param map |
Gets the value of a param in the user-supplied param map or its default value. Raises an error if neither is set.
Gets a param by its name.
Checks whether a param has a default value.
Tests whether this instance contains a param with a given (string) name.
Checks whether a param is explicitly set by user or has a default value.
Checks whether a param is explicitly set by user.
Reads an ML instance from the input path, a shortcut of read().load(path).
Returns all params ordered by name. The default implementation uses dir() to get all attributes of type Param.
Returns an MLReader instance for this class.
Save this ML instance to the given path, a shortcut of write().save(path).
Transforms the input dataset with optional parameters.
Parameters: |
|
---|---|
Returns: | transformed dataset |
New in version 1.3.0.
Returns an MLWriter instance for this ML instance.
Standardizes features by removing the mean and scaling to unit variance using column summary statistics on the samples in the training set.
The “unit std” is computed using the corrected sample standard deviation, which is computed as the square root of the unbiased sample variance.
>>> from pyspark.ml.linalg import Vectors
>>> df = spark.createDataFrame([(Vectors.dense([0.0]),), (Vectors.dense([2.0]),)], ["a"])
>>> standardScaler = StandardScaler(inputCol="a", outputCol="scaled")
>>> model = standardScaler.fit(df)
>>> model.mean
DenseVector([1.0])
>>> model.std
DenseVector([1.4142])
>>> model.transform(df).collect()[1].scaled
DenseVector([1.4142])
>>> standardScalerPath = temp_path + "/standard-scaler"
>>> standardScaler.save(standardScalerPath)
>>> loadedStandardScaler = StandardScaler.load(standardScalerPath)
>>> loadedStandardScaler.getWithMean() == standardScaler.getWithMean()
True
>>> loadedStandardScaler.getWithStd() == standardScaler.getWithStd()
True
>>> modelPath = temp_path + "/standard-scaler-model"
>>> model.save(modelPath)
>>> loadedModel = StandardScalerModel.load(modelPath)
>>> loadedModel.std == model.std
True
>>> loadedModel.mean == model.mean
True
New in version 1.4.0.
Creates a copy of this instance with the same uid and some extra params. This implementation first calls Params.copy and then make a copy of the companion Java pipeline component with extra params. So both the Python wrapper and the Java pipeline component get copied.
Parameters: | extra – Extra parameters to copy to the new instance |
---|---|
Returns: | Copy of this instance |
Explains a single param and returns its name, doc, and optional default value and user-supplied value in a string.
Returns the documentation of all params with their optionally default values and user-supplied values.
Extracts the embedded default param values and user-supplied values, and then merges them with extra values from input into a flat param map, where the latter value is used if there exist conflicts, i.e., with ordering: default param values < user-supplied values < extra.
Parameters: | extra – extra param values |
---|---|
Returns: | merged param map |
Fits a model to the input dataset with optional parameters.
Parameters: |
|
---|---|
Returns: | fitted model(s) |
New in version 1.3.0.
Gets the value of inputCol or its default value.
Gets the value of a param in the user-supplied param map or its default value. Raises an error if neither is set.
Gets the value of outputCol or its default value.
Gets a param by its name.
Checks whether a param has a default value.
Tests whether this instance contains a param with a given (string) name.
Checks whether a param is explicitly set by user or has a default value.
Checks whether a param is explicitly set by user.
Reads an ML instance from the input path, a shortcut of read().load(path).
Returns all params ordered by name. The default implementation uses dir() to get all attributes of type Param.
Returns an MLReader instance for this class.
Save this ML instance to the given path, a shortcut of write().save(path).
Sets params for this StandardScaler.
New in version 1.4.0.
Returns an MLWriter instance for this ML instance.
Model fitted by StandardScaler.
New in version 1.4.0.
Creates a copy of this instance with the same uid and some extra params. This implementation first calls Params.copy and then make a copy of the companion Java pipeline component with extra params. So both the Python wrapper and the Java pipeline component get copied.
Parameters: | extra – Extra parameters to copy to the new instance |
---|---|
Returns: | Copy of this instance |
Explains a single param and returns its name, doc, and optional default value and user-supplied value in a string.
Returns the documentation of all params with their optionally default values and user-supplied values.
Extracts the embedded default param values and user-supplied values, and then merges them with extra values from input into a flat param map, where the latter value is used if there exist conflicts, i.e., with ordering: default param values < user-supplied values < extra.
Parameters: | extra – extra param values |
---|---|
Returns: | merged param map |
Gets the value of a param in the user-supplied param map or its default value. Raises an error if neither is set.
Gets a param by its name.
Checks whether a param has a default value.
Tests whether this instance contains a param with a given (string) name.
Checks whether a param is explicitly set by user or has a default value.
Checks whether a param is explicitly set by user.
Reads an ML instance from the input path, a shortcut of read().load(path).
Returns all params ordered by name. The default implementation uses dir() to get all attributes of type Param.
Returns an MLReader instance for this class.
Save this ML instance to the given path, a shortcut of write().save(path).
Transforms the input dataset with optional parameters.
Parameters: |
|
---|---|
Returns: | transformed dataset |
New in version 1.3.0.
Returns an MLWriter instance for this ML instance.
A feature transformer that filters out stop words from input.
Note
null values from input array are preserved unless adding null to stopWords explicitly.
>>> df = spark.createDataFrame([(["a", "b", "c"],)], ["text"])
>>> remover = StopWordsRemover(inputCol="text", outputCol="words", stopWords=["b"])
>>> remover.transform(df).head().words == ['a', 'c']
True
>>> stopWordsRemoverPath = temp_path + "/stopwords-remover"
>>> remover.save(stopWordsRemoverPath)
>>> loadedRemover = StopWordsRemover.load(stopWordsRemoverPath)
>>> loadedRemover.getStopWords() == remover.getStopWords()
True
>>> loadedRemover.getCaseSensitive() == remover.getCaseSensitive()
True
New in version 1.6.0.
Creates a copy of this instance with the same uid and some extra params. This implementation first calls Params.copy and then make a copy of the companion Java pipeline component with extra params. So both the Python wrapper and the Java pipeline component get copied.
Parameters: | extra – Extra parameters to copy to the new instance |
---|---|
Returns: | Copy of this instance |
Explains a single param and returns its name, doc, and optional default value and user-supplied value in a string.
Returns the documentation of all params with their optionally default values and user-supplied values.
Extracts the embedded default param values and user-supplied values, and then merges them with extra values from input into a flat param map, where the latter value is used if there exist conflicts, i.e., with ordering: default param values < user-supplied values < extra.
Parameters: | extra – extra param values |
---|---|
Returns: | merged param map |
Gets the value of caseSensitive or its default value.
New in version 1.6.0.
Gets the value of inputCol or its default value.
Gets the value of a param in the user-supplied param map or its default value. Raises an error if neither is set.
Gets the value of outputCol or its default value.
Gets a param by its name.
Checks whether a param has a default value.
Tests whether this instance contains a param with a given (string) name.
Checks whether a param is explicitly set by user or has a default value.
Checks whether a param is explicitly set by user.
Reads an ML instance from the input path, a shortcut of read().load(path).
Loads the default stop words for the given language. Supported languages: danish, dutch, english, finnish, french, german, hungarian, italian, norwegian, portuguese, russian, spanish, swedish, turkish
New in version 2.0.0.
Returns all params ordered by name. The default implementation uses dir() to get all attributes of type Param.
Returns an MLReader instance for this class.
Save this ML instance to the given path, a shortcut of write().save(path).
Sets the value of caseSensitive.
New in version 1.6.0.
Sets params for this StopWordRemover.
New in version 1.6.0.
Transforms the input dataset with optional parameters.
Parameters: |
|
---|---|
Returns: | transformed dataset |
New in version 1.3.0.
Returns an MLWriter instance for this ML instance.
A label indexer that maps a string column of labels to an ML column of label indices. If the input column is numeric, we cast it to string and index the string values. The indices are in [0, numLabels), ordered by label frequencies. So the most frequent label gets index 0.
>>> stringIndexer = StringIndexer(inputCol="label", outputCol="indexed", handleInvalid='error')
>>> model = stringIndexer.fit(stringIndDf)
>>> td = model.transform(stringIndDf)
>>> sorted(set([(i[0], i[1]) for i in td.select(td.id, td.indexed).collect()]),
... key=lambda x: x[0])
[(0, 0.0), (1, 2.0), (2, 1.0), (3, 0.0), (4, 0.0), (5, 1.0)]
>>> inverter = IndexToString(inputCol="indexed", outputCol="label2", labels=model.labels)
>>> itd = inverter.transform(td)
>>> sorted(set([(i[0], str(i[1])) for i in itd.select(itd.id, itd.label2).collect()]),
... key=lambda x: x[0])
[(0, 'a'), (1, 'b'), (2, 'c'), (3, 'a'), (4, 'a'), (5, 'c')]
>>> stringIndexerPath = temp_path + "/string-indexer"
>>> stringIndexer.save(stringIndexerPath)
>>> loadedIndexer = StringIndexer.load(stringIndexerPath)
>>> loadedIndexer.getHandleInvalid() == stringIndexer.getHandleInvalid()
True
>>> modelPath = temp_path + "/string-indexer-model"
>>> model.save(modelPath)
>>> loadedModel = StringIndexerModel.load(modelPath)
>>> loadedModel.labels == model.labels
True
>>> indexToStringPath = temp_path + "/index-to-string"
>>> inverter.save(indexToStringPath)
>>> loadedInverter = IndexToString.load(indexToStringPath)
>>> loadedInverter.getLabels() == inverter.getLabels()
True
New in version 1.4.0.
Creates a copy of this instance with the same uid and some extra params. This implementation first calls Params.copy and then make a copy of the companion Java pipeline component with extra params. So both the Python wrapper and the Java pipeline component get copied.
Parameters: | extra – Extra parameters to copy to the new instance |
---|---|
Returns: | Copy of this instance |
Explains a single param and returns its name, doc, and optional default value and user-supplied value in a string.
Returns the documentation of all params with their optionally default values and user-supplied values.
Extracts the embedded default param values and user-supplied values, and then merges them with extra values from input into a flat param map, where the latter value is used if there exist conflicts, i.e., with ordering: default param values < user-supplied values < extra.
Parameters: | extra – extra param values |
---|---|
Returns: | merged param map |
Fits a model to the input dataset with optional parameters.
Parameters: |
|
---|---|
Returns: | fitted model(s) |
New in version 1.3.0.
Gets the value of handleInvalid or its default value.
Gets the value of inputCol or its default value.
Gets the value of a param in the user-supplied param map or its default value. Raises an error if neither is set.
Gets the value of outputCol or its default value.
Gets a param by its name.
Checks whether a param has a default value.
Tests whether this instance contains a param with a given (string) name.
Checks whether a param is explicitly set by user or has a default value.
Checks whether a param is explicitly set by user.
Reads an ML instance from the input path, a shortcut of read().load(path).
Returns all params ordered by name. The default implementation uses dir() to get all attributes of type Param.
Returns an MLReader instance for this class.
Save this ML instance to the given path, a shortcut of write().save(path).
Sets the value of handleInvalid.
Sets params for this StringIndexer.
New in version 1.4.0.
Returns an MLWriter instance for this ML instance.
Model fitted by StringIndexer.
New in version 1.4.0.
Creates a copy of this instance with the same uid and some extra params. This implementation first calls Params.copy and then make a copy of the companion Java pipeline component with extra params. So both the Python wrapper and the Java pipeline component get copied.
Parameters: | extra – Extra parameters to copy to the new instance |
---|---|
Returns: | Copy of this instance |
Explains a single param and returns its name, doc, and optional default value and user-supplied value in a string.
Returns the documentation of all params with their optionally default values and user-supplied values.
Extracts the embedded default param values and user-supplied values, and then merges them with extra values from input into a flat param map, where the latter value is used if there exist conflicts, i.e., with ordering: default param values < user-supplied values < extra.
Parameters: | extra – extra param values |
---|---|
Returns: | merged param map |
Gets the value of a param in the user-supplied param map or its default value. Raises an error if neither is set.
Gets a param by its name.
Checks whether a param has a default value.
Tests whether this instance contains a param with a given (string) name.
Checks whether a param is explicitly set by user or has a default value.
Checks whether a param is explicitly set by user.
Ordered list of labels, corresponding to indices to be assigned.
New in version 1.5.0.
Reads an ML instance from the input path, a shortcut of read().load(path).
Returns all params ordered by name. The default implementation uses dir() to get all attributes of type Param.
Returns an MLReader instance for this class.
Save this ML instance to the given path, a shortcut of write().save(path).
Transforms the input dataset with optional parameters.
Parameters: |
|
---|---|
Returns: | transformed dataset |
New in version 1.3.0.
Returns an MLWriter instance for this ML instance.
A tokenizer that converts the input string to lowercase and then splits it by white spaces.
>>> df = spark.createDataFrame([("a b c",)], ["text"])
>>> tokenizer = Tokenizer(inputCol="text", outputCol="words")
>>> tokenizer.transform(df).head()
Row(text=u'a b c', words=[u'a', u'b', u'c'])
>>> # Change a parameter.
>>> tokenizer.setParams(outputCol="tokens").transform(df).head()
Row(text=u'a b c', tokens=[u'a', u'b', u'c'])
>>> # Temporarily modify a parameter.
>>> tokenizer.transform(df, {tokenizer.outputCol: "words"}).head()
Row(text=u'a b c', words=[u'a', u'b', u'c'])
>>> tokenizer.transform(df).head()
Row(text=u'a b c', tokens=[u'a', u'b', u'c'])
>>> # Must use keyword arguments to specify params.
>>> tokenizer.setParams("text")
Traceback (most recent call last):
...
TypeError: Method setParams forces keyword arguments.
>>> tokenizerPath = temp_path + "/tokenizer"
>>> tokenizer.save(tokenizerPath)
>>> loadedTokenizer = Tokenizer.load(tokenizerPath)
>>> loadedTokenizer.transform(df).head().tokens == tokenizer.transform(df).head().tokens
True
New in version 1.3.0.
Creates a copy of this instance with the same uid and some extra params. This implementation first calls Params.copy and then make a copy of the companion Java pipeline component with extra params. So both the Python wrapper and the Java pipeline component get copied.
Parameters: | extra – Extra parameters to copy to the new instance |
---|---|
Returns: | Copy of this instance |
Explains a single param and returns its name, doc, and optional default value and user-supplied value in a string.
Returns the documentation of all params with their optionally default values and user-supplied values.
Extracts the embedded default param values and user-supplied values, and then merges them with extra values from input into a flat param map, where the latter value is used if there exist conflicts, i.e., with ordering: default param values < user-supplied values < extra.
Parameters: | extra – extra param values |
---|---|
Returns: | merged param map |
Gets the value of inputCol or its default value.
Gets the value of a param in the user-supplied param map or its default value. Raises an error if neither is set.
Gets the value of outputCol or its default value.
Gets a param by its name.
Checks whether a param has a default value.
Tests whether this instance contains a param with a given (string) name.
Checks whether a param is explicitly set by user or has a default value.
Checks whether a param is explicitly set by user.
Reads an ML instance from the input path, a shortcut of read().load(path).
Returns all params ordered by name. The default implementation uses dir() to get all attributes of type Param.
Returns an MLReader instance for this class.
Save this ML instance to the given path, a shortcut of write().save(path).
Sets params for this Tokenizer.
New in version 1.3.0.
Transforms the input dataset with optional parameters.
Parameters: |
|
---|---|
Returns: | transformed dataset |
New in version 1.3.0.
Returns an MLWriter instance for this ML instance.
A feature transformer that merges multiple columns into a vector column.
>>> df = spark.createDataFrame([(1, 0, 3)], ["a", "b", "c"])
>>> vecAssembler = VectorAssembler(inputCols=["a", "b", "c"], outputCol="features")
>>> vecAssembler.transform(df).head().features
DenseVector([1.0, 0.0, 3.0])
>>> vecAssembler.setParams(outputCol="freqs").transform(df).head().freqs
DenseVector([1.0, 0.0, 3.0])
>>> params = {vecAssembler.inputCols: ["b", "a"], vecAssembler.outputCol: "vector"}
>>> vecAssembler.transform(df, params).head().vector
DenseVector([0.0, 1.0])
>>> vectorAssemblerPath = temp_path + "/vector-assembler"
>>> vecAssembler.save(vectorAssemblerPath)
>>> loadedAssembler = VectorAssembler.load(vectorAssemblerPath)
>>> loadedAssembler.transform(df).head().freqs == vecAssembler.transform(df).head().freqs
True
New in version 1.4.0.
Creates a copy of this instance with the same uid and some extra params. This implementation first calls Params.copy and then make a copy of the companion Java pipeline component with extra params. So both the Python wrapper and the Java pipeline component get copied.
Parameters: | extra – Extra parameters to copy to the new instance |
---|---|
Returns: | Copy of this instance |
Explains a single param and returns its name, doc, and optional default value and user-supplied value in a string.
Returns the documentation of all params with their optionally default values and user-supplied values.
Extracts the embedded default param values and user-supplied values, and then merges them with extra values from input into a flat param map, where the latter value is used if there exist conflicts, i.e., with ordering: default param values < user-supplied values < extra.
Parameters: | extra – extra param values |
---|---|
Returns: | merged param map |
Gets the value of inputCols or its default value.
Gets the value of a param in the user-supplied param map or its default value. Raises an error if neither is set.
Gets the value of outputCol or its default value.
Gets a param by its name.
Checks whether a param has a default value.
Tests whether this instance contains a param with a given (string) name.
Checks whether a param is explicitly set by user or has a default value.
Checks whether a param is explicitly set by user.
Reads an ML instance from the input path, a shortcut of read().load(path).
Returns all params ordered by name. The default implementation uses dir() to get all attributes of type Param.
Returns an MLReader instance for this class.
Save this ML instance to the given path, a shortcut of write().save(path).
Sets params for this VectorAssembler.
New in version 1.4.0.
Transforms the input dataset with optional parameters.
Parameters: |
|
---|---|
Returns: | transformed dataset |
New in version 1.3.0.
Returns an MLWriter instance for this ML instance.
Class for indexing categorical feature columns in a dataset of Vector.
- Automatically identify categorical features (default behavior)
- This helps process a dataset of unknown vectors into a dataset with some continuous features and some categorical features. The choice between continuous and categorical is based upon a maxCategories parameter.
- Set maxCategories to the maximum number of categorical any categorical feature should have.
- E.g.: Feature 0 has unique values {-1.0, 0.0}, and feature 1 values {1.0, 3.0, 5.0}. If maxCategories = 2, then feature 0 will be declared categorical and use indices {0, 1}, and feature 1 will be declared continuous.
- Index all features, if all features are categorical
- If maxCategories is set to be very large, then this will build an index of unique values for all features.
- Warning: This can cause problems if features are continuous since this will collect ALL unique values to the driver.
- E.g.: Feature 0 has unique values {-1.0, 0.0}, and feature 1 values {1.0, 3.0, 5.0}. If maxCategories >= 3, then both features will be declared categorical.
This returns a model which can transform categorical features to use 0-based indices.
- This is not guaranteed to choose the same category index across multiple runs.
- If a categorical feature includes value 0, then this is guaranteed to map value 0 to index 0. This maintains vector sparsity.
- More stability may be added in the future.
>>> from pyspark.ml.linalg import Vectors
>>> df = spark.createDataFrame([(Vectors.dense([-1.0, 0.0]),),
... (Vectors.dense([0.0, 1.0]),), (Vectors.dense([0.0, 2.0]),)], ["a"])
>>> indexer = VectorIndexer(maxCategories=2, inputCol="a", outputCol="indexed")
>>> model = indexer.fit(df)
>>> model.transform(df).head().indexed
DenseVector([1.0, 0.0])
>>> model.numFeatures
2
>>> model.categoryMaps
{0: {0.0: 0, -1.0: 1}}
>>> indexer.setParams(outputCol="test").fit(df).transform(df).collect()[1].test
DenseVector([0.0, 1.0])
>>> params = {indexer.maxCategories: 3, indexer.outputCol: "vector"}
>>> model2 = indexer.fit(df, params)
>>> model2.transform(df).head().vector
DenseVector([1.0, 0.0])
>>> vectorIndexerPath = temp_path + "/vector-indexer"
>>> indexer.save(vectorIndexerPath)
>>> loadedIndexer = VectorIndexer.load(vectorIndexerPath)
>>> loadedIndexer.getMaxCategories() == indexer.getMaxCategories()
True
>>> modelPath = temp_path + "/vector-indexer-model"
>>> model.save(modelPath)
>>> loadedModel = VectorIndexerModel.load(modelPath)
>>> loadedModel.numFeatures == model.numFeatures
True
>>> loadedModel.categoryMaps == model.categoryMaps
True
New in version 1.4.0.
Creates a copy of this instance with the same uid and some extra params. This implementation first calls Params.copy and then make a copy of the companion Java pipeline component with extra params. So both the Python wrapper and the Java pipeline component get copied.
Parameters: | extra – Extra parameters to copy to the new instance |
---|---|
Returns: | Copy of this instance |
Explains a single param and returns its name, doc, and optional default value and user-supplied value in a string.
Returns the documentation of all params with their optionally default values and user-supplied values.
Extracts the embedded default param values and user-supplied values, and then merges them with extra values from input into a flat param map, where the latter value is used if there exist conflicts, i.e., with ordering: default param values < user-supplied values < extra.
Parameters: | extra – extra param values |
---|---|
Returns: | merged param map |
Fits a model to the input dataset with optional parameters.
Parameters: |
|
---|---|
Returns: | fitted model(s) |
New in version 1.3.0.
Gets the value of inputCol or its default value.
Gets the value of maxCategories or its default value.
New in version 1.4.0.
Gets the value of a param in the user-supplied param map or its default value. Raises an error if neither is set.
Gets the value of outputCol or its default value.
Gets a param by its name.
Checks whether a param has a default value.
Tests whether this instance contains a param with a given (string) name.
Checks whether a param is explicitly set by user or has a default value.
Checks whether a param is explicitly set by user.
Reads an ML instance from the input path, a shortcut of read().load(path).
Returns all params ordered by name. The default implementation uses dir() to get all attributes of type Param.
Returns an MLReader instance for this class.
Save this ML instance to the given path, a shortcut of write().save(path).
Sets the value of maxCategories.
New in version 1.4.0.
Sets params for this VectorIndexer.
New in version 1.4.0.
Returns an MLWriter instance for this ML instance.
Model fitted by VectorIndexer.
This also appends metadata to the output column, marking features as Numeric (continuous), Nominal (categorical), or Binary (either continuous or categorical). Non-ML metadata is not carried over from the input to the output column.
This maintains vector sparsity.
New in version 1.4.0.
Feature value index. Keys are categorical feature indices (column indices). Values are maps from original features values to 0-based category indices. If a feature is not in this map, it is treated as continuous.
New in version 1.4.0.
Creates a copy of this instance with the same uid and some extra params. This implementation first calls Params.copy and then make a copy of the companion Java pipeline component with extra params. So both the Python wrapper and the Java pipeline component get copied.
Parameters: | extra – Extra parameters to copy to the new instance |
---|---|
Returns: | Copy of this instance |
Explains a single param and returns its name, doc, and optional default value and user-supplied value in a string.
Returns the documentation of all params with their optionally default values and user-supplied values.
Extracts the embedded default param values and user-supplied values, and then merges them with extra values from input into a flat param map, where the latter value is used if there exist conflicts, i.e., with ordering: default param values < user-supplied values < extra.
Parameters: | extra – extra param values |
---|---|
Returns: | merged param map |
Gets the value of a param in the user-supplied param map or its default value. Raises an error if neither is set.
Gets a param by its name.
Checks whether a param has a default value.
Tests whether this instance contains a param with a given (string) name.
Checks whether a param is explicitly set by user or has a default value.
Checks whether a param is explicitly set by user.
Reads an ML instance from the input path, a shortcut of read().load(path).
Number of features, i.e., length of Vectors which this transforms.
New in version 1.4.0.
Returns all params ordered by name. The default implementation uses dir() to get all attributes of type Param.
Returns an MLReader instance for this class.
Save this ML instance to the given path, a shortcut of write().save(path).
Transforms the input dataset with optional parameters.
Parameters: |
|
---|---|
Returns: | transformed dataset |
New in version 1.3.0.
Returns an MLWriter instance for this ML instance.
This class takes a feature vector and outputs a new feature vector with a subarray of the original features.
The subset of features can be specified with either indices (setIndices()) or names (setNames()). At least one feature must be selected. Duplicate features are not allowed, so there can be no overlap between selected indices and names.
The output vector will order features with the selected indices first (in the order given), followed by the selected names (in the order given).
>>> from pyspark.ml.linalg import Vectors
>>> df = spark.createDataFrame([
... (Vectors.dense([-2.0, 2.3, 0.0, 0.0, 1.0]),),
... (Vectors.dense([0.0, 0.0, 0.0, 0.0, 0.0]),),
... (Vectors.dense([0.6, -1.1, -3.0, 4.5, 3.3]),)], ["features"])
>>> vs = VectorSlicer(inputCol="features", outputCol="sliced", indices=[1, 4])
>>> vs.transform(df).head().sliced
DenseVector([2.3, 1.0])
>>> vectorSlicerPath = temp_path + "/vector-slicer"
>>> vs.save(vectorSlicerPath)
>>> loadedVs = VectorSlicer.load(vectorSlicerPath)
>>> loadedVs.getIndices() == vs.getIndices()
True
>>> loadedVs.getNames() == vs.getNames()
True
New in version 1.6.0.
Creates a copy of this instance with the same uid and some extra params. This implementation first calls Params.copy and then make a copy of the companion Java pipeline component with extra params. So both the Python wrapper and the Java pipeline component get copied.
Parameters: | extra – Extra parameters to copy to the new instance |
---|---|
Returns: | Copy of this instance |
Explains a single param and returns its name, doc, and optional default value and user-supplied value in a string.
Returns the documentation of all params with their optionally default values and user-supplied values.
Extracts the embedded default param values and user-supplied values, and then merges them with extra values from input into a flat param map, where the latter value is used if there exist conflicts, i.e., with ordering: default param values < user-supplied values < extra.
Parameters: | extra – extra param values |
---|---|
Returns: | merged param map |
Gets the value of inputCol or its default value.
Gets the value of a param in the user-supplied param map or its default value. Raises an error if neither is set.
Gets the value of outputCol or its default value.
Gets a param by its name.
Checks whether a param has a default value.
Tests whether this instance contains a param with a given (string) name.
Checks whether a param is explicitly set by user or has a default value.
Checks whether a param is explicitly set by user.
Reads an ML instance from the input path, a shortcut of read().load(path).
Returns all params ordered by name. The default implementation uses dir() to get all attributes of type Param.
Returns an MLReader instance for this class.
Save this ML instance to the given path, a shortcut of write().save(path).
setParams(self, inputCol=None, outputCol=None, indices=None, names=None): Sets params for this VectorSlicer.
New in version 1.6.0.
Transforms the input dataset with optional parameters.
Parameters: |
|
---|---|
Returns: | transformed dataset |
New in version 1.3.0.
Returns an MLWriter instance for this ML instance.
Word2Vec trains a model of Map(String, Vector), i.e. transforms a word into a code for further natural language processing or machine learning process.
>>> sent = ("a b " * 100 + "a c " * 10).split(" ")
>>> doc = spark.createDataFrame([(sent,), (sent,)], ["sentence"])
>>> word2Vec = Word2Vec(vectorSize=5, seed=42, inputCol="sentence", outputCol="model")
>>> model = word2Vec.fit(doc)
>>> model.getVectors().show()
+----+--------------------+
|word| vector|
+----+--------------------+
| a|[0.09461779892444...|
| b|[1.15474212169647...|
| c|[-0.3794820010662...|
+----+--------------------+
...
>>> from pyspark.sql.functions import format_number as fmt
>>> model.findSynonyms("a", 2).select("word", fmt("similarity", 5).alias("similarity")).show()
+----+----------+
|word|similarity|
+----+----------+
| b| 0.25053|
| c| -0.69805|
+----+----------+
...
>>> model.transform(doc).head().model
DenseVector([0.5524, -0.4995, -0.3599, 0.0241, 0.3461])
>>> word2vecPath = temp_path + "/word2vec"
>>> word2Vec.save(word2vecPath)
>>> loadedWord2Vec = Word2Vec.load(word2vecPath)
>>> loadedWord2Vec.getVectorSize() == word2Vec.getVectorSize()
True
>>> loadedWord2Vec.getNumPartitions() == word2Vec.getNumPartitions()
True
>>> loadedWord2Vec.getMinCount() == word2Vec.getMinCount()
True
>>> modelPath = temp_path + "/word2vec-model"
>>> model.save(modelPath)
>>> loadedModel = Word2VecModel.load(modelPath)
>>> loadedModel.getVectors().first().word == model.getVectors().first().word
True
>>> loadedModel.getVectors().first().vector == model.getVectors().first().vector
True
New in version 1.4.0.
Creates a copy of this instance with the same uid and some extra params. This implementation first calls Params.copy and then make a copy of the companion Java pipeline component with extra params. So both the Python wrapper and the Java pipeline component get copied.
Parameters: | extra – Extra parameters to copy to the new instance |
---|---|
Returns: | Copy of this instance |
Explains a single param and returns its name, doc, and optional default value and user-supplied value in a string.
Returns the documentation of all params with their optionally default values and user-supplied values.
Extracts the embedded default param values and user-supplied values, and then merges them with extra values from input into a flat param map, where the latter value is used if there exist conflicts, i.e., with ordering: default param values < user-supplied values < extra.
Parameters: | extra – extra param values |
---|---|
Returns: | merged param map |
Fits a model to the input dataset with optional parameters.
Parameters: |
|
---|---|
Returns: | fitted model(s) |
New in version 1.3.0.
Gets the value of inputCol or its default value.
Gets the value of maxIter or its default value.
Gets the value of maxSentenceLength or its default value.
New in version 2.0.0.
Gets the value of numPartitions or its default value.
New in version 1.4.0.
Gets the value of a param in the user-supplied param map or its default value. Raises an error if neither is set.
Gets the value of outputCol or its default value.
Gets a param by its name.
Gets the value of seed or its default value.
Gets the value of stepSize or its default value.
Checks whether a param has a default value.
Tests whether this instance contains a param with a given (string) name.
Checks whether a param is explicitly set by user or has a default value.
Checks whether a param is explicitly set by user.
Reads an ML instance from the input path, a shortcut of read().load(path).
Returns all params ordered by name. The default implementation uses dir() to get all attributes of type Param.
Returns an MLReader instance for this class.
Save this ML instance to the given path, a shortcut of write().save(path).
Sets the value of maxSentenceLength.
New in version 2.0.0.
Sets the value of numPartitions.
New in version 1.4.0.
Sets params for this Word2Vec.
New in version 1.4.0.
Sets the value of vectorSize.
New in version 1.4.0.
Sets the value of windowSize.
New in version 2.0.0.
Returns an MLWriter instance for this ML instance.
Model fitted by Word2Vec.
New in version 1.4.0.
Creates a copy of this instance with the same uid and some extra params. This implementation first calls Params.copy and then make a copy of the companion Java pipeline component with extra params. So both the Python wrapper and the Java pipeline component get copied.
Parameters: | extra – Extra parameters to copy to the new instance |
---|---|
Returns: | Copy of this instance |
Explains a single param and returns its name, doc, and optional default value and user-supplied value in a string.
Returns the documentation of all params with their optionally default values and user-supplied values.
Extracts the embedded default param values and user-supplied values, and then merges them with extra values from input into a flat param map, where the latter value is used if there exist conflicts, i.e., with ordering: default param values < user-supplied values < extra.
Parameters: | extra – extra param values |
---|---|
Returns: | merged param map |
Find “num” number of words closest in similarity to “word”. word can be a string or vector representation. Returns a dataframe with two fields word and similarity (which gives the cosine similarity).
New in version 1.5.0.
Gets the value of a param in the user-supplied param map or its default value. Raises an error if neither is set.
Gets a param by its name.
Returns the vector representation of the words as a dataframe with two fields, word and vector.
New in version 1.5.0.
Checks whether a param has a default value.
Tests whether this instance contains a param with a given (string) name.
Checks whether a param is explicitly set by user or has a default value.
Checks whether a param is explicitly set by user.
Reads an ML instance from the input path, a shortcut of read().load(path).
Returns all params ordered by name. The default implementation uses dir() to get all attributes of type Param.
Returns an MLReader instance for this class.
Save this ML instance to the given path, a shortcut of write().save(path).
Transforms the input dataset with optional parameters.
Parameters: |
|
---|---|
Returns: | transformed dataset |
New in version 1.3.0.
Returns an MLWriter instance for this ML instance.
Note
Experimental
This binary classifier optimizes the Hinge Loss using the OWLQN optimizer. Only supports L2 regularization currently.
>>> from pyspark.sql import Row
>>> from pyspark.ml.linalg import Vectors
>>> df = sc.parallelize([
... Row(label=1.0, features=Vectors.dense(1.0, 1.0, 1.0)),
... Row(label=0.0, features=Vectors.dense(1.0, 2.0, 3.0))]).toDF()
>>> svm = LinearSVC(maxIter=5, regParam=0.01)
>>> model = svm.fit(df)
>>> model.coefficients
DenseVector([0.0, -0.2792, -0.1833])
>>> model.intercept
1.0206118982229047
>>> model.numClasses
2
>>> model.numFeatures
3
>>> test0 = sc.parallelize([Row(features=Vectors.dense(-1.0, -1.0, -1.0))]).toDF()
>>> result = model.transform(test0).head()
>>> result.prediction
1.0
>>> result.rawPrediction
DenseVector([-1.4831, 1.4831])
>>> svm_path = temp_path + "/svm"
>>> svm.save(svm_path)
>>> svm2 = LinearSVC.load(svm_path)
>>> svm2.getMaxIter()
5
>>> model_path = temp_path + "/svm_model"
>>> model.save(model_path)
>>> model2 = LinearSVCModel.load(model_path)
>>> model.coefficients[0] == model2.coefficients[0]
True
>>> model.intercept == model2.intercept
True
New in version 2.2.0.
Creates a copy of this instance with the same uid and some extra params. This implementation first calls Params.copy and then make a copy of the companion Java pipeline component with extra params. So both the Python wrapper and the Java pipeline component get copied.
Parameters: | extra – Extra parameters to copy to the new instance |
---|---|
Returns: | Copy of this instance |
Explains a single param and returns its name, doc, and optional default value and user-supplied value in a string.
Returns the documentation of all params with their optionally default values and user-supplied values.
Extracts the embedded default param values and user-supplied values, and then merges them with extra values from input into a flat param map, where the latter value is used if there exist conflicts, i.e., with ordering: default param values < user-supplied values < extra.
Parameters: | extra – extra param values |
---|---|
Returns: | merged param map |
Fits a model to the input dataset with optional parameters.
Parameters: |
|
---|---|
Returns: | fitted model(s) |
New in version 1.3.0.
Gets the value of aggregationDepth or its default value.
Gets the value of featuresCol or its default value.
Gets the value of fitIntercept or its default value.
Gets the value of labelCol or its default value.
Gets the value of maxIter or its default value.
Gets the value of a param in the user-supplied param map or its default value. Raises an error if neither is set.
Gets a param by its name.
Gets the value of predictionCol or its default value.
Gets the value of rawPredictionCol or its default value.
Gets the value of regParam or its default value.
Gets the value of standardization or its default value.
Gets the value of tol or its default value.
Gets the value of weightCol or its default value.
Checks whether a param has a default value.
Tests whether this instance contains a param with a given (string) name.
Checks whether a param is explicitly set by user or has a default value.
Checks whether a param is explicitly set by user.
Reads an ML instance from the input path, a shortcut of read().load(path).
Returns all params ordered by name. The default implementation uses dir() to get all attributes of type Param.
Returns an MLReader instance for this class.
Save this ML instance to the given path, a shortcut of write().save(path).
Sets the value of aggregationDepth.
Sets the value of featuresCol.
Sets the value of fitIntercept.
setParams(self, featuresCol=”features”, labelCol=”label”, predictionCol=”prediction”, maxIter=100, regParam=0.0, tol=1e-6, rawPredictionCol=”rawPrediction”, fitIntercept=True, standardization=True, threshold=0.0, weightCol=None, aggregationDepth=2): Sets params for Linear SVM Classifier.
New in version 2.2.0.
Sets the value of predictionCol.
Sets the value of rawPredictionCol.
Sets the value of standardization.
Returns an MLWriter instance for this ML instance.
Note
Experimental
Model fitted by LinearSVC.
New in version 2.2.0.
Creates a copy of this instance with the same uid and some extra params. This implementation first calls Params.copy and then make a copy of the companion Java pipeline component with extra params. So both the Python wrapper and the Java pipeline component get copied.
Parameters: | extra – Extra parameters to copy to the new instance |
---|---|
Returns: | Copy of this instance |
Explains a single param and returns its name, doc, and optional default value and user-supplied value in a string.
Returns the documentation of all params with their optionally default values and user-supplied values.
Extracts the embedded default param values and user-supplied values, and then merges them with extra values from input into a flat param map, where the latter value is used if there exist conflicts, i.e., with ordering: default param values < user-supplied values < extra.
Parameters: | extra – extra param values |
---|---|
Returns: | merged param map |
Gets the value of a param in the user-supplied param map or its default value. Raises an error if neither is set.
Gets a param by its name.
Checks whether a param has a default value.
Tests whether this instance contains a param with a given (string) name.
Checks whether a param is explicitly set by user or has a default value.
Checks whether a param is explicitly set by user.
Reads an ML instance from the input path, a shortcut of read().load(path).
Number of classes (values which the label can take).
New in version 2.1.0.
Returns the number of features the model was trained on. If unknown, returns -1
New in version 2.1.0.
Returns all params ordered by name. The default implementation uses dir() to get all attributes of type Param.
Returns an MLReader instance for this class.
Save this ML instance to the given path, a shortcut of write().save(path).
Transforms the input dataset with optional parameters.
Parameters: |
|
---|---|
Returns: | transformed dataset |
New in version 1.3.0.
Returns an MLWriter instance for this ML instance.
Logistic regression. This class supports multinomial logistic (softmax) and binomial logistic regression.
>>> from pyspark.sql import Row
>>> from pyspark.ml.linalg import Vectors
>>> bdf = sc.parallelize([
... Row(label=1.0, weight=1.0, features=Vectors.dense(0.0, 5.0)),
... Row(label=0.0, weight=2.0, features=Vectors.dense(1.0, 2.0)),
... Row(label=1.0, weight=3.0, features=Vectors.dense(2.0, 1.0)),
... Row(label=0.0, weight=4.0, features=Vectors.dense(3.0, 3.0))]).toDF()
>>> blor = LogisticRegression(regParam=0.01, weightCol="weight")
>>> blorModel = blor.fit(bdf)
>>> blorModel.coefficients
DenseVector([-1.080..., -0.646...])
>>> blorModel.intercept
3.112...
>>> data_path = "data/mllib/sample_multiclass_classification_data.txt"
>>> mdf = spark.read.format("libsvm").load(data_path)
>>> mlor = LogisticRegression(regParam=0.1, elasticNetParam=1.0, family="multinomial")
>>> mlorModel = mlor.fit(mdf)
>>> mlorModel.coefficientMatrix
SparseMatrix(3, 4, [0, 1, 2, 3], [3, 2, 1], [1.87..., -2.75..., -0.50...], 1)
>>> mlorModel.interceptVector
DenseVector([0.04..., -0.42..., 0.37...])
>>> test0 = sc.parallelize([Row(features=Vectors.dense(-1.0, 1.0))]).toDF()
>>> result = blorModel.transform(test0).head()
>>> result.prediction
1.0
>>> result.probability
DenseVector([0.02..., 0.97...])
>>> result.rawPrediction
DenseVector([-3.54..., 3.54...])
>>> test1 = sc.parallelize([Row(features=Vectors.sparse(2, [0], [1.0]))]).toDF()
>>> blorModel.transform(test1).head().prediction
1.0
>>> blor.setParams("vector")
Traceback (most recent call last):
...
TypeError: Method setParams forces keyword arguments.
>>> lr_path = temp_path + "/lr"
>>> blor.save(lr_path)
>>> lr2 = LogisticRegression.load(lr_path)
>>> lr2.getRegParam()
0.01
>>> model_path = temp_path + "/lr_model"
>>> blorModel.save(model_path)
>>> model2 = LogisticRegressionModel.load(model_path)
>>> blorModel.coefficients[0] == model2.coefficients[0]
True
>>> blorModel.intercept == model2.intercept
True
New in version 1.3.0.
Creates a copy of this instance with the same uid and some extra params. This implementation first calls Params.copy and then make a copy of the companion Java pipeline component with extra params. So both the Python wrapper and the Java pipeline component get copied.
Parameters: | extra – Extra parameters to copy to the new instance |
---|---|
Returns: | Copy of this instance |
Explains a single param and returns its name, doc, and optional default value and user-supplied value in a string.
Returns the documentation of all params with their optionally default values and user-supplied values.
Extracts the embedded default param values and user-supplied values, and then merges them with extra values from input into a flat param map, where the latter value is used if there exist conflicts, i.e., with ordering: default param values < user-supplied values < extra.
Parameters: | extra – extra param values |
---|---|
Returns: | merged param map |
Fits a model to the input dataset with optional parameters.
Parameters: |
|
---|---|
Returns: | fitted model(s) |
New in version 1.3.0.
Gets the value of aggregationDepth or its default value.
Gets the value of elasticNetParam or its default value.
Gets the value of featuresCol or its default value.
Gets the value of fitIntercept or its default value.
Gets the value of labelCol or its default value.
Gets the value of maxIter or its default value.
Gets the value of a param in the user-supplied param map or its default value. Raises an error if neither is set.
Gets a param by its name.
Gets the value of predictionCol or its default value.
Gets the value of probabilityCol or its default value.
Gets the value of rawPredictionCol or its default value.
Gets the value of regParam or its default value.
Gets the value of standardization or its default value.
Get threshold for binary classification.
If thresholds is set with length 2 (i.e., binary classification), this returns the equivalent threshold: \(\frac{1}{1 + \frac{thresholds(0)}{thresholds(1)}}\). Otherwise, returns threshold if set or its default value if unset.
New in version 1.4.0.
If thresholds is set, return its value. Otherwise, if threshold is set, return the equivalent thresholds for binary classification: (1-threshold, threshold). If neither are set, throw an error.
New in version 1.5.0.
Gets the value of tol or its default value.
Gets the value of weightCol or its default value.
Checks whether a param has a default value.
Tests whether this instance contains a param with a given (string) name.
Checks whether a param is explicitly set by user or has a default value.
Checks whether a param is explicitly set by user.
Reads an ML instance from the input path, a shortcut of read().load(path).
Returns all params ordered by name. The default implementation uses dir() to get all attributes of type Param.
Returns an MLReader instance for this class.
Save this ML instance to the given path, a shortcut of write().save(path).
Sets the value of aggregationDepth.
Sets the value of elasticNetParam.
Sets the value of featuresCol.
Sets the value of fitIntercept.
Sets params for logistic regression. If the threshold and thresholds Params are both set, they must be equivalent.
New in version 1.3.0.
Sets the value of predictionCol.
Sets the value of probabilityCol.
Sets the value of rawPredictionCol.
Sets the value of standardization.
Sets the value of threshold. Clears value of thresholds if it has been set.
New in version 1.4.0.
Sets the value of thresholds. Clears value of threshold if it has been set.
New in version 1.5.0.
Returns an MLWriter instance for this ML instance.
Model fitted by LogisticRegression.
New in version 1.3.0.
Model coefficients of binomial logistic regression. An exception is thrown in the case of multinomial logistic regression.
New in version 2.0.0.
Creates a copy of this instance with the same uid and some extra params. This implementation first calls Params.copy and then make a copy of the companion Java pipeline component with extra params. So both the Python wrapper and the Java pipeline component get copied.
Parameters: | extra – Extra parameters to copy to the new instance |
---|---|
Returns: | Copy of this instance |
Evaluates the model on a test dataset.
Parameters: | dataset – Test dataset to evaluate model on, where dataset is an instance of pyspark.sql.DataFrame |
---|
New in version 2.0.0.
Explains a single param and returns its name, doc, and optional default value and user-supplied value in a string.
Returns the documentation of all params with their optionally default values and user-supplied values.
Extracts the embedded default param values and user-supplied values, and then merges them with extra values from input into a flat param map, where the latter value is used if there exist conflicts, i.e., with ordering: default param values < user-supplied values < extra.
Parameters: | extra – extra param values |
---|---|
Returns: | merged param map |
Gets the value of a param in the user-supplied param map or its default value. Raises an error if neither is set.
Gets a param by its name.
Checks whether a param has a default value.
Tests whether this instance contains a param with a given (string) name.
Indicates whether a training summary exists for this model instance.
New in version 2.0.0.
Model intercept of binomial logistic regression. An exception is thrown in the case of multinomial logistic regression.
New in version 1.4.0.
Checks whether a param is explicitly set by user or has a default value.
Checks whether a param is explicitly set by user.
Reads an ML instance from the input path, a shortcut of read().load(path).
Number of classes (values which the label can take).
New in version 2.1.0.
Returns the number of features the model was trained on. If unknown, returns -1
New in version 2.1.0.
Returns all params ordered by name. The default implementation uses dir() to get all attributes of type Param.
Returns an MLReader instance for this class.
Save this ML instance to the given path, a shortcut of write().save(path).
Gets summary (e.g. accuracy/precision/recall, objective history, total iterations) of model trained on the training set. An exception is thrown if trainingSummary is None.
New in version 2.0.0.
Transforms the input dataset with optional parameters.
Parameters: |
|
---|---|
Returns: | transformed dataset |
New in version 1.3.0.
Returns an MLWriter instance for this ML instance.
Note
Experimental
Abstraction for Logistic Regression Results for a given model.
New in version 2.0.0.
Field in “predictions” which gives the features of each instance as a vector.
New in version 2.0.0.
Note
Experimental
Abstraction for multinomial Logistic Regression Training results. Currently, the training summary ignores the training weights except for the objective trace.
New in version 2.0.0.
Field in “predictions” which gives the features of each instance as a vector.
New in version 2.0.0.
Field in “predictions” which gives the true label of each instance.
New in version 2.0.0.
Objective function (scaled loss + regularization) at each iteration.
New in version 2.0.0.
Dataframe outputted by the model’s transform method.
New in version 2.0.0.
Field in “predictions” which gives the probability of each class as a vector.
New in version 2.0.0.
Note
Experimental
Binary Logistic regression results for a given model.
New in version 2.0.0.
Computes the area under the receiver operating characteristic (ROC) curve.
Note
This ignores instance weights (setting all to 1.0) from LogisticRegression.weightCol. This will change in later Spark versions.
New in version 2.0.0.
Returns a dataframe with two fields (threshold, F-Measure) curve with beta = 1.0.
Note
This ignores instance weights (setting all to 1.0) from LogisticRegression.weightCol. This will change in later Spark versions.
New in version 2.0.0.
Field in “predictions” which gives the features of each instance as a vector.
New in version 2.0.0.
Field in “predictions” which gives the true label of each instance.
New in version 2.0.0.
Returns the precision-recall curve, which is a Dataframe containing two fields recall, precision with (0.0, 1.0) prepended to it.
Note
This ignores instance weights (setting all to 1.0) from LogisticRegression.weightCol. This will change in later Spark versions.
New in version 2.0.0.
Returns a dataframe with two fields (threshold, precision) curve. Every possible probability obtained in transforming the dataset are used as thresholds used in calculating the precision.
Note
This ignores instance weights (setting all to 1.0) from LogisticRegression.weightCol. This will change in later Spark versions.
New in version 2.0.0.
Dataframe outputted by the model’s transform method.
New in version 2.0.0.
Field in “predictions” which gives the probability of each class as a vector.
New in version 2.0.0.
Returns a dataframe with two fields (threshold, recall) curve. Every possible probability obtained in transforming the dataset are used as thresholds used in calculating the recall.
Note
This ignores instance weights (setting all to 1.0) from LogisticRegression.weightCol. This will change in later Spark versions.
New in version 2.0.0.
Returns the receiver operating characteristic (ROC) curve, which is a Dataframe having two fields (FPR, TPR) with (0.0, 0.0) prepended and (1.0, 1.0) appended to it.
See also
Note
This ignores instance weights (setting all to 1.0) from LogisticRegression.weightCol. This will change in later Spark versions.
New in version 2.0.0.
Note
Experimental
Binary Logistic regression training results for a given model.
New in version 2.0.0.
Computes the area under the receiver operating characteristic (ROC) curve.
Note
This ignores instance weights (setting all to 1.0) from LogisticRegression.weightCol. This will change in later Spark versions.
New in version 2.0.0.
Returns a dataframe with two fields (threshold, F-Measure) curve with beta = 1.0.
Note
This ignores instance weights (setting all to 1.0) from LogisticRegression.weightCol. This will change in later Spark versions.
New in version 2.0.0.
Field in “predictions” which gives the features of each instance as a vector.
New in version 2.0.0.
Field in “predictions” which gives the true label of each instance.
New in version 2.0.0.
Objective function (scaled loss + regularization) at each iteration.
New in version 2.0.0.
Returns the precision-recall curve, which is a Dataframe containing two fields recall, precision with (0.0, 1.0) prepended to it.
Note
This ignores instance weights (setting all to 1.0) from LogisticRegression.weightCol. This will change in later Spark versions.
New in version 2.0.0.
Returns a dataframe with two fields (threshold, precision) curve. Every possible probability obtained in transforming the dataset are used as thresholds used in calculating the precision.
Note
This ignores instance weights (setting all to 1.0) from LogisticRegression.weightCol. This will change in later Spark versions.
New in version 2.0.0.
Dataframe outputted by the model’s transform method.
New in version 2.0.0.
Field in “predictions” which gives the probability of each class as a vector.
New in version 2.0.0.
Returns a dataframe with two fields (threshold, recall) curve. Every possible probability obtained in transforming the dataset are used as thresholds used in calculating the recall.
Note
This ignores instance weights (setting all to 1.0) from LogisticRegression.weightCol. This will change in later Spark versions.
New in version 2.0.0.
Returns the receiver operating characteristic (ROC) curve, which is a Dataframe having two fields (FPR, TPR) with (0.0, 0.0) prepended and (1.0, 1.0) appended to it.
See also
Note
This ignores instance weights (setting all to 1.0) from LogisticRegression.weightCol. This will change in later Spark versions.
New in version 2.0.0.
Number of training iterations until termination.
New in version 2.0.0.
Decision tree learning algorithm for classification. It supports both binary and multiclass labels, as well as both continuous and categorical features.
>>> from pyspark.ml.linalg import Vectors
>>> from pyspark.ml.feature import StringIndexer
>>> df = spark.createDataFrame([
... (1.0, Vectors.dense(1.0)),
... (0.0, Vectors.sparse(1, [], []))], ["label", "features"])
>>> stringIndexer = StringIndexer(inputCol="label", outputCol="indexed")
>>> si_model = stringIndexer.fit(df)
>>> td = si_model.transform(df)
>>> dt = DecisionTreeClassifier(maxDepth=2, labelCol="indexed")
>>> model = dt.fit(td)
>>> model.numNodes
3
>>> model.depth
1
>>> model.featureImportances
SparseVector(1, {0: 1.0})
>>> model.numFeatures
1
>>> model.numClasses
2
>>> print(model.toDebugString)
DecisionTreeClassificationModel (uid=...) of depth 1 with 3 nodes...
>>> test0 = spark.createDataFrame([(Vectors.dense(-1.0),)], ["features"])
>>> result = model.transform(test0).head()
>>> result.prediction
0.0
>>> result.probability
DenseVector([1.0, 0.0])
>>> result.rawPrediction
DenseVector([1.0, 0.0])
>>> test1 = spark.createDataFrame([(Vectors.sparse(1, [0], [1.0]),)], ["features"])
>>> model.transform(test1).head().prediction
1.0
>>> dtc_path = temp_path + "/dtc"
>>> dt.save(dtc_path)
>>> dt2 = DecisionTreeClassifier.load(dtc_path)
>>> dt2.getMaxDepth()
2
>>> model_path = temp_path + "/dtc_model"
>>> model.save(model_path)
>>> model2 = DecisionTreeClassificationModel.load(model_path)
>>> model.featureImportances == model2.featureImportances
True
New in version 1.4.0.
Creates a copy of this instance with the same uid and some extra params. This implementation first calls Params.copy and then make a copy of the companion Java pipeline component with extra params. So both the Python wrapper and the Java pipeline component get copied.
Parameters: | extra – Extra parameters to copy to the new instance |
---|---|
Returns: | Copy of this instance |
Explains a single param and returns its name, doc, and optional default value and user-supplied value in a string.
Returns the documentation of all params with their optionally default values and user-supplied values.
Extracts the embedded default param values and user-supplied values, and then merges them with extra values from input into a flat param map, where the latter value is used if there exist conflicts, i.e., with ordering: default param values < user-supplied values < extra.
Parameters: | extra – extra param values |
---|---|
Returns: | merged param map |
Fits a model to the input dataset with optional parameters.
Parameters: |
|
---|---|
Returns: | fitted model(s) |
New in version 1.3.0.
Gets the value of cacheNodeIds or its default value.
Gets the value of checkpointInterval or its default value.
Gets the value of featuresCol or its default value.
Gets the value of impurity or its default value.
New in version 1.6.0.
Gets the value of labelCol or its default value.
Gets the value of maxBins or its default value.
Gets the value of maxDepth or its default value.
Gets the value of maxMemoryInMB or its default value.
Gets the value of minInfoGain or its default value.
Gets the value of minInstancesPerNode or its default value.
Gets the value of a param in the user-supplied param map or its default value. Raises an error if neither is set.
Gets a param by its name.
Gets the value of predictionCol or its default value.
Gets the value of probabilityCol or its default value.
Gets the value of rawPredictionCol or its default value.
Gets the value of seed or its default value.
Checks whether a param has a default value.
Tests whether this instance contains a param with a given (string) name.
Checks whether a param is explicitly set by user or has a default value.
Checks whether a param is explicitly set by user.
Reads an ML instance from the input path, a shortcut of read().load(path).
Returns all params ordered by name. The default implementation uses dir() to get all attributes of type Param.
Returns an MLReader instance for this class.
Save this ML instance to the given path, a shortcut of write().save(path).
Sets the value of cacheNodeIds.
Sets the value of checkpointInterval.
Sets the value of featuresCol.
Sets the value of maxMemoryInMB.
Sets the value of minInfoGain.
Sets the value of minInstancesPerNode.
Sets params for the DecisionTreeClassifier.
New in version 1.4.0.
Sets the value of predictionCol.
Sets the value of probabilityCol.
Sets the value of rawPredictionCol.
Returns an MLWriter instance for this ML instance.
Model fitted by DecisionTreeClassifier.
New in version 1.4.0.
Creates a copy of this instance with the same uid and some extra params. This implementation first calls Params.copy and then make a copy of the companion Java pipeline component with extra params. So both the Python wrapper and the Java pipeline component get copied.
Parameters: | extra – Extra parameters to copy to the new instance |
---|---|
Returns: | Copy of this instance |
Return depth of the decision tree.
New in version 1.5.0.
Explains a single param and returns its name, doc, and optional default value and user-supplied value in a string.
Returns the documentation of all params with their optionally default values and user-supplied values.
Extracts the embedded default param values and user-supplied values, and then merges them with extra values from input into a flat param map, where the latter value is used if there exist conflicts, i.e., with ordering: default param values < user-supplied values < extra.
Parameters: | extra – extra param values |
---|---|
Returns: | merged param map |
Estimate of the importance of each feature.
This generalizes the idea of “Gini” importance to other losses, following the explanation of Gini importance from “Random Forests” documentation by Leo Breiman and Adele Cutler, and following the implementation from scikit-learn.
Note
Feature importance for single decision trees can have high variance due to correlated predictor variables. Consider using a RandomForestClassifier to determine feature importance instead.
New in version 2.0.0.
Gets the value of a param in the user-supplied param map or its default value. Raises an error if neither is set.
Gets a param by its name.
Checks whether a param has a default value.
Tests whether this instance contains a param with a given (string) name.
Checks whether a param is explicitly set by user or has a default value.
Checks whether a param is explicitly set by user.
Reads an ML instance from the input path, a shortcut of read().load(path).
Number of classes (values which the label can take).
New in version 2.1.0.
Returns the number of features the model was trained on. If unknown, returns -1
New in version 2.1.0.
Return number of nodes of the decision tree.
New in version 1.5.0.
Returns all params ordered by name. The default implementation uses dir() to get all attributes of type Param.
Returns an MLReader instance for this class.
Save this ML instance to the given path, a shortcut of write().save(path).
Full description of model.
New in version 2.0.0.
Transforms the input dataset with optional parameters.
Parameters: |
|
---|---|
Returns: | transformed dataset |
New in version 1.3.0.
Returns an MLWriter instance for this ML instance.
Gradient-Boosted Trees (GBTs) learning algorithm for classification. It supports binary labels, as well as both continuous and categorical features.
The implementation is based upon: J.H. Friedman. “Stochastic Gradient Boosting.” 1999.
Notes on Gradient Boosting vs. TreeBoost: - This implementation is for Stochastic Gradient Boosting, not for TreeBoost. - Both algorithms learn tree ensembles by minimizing loss functions. - TreeBoost (Friedman, 1999) additionally modifies the outputs at tree leaf nodes based on the loss function, whereas the original gradient boosting method does not. - We expect to implement TreeBoost in the future: SPARK-4240
Note
Multiclass labels are not currently supported.
>>> from numpy import allclose
>>> from pyspark.ml.linalg import Vectors
>>> from pyspark.ml.feature import StringIndexer
>>> df = spark.createDataFrame([
... (1.0, Vectors.dense(1.0)),
... (0.0, Vectors.sparse(1, [], []))], ["label", "features"])
>>> stringIndexer = StringIndexer(inputCol="label", outputCol="indexed")
>>> si_model = stringIndexer.fit(df)
>>> td = si_model.transform(df)
>>> gbt = GBTClassifier(maxIter=5, maxDepth=2, labelCol="indexed", seed=42)
>>> model = gbt.fit(td)
>>> model.featureImportances
SparseVector(1, {0: 1.0})
>>> allclose(model.treeWeights, [1.0, 0.1, 0.1, 0.1, 0.1])
True
>>> test0 = spark.createDataFrame([(Vectors.dense(-1.0),)], ["features"])
>>> model.transform(test0).head().prediction
0.0
>>> test1 = spark.createDataFrame([(Vectors.sparse(1, [0], [1.0]),)], ["features"])
>>> model.transform(test1).head().prediction
1.0
>>> model.totalNumNodes
15
>>> print(model.toDebugString)
GBTClassificationModel (uid=...)...with 5 trees...
>>> gbtc_path = temp_path + "gbtc"
>>> gbt.save(gbtc_path)
>>> gbt2 = GBTClassifier.load(gbtc_path)
>>> gbt2.getMaxDepth()
2
>>> model_path = temp_path + "gbtc_model"
>>> model.save(model_path)
>>> model2 = GBTClassificationModel.load(model_path)
>>> model.featureImportances == model2.featureImportances
True
>>> model.treeWeights == model2.treeWeights
True
>>> model.trees
[DecisionTreeRegressionModel (uid=...) of depth..., DecisionTreeRegressionModel...]
New in version 1.4.0.
Creates a copy of this instance with the same uid and some extra params. This implementation first calls Params.copy and then make a copy of the companion Java pipeline component with extra params. So both the Python wrapper and the Java pipeline component get copied.
Parameters: | extra – Extra parameters to copy to the new instance |
---|---|
Returns: | Copy of this instance |
Explains a single param and returns its name, doc, and optional default value and user-supplied value in a string.
Returns the documentation of all params with their optionally default values and user-supplied values.
Extracts the embedded default param values and user-supplied values, and then merges them with extra values from input into a flat param map, where the latter value is used if there exist conflicts, i.e., with ordering: default param values < user-supplied values < extra.
Parameters: | extra – extra param values |
---|---|
Returns: | merged param map |
Fits a model to the input dataset with optional parameters.
Parameters: |
|
---|---|
Returns: | fitted model(s) |
New in version 1.3.0.
Gets the value of cacheNodeIds or its default value.
Gets the value of checkpointInterval or its default value.
Gets the value of featuresCol or its default value.
Gets the value of labelCol or its default value.
Gets the value of maxBins or its default value.
Gets the value of maxDepth or its default value.
Gets the value of maxIter or its default value.
Gets the value of maxMemoryInMB or its default value.
Gets the value of minInfoGain or its default value.
Gets the value of minInstancesPerNode or its default value.
Gets the value of a param in the user-supplied param map or its default value. Raises an error if neither is set.
Gets a param by its name.
Gets the value of predictionCol or its default value.
Gets the value of seed or its default value.
Gets the value of stepSize or its default value.
Gets the value of subsamplingRate or its default value.
New in version 1.4.0.
Checks whether a param has a default value.
Tests whether this instance contains a param with a given (string) name.
Checks whether a param is explicitly set by user or has a default value.
Checks whether a param is explicitly set by user.
Reads an ML instance from the input path, a shortcut of read().load(path).
Returns all params ordered by name. The default implementation uses dir() to get all attributes of type Param.
Returns an MLReader instance for this class.
Save this ML instance to the given path, a shortcut of write().save(path).
Sets the value of cacheNodeIds.
Sets the value of checkpointInterval.
Sets the value of featuresCol.
Sets the value of maxMemoryInMB.
Sets the value of minInfoGain.
Sets the value of minInstancesPerNode.
Sets params for Gradient Boosted Tree Classification.
New in version 1.4.0.
Sets the value of predictionCol.
Sets the value of subsamplingRate.
New in version 1.4.0.
Returns an MLWriter instance for this ML instance.
Model fitted by GBTClassifier.
New in version 1.4.0.
Creates a copy of this instance with the same uid and some extra params. This implementation first calls Params.copy and then make a copy of the companion Java pipeline component with extra params. So both the Python wrapper and the Java pipeline component get copied.
Parameters: | extra – Extra parameters to copy to the new instance |
---|---|
Returns: | Copy of this instance |
Explains a single param and returns its name, doc, and optional default value and user-supplied value in a string.
Returns the documentation of all params with their optionally default values and user-supplied values.
Extracts the embedded default param values and user-supplied values, and then merges them with extra values from input into a flat param map, where the latter value is used if there exist conflicts, i.e., with ordering: default param values < user-supplied values < extra.
Parameters: | extra – extra param values |
---|---|
Returns: | merged param map |
Estimate of the importance of each feature.
Each feature’s importance is the average of its importance across all trees in the ensemble The importance vector is normalized to sum to 1. This method is suggested by Hastie et al. (Hastie, Tibshirani, Friedman. “The Elements of Statistical Learning, 2nd Edition.” 2001.) and follows the implementation from scikit-learn.
New in version 2.0.0.
Number of trees in ensemble.
New in version 2.0.0.
Gets the value of a param in the user-supplied param map or its default value. Raises an error if neither is set.
Gets a param by its name.
Checks whether a param has a default value.
Tests whether this instance contains a param with a given (string) name.
Checks whether a param is explicitly set by user or has a default value.
Checks whether a param is explicitly set by user.
Reads an ML instance from the input path, a shortcut of read().load(path).
Returns the number of features the model was trained on. If unknown, returns -1
New in version 2.1.0.
Returns all params ordered by name. The default implementation uses dir() to get all attributes of type Param.
Returns an MLReader instance for this class.
Save this ML instance to the given path, a shortcut of write().save(path).
Full description of model.
New in version 2.0.0.
Total number of nodes, summed over all trees in the ensemble.
New in version 2.0.0.
Transforms the input dataset with optional parameters.
Parameters: |
|
---|---|
Returns: | transformed dataset |
New in version 1.3.0.
Return the weights for each tree
New in version 1.5.0.
Trees in this ensemble. Warning: These have null parent Estimators.
New in version 2.0.0.
Returns an MLWriter instance for this ML instance.
Random Forest learning algorithm for classification. It supports both binary and multiclass labels, as well as both continuous and categorical features.
>>> import numpy
>>> from numpy import allclose
>>> from pyspark.ml.linalg import Vectors
>>> from pyspark.ml.feature import StringIndexer
>>> df = spark.createDataFrame([
... (1.0, Vectors.dense(1.0)),
... (0.0, Vectors.sparse(1, [], []))], ["label", "features"])
>>> stringIndexer = StringIndexer(inputCol="label", outputCol="indexed")
>>> si_model = stringIndexer.fit(df)
>>> td = si_model.transform(df)
>>> rf = RandomForestClassifier(numTrees=3, maxDepth=2, labelCol="indexed", seed=42)
>>> model = rf.fit(td)
>>> model.featureImportances
SparseVector(1, {0: 1.0})
>>> allclose(model.treeWeights, [1.0, 1.0, 1.0])
True
>>> test0 = spark.createDataFrame([(Vectors.dense(-1.0),)], ["features"])
>>> result = model.transform(test0).head()
>>> result.prediction
0.0
>>> numpy.argmax(result.probability)
0
>>> numpy.argmax(result.rawPrediction)
0
>>> test1 = spark.createDataFrame([(Vectors.sparse(1, [0], [1.0]),)], ["features"])
>>> model.transform(test1).head().prediction
1.0
>>> model.trees
[DecisionTreeClassificationModel (uid=...) of depth..., DecisionTreeClassificationModel...]
>>> rfc_path = temp_path + "/rfc"
>>> rf.save(rfc_path)
>>> rf2 = RandomForestClassifier.load(rfc_path)
>>> rf2.getNumTrees()
3
>>> model_path = temp_path + "/rfc_model"
>>> model.save(model_path)
>>> model2 = RandomForestClassificationModel.load(model_path)
>>> model.featureImportances == model2.featureImportances
True
New in version 1.4.0.
Creates a copy of this instance with the same uid and some extra params. This implementation first calls Params.copy and then make a copy of the companion Java pipeline component with extra params. So both the Python wrapper and the Java pipeline component get copied.
Parameters: | extra – Extra parameters to copy to the new instance |
---|---|
Returns: | Copy of this instance |
Explains a single param and returns its name, doc, and optional default value and user-supplied value in a string.
Returns the documentation of all params with their optionally default values and user-supplied values.
Extracts the embedded default param values and user-supplied values, and then merges them with extra values from input into a flat param map, where the latter value is used if there exist conflicts, i.e., with ordering: default param values < user-supplied values < extra.
Parameters: | extra – extra param values |
---|---|
Returns: | merged param map |
Fits a model to the input dataset with optional parameters.
Parameters: |
|
---|---|
Returns: | fitted model(s) |
New in version 1.3.0.
Gets the value of cacheNodeIds or its default value.
Gets the value of checkpointInterval or its default value.
Gets the value of featureSubsetStrategy or its default value.
New in version 1.4.0.
Gets the value of featuresCol or its default value.
Gets the value of impurity or its default value.
New in version 1.6.0.
Gets the value of labelCol or its default value.
Gets the value of maxBins or its default value.
Gets the value of maxDepth or its default value.
Gets the value of maxMemoryInMB or its default value.
Gets the value of minInfoGain or its default value.
Gets the value of minInstancesPerNode or its default value.
Gets the value of numTrees or its default value.
New in version 1.4.0.
Gets the value of a param in the user-supplied param map or its default value. Raises an error if neither is set.
Gets a param by its name.
Gets the value of predictionCol or its default value.
Gets the value of probabilityCol or its default value.
Gets the value of rawPredictionCol or its default value.
Gets the value of seed or its default value.
Gets the value of subsamplingRate or its default value.
New in version 1.4.0.
Checks whether a param has a default value.
Tests whether this instance contains a param with a given (string) name.
Checks whether a param is explicitly set by user or has a default value.
Checks whether a param is explicitly set by user.
Reads an ML instance from the input path, a shortcut of read().load(path).
Returns all params ordered by name. The default implementation uses dir() to get all attributes of type Param.
Returns an MLReader instance for this class.
Save this ML instance to the given path, a shortcut of write().save(path).
Sets the value of cacheNodeIds.
Sets the value of checkpointInterval.
Sets the value of featureSubsetStrategy.
New in version 1.4.0.
Sets the value of featuresCol.
Sets the value of maxMemoryInMB.
Sets the value of minInfoGain.
Sets the value of minInstancesPerNode.
Sets params for linear classification.
New in version 1.4.0.
Sets the value of predictionCol.
Sets the value of probabilityCol.
Sets the value of rawPredictionCol.
Sets the value of subsamplingRate.
New in version 1.4.0.
Returns an MLWriter instance for this ML instance.
Model fitted by RandomForestClassifier.
New in version 1.4.0.
Creates a copy of this instance with the same uid and some extra params. This implementation first calls Params.copy and then make a copy of the companion Java pipeline component with extra params. So both the Python wrapper and the Java pipeline component get copied.
Parameters: | extra – Extra parameters to copy to the new instance |
---|---|
Returns: | Copy of this instance |
Explains a single param and returns its name, doc, and optional default value and user-supplied value in a string.
Returns the documentation of all params with their optionally default values and user-supplied values.
Extracts the embedded default param values and user-supplied values, and then merges them with extra values from input into a flat param map, where the latter value is used if there exist conflicts, i.e., with ordering: default param values < user-supplied values < extra.
Parameters: | extra – extra param values |
---|---|
Returns: | merged param map |
Estimate of the importance of each feature.
Each feature’s importance is the average of its importance across all trees in the ensemble The importance vector is normalized to sum to 1. This method is suggested by Hastie et al. (Hastie, Tibshirani, Friedman. “The Elements of Statistical Learning, 2nd Edition.” 2001.) and follows the implementation from scikit-learn.
New in version 2.0.0.
Number of trees in ensemble.
New in version 2.0.0.
Gets the value of a param in the user-supplied param map or its default value. Raises an error if neither is set.
Gets a param by its name.
Checks whether a param has a default value.
Tests whether this instance contains a param with a given (string) name.
Checks whether a param is explicitly set by user or has a default value.
Checks whether a param is explicitly set by user.
Reads an ML instance from the input path, a shortcut of read().load(path).
Number of classes (values which the label can take).
New in version 2.1.0.
Returns the number of features the model was trained on. If unknown, returns -1
New in version 2.1.0.
Returns all params ordered by name. The default implementation uses dir() to get all attributes of type Param.
Returns an MLReader instance for this class.
Save this ML instance to the given path, a shortcut of write().save(path).
Full description of model.
New in version 2.0.0.
Total number of nodes, summed over all trees in the ensemble.
New in version 2.0.0.
Transforms the input dataset with optional parameters.
Parameters: |
|
---|---|
Returns: | transformed dataset |
New in version 1.3.0.
Return the weights for each tree
New in version 1.5.0.
Trees in this ensemble. Warning: These have null parent Estimators.
New in version 2.0.0.
Returns an MLWriter instance for this ML instance.
Naive Bayes Classifiers. It supports both Multinomial and Bernoulli NB. Multinomial NB can handle finitely supported discrete data. For example, by converting documents into TF-IDF vectors, it can be used for document classification. By making every vector a binary (0/1) data, it can also be used as Bernoulli NB. The input feature values must be nonnegative.
>>> from pyspark.sql import Row
>>> from pyspark.ml.linalg import Vectors
>>> df = spark.createDataFrame([
... Row(label=0.0, weight=0.1, features=Vectors.dense([0.0, 0.0])),
... Row(label=0.0, weight=0.5, features=Vectors.dense([0.0, 1.0])),
... Row(label=1.0, weight=1.0, features=Vectors.dense([1.0, 0.0]))])
>>> nb = NaiveBayes(smoothing=1.0, modelType="multinomial", weightCol="weight")
>>> model = nb.fit(df)
>>> model.pi
DenseVector([-0.81..., -0.58...])
>>> model.theta
DenseMatrix(2, 2, [-0.91..., -0.51..., -0.40..., -1.09...], 1)
>>> test0 = sc.parallelize([Row(features=Vectors.dense([1.0, 0.0]))]).toDF()
>>> result = model.transform(test0).head()
>>> result.prediction
1.0
>>> result.probability
DenseVector([0.32..., 0.67...])
>>> result.rawPrediction
DenseVector([-1.72..., -0.99...])
>>> test1 = sc.parallelize([Row(features=Vectors.sparse(2, [0], [1.0]))]).toDF()
>>> model.transform(test1).head().prediction
1.0
>>> nb_path = temp_path + "/nb"
>>> nb.save(nb_path)
>>> nb2 = NaiveBayes.load(nb_path)
>>> nb2.getSmoothing()
1.0
>>> model_path = temp_path + "/nb_model"
>>> model.save(model_path)
>>> model2 = NaiveBayesModel.load(model_path)
>>> model.pi == model2.pi
True
>>> model.theta == model2.theta
True
>>> nb = nb.setThresholds([0.01, 10.00])
>>> model3 = nb.fit(df)
>>> result = model3.transform(test0).head()
>>> result.prediction
0.0
New in version 1.5.0.
Creates a copy of this instance with the same uid and some extra params. This implementation first calls Params.copy and then make a copy of the companion Java pipeline component with extra params. So both the Python wrapper and the Java pipeline component get copied.
Parameters: | extra – Extra parameters to copy to the new instance |
---|---|
Returns: | Copy of this instance |
Explains a single param and returns its name, doc, and optional default value and user-supplied value in a string.
Returns the documentation of all params with their optionally default values and user-supplied values.
Extracts the embedded default param values and user-supplied values, and then merges them with extra values from input into a flat param map, where the latter value is used if there exist conflicts, i.e., with ordering: default param values < user-supplied values < extra.
Parameters: | extra – extra param values |
---|---|
Returns: | merged param map |
Fits a model to the input dataset with optional parameters.
Parameters: |
|
---|---|
Returns: | fitted model(s) |
New in version 1.3.0.
Gets the value of featuresCol or its default value.
Gets the value of labelCol or its default value.
Gets the value of a param in the user-supplied param map or its default value. Raises an error if neither is set.
Gets a param by its name.
Gets the value of predictionCol or its default value.
Gets the value of probabilityCol or its default value.
Gets the value of rawPredictionCol or its default value.
Gets the value of thresholds or its default value.
Gets the value of weightCol or its default value.
Checks whether a param has a default value.
Tests whether this instance contains a param with a given (string) name.
Checks whether a param is explicitly set by user or has a default value.
Checks whether a param is explicitly set by user.
Reads an ML instance from the input path, a shortcut of read().load(path).
Returns all params ordered by name. The default implementation uses dir() to get all attributes of type Param.
Returns an MLReader instance for this class.
Save this ML instance to the given path, a shortcut of write().save(path).
Sets the value of featuresCol.
Sets params for Naive Bayes.
New in version 1.5.0.
Sets the value of predictionCol.
Sets the value of probabilityCol.
Sets the value of rawPredictionCol.
Sets the value of thresholds.
Returns an MLWriter instance for this ML instance.
Model fitted by NaiveBayes.
New in version 1.5.0.
Creates a copy of this instance with the same uid and some extra params. This implementation first calls Params.copy and then make a copy of the companion Java pipeline component with extra params. So both the Python wrapper and the Java pipeline component get copied.
Parameters: | extra – Extra parameters to copy to the new instance |
---|---|
Returns: | Copy of this instance |
Explains a single param and returns its name, doc, and optional default value and user-supplied value in a string.
Returns the documentation of all params with their optionally default values and user-supplied values.
Extracts the embedded default param values and user-supplied values, and then merges them with extra values from input into a flat param map, where the latter value is used if there exist conflicts, i.e., with ordering: default param values < user-supplied values < extra.
Parameters: | extra – extra param values |
---|---|
Returns: | merged param map |
Gets the value of a param in the user-supplied param map or its default value. Raises an error if neither is set.
Gets a param by its name.
Checks whether a param has a default value.
Tests whether this instance contains a param with a given (string) name.
Checks whether a param is explicitly set by user or has a default value.
Checks whether a param is explicitly set by user.
Reads an ML instance from the input path, a shortcut of read().load(path).
Number of classes (values which the label can take).
New in version 2.1.0.
Returns the number of features the model was trained on. If unknown, returns -1
New in version 2.1.0.
Returns all params ordered by name. The default implementation uses dir() to get all attributes of type Param.
Returns an MLReader instance for this class.
Save this ML instance to the given path, a shortcut of write().save(path).
Transforms the input dataset with optional parameters.
Parameters: |
|
---|---|
Returns: | transformed dataset |
New in version 1.3.0.
Returns an MLWriter instance for this ML instance.
Classifier trainer based on the Multilayer Perceptron. Each layer has sigmoid activation function, output layer has softmax. Number of inputs has to be equal to the size of feature vectors. Number of outputs has to be equal to the total number of labels.
>>> from pyspark.ml.linalg import Vectors
>>> df = spark.createDataFrame([
... (0.0, Vectors.dense([0.0, 0.0])),
... (1.0, Vectors.dense([0.0, 1.0])),
... (1.0, Vectors.dense([1.0, 0.0])),
... (0.0, Vectors.dense([1.0, 1.0]))], ["label", "features"])
>>> mlp = MultilayerPerceptronClassifier(maxIter=100, layers=[2, 2, 2], blockSize=1, seed=123)
>>> model = mlp.fit(df)
>>> model.layers
[2, 2, 2]
>>> model.weights.size
12
>>> testDF = spark.createDataFrame([
... (Vectors.dense([1.0, 0.0]),),
... (Vectors.dense([0.0, 0.0]),)], ["features"])
>>> model.transform(testDF).show()
+---------+----------+
| features|prediction|
+---------+----------+
|[1.0,0.0]| 1.0|
|[0.0,0.0]| 0.0|
+---------+----------+
...
>>> mlp_path = temp_path + "/mlp"
>>> mlp.save(mlp_path)
>>> mlp2 = MultilayerPerceptronClassifier.load(mlp_path)
>>> mlp2.getBlockSize()
1
>>> model_path = temp_path + "/mlp_model"
>>> model.save(model_path)
>>> model2 = MultilayerPerceptronClassificationModel.load(model_path)
>>> model.layers == model2.layers
True
>>> model.weights == model2.weights
True
>>> mlp2 = mlp2.setInitialWeights(list(range(0, 12)))
>>> model3 = mlp2.fit(df)
>>> model3.weights != model2.weights
True
>>> model3.layers == model.layers
True
New in version 1.6.0.
Creates a copy of this instance with the same uid and some extra params. This implementation first calls Params.copy and then make a copy of the companion Java pipeline component with extra params. So both the Python wrapper and the Java pipeline component get copied.
Parameters: | extra – Extra parameters to copy to the new instance |
---|---|
Returns: | Copy of this instance |
Explains a single param and returns its name, doc, and optional default value and user-supplied value in a string.
Returns the documentation of all params with their optionally default values and user-supplied values.
Extracts the embedded default param values and user-supplied values, and then merges them with extra values from input into a flat param map, where the latter value is used if there exist conflicts, i.e., with ordering: default param values < user-supplied values < extra.
Parameters: | extra – extra param values |
---|---|
Returns: | merged param map |
Fits a model to the input dataset with optional parameters.
Parameters: |
|
---|---|
Returns: | fitted model(s) |
New in version 1.3.0.
Gets the value of featuresCol or its default value.
Gets the value of initialWeights or its default value.
New in version 2.0.0.
Gets the value of labelCol or its default value.
Gets the value of maxIter or its default value.
Gets the value of a param in the user-supplied param map or its default value. Raises an error if neither is set.
Gets a param by its name.
Gets the value of predictionCol or its default value.
Gets the value of seed or its default value.
Gets the value of tol or its default value.
Checks whether a param has a default value.
Tests whether this instance contains a param with a given (string) name.
Checks whether a param is explicitly set by user or has a default value.
Checks whether a param is explicitly set by user.
Reads an ML instance from the input path, a shortcut of read().load(path).
Returns all params ordered by name. The default implementation uses dir() to get all attributes of type Param.
Returns an MLReader instance for this class.
Save this ML instance to the given path, a shortcut of write().save(path).
Sets the value of featuresCol.
Sets the value of initialWeights.
New in version 2.0.0.
Sets params for MultilayerPerceptronClassifier.
New in version 1.6.0.
Sets the value of predictionCol.
Returns an MLWriter instance for this ML instance.
Model fitted by MultilayerPerceptronClassifier.
New in version 1.6.0.
Creates a copy of this instance with the same uid and some extra params. This implementation first calls Params.copy and then make a copy of the companion Java pipeline component with extra params. So both the Python wrapper and the Java pipeline component get copied.
Parameters: | extra – Extra parameters to copy to the new instance |
---|---|
Returns: | Copy of this instance |
Explains a single param and returns its name, doc, and optional default value and user-supplied value in a string.
Returns the documentation of all params with their optionally default values and user-supplied values.
Extracts the embedded default param values and user-supplied values, and then merges them with extra values from input into a flat param map, where the latter value is used if there exist conflicts, i.e., with ordering: default param values < user-supplied values < extra.
Parameters: | extra – extra param values |
---|---|
Returns: | merged param map |
Gets the value of a param in the user-supplied param map or its default value. Raises an error if neither is set.
Gets a param by its name.
Checks whether a param has a default value.
Tests whether this instance contains a param with a given (string) name.
Checks whether a param is explicitly set by user or has a default value.
Checks whether a param is explicitly set by user.
Reads an ML instance from the input path, a shortcut of read().load(path).
Returns the number of features the model was trained on. If unknown, returns -1
New in version 2.1.0.
Returns all params ordered by name. The default implementation uses dir() to get all attributes of type Param.
Returns an MLReader instance for this class.
Save this ML instance to the given path, a shortcut of write().save(path).
Transforms the input dataset with optional parameters.
Parameters: |
|
---|---|
Returns: | transformed dataset |
New in version 1.3.0.
Returns an MLWriter instance for this ML instance.
Note
Experimental
Reduction of Multiclass Classification to Binary Classification. Performs reduction using one against all strategy. For a multiclass classification with k classes, train k models (one per class). Each example is scored against all k models and the model with highest score is picked to label the example.
>>> from pyspark.sql import Row
>>> from pyspark.ml.linalg import Vectors
>>> data_path = "data/mllib/sample_multiclass_classification_data.txt"
>>> df = spark.read.format("libsvm").load(data_path)
>>> lr = LogisticRegression(regParam=0.01)
>>> ovr = OneVsRest(classifier=lr)
>>> model = ovr.fit(df)
>>> model.models[0].coefficients
DenseVector([0.5..., -1.0..., 3.4..., 4.2...])
>>> model.models[1].coefficients
DenseVector([-2.1..., 3.1..., -2.6..., -2.3...])
>>> model.models[2].coefficients
DenseVector([0.3..., -3.4..., 1.0..., -1.1...])
>>> [x.intercept for x in model.models]
[-2.7..., -2.5..., -1.3...]
>>> test0 = sc.parallelize([Row(features=Vectors.dense(-1.0, 0.0, 1.0, 1.0))]).toDF()
>>> model.transform(test0).head().prediction
0.0
>>> test1 = sc.parallelize([Row(features=Vectors.sparse(4, [0], [1.0]))]).toDF()
>>> model.transform(test1).head().prediction
2.0
>>> test2 = sc.parallelize([Row(features=Vectors.dense(0.5, 0.4, 0.3, 0.2))]).toDF()
>>> model.transform(test2).head().prediction
0.0
>>> model_path = temp_path + "/ovr_model"
>>> model.save(model_path)
>>> model2 = OneVsRestModel.load(model_path)
>>> model2.transform(test0).head().prediction
0.0
New in version 2.0.0.
Creates a copy of this instance with a randomly generated uid and some extra params. This creates a deep copy of the embedded paramMap, and copies the embedded and extra parameters over.
Parameters: | extra – Extra parameters to copy to the new instance |
---|---|
Returns: | Copy of this instance |
New in version 2.0.0.
Explains a single param and returns its name, doc, and optional default value and user-supplied value in a string.
Returns the documentation of all params with their optionally default values and user-supplied values.
Extracts the embedded default param values and user-supplied values, and then merges them with extra values from input into a flat param map, where the latter value is used if there exist conflicts, i.e., with ordering: default param values < user-supplied values < extra.
Parameters: | extra – extra param values |
---|---|
Returns: | merged param map |
Fits a model to the input dataset with optional parameters.
Parameters: |
|
---|---|
Returns: | fitted model(s) |
New in version 1.3.0.
Gets the value of classifier or its default value.
New in version 2.0.0.
Gets the value of featuresCol or its default value.
Gets the value of labelCol or its default value.
Gets the value of a param in the user-supplied param map or its default value. Raises an error if neither is set.
Gets a param by its name.
Gets the value of predictionCol or its default value.
Checks whether a param has a default value.
Tests whether this instance contains a param with a given (string) name.
Checks whether a param is explicitly set by user or has a default value.
Checks whether a param is explicitly set by user.
Reads an ML instance from the input path, a shortcut of read().load(path).
Returns all params ordered by name. The default implementation uses dir() to get all attributes of type Param.
Save this ML instance to the given path, a shortcut of write().save(path).
New in version 2.0.0.
Sets the value of classifier.
Note
Only LogisticRegression and NaiveBayes are supported now.
New in version 2.0.0.
Sets the value of featuresCol.
setParams(self, featuresCol=None, labelCol=None, predictionCol=None, classifier=None): Sets params for OneVsRest.
New in version 2.0.0.
Sets the value of predictionCol.
Note
Experimental
Model fitted by OneVsRest. This stores the models resulting from training k binary classifiers: one for each class. Each example is scored against all k models, and the model with the highest score is picked to label the example.
New in version 2.0.0.
Creates a copy of this instance with a randomly generated uid and some extra params. This creates a deep copy of the embedded paramMap, and copies the embedded and extra parameters over.
Parameters: | extra – Extra parameters to copy to the new instance |
---|---|
Returns: | Copy of this instance |
New in version 2.0.0.
Explains a single param and returns its name, doc, and optional default value and user-supplied value in a string.
Returns the documentation of all params with their optionally default values and user-supplied values.
Extracts the embedded default param values and user-supplied values, and then merges them with extra values from input into a flat param map, where the latter value is used if there exist conflicts, i.e., with ordering: default param values < user-supplied values < extra.
Parameters: | extra – extra param values |
---|---|
Returns: | merged param map |
Gets the value of classifier or its default value.
New in version 2.0.0.
Gets the value of featuresCol or its default value.
Gets the value of labelCol or its default value.
Gets the value of a param in the user-supplied param map or its default value. Raises an error if neither is set.
Gets a param by its name.
Gets the value of predictionCol or its default value.
Checks whether a param has a default value.
Tests whether this instance contains a param with a given (string) name.
Checks whether a param is explicitly set by user or has a default value.
Checks whether a param is explicitly set by user.
Reads an ML instance from the input path, a shortcut of read().load(path).
Returns all params ordered by name. The default implementation uses dir() to get all attributes of type Param.
Save this ML instance to the given path, a shortcut of write().save(path).
New in version 2.0.0.
Sets the value of classifier.
Note
Only LogisticRegression and NaiveBayes are supported now.
New in version 2.0.0.
Sets the value of featuresCol.
Sets the value of predictionCol.
Transforms the input dataset with optional parameters.
Parameters: |
|
---|---|
Returns: | transformed dataset |
New in version 1.3.0.
A bisecting k-means algorithm based on the paper “A comparison of document clustering techniques” by Steinbach, Karypis, and Kumar, with modification to fit Spark. The algorithm starts from a single cluster that contains all points. Iteratively it finds divisible clusters on the bottom level and bisects each of them using k-means, until there are k leaf clusters in total or no leaf clusters are divisible. The bisecting steps of clusters on the same level are grouped together to increase parallelism. If bisecting all divisible clusters on the bottom level would result more than k leaf clusters, larger clusters get higher priority.
>>> from pyspark.ml.linalg import Vectors
>>> data = [(Vectors.dense([0.0, 0.0]),), (Vectors.dense([1.0, 1.0]),),
... (Vectors.dense([9.0, 8.0]),), (Vectors.dense([8.0, 9.0]),)]
>>> df = spark.createDataFrame(data, ["features"])
>>> bkm = BisectingKMeans(k=2, minDivisibleClusterSize=1.0)
>>> model = bkm.fit(df)
>>> centers = model.clusterCenters()
>>> len(centers)
2
>>> model.computeCost(df)
2.000...
>>> model.hasSummary
True
>>> summary = model.summary
>>> summary.k
2
>>> summary.clusterSizes
[2, 2]
>>> transformed = model.transform(df).select("features", "prediction")
>>> rows = transformed.collect()
>>> rows[0].prediction == rows[1].prediction
True
>>> rows[2].prediction == rows[3].prediction
True
>>> bkm_path = temp_path + "/bkm"
>>> bkm.save(bkm_path)
>>> bkm2 = BisectingKMeans.load(bkm_path)
>>> bkm2.getK()
2
>>> model_path = temp_path + "/bkm_model"
>>> model.save(model_path)
>>> model2 = BisectingKMeansModel.load(model_path)
>>> model2.hasSummary
False
>>> model.clusterCenters()[0] == model2.clusterCenters()[0]
array([ True, True], dtype=bool)
>>> model.clusterCenters()[1] == model2.clusterCenters()[1]
array([ True, True], dtype=bool)
New in version 2.0.0.
Creates a copy of this instance with the same uid and some extra params. This implementation first calls Params.copy and then make a copy of the companion Java pipeline component with extra params. So both the Python wrapper and the Java pipeline component get copied.
Parameters: | extra – Extra parameters to copy to the new instance |
---|---|
Returns: | Copy of this instance |
Explains a single param and returns its name, doc, and optional default value and user-supplied value in a string.
Returns the documentation of all params with their optionally default values and user-supplied values.
Extracts the embedded default param values and user-supplied values, and then merges them with extra values from input into a flat param map, where the latter value is used if there exist conflicts, i.e., with ordering: default param values < user-supplied values < extra.
Parameters: | extra – extra param values |
---|---|
Returns: | merged param map |
Fits a model to the input dataset with optional parameters.
Parameters: |
|
---|---|
Returns: | fitted model(s) |
New in version 1.3.0.
Gets the value of featuresCol or its default value.
Gets the value of maxIter or its default value.
Gets the value of minDivisibleClusterSize or its default value.
New in version 2.0.0.
Gets the value of a param in the user-supplied param map or its default value. Raises an error if neither is set.
Gets a param by its name.
Gets the value of predictionCol or its default value.
Gets the value of seed or its default value.
Checks whether a param has a default value.
Tests whether this instance contains a param with a given (string) name.
Checks whether a param is explicitly set by user or has a default value.
Checks whether a param is explicitly set by user.
Reads an ML instance from the input path, a shortcut of read().load(path).
Returns all params ordered by name. The default implementation uses dir() to get all attributes of type Param.
Returns an MLReader instance for this class.
Save this ML instance to the given path, a shortcut of write().save(path).
Sets the value of featuresCol.
Sets the value of minDivisibleClusterSize.
New in version 2.0.0.
Sets params for BisectingKMeans.
New in version 2.0.0.
Sets the value of predictionCol.
Returns an MLWriter instance for this ML instance.
Model fitted by BisectingKMeans.
New in version 2.0.0.
Get the cluster centers, represented as a list of NumPy arrays.
New in version 2.0.0.
Computes the sum of squared distances between the input points and their corresponding cluster centers.
New in version 2.0.0.
Creates a copy of this instance with the same uid and some extra params. This implementation first calls Params.copy and then make a copy of the companion Java pipeline component with extra params. So both the Python wrapper and the Java pipeline component get copied.
Parameters: | extra – Extra parameters to copy to the new instance |
---|---|
Returns: | Copy of this instance |
Explains a single param and returns its name, doc, and optional default value and user-supplied value in a string.
Returns the documentation of all params with their optionally default values and user-supplied values.
Extracts the embedded default param values and user-supplied values, and then merges them with extra values from input into a flat param map, where the latter value is used if there exist conflicts, i.e., with ordering: default param values < user-supplied values < extra.
Parameters: | extra – extra param values |
---|---|
Returns: | merged param map |
Gets the value of a param in the user-supplied param map or its default value. Raises an error if neither is set.
Gets a param by its name.
Checks whether a param has a default value.
Tests whether this instance contains a param with a given (string) name.
Indicates whether a training summary exists for this model instance.
New in version 2.1.0.
Checks whether a param is explicitly set by user or has a default value.
Checks whether a param is explicitly set by user.
Reads an ML instance from the input path, a shortcut of read().load(path).
Returns all params ordered by name. The default implementation uses dir() to get all attributes of type Param.
Returns an MLReader instance for this class.
Save this ML instance to the given path, a shortcut of write().save(path).
Gets summary (e.g. cluster assignments, cluster sizes) of the model trained on the training set. An exception is thrown if no summary exists.
New in version 2.1.0.
Transforms the input dataset with optional parameters.
Parameters: |
|
---|---|
Returns: | transformed dataset |
New in version 1.3.0.
Returns an MLWriter instance for this ML instance.
Note
Experimental
Bisecting KMeans clustering results for a given model.
New in version 2.1.0.
DataFrame of predicted cluster centers for each training data point.
New in version 2.1.0.
Size of (number of data points in) each cluster.
New in version 2.1.0.
Name for column of features in predictions.
New in version 2.1.0.
The number of clusters the model was trained with.
New in version 2.1.0.
Name for column of predicted clusters in predictions.
New in version 2.1.0.
DataFrame produced by the model’s transform method.
New in version 2.1.0.
K-means clustering with a k-means++ like initialization mode (the k-means|| algorithm by Bahmani et al).
>>> from pyspark.ml.linalg import Vectors
>>> data = [(Vectors.dense([0.0, 0.0]),), (Vectors.dense([1.0, 1.0]),),
... (Vectors.dense([9.0, 8.0]),), (Vectors.dense([8.0, 9.0]),)]
>>> df = spark.createDataFrame(data, ["features"])
>>> kmeans = KMeans(k=2, seed=1)
>>> model = kmeans.fit(df)
>>> centers = model.clusterCenters()
>>> len(centers)
2
>>> model.computeCost(df)
2.000...
>>> transformed = model.transform(df).select("features", "prediction")
>>> rows = transformed.collect()
>>> rows[0].prediction == rows[1].prediction
True
>>> rows[2].prediction == rows[3].prediction
True
>>> model.hasSummary
True
>>> summary = model.summary
>>> summary.k
2
>>> summary.clusterSizes
[2, 2]
>>> kmeans_path = temp_path + "/kmeans"
>>> kmeans.save(kmeans_path)
>>> kmeans2 = KMeans.load(kmeans_path)
>>> kmeans2.getK()
2
>>> model_path = temp_path + "/kmeans_model"
>>> model.save(model_path)
>>> model2 = KMeansModel.load(model_path)
>>> model2.hasSummary
False
>>> model.clusterCenters()[0] == model2.clusterCenters()[0]
array([ True, True], dtype=bool)
>>> model.clusterCenters()[1] == model2.clusterCenters()[1]
array([ True, True], dtype=bool)
New in version 1.5.0.
Creates a copy of this instance with the same uid and some extra params. This implementation first calls Params.copy and then make a copy of the companion Java pipeline component with extra params. So both the Python wrapper and the Java pipeline component get copied.
Parameters: | extra – Extra parameters to copy to the new instance |
---|---|
Returns: | Copy of this instance |
Explains a single param and returns its name, doc, and optional default value and user-supplied value in a string.
Returns the documentation of all params with their optionally default values and user-supplied values.
Extracts the embedded default param values and user-supplied values, and then merges them with extra values from input into a flat param map, where the latter value is used if there exist conflicts, i.e., with ordering: default param values < user-supplied values < extra.
Parameters: | extra – extra param values |
---|---|
Returns: | merged param map |
Fits a model to the input dataset with optional parameters.
Parameters: |
|
---|---|
Returns: | fitted model(s) |
New in version 1.3.0.
Gets the value of featuresCol or its default value.
Gets the value of maxIter or its default value.
Gets the value of a param in the user-supplied param map or its default value. Raises an error if neither is set.
Gets a param by its name.
Gets the value of predictionCol or its default value.
Gets the value of seed or its default value.
Gets the value of tol or its default value.
Checks whether a param has a default value.
Tests whether this instance contains a param with a given (string) name.
Checks whether a param is explicitly set by user or has a default value.
Checks whether a param is explicitly set by user.
Reads an ML instance from the input path, a shortcut of read().load(path).
Returns all params ordered by name. The default implementation uses dir() to get all attributes of type Param.
Returns an MLReader instance for this class.
Save this ML instance to the given path, a shortcut of write().save(path).
Sets the value of featuresCol.
Sets params for KMeans.
New in version 1.5.0.
Sets the value of predictionCol.
Returns an MLWriter instance for this ML instance.
Model fitted by KMeans.
New in version 1.5.0.
Get the cluster centers, represented as a list of NumPy arrays.
New in version 1.5.0.
Return the K-means cost (sum of squared distances of points to their nearest center) for this model on the given data.
New in version 2.0.0.
Creates a copy of this instance with the same uid and some extra params. This implementation first calls Params.copy and then make a copy of the companion Java pipeline component with extra params. So both the Python wrapper and the Java pipeline component get copied.
Parameters: | extra – Extra parameters to copy to the new instance |
---|---|
Returns: | Copy of this instance |
Explains a single param and returns its name, doc, and optional default value and user-supplied value in a string.
Returns the documentation of all params with their optionally default values and user-supplied values.
Extracts the embedded default param values and user-supplied values, and then merges them with extra values from input into a flat param map, where the latter value is used if there exist conflicts, i.e., with ordering: default param values < user-supplied values < extra.
Parameters: | extra – extra param values |
---|---|
Returns: | merged param map |
Gets the value of a param in the user-supplied param map or its default value. Raises an error if neither is set.
Gets a param by its name.
Checks whether a param has a default value.
Tests whether this instance contains a param with a given (string) name.
Indicates whether a training summary exists for this model instance.
New in version 2.1.0.
Checks whether a param is explicitly set by user or has a default value.
Checks whether a param is explicitly set by user.
Reads an ML instance from the input path, a shortcut of read().load(path).
Returns all params ordered by name. The default implementation uses dir() to get all attributes of type Param.
Returns an MLReader instance for this class.
Save this ML instance to the given path, a shortcut of write().save(path).
Gets summary (e.g. cluster assignments, cluster sizes) of the model trained on the training set. An exception is thrown if no summary exists.
New in version 2.1.0.
Transforms the input dataset with optional parameters.
Parameters: |
|
---|---|
Returns: | transformed dataset |
New in version 1.3.0.
Returns an MLWriter instance for this ML instance.
GaussianMixture clustering. This class performs expectation maximization for multivariate Gaussian Mixture Models (GMMs). A GMM represents a composite distribution of independent Gaussian distributions with associated “mixing” weights specifying each’s contribution to the composite.
Given a set of sample points, this class will maximize the log-likelihood for a mixture of k Gaussians, iterating until the log-likelihood changes by less than convergenceTol, or until it has reached the max number of iterations. While this process is generally guaranteed to converge, it is not guaranteed to find a global optimum.
Note
For high-dimensional data (with many features), this algorithm may perform poorly. This is due to high-dimensional data (a) making it difficult to cluster at all (based on statistical/theoretical arguments) and (b) numerical issues with Gaussian distributions.
>>> from pyspark.ml.linalg import Vectors
>>> data = [(Vectors.dense([-0.1, -0.05 ]),),
... (Vectors.dense([-0.01, -0.1]),),
... (Vectors.dense([0.9, 0.8]),),
... (Vectors.dense([0.75, 0.935]),),
... (Vectors.dense([-0.83, -0.68]),),
... (Vectors.dense([-0.91, -0.76]),)]
>>> df = spark.createDataFrame(data, ["features"])
>>> gm = GaussianMixture(k=3, tol=0.0001,
... maxIter=10, seed=10)
>>> model = gm.fit(df)
>>> model.hasSummary
True
>>> summary = model.summary
>>> summary.k
3
>>> summary.clusterSizes
[2, 2, 2]
>>> summary.logLikelihood
8.14636...
>>> weights = model.weights
>>> len(weights)
3
>>> model.gaussiansDF.select("mean").head()
Row(mean=DenseVector([0.825, 0.8675]))
>>> model.gaussiansDF.select("cov").head()
Row(cov=DenseMatrix(2, 2, [0.0056, -0.0051, -0.0051, 0.0046], False))
>>> transformed = model.transform(df).select("features", "prediction")
>>> rows = transformed.collect()
>>> rows[4].prediction == rows[5].prediction
True
>>> rows[2].prediction == rows[3].prediction
True
>>> gmm_path = temp_path + "/gmm"
>>> gm.save(gmm_path)
>>> gm2 = GaussianMixture.load(gmm_path)
>>> gm2.getK()
3
>>> model_path = temp_path + "/gmm_model"
>>> model.save(model_path)
>>> model2 = GaussianMixtureModel.load(model_path)
>>> model2.hasSummary
False
>>> model2.weights == model.weights
True
>>> model2.gaussiansDF.select("mean").head()
Row(mean=DenseVector([0.825, 0.8675]))
>>> model2.gaussiansDF.select("cov").head()
Row(cov=DenseMatrix(2, 2, [0.0056, -0.0051, -0.0051, 0.0046], False))
New in version 2.0.0.
Creates a copy of this instance with the same uid and some extra params. This implementation first calls Params.copy and then make a copy of the companion Java pipeline component with extra params. So both the Python wrapper and the Java pipeline component get copied.
Parameters: | extra – Extra parameters to copy to the new instance |
---|---|
Returns: | Copy of this instance |
Explains a single param and returns its name, doc, and optional default value and user-supplied value in a string.
Returns the documentation of all params with their optionally default values and user-supplied values.
Extracts the embedded default param values and user-supplied values, and then merges them with extra values from input into a flat param map, where the latter value is used if there exist conflicts, i.e., with ordering: default param values < user-supplied values < extra.
Parameters: | extra – extra param values |
---|---|
Returns: | merged param map |
Fits a model to the input dataset with optional parameters.
Parameters: |
|
---|---|
Returns: | fitted model(s) |
New in version 1.3.0.
Gets the value of featuresCol or its default value.
Gets the value of maxIter or its default value.
Gets the value of a param in the user-supplied param map or its default value. Raises an error if neither is set.
Gets a param by its name.
Gets the value of predictionCol or its default value.
Gets the value of probabilityCol or its default value.
Gets the value of seed or its default value.
Gets the value of tol or its default value.
Checks whether a param has a default value.
Tests whether this instance contains a param with a given (string) name.
Checks whether a param is explicitly set by user or has a default value.
Checks whether a param is explicitly set by user.
Reads an ML instance from the input path, a shortcut of read().load(path).
Returns all params ordered by name. The default implementation uses dir() to get all attributes of type Param.
Returns an MLReader instance for this class.
Save this ML instance to the given path, a shortcut of write().save(path).
Sets the value of featuresCol.
Sets params for GaussianMixture.
New in version 2.0.0.
Sets the value of predictionCol.
Sets the value of probabilityCol.
Returns an MLWriter instance for this ML instance.
Model fitted by GaussianMixture.
New in version 2.0.0.
Creates a copy of this instance with the same uid and some extra params. This implementation first calls Params.copy and then make a copy of the companion Java pipeline component with extra params. So both the Python wrapper and the Java pipeline component get copied.
Parameters: | extra – Extra parameters to copy to the new instance |
---|---|
Returns: | Copy of this instance |
Explains a single param and returns its name, doc, and optional default value and user-supplied value in a string.
Returns the documentation of all params with their optionally default values and user-supplied values.
Extracts the embedded default param values and user-supplied values, and then merges them with extra values from input into a flat param map, where the latter value is used if there exist conflicts, i.e., with ordering: default param values < user-supplied values < extra.
Parameters: | extra – extra param values |
---|---|
Returns: | merged param map |
Retrieve Gaussian distributions as a DataFrame. Each row represents a Gaussian Distribution. The DataFrame has two columns: mean (Vector) and cov (Matrix).
New in version 2.0.0.
Gets the value of a param in the user-supplied param map or its default value. Raises an error if neither is set.
Gets a param by its name.
Checks whether a param has a default value.
Tests whether this instance contains a param with a given (string) name.
Indicates whether a training summary exists for this model instance.
New in version 2.1.0.
Checks whether a param is explicitly set by user or has a default value.
Checks whether a param is explicitly set by user.
Reads an ML instance from the input path, a shortcut of read().load(path).
Returns all params ordered by name. The default implementation uses dir() to get all attributes of type Param.
Returns an MLReader instance for this class.
Save this ML instance to the given path, a shortcut of write().save(path).
Gets summary (e.g. cluster assignments, cluster sizes) of the model trained on the training set. An exception is thrown if no summary exists.
New in version 2.1.0.
Transforms the input dataset with optional parameters.
Parameters: |
|
---|---|
Returns: | transformed dataset |
New in version 1.3.0.
Weight for each Gaussian distribution in the mixture. This is a multinomial probability distribution over the k Gaussians, where weights[i] is the weight for Gaussian i, and weights sum to 1.
New in version 2.0.0.
Returns an MLWriter instance for this ML instance.
Note
Experimental
Gaussian mixture clustering results for a given model.
New in version 2.1.0.
DataFrame of predicted cluster centers for each training data point.
New in version 2.1.0.
Size of (number of data points in) each cluster.
New in version 2.1.0.
Name for column of features in predictions.
New in version 2.1.0.
The number of clusters the model was trained with.
New in version 2.1.0.
Name for column of predicted clusters in predictions.
New in version 2.1.0.
DataFrame produced by the model’s transform method.
New in version 2.1.0.
Latent Dirichlet Allocation (LDA), a topic model designed for text documents.
Terminology:
- “term” = “word”: an el
- “token”: instance of a term appearing in a document
- “topic”: multinomial distribution over terms representing some concept
- “document”: one piece of text, corresponding to one row in the input data
Input data (featuresCol): LDA is given a collection of documents as input data, via the featuresCol parameter. Each document is specified as a Vector of length vocabSize, where each entry is the count for the corresponding term (word) in the document. Feature transformers such as pyspark.ml.feature.Tokenizer and pyspark.ml.feature.CountVectorizer can be useful for converting text to word count vectors.
>>> from pyspark.ml.linalg import Vectors, SparseVector
>>> from pyspark.ml.clustering import LDA
>>> df = spark.createDataFrame([[1, Vectors.dense([0.0, 1.0])],
... [2, SparseVector(2, {0: 1.0})],], ["id", "features"])
>>> lda = LDA(k=2, seed=1, optimizer="em")
>>> model = lda.fit(df)
>>> model.isDistributed()
True
>>> localModel = model.toLocal()
>>> localModel.isDistributed()
False
>>> model.vocabSize()
2
>>> model.describeTopics().show()
+-----+-----------+--------------------+
|topic|termIndices| termWeights|
+-----+-----------+--------------------+
| 0| [1, 0]|[0.50401530077160...|
| 1| [0, 1]|[0.50401530077160...|
+-----+-----------+--------------------+
...
>>> model.topicsMatrix()
DenseMatrix(2, 2, [0.496, 0.504, 0.504, 0.496], 0)
>>> lda_path = temp_path + "/lda"
>>> lda.save(lda_path)
>>> sameLDA = LDA.load(lda_path)
>>> distributed_model_path = temp_path + "/lda_distributed_model"
>>> model.save(distributed_model_path)
>>> sameModel = DistributedLDAModel.load(distributed_model_path)
>>> local_model_path = temp_path + "/lda_local_model"
>>> localModel.save(local_model_path)
>>> sameLocalModel = LocalLDAModel.load(local_model_path)
New in version 2.0.0.
Creates a copy of this instance with the same uid and some extra params. This implementation first calls Params.copy and then make a copy of the companion Java pipeline component with extra params. So both the Python wrapper and the Java pipeline component get copied.
Parameters: | extra – Extra parameters to copy to the new instance |
---|---|
Returns: | Copy of this instance |
Explains a single param and returns its name, doc, and optional default value and user-supplied value in a string.
Returns the documentation of all params with their optionally default values and user-supplied values.
Extracts the embedded default param values and user-supplied values, and then merges them with extra values from input into a flat param map, where the latter value is used if there exist conflicts, i.e., with ordering: default param values < user-supplied values < extra.
Parameters: | extra – extra param values |
---|---|
Returns: | merged param map |
Fits a model to the input dataset with optional parameters.
Parameters: |
|
---|---|
Returns: | fitted model(s) |
New in version 1.3.0.
Gets the value of checkpointInterval or its default value.
Gets the value of docConcentration or its default value.
New in version 2.0.0.
Gets the value of featuresCol or its default value.
Gets the value of keepLastCheckpoint or its default value.
New in version 2.0.0.
Gets the value of learningDecay or its default value.
New in version 2.0.0.
Gets the value of learningOffset or its default value.
New in version 2.0.0.
Gets the value of maxIter or its default value.
Gets the value of optimizeDocConcentration or its default value.
New in version 2.0.0.
Gets the value of a param in the user-supplied param map or its default value. Raises an error if neither is set.
Gets a param by its name.
Gets the value of seed or its default value.
Gets the value of subsamplingRate or its default value.
New in version 2.0.0.
Gets the value of topicConcentration or its default value.
New in version 2.0.0.
Gets the value of topicDistributionCol or its default value.
New in version 2.0.0.
Checks whether a param has a default value.
Tests whether this instance contains a param with a given (string) name.
Checks whether a param is explicitly set by user or has a default value.
Checks whether a param is explicitly set by user.
Reads an ML instance from the input path, a shortcut of read().load(path).
Returns all params ordered by name. The default implementation uses dir() to get all attributes of type Param.
Returns an MLReader instance for this class.
Save this ML instance to the given path, a shortcut of write().save(path).
Sets the value of checkpointInterval.
Sets the value of docConcentration.
>>> algo = LDA().setDocConcentration([0.1, 0.2])
>>> algo.getDocConcentration()
[0.1..., 0.2...]
New in version 2.0.0.
Sets the value of featuresCol.
Sets the value of k.
>>> algo = LDA().setK(10)
>>> algo.getK()
10
New in version 2.0.0.
Sets the value of keepLastCheckpoint.
>>> algo = LDA().setKeepLastCheckpoint(False)
>>> algo.getKeepLastCheckpoint()
False
New in version 2.0.0.
Sets the value of learningDecay.
>>> algo = LDA().setLearningDecay(0.1)
>>> algo.getLearningDecay()
0.1...
New in version 2.0.0.
Sets the value of learningOffset.
>>> algo = LDA().setLearningOffset(100)
>>> algo.getLearningOffset()
100.0
New in version 2.0.0.
Sets the value of optimizeDocConcentration.
>>> algo = LDA().setOptimizeDocConcentration(True)
>>> algo.getOptimizeDocConcentration()
True
New in version 2.0.0.
Sets the value of optimizer. Currenlty only support ‘em’ and ‘online’.
>>> algo = LDA().setOptimizer("em")
>>> algo.getOptimizer()
'em'
New in version 2.0.0.
setParams(self, featuresCol=”features”, maxIter=20, seed=None, checkpointInterval=10, k=10, optimizer=”online”, learningOffset=1024.0, learningDecay=0.51, subsamplingRate=0.05, optimizeDocConcentration=True, docConcentration=None, topicConcentration=None, topicDistributionCol=”topicDistribution”, keepLastCheckpoint=True):
Sets params for LDA.
New in version 2.0.0.
Sets the value of subsamplingRate.
>>> algo = LDA().setSubsamplingRate(0.1)
>>> algo.getSubsamplingRate()
0.1...
New in version 2.0.0.
Sets the value of topicConcentration.
>>> algo = LDA().setTopicConcentration(0.5)
>>> algo.getTopicConcentration()
0.5...
New in version 2.0.0.
Sets the value of topicDistributionCol.
>>> algo = LDA().setTopicDistributionCol("topicDistributionCol")
>>> algo.getTopicDistributionCol()
'topicDistributionCol'
New in version 2.0.0.
Returns an MLWriter instance for this ML instance.
Latent Dirichlet Allocation (LDA) model. This abstraction permits for different underlying representations, including local and distributed data structures.
New in version 2.0.0.
Creates a copy of this instance with the same uid and some extra params. This implementation first calls Params.copy and then make a copy of the companion Java pipeline component with extra params. So both the Python wrapper and the Java pipeline component get copied.
Parameters: | extra – Extra parameters to copy to the new instance |
---|---|
Returns: | Copy of this instance |
Return the topics described by their top-weighted terms.
New in version 2.0.0.
Value for LDA.docConcentration estimated from data. If Online LDA was used and LDA.optimizeDocConcentration was set to false, then this returns the fixed (given) value for the LDA.docConcentration parameter.
New in version 2.0.0.
Explains a single param and returns its name, doc, and optional default value and user-supplied value in a string.
Returns the documentation of all params with their optionally default values and user-supplied values.
Extracts the embedded default param values and user-supplied values, and then merges them with extra values from input into a flat param map, where the latter value is used if there exist conflicts, i.e., with ordering: default param values < user-supplied values < extra.
Parameters: | extra – extra param values |
---|---|
Returns: | merged param map |
Gets the value of a param in the user-supplied param map or its default value. Raises an error if neither is set.
Gets a param by its name.
Checks whether a param has a default value.
Tests whether this instance contains a param with a given (string) name.
Checks whether a param is explicitly set by user or has a default value.
Indicates whether this instance is of type DistributedLDAModel
New in version 2.0.0.
Checks whether a param is explicitly set by user.
Calculates a lower bound on the log likelihood of the entire corpus. See Equation (16) in the Online LDA paper (Hoffman et al., 2010).
WARNING: If this model is an instance of DistributedLDAModel (produced when optimizer is set to “em”), this involves collecting a large topicsMatrix() to the driver. This implementation may be changed in the future.
New in version 2.0.0.
Calculate an upper bound on perplexity. (Lower is better.) See Equation (16) in the Online LDA paper (Hoffman et al., 2010).
WARNING: If this model is an instance of DistributedLDAModel (produced when optimizer is set to “em”), this involves collecting a large topicsMatrix() to the driver. This implementation may be changed in the future.
New in version 2.0.0.
Returns all params ordered by name. The default implementation uses dir() to get all attributes of type Param.
Inferred topics, where each topic is represented by a distribution over terms. This is a matrix of size vocabSize x k, where each column is a topic. No guarantees are given about the ordering of the topics.
WARNING: If this model is actually a DistributedLDAModel instance produced by the Expectation-Maximization (“em”) optimizer, then this method could involve collecting a large amount of data to the driver (on the order of vocabSize x k).
New in version 2.0.0.
Transforms the input dataset with optional parameters.
Parameters: |
|
---|---|
Returns: | transformed dataset |
New in version 1.3.0.
Local (non-distributed) model fitted by LDA. This model stores the inferred topics only; it does not store info about the training dataset.
New in version 2.0.0.
Creates a copy of this instance with the same uid and some extra params. This implementation first calls Params.copy and then make a copy of the companion Java pipeline component with extra params. So both the Python wrapper and the Java pipeline component get copied.
Parameters: | extra – Extra parameters to copy to the new instance |
---|---|
Returns: | Copy of this instance |
Return the topics described by their top-weighted terms.
New in version 2.0.0.
Value for LDA.docConcentration estimated from data. If Online LDA was used and LDA.optimizeDocConcentration was set to false, then this returns the fixed (given) value for the LDA.docConcentration parameter.
New in version 2.0.0.
Explains a single param and returns its name, doc, and optional default value and user-supplied value in a string.
Returns the documentation of all params with their optionally default values and user-supplied values.
Extracts the embedded default param values and user-supplied values, and then merges them with extra values from input into a flat param map, where the latter value is used if there exist conflicts, i.e., with ordering: default param values < user-supplied values < extra.
Parameters: | extra – extra param values |
---|---|
Returns: | merged param map |
Gets the value of a param in the user-supplied param map or its default value. Raises an error if neither is set.
Gets a param by its name.
Checks whether a param has a default value.
Tests whether this instance contains a param with a given (string) name.
Checks whether a param is explicitly set by user or has a default value.
Indicates whether this instance is of type DistributedLDAModel
New in version 2.0.0.
Checks whether a param is explicitly set by user.
Reads an ML instance from the input path, a shortcut of read().load(path).
Calculates a lower bound on the log likelihood of the entire corpus. See Equation (16) in the Online LDA paper (Hoffman et al., 2010).
WARNING: If this model is an instance of DistributedLDAModel (produced when optimizer is set to “em”), this involves collecting a large topicsMatrix() to the driver. This implementation may be changed in the future.
New in version 2.0.0.
Calculate an upper bound on perplexity. (Lower is better.) See Equation (16) in the Online LDA paper (Hoffman et al., 2010).
WARNING: If this model is an instance of DistributedLDAModel (produced when optimizer is set to “em”), this involves collecting a large topicsMatrix() to the driver. This implementation may be changed in the future.
New in version 2.0.0.
Returns all params ordered by name. The default implementation uses dir() to get all attributes of type Param.
Returns an MLReader instance for this class.
Save this ML instance to the given path, a shortcut of write().save(path).
Inferred topics, where each topic is represented by a distribution over terms. This is a matrix of size vocabSize x k, where each column is a topic. No guarantees are given about the ordering of the topics.
WARNING: If this model is actually a DistributedLDAModel instance produced by the Expectation-Maximization (“em”) optimizer, then this method could involve collecting a large amount of data to the driver (on the order of vocabSize x k).
New in version 2.0.0.
Transforms the input dataset with optional parameters.
Parameters: |
|
---|---|
Returns: | transformed dataset |
New in version 1.3.0.
Vocabulary size (number of terms or words in the vocabulary)
New in version 2.0.0.
Returns an MLWriter instance for this ML instance.
Distributed model fitted by LDA. This type of model is currently only produced by Expectation-Maximization (EM).
This model stores the inferred topics, the full training dataset, and the topic distribution for each training document.
New in version 2.0.0.
Creates a copy of this instance with the same uid and some extra params. This implementation first calls Params.copy and then make a copy of the companion Java pipeline component with extra params. So both the Python wrapper and the Java pipeline component get copied.
Parameters: | extra – Extra parameters to copy to the new instance |
---|---|
Returns: | Copy of this instance |
Return the topics described by their top-weighted terms.
New in version 2.0.0.
Value for LDA.docConcentration estimated from data. If Online LDA was used and LDA.optimizeDocConcentration was set to false, then this returns the fixed (given) value for the LDA.docConcentration parameter.
New in version 2.0.0.
Explains a single param and returns its name, doc, and optional default value and user-supplied value in a string.
Returns the documentation of all params with their optionally default values and user-supplied values.
Extracts the embedded default param values and user-supplied values, and then merges them with extra values from input into a flat param map, where the latter value is used if there exist conflicts, i.e., with ordering: default param values < user-supplied values < extra.
Parameters: | extra – extra param values |
---|---|
Returns: | merged param map |
If using checkpointing and LDA.keepLastCheckpoint is set to true, then there may be saved checkpoint files. This method is provided so that users can manage those files.
Note
Removing the checkpoints can cause failures if a partition is lost and is needed by certain DistributedLDAModel methods. Reference counting will clean up the checkpoints when this model and derivative data go out of scope.
:return List of checkpoint files from training
New in version 2.0.0.
Gets the value of a param in the user-supplied param map or its default value. Raises an error if neither is set.
Gets a param by its name.
Checks whether a param has a default value.
Tests whether this instance contains a param with a given (string) name.
Checks whether a param is explicitly set by user or has a default value.
Indicates whether this instance is of type DistributedLDAModel
New in version 2.0.0.
Checks whether a param is explicitly set by user.
Reads an ML instance from the input path, a shortcut of read().load(path).
Calculates a lower bound on the log likelihood of the entire corpus. See Equation (16) in the Online LDA paper (Hoffman et al., 2010).
WARNING: If this model is an instance of DistributedLDAModel (produced when optimizer is set to “em”), this involves collecting a large topicsMatrix() to the driver. This implementation may be changed in the future.
New in version 2.0.0.
Calculate an upper bound on perplexity. (Lower is better.) See Equation (16) in the Online LDA paper (Hoffman et al., 2010).
WARNING: If this model is an instance of DistributedLDAModel (produced when optimizer is set to “em”), this involves collecting a large topicsMatrix() to the driver. This implementation may be changed in the future.
New in version 2.0.0.
Log probability of the current parameter estimate: log P(topics, topic distributions for docs | alpha, eta)
New in version 2.0.0.
Returns all params ordered by name. The default implementation uses dir() to get all attributes of type Param.
Returns an MLReader instance for this class.
Save this ML instance to the given path, a shortcut of write().save(path).
Convert this distributed model to a local representation. This discards info about the training dataset.
WARNING: This involves collecting a large topicsMatrix() to the driver.
New in version 2.0.0.
Inferred topics, where each topic is represented by a distribution over terms. This is a matrix of size vocabSize x k, where each column is a topic. No guarantees are given about the ordering of the topics.
WARNING: If this model is actually a DistributedLDAModel instance produced by the Expectation-Maximization (“em”) optimizer, then this method could involve collecting a large amount of data to the driver (on the order of vocabSize x k).
New in version 2.0.0.
Log likelihood of the observed tokens in the training set, given the current parameter estimates: log P(docs | topics, topic distributions for docs, Dirichlet hyperparameters)
New in version 2.0.0.
Transforms the input dataset with optional parameters.
Parameters: |
|
---|---|
Returns: | transformed dataset |
New in version 1.3.0.
Vocabulary size (number of terms or words in the vocabulary)
New in version 2.0.0.
Returns an MLWriter instance for this ML instance.
MLlib utilities for linear algebra. For dense vectors, MLlib uses the NumPy array type, so you can simply pass NumPy arrays around. For sparse vectors, users can construct a SparseVector object from MLlib or pass SciPy scipy.sparse column vectors if SciPy is available in their environment.
A dense vector represented by a value array. We use numpy array for storage and arithmetics will be delegated to the underlying numpy array.
>>> v = Vectors.dense([1.0, 2.0])
>>> u = Vectors.dense([3.0, 4.0])
>>> v + u
DenseVector([4.0, 6.0])
>>> 2 - v
DenseVector([1.0, 0.0])
>>> v / 2
DenseVector([0.5, 1.0])
>>> v * u
DenseVector([3.0, 8.0])
>>> u / v
DenseVector([3.0, 2.0])
>>> u % 2
DenseVector([1.0, 0.0])
Compute the dot product of two Vectors. We support (Numpy array, list, SparseVector, or SciPy sparse) and a target NumPy array that is either 1- or 2-dimensional. Equivalent to calling numpy.dot of the two vectors.
>>> dense = DenseVector(array.array('d', [1., 2.]))
>>> dense.dot(dense)
5.0
>>> dense.dot(SparseVector(2, [0, 1], [2., 1.]))
4.0
>>> dense.dot(range(1, 3))
5.0
>>> dense.dot(np.array(range(1, 3)))
5.0
>>> dense.dot([1.,])
Traceback (most recent call last):
...
AssertionError: dimension mismatch
>>> dense.dot(np.reshape([1., 2., 3., 4.], (2, 2), order='F'))
array([ 5., 11.])
>>> dense.dot(np.reshape([1., 2., 3.], (3, 1), order='F'))
Traceback (most recent call last):
...
AssertionError: dimension mismatch
Calculates the norm of a DenseVector.
>>> a = DenseVector([0, -1, 2, -3])
>>> a.norm(2)
3.7...
>>> a.norm(1)
6.0
Squared distance of two Vectors.
>>> dense1 = DenseVector(array.array('d', [1., 2.]))
>>> dense1.squared_distance(dense1)
0.0
>>> dense2 = np.array([2., 1.])
>>> dense1.squared_distance(dense2)
2.0
>>> dense3 = [2., 1.]
>>> dense1.squared_distance(dense3)
2.0
>>> sparse1 = SparseVector(2, [0, 1], [2., 1.])
>>> dense1.squared_distance(sparse1)
2.0
>>> dense1.squared_distance([1.,])
Traceback (most recent call last):
...
AssertionError: dimension mismatch
>>> dense1.squared_distance(SparseVector(1, [0,], [1.,]))
Traceback (most recent call last):
...
AssertionError: dimension mismatch
A simple sparse vector class for passing data to MLlib. Users may alternatively pass SciPy’s {scipy.sparse} data types.
Dot product with a SparseVector or 1- or 2-dimensional Numpy array.
>>> a = SparseVector(4, [1, 3], [3.0, 4.0])
>>> a.dot(a)
25.0
>>> a.dot(array.array('d', [1., 2., 3., 4.]))
22.0
>>> b = SparseVector(4, [2], [1.0])
>>> a.dot(b)
0.0
>>> a.dot(np.array([[1, 1], [2, 2], [3, 3], [4, 4]]))
array([ 22., 22.])
>>> a.dot([1., 2., 3.])
Traceback (most recent call last):
...
AssertionError: dimension mismatch
>>> a.dot(np.array([1., 2.]))
Traceback (most recent call last):
...
AssertionError: dimension mismatch
>>> a.dot(DenseVector([1., 2.]))
Traceback (most recent call last):
...
AssertionError: dimension mismatch
>>> a.dot(np.zeros((3, 2)))
Traceback (most recent call last):
...
AssertionError: dimension mismatch
A list of indices corresponding to active entries.
Calculates the norm of a SparseVector.
>>> a = SparseVector(4, [0, 1], [3., -4.])
>>> a.norm(1)
7.0
>>> a.norm(2)
5.0
Number of nonzero elements. This scans all active values and count non zeros.
Size of the vector.
Squared distance from a SparseVector or 1-dimensional NumPy array.
>>> a = SparseVector(4, [1, 3], [3.0, 4.0])
>>> a.squared_distance(a)
0.0
>>> a.squared_distance(array.array('d', [1., 2., 3., 4.]))
11.0
>>> a.squared_distance(np.array([1., 2., 3., 4.]))
11.0
>>> b = SparseVector(4, [2], [1.0])
>>> a.squared_distance(b)
26.0
>>> b.squared_distance(a)
26.0
>>> b.squared_distance([1., 2.])
Traceback (most recent call last):
...
AssertionError: dimension mismatch
>>> b.squared_distance(SparseVector(3, [1,], [1.0,]))
Traceback (most recent call last):
...
AssertionError: dimension mismatch
A list of values corresponding to active entries.
Factory methods for working with vectors.
Note
Dense vectors are simply represented as NumPy array objects, so there is no need to covert them for use in MLlib. For sparse vectors, the factory methods in this class create an MLlib-compatible type, or users can pass in SciPy’s scipy.sparse column vectors.
Create a dense vector of 64-bit floats from a Python list or numbers.
>>> Vectors.dense([1, 2, 3])
DenseVector([1.0, 2.0, 3.0])
>>> Vectors.dense(1.0, 2.0)
DenseVector([1.0, 2.0])
Create a sparse vector, using either a dictionary, a list of (index, value) pairs, or two separate arrays of indices and values (sorted by index).
Parameters: |
|
---|
>>> Vectors.sparse(4, {1: 1.0, 3: 5.5})
SparseVector(4, {1: 1.0, 3: 5.5})
>>> Vectors.sparse(4, [(1, 1.0), (3, 5.5)])
SparseVector(4, {1: 1.0, 3: 5.5})
>>> Vectors.sparse(4, [1, 3], [1.0, 5.5])
SparseVector(4, {1: 1.0, 3: 5.5})
Column-major dense matrix.
Alternating Least Squares (ALS) matrix factorization.
ALS attempts to estimate the ratings matrix R as the product of two lower-rank matrices, X and Y, i.e. X * Yt = R. Typically these approximations are called ‘factor’ matrices. The general approach is iterative. During each iteration, one of the factor matrices is held constant, while the other is solved for using least squares. The newly-solved factor matrix is then held constant while solving for the other factor matrix.
This is a blocked implementation of the ALS factorization algorithm that groups the two sets of factors (referred to as “users” and “products”) into blocks and reduces communication by only sending one copy of each user vector to each product block on each iteration, and only for the product blocks that need that user’s feature vector. This is achieved by pre-computing some information about the ratings matrix to determine the “out-links” of each user (which blocks of products it will contribute to) and “in-link” information for each product (which of the feature vectors it receives from each user block it will depend on). This allows us to send only an array of feature vectors between each user block and product block, and have the product block find the users’ ratings and update the products based on these messages.
For implicit preference data, the algorithm used is based on “Collaborative Filtering for Implicit Feedback Datasets”,, adapted for the blocked approach used here.
Essentially instead of finding the low-rank approximations to the rating matrix R, this finds the approximations for a preference matrix P where the elements of P are 1 if r > 0 and 0 if r <= 0. The ratings then act as ‘confidence’ values related to strength of indicated user preferences rather than explicit ratings given to items.
>>> df = spark.createDataFrame(
... [(0, 0, 4.0), (0, 1, 2.0), (1, 1, 3.0), (1, 2, 4.0), (2, 1, 1.0), (2, 2, 5.0)],
... ["user", "item", "rating"])
>>> als = ALS(rank=10, maxIter=5, seed=0)
>>> model = als.fit(df)
>>> model.rank
10
>>> model.userFactors.orderBy("id").collect()
[Row(id=0, features=[...]), Row(id=1, ...), Row(id=2, ...)]
>>> test = spark.createDataFrame([(0, 2), (1, 0), (2, 0)], ["user", "item"])
>>> predictions = sorted(model.transform(test).collect(), key=lambda r: r[0])
>>> predictions[0]
Row(user=0, item=2, prediction=-0.13807615637779236)
>>> predictions[1]
Row(user=1, item=0, prediction=2.6258413791656494)
>>> predictions[2]
Row(user=2, item=0, prediction=-1.5018409490585327)
>>> user_recs = model.recommendForAllUsers(3)
>>> user_recs.where(user_recs.user == 0) .select("recommendations.item", "recommendations.rating").collect()
[Row(item=[0, 1, 2], rating=[3.910..., 1.992..., -0.138...])]
>>> item_recs = model.recommendForAllItems(3)
>>> item_recs.where(item_recs.item == 2) .select("recommendations.user", "recommendations.rating").collect()
[Row(user=[2, 1, 0], rating=[4.901..., 3.981..., -0.138...])]
>>> als_path = temp_path + "/als"
>>> als.save(als_path)
>>> als2 = ALS.load(als_path)
>>> als.getMaxIter()
5
>>> model_path = temp_path + "/als_model"
>>> model.save(model_path)
>>> model2 = ALSModel.load(model_path)
>>> model.rank == model2.rank
True
>>> sorted(model.userFactors.collect()) == sorted(model2.userFactors.collect())
True
>>> sorted(model.itemFactors.collect()) == sorted(model2.itemFactors.collect())
True
New in version 1.4.0.
Creates a copy of this instance with the same uid and some extra params. This implementation first calls Params.copy and then make a copy of the companion Java pipeline component with extra params. So both the Python wrapper and the Java pipeline component get copied.
Parameters: | extra – Extra parameters to copy to the new instance |
---|---|
Returns: | Copy of this instance |
Explains a single param and returns its name, doc, and optional default value and user-supplied value in a string.
Returns the documentation of all params with their optionally default values and user-supplied values.
Extracts the embedded default param values and user-supplied values, and then merges them with extra values from input into a flat param map, where the latter value is used if there exist conflicts, i.e., with ordering: default param values < user-supplied values < extra.
Parameters: | extra – extra param values |
---|---|
Returns: | merged param map |
Fits a model to the input dataset with optional parameters.
Parameters: |
|
---|---|
Returns: | fitted model(s) |
New in version 1.3.0.
Gets the value of checkpointInterval or its default value.
Gets the value of coldStartStrategy or its default value.
New in version 2.2.0.
Gets the value of finalStorageLevel or its default value.
New in version 2.0.0.
Gets the value of implicitPrefs or its default value.
New in version 1.4.0.
Gets the value of intermediateStorageLevel or its default value.
New in version 2.0.0.
Gets the value of maxIter or its default value.
Gets the value of numItemBlocks or its default value.
New in version 1.4.0.
Gets the value of numUserBlocks or its default value.
New in version 1.4.0.
Gets the value of a param in the user-supplied param map or its default value. Raises an error if neither is set.
Gets a param by its name.
Gets the value of predictionCol or its default value.
Gets the value of regParam or its default value.
Gets the value of seed or its default value.
Checks whether a param has a default value.
Tests whether this instance contains a param with a given (string) name.
Checks whether a param is explicitly set by user or has a default value.
Checks whether a param is explicitly set by user.
Reads an ML instance from the input path, a shortcut of read().load(path).
Returns all params ordered by name. The default implementation uses dir() to get all attributes of type Param.
Returns an MLReader instance for this class.
Save this ML instance to the given path, a shortcut of write().save(path).
Sets the value of checkpointInterval.
Sets the value of coldStartStrategy.
New in version 2.2.0.
Sets the value of finalStorageLevel.
New in version 2.0.0.
Sets the value of implicitPrefs.
New in version 1.4.0.
Sets the value of intermediateStorageLevel.
New in version 2.0.0.
Sets the value of nonnegative.
New in version 1.4.0.
Sets both numUserBlocks and numItemBlocks to the specific value.
New in version 1.4.0.
Sets the value of numItemBlocks.
New in version 1.4.0.
Sets the value of numUserBlocks.
New in version 1.4.0.
Sets params for ALS.
New in version 1.4.0.
Sets the value of predictionCol.
Returns an MLWriter instance for this ML instance.
Model fitted by ALS.
New in version 1.4.0.
Creates a copy of this instance with the same uid and some extra params. This implementation first calls Params.copy and then make a copy of the companion Java pipeline component with extra params. So both the Python wrapper and the Java pipeline component get copied.
Parameters: | extra – Extra parameters to copy to the new instance |
---|---|
Returns: | Copy of this instance |
Explains a single param and returns its name, doc, and optional default value and user-supplied value in a string.
Returns the documentation of all params with their optionally default values and user-supplied values.
Extracts the embedded default param values and user-supplied values, and then merges them with extra values from input into a flat param map, where the latter value is used if there exist conflicts, i.e., with ordering: default param values < user-supplied values < extra.
Parameters: | extra – extra param values |
---|---|
Returns: | merged param map |
Gets the value of a param in the user-supplied param map or its default value. Raises an error if neither is set.
Gets a param by its name.
Checks whether a param has a default value.
Tests whether this instance contains a param with a given (string) name.
Checks whether a param is explicitly set by user or has a default value.
Checks whether a param is explicitly set by user.
a DataFrame that stores item factors in two columns: id and features
New in version 1.4.0.
Reads an ML instance from the input path, a shortcut of read().load(path).
Returns all params ordered by name. The default implementation uses dir() to get all attributes of type Param.
Returns an MLReader instance for this class.
Returns top numUsers users recommended for each item, for all items.
Parameters: | numUsers – max number of recommendations for each item |
---|---|
Returns: | a DataFrame of (itemCol, recommendations), where recommendations are stored as an array of (userCol, rating) Rows. |
New in version 2.2.0.
Returns top numItems items recommended for each user, for all users.
Parameters: | numItems – max number of recommendations for each user |
---|---|
Returns: | a DataFrame of (userCol, recommendations), where recommendations are stored as an array of (itemCol, rating) Rows. |
New in version 2.2.0.
Save this ML instance to the given path, a shortcut of write().save(path).
Transforms the input dataset with optional parameters.
Parameters: |
|
---|---|
Returns: | transformed dataset |
New in version 1.3.0.
a DataFrame that stores user factors in two columns: id and features
New in version 1.4.0.
Returns an MLWriter instance for this ML instance.
Note
Experimental
Accelerated Failure Time (AFT) Model Survival Regression
Fit a parametric AFT survival regression model based on the Weibull distribution of the survival time.
See also
>>> from pyspark.ml.linalg import Vectors
>>> df = spark.createDataFrame([
... (1.0, Vectors.dense(1.0), 1.0),
... (0.0, Vectors.sparse(1, [], []), 0.0)], ["label", "features", "censor"])
>>> aftsr = AFTSurvivalRegression()
>>> model = aftsr.fit(df)
>>> model.predict(Vectors.dense(6.3))
1.0
>>> model.predictQuantiles(Vectors.dense(6.3))
DenseVector([0.0101, 0.0513, 0.1054, 0.2877, 0.6931, 1.3863, 2.3026, 2.9957, 4.6052])
>>> model.transform(df).show()
+-----+---------+------+----------+
|label| features|censor|prediction|
+-----+---------+------+----------+
| 1.0| [1.0]| 1.0| 1.0|
| 0.0|(1,[],[])| 0.0| 1.0|
+-----+---------+------+----------+
...
>>> aftsr_path = temp_path + "/aftsr"
>>> aftsr.save(aftsr_path)
>>> aftsr2 = AFTSurvivalRegression.load(aftsr_path)
>>> aftsr2.getMaxIter()
100
>>> model_path = temp_path + "/aftsr_model"
>>> model.save(model_path)
>>> model2 = AFTSurvivalRegressionModel.load(model_path)
>>> model.coefficients == model2.coefficients
True
>>> model.intercept == model2.intercept
True
>>> model.scale == model2.scale
True
New in version 1.6.0.
Creates a copy of this instance with the same uid and some extra params. This implementation first calls Params.copy and then make a copy of the companion Java pipeline component with extra params. So both the Python wrapper and the Java pipeline component get copied.
Parameters: | extra – Extra parameters to copy to the new instance |
---|---|
Returns: | Copy of this instance |
Explains a single param and returns its name, doc, and optional default value and user-supplied value in a string.
Returns the documentation of all params with their optionally default values and user-supplied values.
Extracts the embedded default param values and user-supplied values, and then merges them with extra values from input into a flat param map, where the latter value is used if there exist conflicts, i.e., with ordering: default param values < user-supplied values < extra.
Parameters: | extra – extra param values |
---|---|
Returns: | merged param map |
Fits a model to the input dataset with optional parameters.
Parameters: |
|
---|---|
Returns: | fitted model(s) |
New in version 1.3.0.
Gets the value of aggregationDepth or its default value.
Gets the value of featuresCol or its default value.
Gets the value of fitIntercept or its default value.
Gets the value of labelCol or its default value.
Gets the value of maxIter or its default value.
Gets the value of a param in the user-supplied param map or its default value. Raises an error if neither is set.
Gets a param by its name.
Gets the value of predictionCol or its default value.
Gets the value of quantileProbabilities or its default value.
New in version 1.6.0.
Gets the value of quantilesCol or its default value.
New in version 1.6.0.
Gets the value of tol or its default value.
Checks whether a param has a default value.
Tests whether this instance contains a param with a given (string) name.
Checks whether a param is explicitly set by user or has a default value.
Checks whether a param is explicitly set by user.
Reads an ML instance from the input path, a shortcut of read().load(path).
Returns all params ordered by name. The default implementation uses dir() to get all attributes of type Param.
Returns an MLReader instance for this class.
Save this ML instance to the given path, a shortcut of write().save(path).
Sets the value of aggregationDepth.
Sets the value of featuresCol.
Sets the value of fitIntercept.
setParams(self, featuresCol=”features”, labelCol=”label”, predictionCol=”prediction”, fitIntercept=True, maxIter=100, tol=1E-6, censorCol=”censor”, quantileProbabilities=[0.01, 0.05, 0.1, 0.25, 0.5, 0.75, 0.9, 0.95, 0.99], quantilesCol=None, aggregationDepth=2):
New in version 1.6.0.
Sets the value of predictionCol.
Sets the value of quantileProbabilities.
New in version 1.6.0.
Sets the value of quantilesCol.
New in version 1.6.0.
Returns an MLWriter instance for this ML instance.
Note
Experimental
Model fitted by AFTSurvivalRegression.
New in version 1.6.0.
Creates a copy of this instance with the same uid and some extra params. This implementation first calls Params.copy and then make a copy of the companion Java pipeline component with extra params. So both the Python wrapper and the Java pipeline component get copied.
Parameters: | extra – Extra parameters to copy to the new instance |
---|---|
Returns: | Copy of this instance |
Explains a single param and returns its name, doc, and optional default value and user-supplied value in a string.
Returns the documentation of all params with their optionally default values and user-supplied values.
Extracts the embedded default param values and user-supplied values, and then merges them with extra values from input into a flat param map, where the latter value is used if there exist conflicts, i.e., with ordering: default param values < user-supplied values < extra.
Parameters: | extra – extra param values |
---|---|
Returns: | merged param map |
Gets the value of a param in the user-supplied param map or its default value. Raises an error if neither is set.
Gets a param by its name.
Checks whether a param has a default value.
Tests whether this instance contains a param with a given (string) name.
Checks whether a param is explicitly set by user or has a default value.
Checks whether a param is explicitly set by user.
Reads an ML instance from the input path, a shortcut of read().load(path).
Returns all params ordered by name. The default implementation uses dir() to get all attributes of type Param.
Returns an MLReader instance for this class.
Save this ML instance to the given path, a shortcut of write().save(path).
Transforms the input dataset with optional parameters.
Parameters: |
|
---|---|
Returns: | transformed dataset |
New in version 1.3.0.
Returns an MLWriter instance for this ML instance.
Decision tree learning algorithm for regression. It supports both continuous and categorical features.
>>> from pyspark.ml.linalg import Vectors
>>> df = spark.createDataFrame([
... (1.0, Vectors.dense(1.0)),
... (0.0, Vectors.sparse(1, [], []))], ["label", "features"])
>>> dt = DecisionTreeRegressor(maxDepth=2, varianceCol="variance")
>>> model = dt.fit(df)
>>> model.depth
1
>>> model.numNodes
3
>>> model.featureImportances
SparseVector(1, {0: 1.0})
>>> model.numFeatures
1
>>> test0 = spark.createDataFrame([(Vectors.dense(-1.0),)], ["features"])
>>> model.transform(test0).head().prediction
0.0
>>> test1 = spark.createDataFrame([(Vectors.sparse(1, [0], [1.0]),)], ["features"])
>>> model.transform(test1).head().prediction
1.0
>>> dtr_path = temp_path + "/dtr"
>>> dt.save(dtr_path)
>>> dt2 = DecisionTreeRegressor.load(dtr_path)
>>> dt2.getMaxDepth()
2
>>> model_path = temp_path + "/dtr_model"
>>> model.save(model_path)
>>> model2 = DecisionTreeRegressionModel.load(model_path)
>>> model.numNodes == model2.numNodes
True
>>> model.depth == model2.depth
True
>>> model.transform(test1).head().variance
0.0
New in version 1.4.0.
Creates a copy of this instance with the same uid and some extra params. This implementation first calls Params.copy and then make a copy of the companion Java pipeline component with extra params. So both the Python wrapper and the Java pipeline component get copied.
Parameters: | extra – Extra parameters to copy to the new instance |
---|---|
Returns: | Copy of this instance |
Explains a single param and returns its name, doc, and optional default value and user-supplied value in a string.
Returns the documentation of all params with their optionally default values and user-supplied values.
Extracts the embedded default param values and user-supplied values, and then merges them with extra values from input into a flat param map, where the latter value is used if there exist conflicts, i.e., with ordering: default param values < user-supplied values < extra.
Parameters: | extra – extra param values |
---|---|
Returns: | merged param map |
Fits a model to the input dataset with optional parameters.
Parameters: |
|
---|---|
Returns: | fitted model(s) |
New in version 1.3.0.
Gets the value of cacheNodeIds or its default value.
Gets the value of checkpointInterval or its default value.
Gets the value of featuresCol or its default value.
Gets the value of impurity or its default value.
New in version 1.4.0.
Gets the value of labelCol or its default value.
Gets the value of maxBins or its default value.
Gets the value of maxDepth or its default value.
Gets the value of maxMemoryInMB or its default value.
Gets the value of minInfoGain or its default value.
Gets the value of minInstancesPerNode or its default value.
Gets the value of a param in the user-supplied param map or its default value. Raises an error if neither is set.
Gets a param by its name.
Gets the value of predictionCol or its default value.
Gets the value of seed or its default value.
Gets the value of varianceCol or its default value.
Checks whether a param has a default value.
Tests whether this instance contains a param with a given (string) name.
Checks whether a param is explicitly set by user or has a default value.
Checks whether a param is explicitly set by user.
Reads an ML instance from the input path, a shortcut of read().load(path).
Returns all params ordered by name. The default implementation uses dir() to get all attributes of type Param.
Returns an MLReader instance for this class.
Save this ML instance to the given path, a shortcut of write().save(path).
Sets the value of cacheNodeIds.
Sets the value of checkpointInterval.
Sets the value of featuresCol.
Sets the value of maxMemoryInMB.
Sets the value of minInfoGain.
Sets the value of minInstancesPerNode.
Sets params for the DecisionTreeRegressor.
New in version 1.4.0.
Sets the value of predictionCol.
Sets the value of varianceCol.
Returns an MLWriter instance for this ML instance.
Model fitted by DecisionTreeRegressor.
New in version 1.4.0.
Creates a copy of this instance with the same uid and some extra params. This implementation first calls Params.copy and then make a copy of the companion Java pipeline component with extra params. So both the Python wrapper and the Java pipeline component get copied.
Parameters: | extra – Extra parameters to copy to the new instance |
---|---|
Returns: | Copy of this instance |
Return depth of the decision tree.
New in version 1.5.0.
Explains a single param and returns its name, doc, and optional default value and user-supplied value in a string.
Returns the documentation of all params with their optionally default values and user-supplied values.
Extracts the embedded default param values and user-supplied values, and then merges them with extra values from input into a flat param map, where the latter value is used if there exist conflicts, i.e., with ordering: default param values < user-supplied values < extra.
Parameters: | extra – extra param values |
---|---|
Returns: | merged param map |
Estimate of the importance of each feature.
This generalizes the idea of “Gini” importance to other losses, following the explanation of Gini importance from “Random Forests” documentation by Leo Breiman and Adele Cutler, and following the implementation from scikit-learn.
Note
Feature importance for single decision trees can have high variance due to correlated predictor variables. Consider using a RandomForestRegressor to determine feature importance instead.
New in version 2.0.0.
Gets the value of a param in the user-supplied param map or its default value. Raises an error if neither is set.
Gets a param by its name.
Checks whether a param has a default value.
Tests whether this instance contains a param with a given (string) name.
Checks whether a param is explicitly set by user or has a default value.
Checks whether a param is explicitly set by user.
Reads an ML instance from the input path, a shortcut of read().load(path).
Returns the number of features the model was trained on. If unknown, returns -1
New in version 2.1.0.
Return number of nodes of the decision tree.
New in version 1.5.0.
Returns all params ordered by name. The default implementation uses dir() to get all attributes of type Param.
Returns an MLReader instance for this class.
Save this ML instance to the given path, a shortcut of write().save(path).
Full description of model.
New in version 2.0.0.
Transforms the input dataset with optional parameters.
Parameters: |
|
---|---|
Returns: | transformed dataset |
New in version 1.3.0.
Returns an MLWriter instance for this ML instance.
Gradient-Boosted Trees (GBTs) learning algorithm for regression. It supports both continuous and categorical features.
>>> from numpy import allclose
>>> from pyspark.ml.linalg import Vectors
>>> df = spark.createDataFrame([
... (1.0, Vectors.dense(1.0)),
... (0.0, Vectors.sparse(1, [], []))], ["label", "features"])
>>> gbt = GBTRegressor(maxIter=5, maxDepth=2, seed=42)
>>> print(gbt.getImpurity())
variance
>>> model = gbt.fit(df)
>>> model.featureImportances
SparseVector(1, {0: 1.0})
>>> model.numFeatures
1
>>> allclose(model.treeWeights, [1.0, 0.1, 0.1, 0.1, 0.1])
True
>>> test0 = spark.createDataFrame([(Vectors.dense(-1.0),)], ["features"])
>>> model.transform(test0).head().prediction
0.0
>>> test1 = spark.createDataFrame([(Vectors.sparse(1, [0], [1.0]),)], ["features"])
>>> model.transform(test1).head().prediction
1.0
>>> gbtr_path = temp_path + "gbtr"
>>> gbt.save(gbtr_path)
>>> gbt2 = GBTRegressor.load(gbtr_path)
>>> gbt2.getMaxDepth()
2
>>> model_path = temp_path + "gbtr_model"
>>> model.save(model_path)
>>> model2 = GBTRegressionModel.load(model_path)
>>> model.featureImportances == model2.featureImportances
True
>>> model.treeWeights == model2.treeWeights
True
>>> model.trees
[DecisionTreeRegressionModel (uid=...) of depth..., DecisionTreeRegressionModel...]
New in version 1.4.0.
Creates a copy of this instance with the same uid and some extra params. This implementation first calls Params.copy and then make a copy of the companion Java pipeline component with extra params. So both the Python wrapper and the Java pipeline component get copied.
Parameters: | extra – Extra parameters to copy to the new instance |
---|---|
Returns: | Copy of this instance |
Explains a single param and returns its name, doc, and optional default value and user-supplied value in a string.
Returns the documentation of all params with their optionally default values and user-supplied values.
Extracts the embedded default param values and user-supplied values, and then merges them with extra values from input into a flat param map, where the latter value is used if there exist conflicts, i.e., with ordering: default param values < user-supplied values < extra.
Parameters: | extra – extra param values |
---|---|
Returns: | merged param map |
Fits a model to the input dataset with optional parameters.
Parameters: |
|
---|---|
Returns: | fitted model(s) |
New in version 1.3.0.
Gets the value of cacheNodeIds or its default value.
Gets the value of checkpointInterval or its default value.
Gets the value of featuresCol or its default value.
Gets the value of impurity or its default value.
New in version 1.4.0.
Gets the value of labelCol or its default value.
Gets the value of maxBins or its default value.
Gets the value of maxDepth or its default value.
Gets the value of maxIter or its default value.
Gets the value of maxMemoryInMB or its default value.
Gets the value of minInfoGain or its default value.
Gets the value of minInstancesPerNode or its default value.
Gets the value of a param in the user-supplied param map or its default value. Raises an error if neither is set.
Gets a param by its name.
Gets the value of predictionCol or its default value.
Gets the value of seed or its default value.
Gets the value of stepSize or its default value.
Gets the value of subsamplingRate or its default value.
New in version 1.4.0.
Checks whether a param has a default value.
Tests whether this instance contains a param with a given (string) name.
Checks whether a param is explicitly set by user or has a default value.
Checks whether a param is explicitly set by user.
Reads an ML instance from the input path, a shortcut of read().load(path).
Returns all params ordered by name. The default implementation uses dir() to get all attributes of type Param.
Returns an MLReader instance for this class.
Save this ML instance to the given path, a shortcut of write().save(path).
Sets the value of cacheNodeIds.
Sets the value of checkpointInterval.
Sets the value of featuresCol.
Sets the value of maxMemoryInMB.
Sets the value of minInfoGain.
Sets the value of minInstancesPerNode.
Sets params for Gradient Boosted Tree Regression.
New in version 1.4.0.
Sets the value of predictionCol.
Sets the value of subsamplingRate.
New in version 1.4.0.
Returns an MLWriter instance for this ML instance.
Model fitted by GBTRegressor.
New in version 1.4.0.
Creates a copy of this instance with the same uid and some extra params. This implementation first calls Params.copy and then make a copy of the companion Java pipeline component with extra params. So both the Python wrapper and the Java pipeline component get copied.
Parameters: | extra – Extra parameters to copy to the new instance |
---|---|
Returns: | Copy of this instance |
Explains a single param and returns its name, doc, and optional default value and user-supplied value in a string.
Returns the documentation of all params with their optionally default values and user-supplied values.
Extracts the embedded default param values and user-supplied values, and then merges them with extra values from input into a flat param map, where the latter value is used if there exist conflicts, i.e., with ordering: default param values < user-supplied values < extra.
Parameters: | extra – extra param values |
---|---|
Returns: | merged param map |
Estimate of the importance of each feature.
Each feature’s importance is the average of its importance across all trees in the ensemble The importance vector is normalized to sum to 1. This method is suggested by Hastie et al. (Hastie, Tibshirani, Friedman. “The Elements of Statistical Learning, 2nd Edition.” 2001.) and follows the implementation from scikit-learn.
New in version 2.0.0.
Number of trees in ensemble.
New in version 2.0.0.
Gets the value of a param in the user-supplied param map or its default value. Raises an error if neither is set.
Gets a param by its name.
Checks whether a param has a default value.
Tests whether this instance contains a param with a given (string) name.
Checks whether a param is explicitly set by user or has a default value.
Checks whether a param is explicitly set by user.
Reads an ML instance from the input path, a shortcut of read().load(path).
Returns the number of features the model was trained on. If unknown, returns -1
New in version 2.1.0.
Returns all params ordered by name. The default implementation uses dir() to get all attributes of type Param.
Returns an MLReader instance for this class.
Save this ML instance to the given path, a shortcut of write().save(path).
Full description of model.
New in version 2.0.0.
Total number of nodes, summed over all trees in the ensemble.
New in version 2.0.0.
Transforms the input dataset with optional parameters.
Parameters: |
|
---|---|
Returns: | transformed dataset |
New in version 1.3.0.
Return the weights for each tree
New in version 1.5.0.
Trees in this ensemble. Warning: These have null parent Estimators.
New in version 2.0.0.
Returns an MLWriter instance for this ML instance.
Note
Experimental
Generalized Linear Regression.
Fit a Generalized Linear Model specified by giving a symbolic description of the linear predictor (link function) and a description of the error distribution (family). It supports “gaussian”, “binomial”, “poisson”, “gamma” and “tweedie” as family. Valid link functions for each family is listed below. The first link function of each family is the default one.
See also
>>> from pyspark.ml.linalg import Vectors
>>> df = spark.createDataFrame([
... (1.0, Vectors.dense(0.0, 0.0)),
... (1.0, Vectors.dense(1.0, 2.0)),
... (2.0, Vectors.dense(0.0, 0.0)),
... (2.0, Vectors.dense(1.0, 1.0)),], ["label", "features"])
>>> glr = GeneralizedLinearRegression(family="gaussian", link="identity", linkPredictionCol="p")
>>> model = glr.fit(df)
>>> transformed = model.transform(df)
>>> abs(transformed.head().prediction - 1.5) < 0.001
True
>>> abs(transformed.head().p - 1.5) < 0.001
True
>>> model.coefficients
DenseVector([1.5..., -1.0...])
>>> model.numFeatures
2
>>> abs(model.intercept - 1.5) < 0.001
True
>>> glr_path = temp_path + "/glr"
>>> glr.save(glr_path)
>>> glr2 = GeneralizedLinearRegression.load(glr_path)
>>> glr.getFamily() == glr2.getFamily()
True
>>> model_path = temp_path + "/glr_model"
>>> model.save(model_path)
>>> model2 = GeneralizedLinearRegressionModel.load(model_path)
>>> model.intercept == model2.intercept
True
>>> model.coefficients[0] == model2.coefficients[0]
True
New in version 2.0.0.
Creates a copy of this instance with the same uid and some extra params. This implementation first calls Params.copy and then make a copy of the companion Java pipeline component with extra params. So both the Python wrapper and the Java pipeline component get copied.
Parameters: | extra – Extra parameters to copy to the new instance |
---|---|
Returns: | Copy of this instance |
Explains a single param and returns its name, doc, and optional default value and user-supplied value in a string.
Returns the documentation of all params with their optionally default values and user-supplied values.
Extracts the embedded default param values and user-supplied values, and then merges them with extra values from input into a flat param map, where the latter value is used if there exist conflicts, i.e., with ordering: default param values < user-supplied values < extra.
Parameters: | extra – extra param values |
---|---|
Returns: | merged param map |
Fits a model to the input dataset with optional parameters.
Parameters: |
|
---|---|
Returns: | fitted model(s) |
New in version 1.3.0.
Gets the value of featuresCol or its default value.
Gets the value of fitIntercept or its default value.
Gets the value of labelCol or its default value.
Gets the value of linkPredictionCol or its default value.
New in version 2.0.0.
Gets the value of maxIter or its default value.
Gets the value of a param in the user-supplied param map or its default value. Raises an error if neither is set.
Gets a param by its name.
Gets the value of predictionCol or its default value.
Gets the value of regParam or its default value.
Gets the value of solver or its default value.
Gets the value of tol or its default value.
Gets the value of variancePower or its default value.
New in version 2.2.0.
Gets the value of weightCol or its default value.
Checks whether a param has a default value.
Tests whether this instance contains a param with a given (string) name.
Checks whether a param is explicitly set by user or has a default value.
Checks whether a param is explicitly set by user.
Reads an ML instance from the input path, a shortcut of read().load(path).
Returns all params ordered by name. The default implementation uses dir() to get all attributes of type Param.
Returns an MLReader instance for this class.
Save this ML instance to the given path, a shortcut of write().save(path).
Sets the value of featuresCol.
Sets the value of fitIntercept.
Sets the value of linkPredictionCol.
New in version 2.0.0.
Sets params for generalized linear regression.
New in version 2.0.0.
Sets the value of predictionCol.
Sets the value of variancePower.
New in version 2.2.0.
Returns an MLWriter instance for this ML instance.
Note
Experimental
Model fitted by GeneralizedLinearRegression.
New in version 2.0.0.
Creates a copy of this instance with the same uid and some extra params. This implementation first calls Params.copy and then make a copy of the companion Java pipeline component with extra params. So both the Python wrapper and the Java pipeline component get copied.
Parameters: | extra – Extra parameters to copy to the new instance |
---|---|
Returns: | Copy of this instance |
Evaluates the model on a test dataset.
Parameters: | dataset – Test dataset to evaluate model on, where dataset is an instance of pyspark.sql.DataFrame |
---|
New in version 2.0.0.
Explains a single param and returns its name, doc, and optional default value and user-supplied value in a string.
Returns the documentation of all params with their optionally default values and user-supplied values.
Extracts the embedded default param values and user-supplied values, and then merges them with extra values from input into a flat param map, where the latter value is used if there exist conflicts, i.e., with ordering: default param values < user-supplied values < extra.
Parameters: | extra – extra param values |
---|---|
Returns: | merged param map |
Gets the value of a param in the user-supplied param map or its default value. Raises an error if neither is set.
Gets a param by its name.
Checks whether a param has a default value.
Tests whether this instance contains a param with a given (string) name.
Indicates whether a training summary exists for this model instance.
New in version 2.0.0.
Checks whether a param is explicitly set by user or has a default value.
Checks whether a param is explicitly set by user.
Reads an ML instance from the input path, a shortcut of read().load(path).
Returns the number of features the model was trained on. If unknown, returns -1
New in version 2.1.0.
Returns all params ordered by name. The default implementation uses dir() to get all attributes of type Param.
Returns an MLReader instance for this class.
Save this ML instance to the given path, a shortcut of write().save(path).
Gets summary (e.g. residuals, deviance, pValues) of model on training set. An exception is thrown if trainingSummary is None.
New in version 2.0.0.
Transforms the input dataset with optional parameters.
Parameters: |
|
---|---|
Returns: | transformed dataset |
New in version 1.3.0.
Returns an MLWriter instance for this ML instance.
Note
Experimental
Generalized linear regression results evaluated on a dataset.
New in version 2.0.0.
The dispersion of the fitted model. It is taken as 1.0 for the “binomial” and “poisson” families, and otherwise estimated by the residual Pearson’s Chi-Squared statistic (which is defined as sum of the squares of the Pearson residuals) divided by the residual degrees of freedom.
New in version 2.0.0.
Field in predictions which gives the predicted value of each instance. This is set to a new column name if the original model’s predictionCol is not set.
New in version 2.0.0.
Note
Experimental
Generalized linear regression training results.
New in version 2.0.0.
Akaike’s “An Information Criterion”(AIC) for the fitted model.
New in version 2.0.0.
Standard error of estimated coefficients and intercept.
If GeneralizedLinearRegression.fitIntercept is set to True, then the last element returned corresponds to the intercept.
New in version 2.0.0.
Degrees of freedom.
New in version 2.0.0.
The deviance for the fitted model.
New in version 2.0.0.
The dispersion of the fitted model. It is taken as 1.0 for the “binomial” and “poisson” families, and otherwise estimated by the residual Pearson’s Chi-Squared statistic (which is defined as sum of the squares of the Pearson residuals) divided by the residual degrees of freedom.
New in version 2.0.0.
The deviance for the null model.
New in version 2.0.0.
Number of instances in DataFrame predictions.
New in version 2.2.0.
Two-sided p-value of estimated coefficients and intercept.
If GeneralizedLinearRegression.fitIntercept is set to True, then the last element returned corresponds to the intercept.
New in version 2.0.0.
Field in predictions which gives the predicted value of each instance. This is set to a new column name if the original model’s predictionCol is not set.
New in version 2.0.0.
Predictions output by the model’s transform method.
New in version 2.0.0.
The numeric rank of the fitted linear model.
New in version 2.0.0.
The residual degrees of freedom.
New in version 2.0.0.
The residual degrees of freedom for the null model.
New in version 2.0.0.
Get the residuals of the fitted model by type.
Parameters: | residualsType – The type of residuals which should be returned. Supported options: deviance (default), pearson, working, and response. |
---|
New in version 2.0.0.
T-statistic of estimated coefficients and intercept.
If GeneralizedLinearRegression.fitIntercept is set to True, then the last element returned corresponds to the intercept.
New in version 2.0.0.
Currently implemented using parallelized pool adjacent violators algorithm. Only univariate (single feature) algorithm supported.
>>> from pyspark.ml.linalg import Vectors
>>> df = spark.createDataFrame([
... (1.0, Vectors.dense(1.0)),
... (0.0, Vectors.sparse(1, [], []))], ["label", "features"])
>>> ir = IsotonicRegression()
>>> model = ir.fit(df)
>>> test0 = spark.createDataFrame([(Vectors.dense(-1.0),)], ["features"])
>>> model.transform(test0).head().prediction
0.0
>>> model.boundaries
DenseVector([0.0, 1.0])
>>> ir_path = temp_path + "/ir"
>>> ir.save(ir_path)
>>> ir2 = IsotonicRegression.load(ir_path)
>>> ir2.getIsotonic()
True
>>> model_path = temp_path + "/ir_model"
>>> model.save(model_path)
>>> model2 = IsotonicRegressionModel.load(model_path)
>>> model.boundaries == model2.boundaries
True
>>> model.predictions == model2.predictions
True
New in version 1.6.0.
Creates a copy of this instance with the same uid and some extra params. This implementation first calls Params.copy and then make a copy of the companion Java pipeline component with extra params. So both the Python wrapper and the Java pipeline component get copied.
Parameters: | extra – Extra parameters to copy to the new instance |
---|---|
Returns: | Copy of this instance |
Explains a single param and returns its name, doc, and optional default value and user-supplied value in a string.
Returns the documentation of all params with their optionally default values and user-supplied values.
Extracts the embedded default param values and user-supplied values, and then merges them with extra values from input into a flat param map, where the latter value is used if there exist conflicts, i.e., with ordering: default param values < user-supplied values < extra.
Parameters: | extra – extra param values |
---|---|
Returns: | merged param map |
Fits a model to the input dataset with optional parameters.
Parameters: |
|
---|---|
Returns: | fitted model(s) |
New in version 1.3.0.
Gets the value of featuresCol or its default value.
Gets the value of labelCol or its default value.
Gets the value of a param in the user-supplied param map or its default value. Raises an error if neither is set.
Gets a param by its name.
Gets the value of predictionCol or its default value.
Gets the value of weightCol or its default value.
Checks whether a param has a default value.
Tests whether this instance contains a param with a given (string) name.
Checks whether a param is explicitly set by user or has a default value.
Checks whether a param is explicitly set by user.
Reads an ML instance from the input path, a shortcut of read().load(path).
Returns all params ordered by name. The default implementation uses dir() to get all attributes of type Param.
Returns an MLReader instance for this class.
Save this ML instance to the given path, a shortcut of write().save(path).
Sets the value of featureIndex.
Sets the value of featuresCol.
setParams(self, featuresCol=”features”, labelCol=”label”, predictionCol=”prediction”, weightCol=None, isotonic=True, featureIndex=0): Set the params for IsotonicRegression.
Sets the value of predictionCol.
Returns an MLWriter instance for this ML instance.
Model fitted by IsotonicRegression.
New in version 1.6.0.
Boundaries in increasing order for which predictions are known.
New in version 1.6.0.
Creates a copy of this instance with the same uid and some extra params. This implementation first calls Params.copy and then make a copy of the companion Java pipeline component with extra params. So both the Python wrapper and the Java pipeline component get copied.
Parameters: | extra – Extra parameters to copy to the new instance |
---|---|
Returns: | Copy of this instance |
Explains a single param and returns its name, doc, and optional default value and user-supplied value in a string.
Returns the documentation of all params with their optionally default values and user-supplied values.
Extracts the embedded default param values and user-supplied values, and then merges them with extra values from input into a flat param map, where the latter value is used if there exist conflicts, i.e., with ordering: default param values < user-supplied values < extra.
Parameters: | extra – extra param values |
---|---|
Returns: | merged param map |
Gets the value of a param in the user-supplied param map or its default value. Raises an error if neither is set.
Gets a param by its name.
Checks whether a param has a default value.
Tests whether this instance contains a param with a given (string) name.
Checks whether a param is explicitly set by user or has a default value.
Checks whether a param is explicitly set by user.
Reads an ML instance from the input path, a shortcut of read().load(path).
Returns all params ordered by name. The default implementation uses dir() to get all attributes of type Param.
Predictions associated with the boundaries at the same index, monotone because of isotonic regression.
New in version 1.6.0.
Returns an MLReader instance for this class.
Save this ML instance to the given path, a shortcut of write().save(path).
Transforms the input dataset with optional parameters.
Parameters: |
|
---|---|
Returns: | transformed dataset |
New in version 1.3.0.
Returns an MLWriter instance for this ML instance.
Linear regression.
The learning objective is to minimize the squared error, with regularization. The specific squared error loss function used is: L = 1/2n ||A coefficients - y||^2^
This supports multiple types of regularization:
- none (a.k.a. ordinary least squares)
- L2 (ridge regression)
- L1 (Lasso)
- L2 + L1 (elastic net)
>>> from pyspark.ml.linalg import Vectors
>>> df = spark.createDataFrame([
... (1.0, 2.0, Vectors.dense(1.0)),
... (0.0, 2.0, Vectors.sparse(1, [], []))], ["label", "weight", "features"])
>>> lr = LinearRegression(maxIter=5, regParam=0.0, solver="normal", weightCol="weight")
>>> model = lr.fit(df)
>>> test0 = spark.createDataFrame([(Vectors.dense(-1.0),)], ["features"])
>>> abs(model.transform(test0).head().prediction - (-1.0)) < 0.001
True
>>> abs(model.coefficients[0] - 1.0) < 0.001
True
>>> abs(model.intercept - 0.0) < 0.001
True
>>> test1 = spark.createDataFrame([(Vectors.sparse(1, [0], [1.0]),)], ["features"])
>>> abs(model.transform(test1).head().prediction - 1.0) < 0.001
True
>>> lr.setParams("vector")
Traceback (most recent call last):
...
TypeError: Method setParams forces keyword arguments.
>>> lr_path = temp_path + "/lr"
>>> lr.save(lr_path)
>>> lr2 = LinearRegression.load(lr_path)
>>> lr2.getMaxIter()
5
>>> model_path = temp_path + "/lr_model"
>>> model.save(model_path)
>>> model2 = LinearRegressionModel.load(model_path)
>>> model.coefficients[0] == model2.coefficients[0]
True
>>> model.intercept == model2.intercept
True
>>> model.numFeatures
1
New in version 1.4.0.
Creates a copy of this instance with the same uid and some extra params. This implementation first calls Params.copy and then make a copy of the companion Java pipeline component with extra params. So both the Python wrapper and the Java pipeline component get copied.
Parameters: | extra – Extra parameters to copy to the new instance |
---|---|
Returns: | Copy of this instance |
Explains a single param and returns its name, doc, and optional default value and user-supplied value in a string.
Returns the documentation of all params with their optionally default values and user-supplied values.
Extracts the embedded default param values and user-supplied values, and then merges them with extra values from input into a flat param map, where the latter value is used if there exist conflicts, i.e., with ordering: default param values < user-supplied values < extra.
Parameters: | extra – extra param values |
---|---|
Returns: | merged param map |
Fits a model to the input dataset with optional parameters.
Parameters: |
|
---|---|
Returns: | fitted model(s) |
New in version 1.3.0.
Gets the value of aggregationDepth or its default value.
Gets the value of elasticNetParam or its default value.
Gets the value of featuresCol or its default value.
Gets the value of fitIntercept or its default value.
Gets the value of labelCol or its default value.
Gets the value of maxIter or its default value.
Gets the value of a param in the user-supplied param map or its default value. Raises an error if neither is set.
Gets a param by its name.
Gets the value of predictionCol or its default value.
Gets the value of regParam or its default value.
Gets the value of solver or its default value.
Gets the value of standardization or its default value.
Gets the value of tol or its default value.
Gets the value of weightCol or its default value.
Checks whether a param has a default value.
Tests whether this instance contains a param with a given (string) name.
Checks whether a param is explicitly set by user or has a default value.
Checks whether a param is explicitly set by user.
Reads an ML instance from the input path, a shortcut of read().load(path).
Returns all params ordered by name. The default implementation uses dir() to get all attributes of type Param.
Returns an MLReader instance for this class.
Save this ML instance to the given path, a shortcut of write().save(path).
Sets the value of aggregationDepth.
Sets the value of elasticNetParam.
Sets the value of featuresCol.
Sets the value of fitIntercept.
Sets params for linear regression.
New in version 1.4.0.
Sets the value of predictionCol.
Sets the value of standardization.
Returns an MLWriter instance for this ML instance.
Model fitted by LinearRegression.
New in version 1.4.0.
Creates a copy of this instance with the same uid and some extra params. This implementation first calls Params.copy and then make a copy of the companion Java pipeline component with extra params. So both the Python wrapper and the Java pipeline component get copied.
Parameters: | extra – Extra parameters to copy to the new instance |
---|---|
Returns: | Copy of this instance |
Evaluates the model on a test dataset.
Parameters: | dataset – Test dataset to evaluate model on, where dataset is an instance of pyspark.sql.DataFrame |
---|
New in version 2.0.0.
Explains a single param and returns its name, doc, and optional default value and user-supplied value in a string.
Returns the documentation of all params with their optionally default values and user-supplied values.
Extracts the embedded default param values and user-supplied values, and then merges them with extra values from input into a flat param map, where the latter value is used if there exist conflicts, i.e., with ordering: default param values < user-supplied values < extra.
Parameters: | extra – extra param values |
---|---|
Returns: | merged param map |
Gets the value of a param in the user-supplied param map or its default value. Raises an error if neither is set.
Gets a param by its name.
Checks whether a param has a default value.
Tests whether this instance contains a param with a given (string) name.
Indicates whether a training summary exists for this model instance.
New in version 2.0.0.
Checks whether a param is explicitly set by user or has a default value.
Checks whether a param is explicitly set by user.
Reads an ML instance from the input path, a shortcut of read().load(path).
Returns the number of features the model was trained on. If unknown, returns -1
New in version 2.1.0.
Returns all params ordered by name. The default implementation uses dir() to get all attributes of type Param.
Returns an MLReader instance for this class.
Save this ML instance to the given path, a shortcut of write().save(path).
Gets summary (e.g. residuals, mse, r-squared ) of model on training set. An exception is thrown if trainingSummary is None.
New in version 2.0.0.
Transforms the input dataset with optional parameters.
Parameters: |
|
---|---|
Returns: | transformed dataset |
New in version 1.3.0.
Returns an MLWriter instance for this ML instance.
Note
Experimental
Linear regression results evaluated on a dataset.
New in version 2.0.0.
Standard error of estimated coefficients and intercept. This value is only available when using the “normal” solver.
If LinearRegression.fitIntercept is set to True, then the last element returned corresponds to the intercept.
See also
New in version 2.0.0.
The weighted residuals, the usual residuals rescaled by the square root of the instance weights.
New in version 2.0.0.
Returns the explained variance regression score. explainedVariance = 1 - variance(y - hat{y}) / variance(y)
See also
Note
This ignores instance weights (setting all to 1.0) from LinearRegression.weightCol. This will change in later Spark versions.
New in version 2.0.0.
Field in “predictions” which gives the features of each instance as a vector.
New in version 2.0.0.
Field in “predictions” which gives the true label of each instance.
New in version 2.0.0.
Returns the mean absolute error, which is a risk function corresponding to the expected value of the absolute error loss or l1-norm loss.
Note
This ignores instance weights (setting all to 1.0) from LinearRegression.weightCol. This will change in later Spark versions.
New in version 2.0.0.
Returns the mean squared error, which is a risk function corresponding to the expected value of the squared error loss or quadratic loss.
Note
This ignores instance weights (setting all to 1.0) from LinearRegression.weightCol. This will change in later Spark versions.
New in version 2.0.0.
Two-sided p-value of estimated coefficients and intercept. This value is only available when using the “normal” solver.
If LinearRegression.fitIntercept is set to True, then the last element returned corresponds to the intercept.
See also
New in version 2.0.0.
Field in “predictions” which gives the predicted value of the label at each instance.
New in version 2.0.0.
Returns R^2^, the coefficient of determination.
See also
Wikipedia coefficient of determination <http://en.wikipedia.org/wiki/Coefficient_of_determination>
Note
This ignores instance weights (setting all to 1.0) from LinearRegression.weightCol. This will change in later Spark versions.
New in version 2.0.0.
Returns the root mean squared error, which is defined as the square root of the mean squared error.
Note
This ignores instance weights (setting all to 1.0) from LinearRegression.weightCol. This will change in later Spark versions.
New in version 2.0.0.
T-statistic of estimated coefficients and intercept. This value is only available when using the “normal” solver.
If LinearRegression.fitIntercept is set to True, then the last element returned corresponds to the intercept.
See also
New in version 2.0.0.
Note
Experimental
Linear regression training results. Currently, the training summary ignores the training weights except for the objective trace.
New in version 2.0.0.
Standard error of estimated coefficients and intercept. This value is only available when using the “normal” solver.
If LinearRegression.fitIntercept is set to True, then the last element returned corresponds to the intercept.
See also
New in version 2.0.0.
Degrees of freedom.
New in version 2.2.0.
The weighted residuals, the usual residuals rescaled by the square root of the instance weights.
New in version 2.0.0.
Returns the explained variance regression score. explainedVariance = 1 - variance(y - hat{y}) / variance(y)
See also
Note
This ignores instance weights (setting all to 1.0) from LinearRegression.weightCol. This will change in later Spark versions.
New in version 2.0.0.
Field in “predictions” which gives the features of each instance as a vector.
New in version 2.0.0.
Field in “predictions” which gives the true label of each instance.
New in version 2.0.0.
Returns the mean absolute error, which is a risk function corresponding to the expected value of the absolute error loss or l1-norm loss.
Note
This ignores instance weights (setting all to 1.0) from LinearRegression.weightCol. This will change in later Spark versions.
New in version 2.0.0.
Returns the mean squared error, which is a risk function corresponding to the expected value of the squared error loss or quadratic loss.
Note
This ignores instance weights (setting all to 1.0) from LinearRegression.weightCol. This will change in later Spark versions.
New in version 2.0.0.
Number of instances in DataFrame predictions
New in version 2.0.0.
Objective function (scaled loss + regularization) at each iteration. This value is only available when using the “l-bfgs” solver.
See also
New in version 2.0.0.
Two-sided p-value of estimated coefficients and intercept. This value is only available when using the “normal” solver.
If LinearRegression.fitIntercept is set to True, then the last element returned corresponds to the intercept.
See also
New in version 2.0.0.
Field in “predictions” which gives the predicted value of the label at each instance.
New in version 2.0.0.
Dataframe outputted by the model’s transform method.
New in version 2.0.0.
Returns R^2^, the coefficient of determination.
See also
Wikipedia coefficient of determination <http://en.wikipedia.org/wiki/Coefficient_of_determination>
Note
This ignores instance weights (setting all to 1.0) from LinearRegression.weightCol. This will change in later Spark versions.
New in version 2.0.0.
Residuals (label - predicted value)
New in version 2.0.0.
Returns the root mean squared error, which is defined as the square root of the mean squared error.
Note
This ignores instance weights (setting all to 1.0) from LinearRegression.weightCol. This will change in later Spark versions.
New in version 2.0.0.
T-statistic of estimated coefficients and intercept. This value is only available when using the “normal” solver.
If LinearRegression.fitIntercept is set to True, then the last element returned corresponds to the intercept.
See also
New in version 2.0.0.
Random Forest learning algorithm for regression. It supports both continuous and categorical features.
>>> from numpy import allclose
>>> from pyspark.ml.linalg import Vectors
>>> df = spark.createDataFrame([
... (1.0, Vectors.dense(1.0)),
... (0.0, Vectors.sparse(1, [], []))], ["label", "features"])
>>> rf = RandomForestRegressor(numTrees=2, maxDepth=2, seed=42)
>>> model = rf.fit(df)
>>> model.featureImportances
SparseVector(1, {0: 1.0})
>>> allclose(model.treeWeights, [1.0, 1.0])
True
>>> test0 = spark.createDataFrame([(Vectors.dense(-1.0),)], ["features"])
>>> model.transform(test0).head().prediction
0.0
>>> model.numFeatures
1
>>> model.trees
[DecisionTreeRegressionModel (uid=...) of depth..., DecisionTreeRegressionModel...]
>>> model.getNumTrees
2
>>> test1 = spark.createDataFrame([(Vectors.sparse(1, [0], [1.0]),)], ["features"])
>>> model.transform(test1).head().prediction
0.5
>>> rfr_path = temp_path + "/rfr"
>>> rf.save(rfr_path)
>>> rf2 = RandomForestRegressor.load(rfr_path)
>>> rf2.getNumTrees()
2
>>> model_path = temp_path + "/rfr_model"
>>> model.save(model_path)
>>> model2 = RandomForestRegressionModel.load(model_path)
>>> model.featureImportances == model2.featureImportances
True
New in version 1.4.0.
Creates a copy of this instance with the same uid and some extra params. This implementation first calls Params.copy and then make a copy of the companion Java pipeline component with extra params. So both the Python wrapper and the Java pipeline component get copied.
Parameters: | extra – Extra parameters to copy to the new instance |
---|---|
Returns: | Copy of this instance |
Explains a single param and returns its name, doc, and optional default value and user-supplied value in a string.
Returns the documentation of all params with their optionally default values and user-supplied values.
Extracts the embedded default param values and user-supplied values, and then merges them with extra values from input into a flat param map, where the latter value is used if there exist conflicts, i.e., with ordering: default param values < user-supplied values < extra.
Parameters: | extra – extra param values |
---|---|
Returns: | merged param map |
Fits a model to the input dataset with optional parameters.
Parameters: |
|
---|---|
Returns: | fitted model(s) |
New in version 1.3.0.
Gets the value of cacheNodeIds or its default value.
Gets the value of checkpointInterval or its default value.
Gets the value of featureSubsetStrategy or its default value.
New in version 1.4.0.
Gets the value of featuresCol or its default value.
Gets the value of impurity or its default value.
New in version 1.4.0.
Gets the value of labelCol or its default value.
Gets the value of maxBins or its default value.
Gets the value of maxDepth or its default value.
Gets the value of maxMemoryInMB or its default value.
Gets the value of minInfoGain or its default value.
Gets the value of minInstancesPerNode or its default value.
Gets the value of numTrees or its default value.
New in version 1.4.0.
Gets the value of a param in the user-supplied param map or its default value. Raises an error if neither is set.
Gets a param by its name.
Gets the value of predictionCol or its default value.
Gets the value of seed or its default value.
Gets the value of subsamplingRate or its default value.
New in version 1.4.0.
Checks whether a param has a default value.
Tests whether this instance contains a param with a given (string) name.
Checks whether a param is explicitly set by user or has a default value.
Checks whether a param is explicitly set by user.
Reads an ML instance from the input path, a shortcut of read().load(path).
Returns all params ordered by name. The default implementation uses dir() to get all attributes of type Param.
Returns an MLReader instance for this class.
Save this ML instance to the given path, a shortcut of write().save(path).
Sets the value of cacheNodeIds.
Sets the value of checkpointInterval.
Sets the value of featureSubsetStrategy.
New in version 1.4.0.
Sets the value of featuresCol.
Sets the value of maxMemoryInMB.
Sets the value of minInfoGain.
Sets the value of minInstancesPerNode.
Sets params for linear regression.
New in version 1.4.0.
Sets the value of predictionCol.
Sets the value of subsamplingRate.
New in version 1.4.0.
Returns an MLWriter instance for this ML instance.
Model fitted by RandomForestRegressor.
New in version 1.4.0.
Creates a copy of this instance with the same uid and some extra params. This implementation first calls Params.copy and then make a copy of the companion Java pipeline component with extra params. So both the Python wrapper and the Java pipeline component get copied.
Parameters: | extra – Extra parameters to copy to the new instance |
---|---|
Returns: | Copy of this instance |
Explains a single param and returns its name, doc, and optional default value and user-supplied value in a string.
Returns the documentation of all params with their optionally default values and user-supplied values.
Extracts the embedded default param values and user-supplied values, and then merges them with extra values from input into a flat param map, where the latter value is used if there exist conflicts, i.e., with ordering: default param values < user-supplied values < extra.
Parameters: | extra – extra param values |
---|---|
Returns: | merged param map |
Estimate of the importance of each feature.
Each feature’s importance is the average of its importance across all trees in the ensemble The importance vector is normalized to sum to 1. This method is suggested by Hastie et al. (Hastie, Tibshirani, Friedman. “The Elements of Statistical Learning, 2nd Edition.” 2001.) and follows the implementation from scikit-learn.
New in version 2.0.0.
Number of trees in ensemble.
New in version 2.0.0.
Gets the value of a param in the user-supplied param map or its default value. Raises an error if neither is set.
Gets a param by its name.
Checks whether a param has a default value.
Tests whether this instance contains a param with a given (string) name.
Checks whether a param is explicitly set by user or has a default value.
Checks whether a param is explicitly set by user.
Reads an ML instance from the input path, a shortcut of read().load(path).
Returns the number of features the model was trained on. If unknown, returns -1
New in version 2.1.0.
Returns all params ordered by name. The default implementation uses dir() to get all attributes of type Param.
Returns an MLReader instance for this class.
Save this ML instance to the given path, a shortcut of write().save(path).
Full description of model.
New in version 2.0.0.
Total number of nodes, summed over all trees in the ensemble.
New in version 2.0.0.
Transforms the input dataset with optional parameters.
Parameters: |
|
---|---|
Returns: | transformed dataset |
New in version 1.3.0.
Return the weights for each tree
New in version 1.5.0.
Trees in this ensemble. Warning: These have null parent Estimators.
New in version 2.0.0.
Returns an MLWriter instance for this ML instance.
Note
Experimental
Conduct Pearson’s independence test for every feature against the label. For each feature, the (feature, label) pairs are converted into a contingency matrix for which the Chi-squared statistic is computed. All label and feature values must be categorical.
The null hypothesis is that the occurrence of the outcomes is statistically independent.
Parameters: |
|
---|---|
Returns: | DataFrame containing the test result for every feature against the label. This DataFrame will contain a single Row with the following fields: - pValues: Vector - degreesOfFreedom: Array[Int] - statistics: Vector Each of these fields has one value per feature. |
>>> from pyspark.ml.linalg import Vectors
>>> from pyspark.ml.stat import ChiSquareTest
>>> dataset = [[0, Vectors.dense([0, 0, 1])],
... [0, Vectors.dense([1, 0, 1])],
... [1, Vectors.dense([2, 1, 1])],
... [1, Vectors.dense([3, 1, 1])]]
>>> dataset = spark.createDataFrame(dataset, ["label", "features"])
>>> chiSqResult = ChiSquareTest.test(dataset, 'features', 'label')
>>> chiSqResult.select("degreesOfFreedom").collect()[0]
Row(degreesOfFreedom=[3, 1, 0])
New in version 2.2.0.
Note
Experimental
Compute the correlation matrix for the input dataset of Vectors using the specified method. Methods currently supported: pearson (default), spearman.
Note
For Spearman, a rank correlation, we need to create an RDD[Double] for each column and sort it in order to retrieve the ranks and then join the columns back into an RDD[Vector], which is fairly costly. Cache the input Dataset before calling corr with method = ‘spearman’ to avoid recomputing the common lineage.
Parameters: |
|
---|---|
Returns: | A dataframe that contains the correlation matrix of the column of vectors. This dataframe contains a single row and a single column of name ‘$METHODNAME($COLUMN)’. |
>>> from pyspark.ml.linalg import Vectors
>>> from pyspark.ml.stat import Correlation
>>> dataset = [[Vectors.dense([1, 0, 0, -2])],
... [Vectors.dense([4, 5, 0, 3])],
... [Vectors.dense([6, 7, 0, 8])],
... [Vectors.dense([9, 0, 0, 1])]]
>>> dataset = spark.createDataFrame(dataset, ['features'])
>>> pearsonCorr = Correlation.corr(dataset, 'features', 'pearson').collect()[0][0]
>>> print(str(pearsonCorr).replace('nan', 'NaN'))
DenseMatrix([[ 1. , 0.0556..., NaN, 0.4004...],
[ 0.0556..., 1. , NaN, 0.9135...],
[ NaN, NaN, 1. , NaN],
[ 0.4004..., 0.9135..., NaN, 1. ]])
>>> spearmanCorr = Correlation.corr(dataset, 'features', method='spearman').collect()[0][0]
>>> print(str(spearmanCorr).replace('nan', 'NaN'))
DenseMatrix([[ 1. , 0.1054..., NaN, 0.4 ],
[ 0.1054..., 1. , NaN, 0.9486... ],
[ NaN, NaN, 1. , NaN],
[ 0.4 , 0.9486... , NaN, 1. ]])
New in version 2.2.0.
Builder for a param grid used in grid search-based model selection.
>>> from pyspark.ml.classification import LogisticRegression
>>> lr = LogisticRegression()
>>> output = ParamGridBuilder() \
... .baseOn({lr.labelCol: 'l'}) \
... .baseOn([lr.predictionCol, 'p']) \
... .addGrid(lr.regParam, [1.0, 2.0]) \
... .addGrid(lr.maxIter, [1, 5]) \
... .build()
>>> expected = [
... {lr.regParam: 1.0, lr.maxIter: 1, lr.labelCol: 'l', lr.predictionCol: 'p'},
... {lr.regParam: 2.0, lr.maxIter: 1, lr.labelCol: 'l', lr.predictionCol: 'p'},
... {lr.regParam: 1.0, lr.maxIter: 5, lr.labelCol: 'l', lr.predictionCol: 'p'},
... {lr.regParam: 2.0, lr.maxIter: 5, lr.labelCol: 'l', lr.predictionCol: 'p'}]
>>> len(output) == len(expected)
True
>>> all([m in expected for m in output])
True
New in version 1.4.0.
Sets the given parameters in this grid to fixed values.
New in version 1.4.0.
K-fold cross validation performs model selection by splitting the dataset into a set of non-overlapping randomly partitioned folds which are used as separate training and test datasets e.g., with k=3 folds, K-fold cross validation will generate 3 (training, test) dataset pairs, each of which uses 2/3 of the data for training and 1/3 for testing. Each fold is used as the test set exactly once.
>>> from pyspark.ml.classification import LogisticRegression
>>> from pyspark.ml.evaluation import BinaryClassificationEvaluator
>>> from pyspark.ml.linalg import Vectors
>>> dataset = spark.createDataFrame(
... [(Vectors.dense([0.0]), 0.0),
... (Vectors.dense([0.4]), 1.0),
... (Vectors.dense([0.5]), 0.0),
... (Vectors.dense([0.6]), 1.0),
... (Vectors.dense([1.0]), 1.0)] * 10,
... ["features", "label"])
>>> lr = LogisticRegression()
>>> grid = ParamGridBuilder().addGrid(lr.maxIter, [0, 1]).build()
>>> evaluator = BinaryClassificationEvaluator()
>>> cv = CrossValidator(estimator=lr, estimatorParamMaps=grid, evaluator=evaluator)
>>> cvModel = cv.fit(dataset)
>>> cvModel.avgMetrics[0]
0.5
>>> evaluator.evaluate(cvModel.transform(dataset))
0.8333...
New in version 1.4.0.
Creates a copy of this instance with a randomly generated uid and some extra params. This copies creates a deep copy of the embedded paramMap, and copies the embedded and extra parameters over.
Parameters: | extra – Extra parameters to copy to the new instance |
---|---|
Returns: | Copy of this instance |
New in version 1.4.0.
Explains a single param and returns its name, doc, and optional default value and user-supplied value in a string.
Returns the documentation of all params with their optionally default values and user-supplied values.
Extracts the embedded default param values and user-supplied values, and then merges them with extra values from input into a flat param map, where the latter value is used if there exist conflicts, i.e., with ordering: default param values < user-supplied values < extra.
Parameters: | extra – extra param values |
---|---|
Returns: | merged param map |
Fits a model to the input dataset with optional parameters.
Parameters: |
|
---|---|
Returns: | fitted model(s) |
New in version 1.3.0.
Gets the value of estimator or its default value.
Gets the value of estimatorParamMaps or its default value.
Gets the value of evaluator or its default value.
Gets the value of a param in the user-supplied param map or its default value. Raises an error if neither is set.
Gets a param by its name.
Gets the value of seed or its default value.
Checks whether a param has a default value.
Tests whether this instance contains a param with a given (string) name.
Checks whether a param is explicitly set by user or has a default value.
Checks whether a param is explicitly set by user.
Returns all params ordered by name. The default implementation uses dir() to get all attributes of type Param.
Sets the value of estimatorParamMaps.
CrossValidatorModel contains the model with the highest average cross-validation metric across folds and uses this model to transform input data. CrossValidatorModel also tracks the metrics for each param map evaluated.
New in version 1.4.0.
Average cross-validation metrics for each paramMap in CrossValidator.estimatorParamMaps, in the corresponding order.
best model from cross validation
Creates a copy of this instance with a randomly generated uid and some extra params. This copies the underlying bestModel, creates a deep copy of the embedded paramMap, and copies the embedded and extra parameters over.
Parameters: | extra – Extra parameters to copy to the new instance |
---|---|
Returns: | Copy of this instance |
New in version 1.4.0.
Explains a single param and returns its name, doc, and optional default value and user-supplied value in a string.
Returns the documentation of all params with their optionally default values and user-supplied values.
Extracts the embedded default param values and user-supplied values, and then merges them with extra values from input into a flat param map, where the latter value is used if there exist conflicts, i.e., with ordering: default param values < user-supplied values < extra.
Parameters: | extra – extra param values |
---|---|
Returns: | merged param map |
Gets the value of estimator or its default value.
Gets the value of estimatorParamMaps or its default value.
Gets the value of evaluator or its default value.
Gets the value of a param in the user-supplied param map or its default value. Raises an error if neither is set.
Gets a param by its name.
Gets the value of seed or its default value.
Checks whether a param has a default value.
Tests whether this instance contains a param with a given (string) name.
Checks whether a param is explicitly set by user or has a default value.
Checks whether a param is explicitly set by user.
Returns all params ordered by name. The default implementation uses dir() to get all attributes of type Param.
Sets the value of estimatorParamMaps.
Transforms the input dataset with optional parameters.
Parameters: |
|
---|---|
Returns: | transformed dataset |
New in version 1.3.0.
Note
Experimental
Validation for hyper-parameter tuning. Randomly splits the input dataset into train and validation sets, and uses evaluation metric on the validation set to select the best model. Similar to CrossValidator, but only splits the set once.
>>> from pyspark.ml.classification import LogisticRegression
>>> from pyspark.ml.evaluation import BinaryClassificationEvaluator
>>> from pyspark.ml.linalg import Vectors
>>> dataset = spark.createDataFrame(
... [(Vectors.dense([0.0]), 0.0),
... (Vectors.dense([0.4]), 1.0),
... (Vectors.dense([0.5]), 0.0),
... (Vectors.dense([0.6]), 1.0),
... (Vectors.dense([1.0]), 1.0)] * 10,
... ["features", "label"])
>>> lr = LogisticRegression()
>>> grid = ParamGridBuilder().addGrid(lr.maxIter, [0, 1]).build()
>>> evaluator = BinaryClassificationEvaluator()
>>> tvs = TrainValidationSplit(estimator=lr, estimatorParamMaps=grid, evaluator=evaluator)
>>> tvsModel = tvs.fit(dataset)
>>> evaluator.evaluate(tvsModel.transform(dataset))
0.8333...
New in version 2.0.0.
Creates a copy of this instance with a randomly generated uid and some extra params. This copies creates a deep copy of the embedded paramMap, and copies the embedded and extra parameters over.
Parameters: | extra – Extra parameters to copy to the new instance |
---|---|
Returns: | Copy of this instance |
New in version 2.0.0.
Explains a single param and returns its name, doc, and optional default value and user-supplied value in a string.
Returns the documentation of all params with their optionally default values and user-supplied values.
Extracts the embedded default param values and user-supplied values, and then merges them with extra values from input into a flat param map, where the latter value is used if there exist conflicts, i.e., with ordering: default param values < user-supplied values < extra.
Parameters: | extra – extra param values |
---|---|
Returns: | merged param map |
Fits a model to the input dataset with optional parameters.
Parameters: |
|
---|---|
Returns: | fitted model(s) |
New in version 1.3.0.
Gets the value of estimator or its default value.
Gets the value of estimatorParamMaps or its default value.
Gets the value of evaluator or its default value.
Gets the value of a param in the user-supplied param map or its default value. Raises an error if neither is set.
Gets a param by its name.
Gets the value of seed or its default value.
Checks whether a param has a default value.
Tests whether this instance contains a param with a given (string) name.
Checks whether a param is explicitly set by user or has a default value.
Checks whether a param is explicitly set by user.
Returns all params ordered by name. The default implementation uses dir() to get all attributes of type Param.
Sets the value of estimatorParamMaps.
setParams(self, estimator=None, estimatorParamMaps=None, evaluator=None, trainRatio=0.75, seed=None): Sets params for the train validation split.
New in version 2.0.0.
Sets the value of trainRatio.
New in version 2.0.0.
Note
Experimental
Model from train validation split.
New in version 2.0.0.
best model from cross validation
Creates a copy of this instance with a randomly generated uid and some extra params. This copies the underlying bestModel, creates a deep copy of the embedded paramMap, and copies the embedded and extra parameters over. And, this creates a shallow copy of the validationMetrics.
Parameters: | extra – Extra parameters to copy to the new instance |
---|---|
Returns: | Copy of this instance |
New in version 2.0.0.
Explains a single param and returns its name, doc, and optional default value and user-supplied value in a string.
Returns the documentation of all params with their optionally default values and user-supplied values.
Extracts the embedded default param values and user-supplied values, and then merges them with extra values from input into a flat param map, where the latter value is used if there exist conflicts, i.e., with ordering: default param values < user-supplied values < extra.
Parameters: | extra – extra param values |
---|---|
Returns: | merged param map |
Gets the value of estimator or its default value.
Gets the value of estimatorParamMaps or its default value.
Gets the value of evaluator or its default value.
Gets the value of a param in the user-supplied param map or its default value. Raises an error if neither is set.
Gets a param by its name.
Gets the value of seed or its default value.
Checks whether a param has a default value.
Tests whether this instance contains a param with a given (string) name.
Checks whether a param is explicitly set by user or has a default value.
Checks whether a param is explicitly set by user.
Returns all params ordered by name. The default implementation uses dir() to get all attributes of type Param.
Sets the value of estimatorParamMaps.
Transforms the input dataset with optional parameters.
Parameters: |
|
---|---|
Returns: | transformed dataset |
New in version 1.3.0.
evaluated validation metrics
Base class for evaluators that compute metrics from predictions.
New in version 1.4.0.
Creates a copy of this instance with the same uid and some extra params. The default implementation creates a shallow copy using copy.copy(), and then copies the embedded and extra parameters over and returns the copy. Subclasses should override this method if the default approach is not sufficient.
Parameters: | extra – Extra parameters to copy to the new instance |
---|---|
Returns: | Copy of this instance |
Evaluates the output with optional parameters.
Parameters: |
|
---|---|
Returns: | metric |
New in version 1.4.0.
Explains a single param and returns its name, doc, and optional default value and user-supplied value in a string.
Returns the documentation of all params with their optionally default values and user-supplied values.
Extracts the embedded default param values and user-supplied values, and then merges them with extra values from input into a flat param map, where the latter value is used if there exist conflicts, i.e., with ordering: default param values < user-supplied values < extra.
Parameters: | extra – extra param values |
---|---|
Returns: | merged param map |
Gets the value of a param in the user-supplied param map or its default value. Raises an error if neither is set.
Gets a param by its name.
Checks whether a param has a default value.
Tests whether this instance contains a param with a given (string) name.
Checks whether a param is explicitly set by user or has a default value.
Indicates whether the metric returned by evaluate() should be maximized (True, default) or minimized (False). A given evaluator may support multiple metrics which may be maximized or minimized.
New in version 1.5.0.
Checks whether a param is explicitly set by user.
Returns all params ordered by name. The default implementation uses dir() to get all attributes of type Param.
Note
Experimental
Evaluator for binary classification, which expects two input columns: rawPrediction and label. The rawPrediction column can be of type double (binary 0/1 prediction, or probability of label 1) or of type vector (length-2 vector of raw predictions, scores, or label probabilities).
>>> from pyspark.ml.linalg import Vectors
>>> scoreAndLabels = map(lambda x: (Vectors.dense([1.0 - x[0], x[0]]), x[1]),
... [(0.1, 0.0), (0.1, 1.0), (0.4, 0.0), (0.6, 0.0), (0.6, 1.0), (0.6, 1.0), (0.8, 1.0)])
>>> dataset = spark.createDataFrame(scoreAndLabels, ["raw", "label"])
...
>>> evaluator = BinaryClassificationEvaluator(rawPredictionCol="raw")
>>> evaluator.evaluate(dataset)
0.70...
>>> evaluator.evaluate(dataset, {evaluator.metricName: "areaUnderPR"})
0.83...
>>> bce_path = temp_path + "/bce"
>>> evaluator.save(bce_path)
>>> evaluator2 = BinaryClassificationEvaluator.load(bce_path)
>>> str(evaluator2.getRawPredictionCol())
'raw'
New in version 1.4.0.
Creates a copy of this instance with the same uid and some extra params. This implementation first calls Params.copy and then make a copy of the companion Java pipeline component with extra params. So both the Python wrapper and the Java pipeline component get copied.
Parameters: | extra – Extra parameters to copy to the new instance |
---|---|
Returns: | Copy of this instance |
Evaluates the output with optional parameters.
Parameters: |
|
---|---|
Returns: | metric |
New in version 1.4.0.
Explains a single param and returns its name, doc, and optional default value and user-supplied value in a string.
Returns the documentation of all params with their optionally default values and user-supplied values.
Extracts the embedded default param values and user-supplied values, and then merges them with extra values from input into a flat param map, where the latter value is used if there exist conflicts, i.e., with ordering: default param values < user-supplied values < extra.
Parameters: | extra – extra param values |
---|---|
Returns: | merged param map |
Gets the value of labelCol or its default value.
Gets the value of a param in the user-supplied param map or its default value. Raises an error if neither is set.
Gets a param by its name.
Gets the value of rawPredictionCol or its default value.
Checks whether a param has a default value.
Tests whether this instance contains a param with a given (string) name.
Checks whether a param is explicitly set by user or has a default value.
Indicates whether the metric returned by evaluate() should be maximized (True, default) or minimized (False). A given evaluator may support multiple metrics which may be maximized or minimized.
New in version 1.5.0.
Checks whether a param is explicitly set by user.
Reads an ML instance from the input path, a shortcut of read().load(path).
Returns all params ordered by name. The default implementation uses dir() to get all attributes of type Param.
Returns an MLReader instance for this class.
Save this ML instance to the given path, a shortcut of write().save(path).
Sets the value of metricName.
New in version 1.4.0.
Sets params for binary classification evaluator.
New in version 1.4.0.
Sets the value of rawPredictionCol.
Returns an MLWriter instance for this ML instance.
Note
Experimental
Evaluator for Regression, which expects two input columns: prediction and label.
>>> scoreAndLabels = [(-28.98343821, -27.0), (20.21491975, 21.5),
... (-25.98418959, -22.0), (30.69731842, 33.0), (74.69283752, 71.0)]
>>> dataset = spark.createDataFrame(scoreAndLabels, ["raw", "label"])
...
>>> evaluator = RegressionEvaluator(predictionCol="raw")
>>> evaluator.evaluate(dataset)
2.842...
>>> evaluator.evaluate(dataset, {evaluator.metricName: "r2"})
0.993...
>>> evaluator.evaluate(dataset, {evaluator.metricName: "mae"})
2.649...
>>> re_path = temp_path + "/re"
>>> evaluator.save(re_path)
>>> evaluator2 = RegressionEvaluator.load(re_path)
>>> str(evaluator2.getPredictionCol())
'raw'
New in version 1.4.0.
Creates a copy of this instance with the same uid and some extra params. This implementation first calls Params.copy and then make a copy of the companion Java pipeline component with extra params. So both the Python wrapper and the Java pipeline component get copied.
Parameters: | extra – Extra parameters to copy to the new instance |
---|---|
Returns: | Copy of this instance |
Evaluates the output with optional parameters.
Parameters: |
|
---|---|
Returns: | metric |
New in version 1.4.0.
Explains a single param and returns its name, doc, and optional default value and user-supplied value in a string.
Returns the documentation of all params with their optionally default values and user-supplied values.
Extracts the embedded default param values and user-supplied values, and then merges them with extra values from input into a flat param map, where the latter value is used if there exist conflicts, i.e., with ordering: default param values < user-supplied values < extra.
Parameters: | extra – extra param values |
---|---|
Returns: | merged param map |
Gets the value of labelCol or its default value.
Gets the value of a param in the user-supplied param map or its default value. Raises an error if neither is set.
Gets a param by its name.
Gets the value of predictionCol or its default value.
Checks whether a param has a default value.
Tests whether this instance contains a param with a given (string) name.
Checks whether a param is explicitly set by user or has a default value.
Indicates whether the metric returned by evaluate() should be maximized (True, default) or minimized (False). A given evaluator may support multiple metrics which may be maximized or minimized.
New in version 1.5.0.
Checks whether a param is explicitly set by user.
Reads an ML instance from the input path, a shortcut of read().load(path).
Returns all params ordered by name. The default implementation uses dir() to get all attributes of type Param.
Returns an MLReader instance for this class.
Save this ML instance to the given path, a shortcut of write().save(path).
Sets the value of metricName.
New in version 1.4.0.
Sets params for regression evaluator.
New in version 1.4.0.
Sets the value of predictionCol.
Returns an MLWriter instance for this ML instance.
Note
Experimental
Evaluator for Multiclass Classification, which expects two input columns: prediction and label.
>>> scoreAndLabels = [(0.0, 0.0), (0.0, 1.0), (0.0, 0.0),
... (1.0, 0.0), (1.0, 1.0), (1.0, 1.0), (1.0, 1.0), (2.0, 2.0), (2.0, 0.0)]
>>> dataset = spark.createDataFrame(scoreAndLabels, ["prediction", "label"])
...
>>> evaluator = MulticlassClassificationEvaluator(predictionCol="prediction")
>>> evaluator.evaluate(dataset)
0.66...
>>> evaluator.evaluate(dataset, {evaluator.metricName: "accuracy"})
0.66...
>>> mce_path = temp_path + "/mce"
>>> evaluator.save(mce_path)
>>> evaluator2 = MulticlassClassificationEvaluator.load(mce_path)
>>> str(evaluator2.getPredictionCol())
'prediction'
New in version 1.5.0.
Creates a copy of this instance with the same uid and some extra params. This implementation first calls Params.copy and then make a copy of the companion Java pipeline component with extra params. So both the Python wrapper and the Java pipeline component get copied.
Parameters: | extra – Extra parameters to copy to the new instance |
---|---|
Returns: | Copy of this instance |
Evaluates the output with optional parameters.
Parameters: |
|
---|---|
Returns: | metric |
New in version 1.4.0.
Explains a single param and returns its name, doc, and optional default value and user-supplied value in a string.
Returns the documentation of all params with their optionally default values and user-supplied values.
Extracts the embedded default param values and user-supplied values, and then merges them with extra values from input into a flat param map, where the latter value is used if there exist conflicts, i.e., with ordering: default param values < user-supplied values < extra.
Parameters: | extra – extra param values |
---|---|
Returns: | merged param map |
Gets the value of labelCol or its default value.
Gets the value of a param in the user-supplied param map or its default value. Raises an error if neither is set.
Gets a param by its name.
Gets the value of predictionCol or its default value.
Checks whether a param has a default value.
Tests whether this instance contains a param with a given (string) name.
Checks whether a param is explicitly set by user or has a default value.
Indicates whether the metric returned by evaluate() should be maximized (True, default) or minimized (False). A given evaluator may support multiple metrics which may be maximized or minimized.
New in version 1.5.0.
Checks whether a param is explicitly set by user.
Reads an ML instance from the input path, a shortcut of read().load(path).
Returns all params ordered by name. The default implementation uses dir() to get all attributes of type Param.
Returns an MLReader instance for this class.
Save this ML instance to the given path, a shortcut of write().save(path).
Sets the value of metricName.
New in version 1.5.0.
Sets params for multiclass classification evaluator.
New in version 1.5.0.
Sets the value of predictionCol.
Returns an MLWriter instance for this ML instance.
Note
Experimental
A parallel FP-growth algorithm to mine frequent itemsets. The algorithm is described in Li et al., PFP: Parallel FP-Growth for Query Recommendation [LI2008]. PFP distributes computation in such a way that each worker executes an independent group of mining tasks. The FP-Growth algorithm is described in Han et al., Mining frequent patterns without candidate generation [HAN2000]
[LI2008] | http://dx.doi.org/10.1145/1454008.1454027 |
[HAN2000] | http://dx.doi.org/10.1145/335191.335372 |
Note
null values in the feature column are ignored during fit().
Note
Internally transform collects and broadcasts association rules.
>>> from pyspark.sql.functions import split
>>> data = (spark.read
... .text("data/mllib/sample_fpgrowth.txt")
... .select(split("value", "\s+").alias("items")))
>>> data.show(truncate=False)
+------------------------+
|items |
+------------------------+
|[r, z, h, k, p] |
|[z, y, x, w, v, u, t, s]|
|[s, x, o, n, r] |
|[x, z, y, m, t, s, q, e]|
|[z] |
|[x, z, y, r, q, t, p] |
+------------------------+
>>> fp = FPGrowth(minSupport=0.2, minConfidence=0.7)
>>> fpm = fp.fit(data)
>>> fpm.freqItemsets.show(5)
+---------+----+
| items|freq|
+---------+----+
| [s]| 3|
| [s, x]| 3|
|[s, x, z]| 2|
| [s, z]| 2|
| [r]| 3|
+---------+----+
only showing top 5 rows
>>> fpm.associationRules.show(5)
+----------+----------+----------+
|antecedent|consequent|confidence|
+----------+----------+----------+
| [t, s]| [y]| 1.0|
| [t, s]| [x]| 1.0|
| [t, s]| [z]| 1.0|
| [p]| [r]| 1.0|
| [p]| [z]| 1.0|
+----------+----------+----------+
only showing top 5 rows
>>> new_data = spark.createDataFrame([(["t", "s"], )], ["items"])
>>> sorted(fpm.transform(new_data).first().prediction)
['x', 'y', 'z']
New in version 2.2.0.
Creates a copy of this instance with the same uid and some extra params. This implementation first calls Params.copy and then make a copy of the companion Java pipeline component with extra params. So both the Python wrapper and the Java pipeline component get copied.
Parameters: | extra – Extra parameters to copy to the new instance |
---|---|
Returns: | Copy of this instance |
Explains a single param and returns its name, doc, and optional default value and user-supplied value in a string.
Returns the documentation of all params with their optionally default values and user-supplied values.
Extracts the embedded default param values and user-supplied values, and then merges them with extra values from input into a flat param map, where the latter value is used if there exist conflicts, i.e., with ordering: default param values < user-supplied values < extra.
Parameters: | extra – extra param values |
---|---|
Returns: | merged param map |
Fits a model to the input dataset with optional parameters.
Parameters: |
|
---|---|
Returns: | fitted model(s) |
New in version 1.3.0.
Gets the value of itemsCol or its default value.
Gets the value of minConfidence or its default value.
Gets the value of minSupport or its default value.
Gets the value of numPartitions or its default value.
Gets the value of a param in the user-supplied param map or its default value. Raises an error if neither is set.
Gets a param by its name.
Gets the value of predictionCol or its default value.
Checks whether a param has a default value.
Tests whether this instance contains a param with a given (string) name.
Checks whether a param is explicitly set by user or has a default value.
Checks whether a param is explicitly set by user.
Reads an ML instance from the input path, a shortcut of read().load(path).
Returns all params ordered by name. The default implementation uses dir() to get all attributes of type Param.
Returns an MLReader instance for this class.
Save this ML instance to the given path, a shortcut of write().save(path).
Sets the value of minConfidence.
Sets the value of minSupport.
Sets the value of numPartitions.
New in version 2.2.0.
Sets the value of predictionCol.
Returns an MLWriter instance for this ML instance.
Note
Experimental
Model fitted by FPGrowth.
New in version 2.2.0.
Data with three columns: * antecedent - Array of the same type as the input column. * consequent - Array of the same type as the input column. * confidence - Confidence for the rule (DoubleType).
New in version 2.2.0.
Creates a copy of this instance with the same uid and some extra params. This implementation first calls Params.copy and then make a copy of the companion Java pipeline component with extra params. So both the Python wrapper and the Java pipeline component get copied.
Parameters: | extra – Extra parameters to copy to the new instance |
---|---|
Returns: | Copy of this instance |
Explains a single param and returns its name, doc, and optional default value and user-supplied value in a string.
Returns the documentation of all params with their optionally default values and user-supplied values.
Extracts the embedded default param values and user-supplied values, and then merges them with extra values from input into a flat param map, where the latter value is used if there exist conflicts, i.e., with ordering: default param values < user-supplied values < extra.
Parameters: | extra – extra param values |
---|---|
Returns: | merged param map |
DataFrame with two columns: * items - Itemset of the same type as the input column. * freq - Frequency of the itemset (LongType).
New in version 2.2.0.
Gets the value of a param in the user-supplied param map or its default value. Raises an error if neither is set.
Gets a param by its name.
Checks whether a param has a default value.
Tests whether this instance contains a param with a given (string) name.
Checks whether a param is explicitly set by user or has a default value.
Checks whether a param is explicitly set by user.
Reads an ML instance from the input path, a shortcut of read().load(path).
Returns all params ordered by name. The default implementation uses dir() to get all attributes of type Param.
Returns an MLReader instance for this class.
Save this ML instance to the given path, a shortcut of write().save(path).
Transforms the input dataset with optional parameters.
Parameters: |
|
---|---|
Returns: | transformed dataset |
New in version 1.3.0.
Returns an MLWriter instance for this ML instance.
Object with a unique ID.
A unique id for the object.
(Private) Mixin for instances that provide JavaMLReader.
Reads an ML instance from the input path, a shortcut of read().load(path).
(Private) Specialization of MLReader for JavaParams types
(Private) Mixin for ML instances that provide JavaMLWriter.
Save this ML instance to the given path, a shortcut of write().save(path).
(Private) Specialization of MLWriter for JavaParams types
(Private) Java Model for prediction tasks (regression and classification). To be mixed in with class:pyspark.ml.JavaModel
Mixin for instances that provide MLReader.
New in version 2.0.0.
Utility class that can load ML instances.
New in version 2.0.0.
Mixin for ML instances that provide MLWriter.
New in version 2.0.0.