RandomForest¶
-
class
pyspark.mllib.tree.
RandomForest
[source]¶ Learning algorithm for a random forest model for classification or regression.
New in version 1.2.0.
Methods
trainClassifier
(data, numClasses, …[, …])Train a random forest model for binary or multiclass classification.
trainRegressor
(data, …[, …])Train a random forest model for regression.
Attributes
Methods Documentation
-
classmethod
trainClassifier
(data, numClasses, categoricalFeaturesInfo, numTrees, featureSubsetStrategy='auto', impurity='gini', maxDepth=4, maxBins=32, seed=None)[source]¶ Train a random forest model for binary or multiclass classification.
New in version 1.2.0.
- Parameters
- data
pyspark.RDD
Training dataset: RDD of LabeledPoint. Labels should take values {0, 1, …, numClasses-1}.
- numClassesint
Number of classes for classification.
- categoricalFeaturesInfodict
Map storing arity of categorical features. An entry (n -> k) indicates that feature n is categorical with k categories indexed from 0: {0, 1, …, k-1}.
- numTreesint
Number of trees in the random forest.
- featureSubsetStrategystr, optional
Number of features to consider for splits at each node. Supported values: “auto”, “all”, “sqrt”, “log2”, “onethird”. If “auto” is set, this parameter is set based on numTrees: if numTrees == 1, set to “all”; if numTrees > 1 (forest) set to “sqrt”. (default: “auto”)
- impuritystr, optional
Criterion used for information gain calculation. Supported values: “gini” or “entropy”. (default: “gini”)
- maxDepthint, optional
Maximum depth of tree (e.g. depth 0 means 1 leaf node, depth 1 means 1 internal node + 2 leaf nodes). (default: 4)
- maxBinsint, optional
Maximum number of bins used for splitting features. (default: 32)
- seedint, Optional
Random seed for bootstrapping and choosing feature subsets. Set as None to generate seed based on system time. (default: None)
- data
- Returns
RandomForestModel
that can be used for prediction.
Examples
>>> from pyspark.mllib.regression import LabeledPoint >>> from pyspark.mllib.tree import RandomForest >>> >>> data = [ ... LabeledPoint(0.0, [0.0]), ... LabeledPoint(0.0, [1.0]), ... LabeledPoint(1.0, [2.0]), ... LabeledPoint(1.0, [3.0]) ... ] >>> model = RandomForest.trainClassifier(sc.parallelize(data), 2, {}, 3, seed=42) >>> model.numTrees() 3 >>> model.totalNumNodes() 7 >>> print(model) TreeEnsembleModel classifier with 3 trees >>> print(model.toDebugString()) TreeEnsembleModel classifier with 3 trees Tree 0: Predict: 1.0 Tree 1: If (feature 0 <= 1.5) Predict: 0.0 Else (feature 0 > 1.5) Predict: 1.0 Tree 2: If (feature 0 <= 1.5) Predict: 0.0 Else (feature 0 > 1.5) Predict: 1.0 >>> model.predict([2.0]) 1.0 >>> model.predict([0.0]) 0.0 >>> rdd = sc.parallelize([[3.0], [1.0]]) >>> model.predict(rdd).collect() [1.0, 0.0]
-
classmethod
trainRegressor
(data, categoricalFeaturesInfo, numTrees, featureSubsetStrategy='auto', impurity='variance', maxDepth=4, maxBins=32, seed=None)[source]¶ Train a random forest model for regression.
New in version 1.2.0.
- Parameters
- data
pyspark.RDD
Training dataset: RDD of LabeledPoint. Labels are real numbers.
- categoricalFeaturesInfodict
Map storing arity of categorical features. An entry (n -> k) indicates that feature n is categorical with k categories indexed from 0: {0, 1, …, k-1}.
- numTreesint
Number of trees in the random forest.
- featureSubsetStrategystr, optional
Number of features to consider for splits at each node. Supported values: “auto”, “all”, “sqrt”, “log2”, “onethird”. If “auto” is set, this parameter is set based on numTrees:
if numTrees == 1, set to “all”;
if numTrees > 1 (forest) set to “onethird” for regression.
(default: “auto”)
- impuritystr, optional
Criterion used for information gain calculation. The only supported value for regression is “variance”. (default: “variance”)
- maxDepthint, optional
Maximum depth of tree (e.g. depth 0 means 1 leaf node, depth 1 means 1 internal node + 2 leaf nodes). (default: 4)
- maxBinsint, optional
Maximum number of bins used for splitting features. (default: 32)
- seedint, optional
Random seed for bootstrapping and choosing feature subsets. Set as None to generate seed based on system time. (default: None)
- data
- Returns
RandomForestModel
that can be used for prediction.
Examples
>>> from pyspark.mllib.regression import LabeledPoint >>> from pyspark.mllib.tree import RandomForest >>> from pyspark.mllib.linalg import SparseVector >>> >>> sparse_data = [ ... LabeledPoint(0.0, SparseVector(2, {0: 1.0})), ... LabeledPoint(1.0, SparseVector(2, {1: 1.0})), ... LabeledPoint(0.0, SparseVector(2, {0: 1.0})), ... LabeledPoint(1.0, SparseVector(2, {1: 2.0})) ... ] >>> >>> model = RandomForest.trainRegressor(sc.parallelize(sparse_data), {}, 2, seed=42) >>> model.numTrees() 2 >>> model.totalNumNodes() 4 >>> model.predict(SparseVector(2, {1: 1.0})) 1.0 >>> model.predict(SparseVector(2, {0: 1.0})) 0.5 >>> rdd = sc.parallelize([[0.0, 1.0], [1.0, 0.0]]) >>> model.predict(rdd).collect() [1.0, 0.5]
Attributes Documentation
-
supportedFeatureSubsetStrategies
= ('auto', 'all', 'sqrt', 'log2', 'onethird')¶
-
classmethod