pyspark.ml.clustering.
DistributedLDAModel
Distributed model fitted by LDA. This type of model is currently only produced by Expectation-Maximization (EM).
LDA
This model stores the inferred topics, the full training dataset, and the topic distribution for each training document.
New in version 2.0.0.
Methods
clear(param)
clear
Clears a param from the param map if it has been explicitly set.
copy([extra])
copy
Creates a copy of this instance with the same uid and some extra params.
describeTopics([maxTermsPerTopic])
describeTopics
Return the topics described by their top-weighted terms.
estimatedDocConcentration()
estimatedDocConcentration
Value for LDA.docConcentration estimated from data.
LDA.docConcentration
explainParam(param)
explainParam
Explains a single param and returns its name, doc, and optional default value and user-supplied value in a string.
explainParams()
explainParams
Returns the documentation of all params with their optionally default values and user-supplied values.
extractParamMap([extra])
extractParamMap
Extracts the embedded default param values and user-supplied values, and then merges them with extra values from input into a flat param map, where the latter value is used if there exist conflicts, i.e., with ordering: default param values < user-supplied values < extra.
getCheckpointFiles()
getCheckpointFiles
If using checkpointing and LDA.keepLastCheckpoint is set to true, then there may be saved checkpoint files.
LDA.keepLastCheckpoint
getCheckpointInterval()
getCheckpointInterval
Gets the value of checkpointInterval or its default value.
getDocConcentration()
getDocConcentration
Gets the value of docConcentration or its default value.
docConcentration
getFeaturesCol()
getFeaturesCol
Gets the value of featuresCol or its default value.
getK()
getK
Gets the value of k or its default value.
k
getKeepLastCheckpoint()
getKeepLastCheckpoint
Gets the value of keepLastCheckpoint or its default value.
keepLastCheckpoint
getLearningDecay()
getLearningDecay
Gets the value of learningDecay or its default value.
learningDecay
getLearningOffset()
getLearningOffset
Gets the value of learningOffset or its default value.
learningOffset
getMaxIter()
getMaxIter
Gets the value of maxIter or its default value.
getOptimizeDocConcentration()
getOptimizeDocConcentration
Gets the value of optimizeDocConcentration or its default value.
optimizeDocConcentration
getOptimizer()
getOptimizer
Gets the value of optimizer or its default value.
optimizer
getOrDefault(param)
getOrDefault
Gets the value of a param in the user-supplied param map or its default value.
getParam(paramName)
getParam
Gets a param by its name.
getSeed()
getSeed
Gets the value of seed or its default value.
getSubsamplingRate()
getSubsamplingRate
Gets the value of subsamplingRate or its default value.
subsamplingRate
getTopicConcentration()
getTopicConcentration
Gets the value of topicConcentration or its default value.
topicConcentration
getTopicDistributionCol()
getTopicDistributionCol
Gets the value of topicDistributionCol or its default value.
topicDistributionCol
hasDefault(param)
hasDefault
Checks whether a param has a default value.
hasParam(paramName)
hasParam
Tests whether this instance contains a param with a given (string) name.
isDefined(param)
isDefined
Checks whether a param is explicitly set by user or has a default value.
isDistributed()
isDistributed
Indicates whether this instance is of type DistributedLDAModel
isSet(param)
isSet
Checks whether a param is explicitly set by user.
load(path)
load
Reads an ML instance from the input path, a shortcut of read().load(path).
logLikelihood(dataset)
logLikelihood
Calculates a lower bound on the log likelihood of the entire corpus.
logPerplexity(dataset)
logPerplexity
Calculate an upper bound on perplexity.
logPrior()
logPrior
Log probability of the current parameter estimate: log P(topics, topic distributions for docs | alpha, eta)
read()
read
Returns an MLReader instance for this class.
save(path)
save
Save this ML instance to the given path, a shortcut of ‘write().save(path)’.
set(param, value)
set
Sets a parameter in the embedded param map.
setFeaturesCol(value)
setFeaturesCol
Sets the value of featuresCol.
featuresCol
setSeed(value)
setSeed
Sets the value of seed.
seed
setTopicDistributionCol(value)
setTopicDistributionCol
Sets the value of topicDistributionCol.
toLocal()
toLocal
Convert this distributed model to a local representation.
topicsMatrix()
topicsMatrix
Inferred topics, where each topic is represented by a distribution over terms.
trainingLogLikelihood()
trainingLogLikelihood
Log likelihood of the observed tokens in the training set, given the current parameter estimates: log P(docs | topics, topic distributions for docs, Dirichlet hyperparameters)
transform(dataset[, params])
transform
Transforms the input dataset with optional parameters.
vocabSize()
vocabSize
Vocabulary size (number of terms or words in the vocabulary)
write()
write
Returns an MLWriter instance for this ML instance.
Attributes
checkpointInterval
maxIter
params
Returns all params ordered by name.
Methods Documentation
Creates a copy of this instance with the same uid and some extra params. This implementation first calls Params.copy and then make a copy of the companion Java pipeline component with extra params. So both the Python wrapper and the Java pipeline component get copied.
Extra parameters to copy to the new instance
JavaParams
Copy of this instance
Value for LDA.docConcentration estimated from data. If Online LDA was used and LDA.optimizeDocConcentration was set to false, then this returns the fixed (given) value for the LDA.docConcentration parameter.
LDA.optimizeDocConcentration
extra param values
merged param map
If using checkpointing and LDA.keepLastCheckpoint is set to true, then there may be saved checkpoint files. This method is provided so that users can manage those files.
List of checkpoint files from training
Notes
Removing the checkpoints can cause failures if a partition is lost and is needed by certain DistributedLDAModel methods. Reference counting will clean up the checkpoints when this model and derivative data go out of scope.
Gets the value of a param in the user-supplied param map or its default value. Raises an error if neither is set.
Calculates a lower bound on the log likelihood of the entire corpus. See Equation (16) in the Online LDA paper (Hoffman et al., 2010).
Warning
If this model is an instance of DistributedLDAModel (produced when optimizer is set to “em”), this involves collecting a large topicsMatrix() to the driver. This implementation may be changed in the future.
Calculate an upper bound on perplexity. (Lower is better.) See Equation (16) in the Online LDA paper (Hoffman et al., 2010).
New in version 3.0.0.
Convert this distributed model to a local representation. This discards info about the training dataset.
This involves collecting a large topicsMatrix() to the driver.
Inferred topics, where each topic is represented by a distribution over terms. This is a matrix of size vocabSize x k, where each column is a topic. No guarantees are given about the ordering of the topics.
If this model is actually a DistributedLDAModel instance produced by the Expectation-Maximization (“em”) optimizer, then this method could involve collecting a large amount of data to the driver (on the order of vocabSize x k).
This excludes the prior; for that, use logPrior().
the hyperparameters.
logLikelihood() on the same training dataset, the topic distributions will be computed again, possibly giving different results.
logLikelihood()
New in version 1.3.0.
pyspark.sql.DataFrame
input dataset
an optional param map that overrides embedded params.
transformed dataset
Attributes Documentation
Returns all params ordered by name. The default implementation uses dir() to get all attributes of type Param.
dir()
Param