Gets the desired number of leaf clusters.
Gets the desired number of leaf clusters.
Gets the max number of k-means iterations to split clusters.
Gets the max number of k-means iterations to split clusters.
Gets the minimum number of points (if greater than or equal to 1.0
) or the minimum proportion
of points (if less than 1.0
) of a divisible cluster.
Gets the minimum number of points (if greater than or equal to 1.0
) or the minimum proportion
of points (if less than 1.0
) of a divisible cluster.
Gets the random seed.
Gets the random seed.
Java-friendly version of run()
.
Runs the bisecting k-means algorithm.
Runs the bisecting k-means algorithm.
RDD of vectors
model for the bisecting kmeans
Sets the desired number of leaf clusters (default: 4).
Sets the desired number of leaf clusters (default: 4). The actual number could be smaller if there are no divisible leaf clusters.
Sets the max number of k-means iterations to split clusters (default: 20).
Sets the max number of k-means iterations to split clusters (default: 20).
Sets the minimum number of points (if greater than or equal to 1.0
) or the minimum proportion
of points (if less than 1.0
) of a divisible cluster (default: 1).
Sets the minimum number of points (if greater than or equal to 1.0
) or the minimum proportion
of points (if less than 1.0
) of a divisible cluster (default: 1).
Sets the random seed (default: hash value of the class name).
Sets the random seed (default: hash value of the class name).
A bisecting k-means algorithm based on the paper "A comparison of document clustering techniques" by Steinbach, Karypis, and Kumar, with modification to fit Spark. The algorithm starts from a single cluster that contains all points. Iteratively it finds divisible clusters on the bottom level and bisects each of them using k-means, until there are
k
leaf clusters in total or no leaf clusters are divisible. The bisecting steps of clusters on the same level are grouped together to increase parallelism. If bisecting all divisible clusters on the bottom level would result more thank
leaf clusters, larger clusters get higher priority.Steinbach, Karypis, and Kumar, A comparison of document clustering techniques, KDD Workshop on Text Mining, 2000.