public class CosineSilhouette
extends Object
1 - s
where s
is the cosine similarity between two points.
The total distance of the point X
to the points $C_{i}$
belonging to the cluster $\Gamma$
is:
$$ \sum\limits_{i=1}^N d(X, C_{i} ) = \sum\limits_{i=1}^N \Big( 1 - \frac{\sum\limits_{j=1}^D x_{j}c_{ij} }{ \|X\|\|C_{i}\|} \Big) = \sum\limits_{i=1}^N 1 - \sum\limits_{i=1}^N \sum\limits_{j=1}^D \frac{x_{j}}{\|X\|} \frac{c_{ij}}{\|C_{i}\|} = N - \sum\limits_{j=1}^D \frac{x_{j}}{\|X\|} \Big( \sum\limits_{i=1}^N \frac{c_{ij}}{\|C_{i}\|} \Big) $$
where $x_{j}$
is the j
-th dimension of the point X
and $c_{ij}$
is the j
-th dimension
of the i
-th point in cluster $\Gamma$
.
Then, we can define the vector:
$$ \xi_{X} : \xi_{X i} = \frac{x_{i}}{\|X\|}, i = 1, ..., D $$
which can be precomputed for each point and the vector
$$ \Omega_{\Gamma} : \Omega_{\Gamma i} = \sum\limits_{j=1}^N \xi_{C_{j}i}, i = 1, ..., D $$
which can be precomputed too for each cluster $\Gamma$
by its points $C_{i}$
.
With these definitions, the numerator becomes:
$$ N - \sum\limits_{j=1}^D \xi_{X j} \Omega_{\Gamma j} $$
Thus the average distance of a point X
to the points of the cluster $\Gamma$
is:
$$ 1 - \frac{\sum\limits_{j=1}^D \xi_{X j} \Omega_{\Gamma j}}{N} $$
In the implementation, the precomputed values for the clusters are distributed among the worker nodes via broadcasted variables, because we can assume that the clusters are limited in number.
The main strengths of this algorithm are the low computational complexity and the intrinsic
parallelism. The precomputed information for each point and for each cluster can be computed
with a computational complexity which is O(N/W)
, where N
is the number of points in the
dataset and W
is the number of worker nodes. After that, every point can be analyzed
independently from the others.
For every point we need to compute the average distance to all the clusters. Since the formula
above requires O(D)
operations, this phase has a computational complexity which is
O(C*D*N/W)
where C
is the number of clusters (which we assume quite low), D
is the number
of dimensions, N
is the number of points in the dataset and W
is the number of worker
nodes.
Constructor and Description |
---|
CosineSilhouette() |
Modifier and Type | Method and Description |
---|---|
static scala.collection.immutable.Map<Object,scala.Tuple2<Vector,Object>> |
computeClusterStats(Dataset<Row> df,
String featuresCol,
String predictionCol)
The method takes the input dataset and computes the aggregated values
about a cluster which are needed by the algorithm.
|
static double |
computeSilhouetteCoefficient(Broadcast<scala.collection.immutable.Map<Object,scala.Tuple2<Vector,Object>>> broadcastedClustersMap,
Vector normalizedFeatures,
double clusterId)
It computes the Silhouette coefficient for a point.
|
static double |
computeSilhouetteScore(Dataset<?> dataset,
String predictionCol,
String featuresCol)
Compute the Silhouette score of the dataset using the cosine distance measure.
|
static double |
overallScore(Dataset<Row> df,
Column scoreColumn) |
static double |
pointSilhouetteCoefficient(scala.collection.immutable.Set<Object> clusterIds,
double pointClusterId,
long pointClusterNumOfPoints,
scala.Function1<Object,Object> averageDistanceToCluster) |
public static scala.collection.immutable.Map<Object,scala.Tuple2<Vector,Object>> computeClusterStats(Dataset<Row> df, String featuresCol, String predictionCol)
df
- The DataFrame which contains the input datapredictionCol
- The name of the column which contains the predicted cluster id
for the point.featuresCol
- (undocumented)Map
which associates each cluster id to a
its statistics (ie. the precomputed values N
and $\Omega_{\Gamma}$
).public static double computeSilhouetteCoefficient(Broadcast<scala.collection.immutable.Map<Object,scala.Tuple2<Vector,Object>>> broadcastedClustersMap, Vector normalizedFeatures, double clusterId)
broadcastedClustersMap
- A map of the precomputed values for each cluster.normalizedFeatures
- The Vector
representing the
normalized features of the current point.clusterId
- The id of the cluster the current point belongs to.public static double computeSilhouetteScore(Dataset<?> dataset, String predictionCol, String featuresCol)
dataset
- The input dataset (previously clustered) on which compute the Silhouette.predictionCol
- The name of the column which contains the predicted cluster id
for the point.featuresCol
- The name of the column which contains the feature vector of the point.public static double pointSilhouetteCoefficient(scala.collection.immutable.Set<Object> clusterIds, double pointClusterId, long pointClusterNumOfPoints, scala.Function1<Object,Object> averageDistanceToCluster)