Guaranteed Recovery of Unambiguous Clusters

Authors: Kayvon Mazooji, Ilan Shomorony

Abstract: Clustering is often a challenging problem because of the inherent ambiguity
in what the “correct” clustering should be. Even when the number of clusters
$K$ is known, this ambiguity often still exists, particularly when there is
variation in density among different clusters, and clusters have multiple
relatively separated regions of high density. In this paper we propose an
information-theoretic characterization of when a $K$-clustering is ambiguous,
and design an algorithm that recovers the clustering whenever it is
unambiguous. This characterization formalizes the situation when two high
density regions within a cluster are separable enough that they look more like
two distinct clusters than two truly distinct clusters in the clustering. The
algorithm first identifies $K$ partial clusters (or “seeds”) using a
density-based approach, and then adds unclustered points to the initial $K$
partial clusters in a greedy manner to form a complete clustering. We implement
and test a version of the algorithm that is modified to effectively handle
overlapping clusters, and observe that it requires little parameter selection
and displays improved performance on many datasets compared to widely used
algorithms for non-convex cluster recovery.

Source: http://arxiv.org/abs/2501.13093v1