January 31, 2011
This paper proposes a new distance metric between clusterings that incorporates information about the spatial distribution of points and clusters. Our approach builds on the idea of a Hilbert space-based representation of clusters as a combination of the representations of their constituent points. We use this representation and the underlying metric to design a spatially-aware consensus clustering procedure. This consensus procedure is implemented via a novel reduction to Eu...
April 20, 2012
We present an new sequential Monte Carlo sampler for coalescent based Bayesian hierarchical clustering. Our model is appropriate for modeling non-i.i.d. data and offers a substantial reduction of computational cost when compared to the original sampler without resorting to approximations. We also propose a quadratic complexity approximation that in practice shows almost no loss in performance compared to its counterpart. We show that as a byproduct of our formulation, we obta...
July 4, 2009
We introduce a new Bayesian model for hierarchical clustering based on a prior over trees called Kingman's coalescent. We develop novel greedy and sequential Monte Carlo inferences which operate in a bottom-up agglomerative fashion. We show experimentally the superiority of our algorithms over others, and demonstrate our approach in document clustering and phylolinguistics.
June 5, 2021
Hierarchical and k-medoids clustering are deterministic clustering algorithms based on pairwise distances. Using these same pairwise distances, we propose a novel stochastic clustering method based on random partition distributions. We call our method CaviarPD, for cluster analysis via random partition distributions. CaviarPD first samples clusterings from a random partition distribution and then finds the best cluster estimate based on these samples using algorithms to minim...
April 28, 2023
A novel nonparametric clustering algorithm is proposed using the interpoint distances between the members of the data to reveal the inherent clustering structure existing in the given set of data, where we apply the classical nonparametric univariate kernel density estimation method to the interpoint distances to estimate the density around a data member. Our clustering algorithm is simple in its formation and easy to apply resulting in well-defined clusters. The algorithm st...
November 25, 2015
With rapidly increasing data, clustering algorithms are important tools for data analytics in modern research. They have been successfully applied to a wide range of domains; for instance, bioinformatics, speech recognition, and financial analysis. Formally speaking, given a set of data instances, a clustering algorithm is expected to divide the set of data instances into the subsets which maximize the intra-subset similarity and inter-subset dissimilarity, where a similarity...
January 2, 2015
It is well known that most of the common clustering objectives are NP-hard to optimize. In practice, however, clustering is being routinely carried out. One approach for providing theoretical understanding of this seeming discrepancy is to come up with notions of clusterability that distinguish realistically interesting input data from worst-case data sets. The hope is that there will be clustering algorithms that are provably efficient on such 'clusterable' instances. In oth...
March 31, 2020
In many modern applications, there is interest in analyzing enormous data sets that cannot be easily moved across computers or loaded into memory on a single computer. In such settings, it is very common to be interested in clustering. Existing distributed clustering algorithms are mostly distance or density based without a likelihood specification, precluding the possibility of formal statistical inference. Model-based clustering allows statistical inference, yet research on...
September 12, 2009
In the context of clustering, we consider a generative model in a Euclidean ambient space with clusters of different shapes, dimensions, sizes and densities. In an asymptotic setting where the number of points becomes large, we obtain theoretical guaranties for a few emblematic methods based on pairwise distances: a simple algorithm based on the extraction of connected components in a neighborhood graph; the spectral clustering method of Ng, Jordan and Weiss; and hierarchical...
March 31, 2023
Bayesian clustering typically relies on mixture models, with each component interpreted as a different cluster. After defining a prior for the component parameters and weights, Markov chain Monte Carlo (MCMC) algorithms are commonly used to produce samples from the posterior distribution of the component labels. The data are then clustered by minimizing the expectation of a clustering loss function that favours similarity to the component labels. Unfortunately, although these...