April 4, 2020
Traditional Bayesian random partition models assume that the size of each cluster grows linearly with the number of data points. While this is appealing for some applications, this assumption is not appropriate for other tasks such as entity resolution, modeling of sparse networks, and DNA sequencing tasks. Such applications require models that yield clusters whose sizes grow sublinearly with the total number of data points -- the microclustering property. Motivated by these ...
January 16, 2017
Employing nonparametric methods for density estimation has become routine in Bayesian statistical practice. Models based on discrete nonparametric priors such as Dirichlet Process Mixture (DPM) models are very attractive choices due to their flexibility and tractability. However, a common problem in fitting DPMs or other discrete models to data is that they tend to produce a large number of (sometimes) redundant clusters. In this work we propose a method that produces parsimo...
June 9, 2020
Loss-based clustering methods, such as k-means and its variants, are standard tools for finding groups in data. However, the lack of quantification of uncertainty in the estimated clusters is a disadvantage. Model-based clustering based on mixture models provides an alternative, but such methods face computational problems and large sensitivity to the choice of kernel. This article proposes a generalized Bayes framework that bridges between these two paradigms through the use...
July 19, 2023
Bayesian nonparametric mixture models are widely used to cluster observations. However, one major drawback of the approach is that the estimated partition often presents unbalanced clusters' frequencies with only a few dominating clusters and a large number of sparsely-populated ones. This feature translates into results that are often uninterpretable unless we accept to ignore a relevant number of observations and clusters. Interpreting the posterior distribution as penalize...
June 20, 2018
This paper introduces a novel data clustering algorithm based on Langevin dynamics, where the associated potential is constructed directly from the data. To introduce a self-consistent potential, we adopt the potential model from the established Quantum Clustering method. The first step is to use a radial basis function to construct a density distribution from the data. A potential function is then constructed such that this density distribution is the ground state solution t...
January 16, 2013
The K-Mean and EM algorithms are popular in clustering and mixture modeling, due to their simplicity and ease of implementation. However, they have several significant limitations. Both coverage to a local optimum of their respective objective functions (ignoring the uncertainty in the model space), require the apriori specification of the number of classes/clsuters, and are inconsistent. In this work we overcome these limitations by using the Minimum Message Length (MML) pri...
March 27, 2017
We develop a general class of Bayesian repulsive Gaussian mixture models that encourage well-separated clusters, aiming at reducing potentially redundant components produced by independent priors for locations (such as the Dirichlet process). The asymptotic results for the posterior distribution of the proposed models are derived, including posterior consistency and posterior contraction rate in the context of nonparametric density estimation. More importantly, we show that c...
March 14, 2000
We discuss a new approach to data clustering. We find that maximum likelyhood leads naturally to an Hamiltonian of Potts variables which depends on the correlation matrix and whose low temperature behavior describes the correlation structure of the data. For random, uncorrelated data sets no correlation structure emerges. On the other hand for data sets with a built-in cluster structure, the method is able to detect and recover efficiently that structure. Finally we apply the...
November 2, 2011
Bayesian models offer great flexibility for clustering applications---Bayesian nonparametrics can be used for modeling infinite mixtures, and hierarchical Bayesian models can be utilized for sharing clusters across multiple data sets. For the most part, such flexibility is lacking in classical clustering methods such as k-means. In this paper, we revisit the k-means clustering algorithm from a Bayesian nonparametric viewpoint. Inspired by the asymptotic connection between k-m...
March 31, 2020
In many modern applications, there is interest in analyzing enormous data sets that cannot be easily moved across computers or loaded into memory on a single computer. In such settings, it is very common to be interested in clustering. Existing distributed clustering algorithms are mostly distance or density based without a likelihood specification, precluding the possibility of formal statistical inference. Model-based clustering allows statistical inference, yet research on...