ID: 1810.02627

Replica analysis of Bayesian data clustering

October 5, 2018

View on ArXiv
Alexander Mozeika, Anthony CC Coolen
Condensed Matter
Disordered Systems and Neura...

We use statistical mechanics to study model-based Bayesian data clustering. In this approach, each partition of the data into clusters is regarded as a microscopic system state, the negative data log-likelihood gives the energy of each state, and the data set realisation acts as disorder. Optimal clustering corresponds to the ground state of the system, and is hence obtained from the free energy via a low `temperature' limit. We assume that for large sample sizes the free energy density is self-averaging, and we use the replica method to compute the asymptotic free energy density. The main order parameter in the resulting (replica symmetric) theory, the distribution of the data over the clusters, satisfies a self-consistent equation which can be solved by a population dynamics algorithm. From this order parameter one computes the average free energy, and all relevant macroscopic characteristics of the problem. The theory describes numerical experiments perfectly, and gives a significant improvement over the mean-field theory that was used to study this model in past.

Similar papers 1

Alexander Mozeika, Anthony CC Coolen
Disordered Systems and Neura...
Data Analysis, Statistics an...

We show that model-based Bayesian clustering, the probabilistically most systematic approach to the partitioning of data, can be mapped into a statistical physics problem for a gas of particles, and as a result becomes amenable to a detailed quantitative analysis. A central role in the resulting statistical physics framework is played by an entropy function. We demonstrate that there is a relevant parameter regime where mean-field analysis of this function is exact, and that,...

Susanne Still, William Bialek
Data Analysis, Statistics an...
General Physics

Clustering provides a common means of identifying structure in complex data, and there is renewed interest in clustering as a tool for the analysis of large data sets in many fields. A natural question is how many clusters are appropriate for the description of a given system. Traditional approaches to this problem are based either on a framework in which clusters of a particular shape are assumed as a model of the system or on a two-step procedure in which a clustering crite...

Clara Grazian
Methodology

Clustering is an important task in many areas of knowledge: medicine and epidemiology, genomics, environmental science, economics, visual sciences, among others. Methodologies to perform inference on the number of clusters have often been proved to be inconsistent, and introducing a dependence structure among the clusters implies additional difficulties in the estimation process. In a Bayesian setting, clustering is performed by considering the unknown partition as a random o...

Andrea Cremaschi, Timothy M. Wertz, Iorio Maria De
Methodology
Statistics Theory
Statistics Theory

Mixture models are commonly used in applications with heterogeneity and overdispersion in the population, as they allow the identification of subpopulations. In the Bayesian framework, this entails the specification of suitable prior distributions for the weights and location parameters of the mixture. Widely used are Bayesian semi-parametric models based on mixtures with infinite or random number of components, such as Dirichlet process mixtures or mixtures with random numbe...

Repulsive Mixtures

April 24, 2012

85% Match
Francesca Petralia, Vinayak Rao, David B. Dunson
Methodology

Discrete mixture models are routinely used for density estimation and clustering. While conducting inferences on the cluster-specific parameters, current frequentist and Bayesian methods often encounter problems when clusters are placed too close together to be scientifically meaningful. Current Bayesian practice generates component-specific parameters independently from a common prior, which tends to favor similar components and often leads to substantial probability assigne...

Abhinav Natarajan, Iorio Maria De, Andreas Heinecke, ... , Glenn Simon
Methodology

Clustering in high-dimensions poses many statistical challenges. While traditional distance-based clustering methods are computationally feasible, they lack probabilistic interpretation and rely on heuristics for estimation of the number of clusters. On the other hand, probabilistic model-based clustering techniques often fail to scale and devising algorithms that are able to effectively explore the posterior space is an open problem. Based on recent developments in Bayesian ...

Panagiotis Papastamoulis
Methodology
Computation

We consider the problem of inferring an unknown number of clusters in replicated multinomial data. Under a model based clustering point of view, this task can be treated by estimating finite mixtures of multinomial distributions with or without covariates. Both Maximum Likelihood (ML) as well as Bayesian estimation are taken into account. Under a Maximum Likelihood approach, we provide an Expectation--Maximization (EM) algorithm which exploits a careful initialization procedu...

Themis Matsoukas
Populations and Evolution

We present a thermodynamic theory for a generic population of $M$ individuals distributed into $N$ groups (clusters). We construct the ensemble of all distributions with fixed $M$ and $N$, introduce a selection functional that embodies the physics that governs the population, and obtain the distribution that emerges in the scaling limit as the most probable among all distributions consistent with the given physics. We develop the thermodynamics of the ensemble and establish a...

Ketong Wang, Michael D. Porter
Methodology

Bayesian model-based clustering is a widely applied procedure for discovering groups of related observations in a dataset. These approaches use Bayesian mixture models, estimated with MCMC, which provide posterior samples of the model parameters and clustering partition. While inference on model parameters is well established, inference on the clustering partition is less developed. A new method is developed for estimating the optimal partition from the pairwise posterior sim...

Sara Wade, Zoubin Ghahramani
Methodology

Clustering is widely studied in statistics and machine learning, with applications in a variety of fields. As opposed to classical algorithms which return a single clustering solution, Bayesian nonparametric models provide a posterior over the entire space of partitions, allowing one to assess statistical properties, such as uncertainty on the number of clusters. However, an important problem is how to summarize the posterior; the huge dimension of partition space and difficu...