September 20, 2018
Bayesian model-based clustering is a widely applied procedure for discovering groups of related observations in a dataset. These approaches use Bayesian mixture models, estimated with MCMC, which provide posterior samples of the model parameters and clustering partition. While inference on model parameters is well established, inference on the clustering partition is less developed. A new method is developed for estimating the optimal partition from the pairwise posterior sim...
January 13, 2015
In this work we explain how to properly use mean-field methods to solve the inverse Ising problem when the phase space is clustered, that is many states are present. The clustering of the phase space can occur for many reasons, e.g. when a system undergoes a phase transition. Mean-field methods for the inverse Ising problem are typically used without taking into account the eventual clustered structure of the input configurations and may led to very bad inference (for instanc...
May 24, 2019
We propose a score function for Bayesian clustering. The function is parameter free and captures the interplay between the within cluster variance and the between cluster entropy of a clustering. It can be used to choose the number of clusters in well-established clustering methods such as hierarchical clustering or $K$-means algorithm.
August 1, 2014
We present a thermodynamic theory for a generic population of $M$ individuals distributed into $N$ groups (clusters). We construct the ensemble of all distributions with fixed $M$ and $N$, introduce a selection functional that embodies the physics that governs the population, and obtain the distribution that emerges in the scaling limit as the most probable among all distributions consistent with the given physics. We develop the thermodynamics of the ensemble and establish a...
May 18, 2017
We analyze the clustering problem through a flexible probabilistic model that aims to identify an optimal partition on the sample X 1 , ..., X n. We perform exact clustering with high probability using a convex semidefinite estimator that interprets as a corrected, relaxed version of K-means. The estimator is analyzed through a non-asymptotic framework and showed to be optimal or near-optimal in recovering the partition. Furthermore, its performances are shown to be adaptive ...
May 17, 2008
Data clustering, including problems such as finding network communities, can be put into a systematic framework by means of a Bayesian approach. The application of Bayesian approaches to real problems can be, however, quite challenging. In most cases the solution is explored via Monte Carlo sampling or variational methods. Here we work further on the application of variational methods to clustering problems. We introduce generative models based on a hidden group structure and...
June 15, 2018
Variable clustering is important for explanatory analysis. However, only few dedicated methods for variable clustering with the Gaussian graphical model have been proposed. Even more severe, small insignificant partial correlations due to noise can dramatically change the clustering result when evaluating for example with the Bayesian Information Criteria (BIC). In this work, we try to address this issue by proposing a Bayesian model that accounts for negligible small, but no...
January 3, 2020
Counting the number of clusters, when these clusters overlap significantly is a challenging problem in machine learning. We argue that a purely mathematical quantum theory, formulated using the path integral technique, when applied to non-physics modeling leads to non-physics quantum theories that are statistical in nature. We show that a quantum theory can be a more robust statistical theory to separate data to count overlapping clusters. The theory is also confirmed from da...
March 31, 2020
In many modern applications, there is interest in analyzing enormous data sets that cannot be easily moved across computers or loaded into memory on a single computer. In such settings, it is very common to be interested in clustering. Existing distributed clustering algorithms are mostly distance or density based without a likelihood specification, precluding the possibility of formal statistical inference. Model-based clustering allows statistical inference, yet research on...
June 2, 2010
Model selection in clustering requires (i) to specify a suitable clustering principle and (ii) to control the model order complexity by choosing an appropriate number of clusters depending on the noise level in the data. We advocate an information theoretic perspective where the uncertainty in the measurements quantizes the set of data partitionings and, thereby, induces uncertainty in the solution space of clusterings. A clustering model, which can tolerate a higher level of...