Mean-field theory of Bayesian clustering

An Experimental Comparison of Several Clustering and Initialization Methods

January 30, 2013

85% Match

Marina Meila, David Heckerman

Machine Learning

We examine methods for clustering in high dimensions. In the first part of the paper, we perform an experimental comparison between three batch clustering algorithms: the Expectation-Maximization (EM) algorithm, a winner take all version of the EM algorithm reminiscent of the K-means algorithm, and model-based hierarchical agglomerative clustering. We learn naive-Bayes models with a hidden root node, using high-dimensional discrete-variable data sets (both real and synthetic)...

Find SimilarView on arXiv

Inferring a property of a large system from a small number of samples

December 1, 2021

85% Match

Damián G. Hernández, Inés Samengo

Data Analysis, Statistics an...

Statistical Mechanics

Mathematical Physics

Inferring the value of a property of a large stochastic system is a difficult task when the number of samples is insufficient to reliably estimate the probability distribution. The Bayesian estimator of the property of interest requires the knowledge of the prior distribution, and in many situations, it is not clear which prior should be used. Several estimators have been developed so far, in which the proposed prior was individually tailored for each property of interest; su...

Find SimilarView on arXiv

Physics-constrained Bayesian inference of state functions in classical density-functional theory

October 7, 2020

85% Match

Peter Yatsyshin, Serafim Kalliadasis, Andrew B. Duncan

Statistical Mechanics

Data Analysis, Statistics an...

Machine Learning

We develop a novel data-driven approach to the inverse problem of classical statistical mechanics: given experimental data on the collective motion of a classical many-body system, how does one characterise the free energy landscape of that system? By combining non-parametric Bayesian inference with physically-motivated constraints, we develop an efficient learning algorithm which automates the construction of approximate free energy functionals. In contrast to optimisation-b...

Find SimilarView on arXiv

Robust and Automatic Data Clustering: Dirichlet Process meets Median-of-Means

November 26, 2023

85% Match

Supratik Basu, Jyotishka Ray Choudhury, ... , Das Swagatam

Machine Learning

Methodology

Clustering stands as one of the most prominent challenges within the realm of unsupervised machine learning. Among the array of centroid-based clustering algorithms, the classic $k$-means algorithm, rooted in Lloyd's heuristic, takes center stage as one of the extensively employed techniques in the literature. Nonetheless, both $k$-means and its variants grapple with noteworthy limitations. These encompass a heavy reliance on initial cluster centroids, susceptibility to conve...

Find SimilarView on arXiv

Significance analysis and statistical mechanics: an application to clustering

September 13, 2010

85% Match

Marta Łuksza, Michael Lässig, Johannes Berg

Molecular Networks

Statistical Mechanics

Quantitative Methods

This paper addresses the statistical significance of structures in random data: Given a set of vectors and a measure of mutual similarity, how likely does a subset of these vectors form a cluster with enhanced similarity among its elements? The computation of this cluster p-value for randomly distributed vectors is mapped onto a well-defined problem of statistical mechanics. We solve this problem analytically, establishing a connection between the physics of quenched disorder...

Find SimilarView on arXiv

Dirichlet Process Parsimonious Mixtures for clustering

January 14, 2015

85% Match

Faicel Chamroukhi, Marius Bartcus, Hervé Glotin

Machine Learning

Methodology

The parsimonious Gaussian mixture models, which exploit an eigenvalue decomposition of the group covariance matrices of the Gaussian mixture, have shown their success in particular in cluster analysis. Their estimation is in general performed by maximum likelihood estimation and has also been considered from a parametric Bayesian prospective. We propose new Dirichlet Process Parsimonious mixtures (DPPM) which represent a Bayesian nonparametric formulation of these parsimoniou...

Find SimilarView on arXiv

Data clustering and noise undressing of correlation matrices

March 14, 2000

85% Match

M. INFM, Trieste-SISSA Marsili

Statistical Mechanics

Disordered Systems and Neura...

Adaptation and Self-Organizi...

We discuss a new approach to data clustering. We find that maximum likelyhood leads naturally to an Hamiltonian of Potts variables which depends on the correlation matrix and whose low temperature behavior describes the correlation structure of the data. For random, uncorrelated data sets no correlation structure emerges. On the other hand for data sets with a built-in cluster structure, the method is able to detect and recover efficiently that structure. Finally we apply the...

Find SimilarView on arXiv

Thermodynamics of the Minimum Description Length on Community Detection

June 19, 2018

85% Match

Juan Ignacio Perotti, Claudio Juan Tessone, ... , Caldarelli Guido

Physics and Society

Disordered Systems and Neura...

Social and Information Netwo...

Data Analysis, Statistics an...

Modern statistical modeling is an important complement to the more traditional approach of physics where Complex Systems are studied by means of extremely simple idealized models. The Minimum Description Length (MDL) is a principled approach to statistical modeling combining Occam's razor with Information Theory for the selection of models providing the most concise descriptions. In this work, we introduce the Boltzmannian MDL (BMDL), a formalization of the principle of MDL w...

Find SimilarView on arXiv

Structures of Spurious Local Minima in $k$-means

February 16, 2020

85% Match

Wei Qian, Yuqian Zhang, Yudong Chen

Machine Learning

Optimization and Control

Statistics Theory

$k$-means clustering is a fundamental problem in unsupervised learning. The problem concerns finding a partition of the data points into $k$ clusters such that the within-cluster variation is minimized. Despite its importance and wide applicability, a theoretical understanding of the $k$-means problem has not been completely satisfactory. Existing algorithms with theoretical performance guarantees often rely on sophisticated (sometimes artificial) algorithmic techniques and r...

Find SimilarView on arXiv

Random Partition Models for Microclustering Tasks

April 4, 2020

85% Match

Brenda Betancourt, Giacomo Zanella, Rebecca C. Steorts

Methodology

Statistics Theory

Traditional Bayesian random partition models assume that the size of each cluster grows linearly with the number of data points. While this is appealing for some applications, this assumption is not appropriate for other tasks such as entity resolution, modeling of sparse networks, and DNA sequencing tasks. Such applications require models that yield clusters whose sizes grow sublinearly with the total number of data points -- the microclustering property. Motivated by these ...

Find SimilarView on arXiv