Cohesion and Repulsion in Bayesian Distance Clustering

July 12, 2021

Robust and Automatic Data Clustering: Dirichlet Process meets Median-of-Means

November 26, 2023

85% Match

Supratik Basu, Jyotishka Ray Choudhury, ... , Das Swagatam

Machine Learning

Methodology

Clustering stands as one of the most prominent challenges within the realm of unsupervised machine learning. Among the array of centroid-based clustering algorithms, the classic $k$-means algorithm, rooted in Lloyd's heuristic, takes center stage as one of the extensively employed techniques in the literature. Nonetheless, both $k$-means and its variants grapple with noteworthy limitations. These encompass a heavy reliance on initial cluster centroids, susceptibility to conve...

Find SimilarView on arXiv

Dirichlet Process Parsimonious Mixtures for clustering

January 14, 2015

85% Match

Faicel Chamroukhi, Marius Bartcus, Hervé Glotin

Machine Learning

Methodology

The parsimonious Gaussian mixture models, which exploit an eigenvalue decomposition of the group covariance matrices of the Gaussian mixture, have shown their success in particular in cluster analysis. Their estimation is in general performed by maximum likelihood estimation and has also been considered from a parametric Bayesian prospective. We propose new Dirichlet Process Parsimonious mixtures (DPPM) which represent a Bayesian nonparametric formulation of these parsimoniou...

Find SimilarView on arXiv

Clustering - What Both Theoreticians and Practitioners are Doing Wrong

May 22, 2018

85% Match

Shai Ben-David

Machine Learning

Unsupervised learning is widely recognized as one of the most important challenges facing machine learning nowa- days. However, in spite of hundreds of papers on the topic being published every year, current theoretical understanding and practical implementations of such tasks, in particular of clustering, is very rudimentary. This note focuses on clustering. I claim that the most signif- icant challenge for clustering is model selection. In contrast with other common computa...

Find SimilarView on arXiv

Inference of global clusters from locally distributed data

January 4, 2010

85% Match

XuanLong Nguyen

Methodology

Machine Learning

We consider the problem of analyzing the heterogeneity of clustering distributions for multiple groups of observed data, each of which is indexed by a covariate value, and inferring global clusters arising from observations aggregated over the covariate domain. We propose a novel Bayesian nonparametric method reposing on the formalism of spatial modeling and a nested hierarchy of Dirichlet processes. We provide an analysis of the model properties, relating and contrasting the...

Find SimilarView on arXiv

Parameter-wise co-clustering for high-dimensional data

August 25, 2018

85% Match

M. P. B. Gallaugher, C. Biernacki, P. D. McNicholas

Machine Learning

In recent years, data dimensionality has increasingly become a concern, leading to many parameter and dimension reduction techniques being proposed in the literature. A parameter-wise co-clustering model, for data modelled via continuous random variables, is presented. The proposed model, although allowing more flexibility, still maintains the very high degree of parsimony achieved by traditional co-clustering. A stochastic expectation-maximization (SEM) algorithm along with ...

Find SimilarView on arXiv

Demystifying Information-Theoretic Clustering

October 15, 2013

85% Match

Greg Ver Steeg, Aram Galstyan, ... , DeDeo Simon

Machine Learning

Information Theory

Data Analysis, Statistics an...

Machine Learning

We propose a novel method for clustering data which is grounded in information-theoretic principles and requires no parametric assumptions. Previous attempts to use information theory to define clusters in an assumption-free way are based on maximizing mutual information between data and cluster labels. We demonstrate that this intuition suffers from a fundamental conceptual flaw that causes clustering performance to deteriorate as the amount of data increases. Instead, we re...

Find SimilarView on arXiv

Repulsion, Chaos and Equilibrium in Mixture Models

June 19, 2023

85% Match

Andrea Cremaschi, Timothy M. Wertz, Iorio Maria De

Methodology

Statistics Theory

Mixture models are commonly used in applications with heterogeneity and overdispersion in the population, as they allow the identification of subpopulations. In the Bayesian framework, this entails the specification of suitable prior distributions for the weights and location parameters of the mixture. Widely used are Bayesian semi-parametric models based on mixtures with infinite or random number of components, such as Dirichlet process mixtures or mixtures with random numbe...

Find SimilarView on arXiv

A Bayesian non-parametric method for clustering high-dimensional binary data

March 8, 2016

85% Match

Tapesh Santra

Applications

Machine Learning

In many real life problems, objects are described by large number of binary features. For instance, documents are characterized by presence or absence of certain keywords; cancer patients are characterized by presence or absence of certain mutations etc. In such cases, grouping together similar objects/profiles based on such high dimensional binary features is desirable, but challenging. Here, I present a Bayesian non parametric algorithm for clustering high dimensional binar...

Find SimilarView on arXiv

A Random Finite Set Model for Data Clustering

March 14, 2017

84% Match

Dinh Phung, Ba-Ngu Bo

Machine Learning

The goal of data clustering is to partition data points into groups to minimize a given objective function. While most existing clustering algorithms treat each data point as vector, in many applications each datum is not a vector but a point pattern or a set of points. Moreover, many existing clustering methods require the user to specify the number of clusters, which is not available in advance. This paper proposes a new class of models for data clustering that addresses se...

Find SimilarView on arXiv

Replica analysis of Bayesian data clustering

October 5, 2018

84% Match

Alexander Mozeika, Anthony CC Coolen

Disordered Systems and Neura...

We use statistical mechanics to study model-based Bayesian data clustering. In this approach, each partition of the data into clusters is regarded as a microscopic system state, the negative data log-likelihood gives the energy of each state, and the data set realisation acts as disorder. Optimal clustering corresponds to the ground state of the system, and is hence obtained from the free energy via a low `temperature' limit. We assume that for large sample sizes the free ene...

Find Similar View on arXiv