How many clusters? An information theoretic perspective

March 4, 2003

Susanne Still, William Bialek

Physics

Data Analysis, Statistics an...

General Physics

Clustering provides a common means of identifying structure in complex data, and there is renewed interest in clustering as a tool for the analysis of large data sets in many fields. A natural question is how many clusters are appropriate for the description of a given system. Traditional approaches to this problem are based either on a framework in which clusters of a particular shape are assumed as a model of the system or on a two-step procedure in which a clustering criterion determines the optimal assignments for a given number of clusters and a separate criterion measures the goodness of the classification to determine the number of clusters. In a statistical mechanics approach, clustering can be seen as a trade--off between energy-- and entropy--like terms, with lower temperature driving the proliferation of clusters to provide a more detailed description of the data. For finite data sets, we expect that there is a limit to the meaningful structure that can be resolved and therefore a minimum temperature beyond which we will capture sampling noise. This suggests that correcting the clustering criterion for the bias which arises due to sampling errors will allow us to find a clustering solution at a temperature which is optimal in the sense that we capture maximal meaningful structure -- without having to define an external criterion for the goodness or stability of the clustering. We show that, in a general information theoretic framework, the finite size of a data set determines an optimal temperature, and we introduce a method for finding the maximal number of clusters which can be resolved from the data in the hard clustering limit.

Information based clustering

November 26, 2005

92% Match

Noam Slonim, Gurinder Singh Atwal, ... , Bialek William

Quantitative Methods

In an age of increasingly large data sets, investigators in many different disciplines have turned to clustering as a tool for data analysis and exploration. Existing clustering methods, however, typically depend on several nontrivial assumptions about the structure of data. Here we reformulate the clustering problem from an information theoretic perspective which avoids many of these assumptions. In particular, our formulation obviates the need for defining a cluster "protot...

Find SimilarView on arXiv

Demystifying Information-Theoretic Clustering

October 15, 2013

92% Match

Greg Ver Steeg, Aram Galstyan, ... , DeDeo Simon

Machine Learning

Information Theory

Data Analysis, Statistics an...

Machine Learning

We propose a novel method for clustering data which is grounded in information-theoretic principles and requires no parametric assumptions. Previous attempts to use information theory to define clusters in an assumption-free way are based on maximizing mutual information between data and cluster labels. We demonstrate that this intuition suffers from a fundamental conceptual flaw that causes clustering performance to deteriorate as the amount of data increases. Instead, we re...

Find SimilarView on arXiv

Mean-field theory of Bayesian clustering

September 6, 2017

91% Match

Alexander Mozeika, Anthony CC Coolen

Disordered Systems and Neura...

Data Analysis, Statistics an...

We show that model-based Bayesian clustering, the probabilistically most systematic approach to the partitioning of data, can be mapped into a statistical physics problem for a gas of particles, and as a result becomes amenable to a detailed quantitative analysis. A central role in the resulting statistical physics framework is played by an entropy function. We demonstrate that there is a relevant parameter regime where mean-field analysis of this function is exact, and that,...

Find SimilarView on arXiv

Information theoretic model validation for clustering

June 2, 2010

90% Match

Joachim M. Buhmann

Information Theory

Machine Learning

Information Theory

Machine Learning

Model selection in clustering requires (i) to specify a suitable clustering principle and (ii) to control the model order complexity by choosing an appropriate number of clusters depending on the noise level in the data. We advocate an information theoretic perspective where the uncertainty in the measurements quantizes the set of data partitionings and, thereby, induces uncertainty in the solution space of clusterings. A clustering model, which can tolerate a higher level of...

Find SimilarView on arXiv

A description length approach to determining the number of k-means clusters

February 28, 2017

90% Match

Hiromitsu Araya Inc Mizutani, Ryota Araya Inc Kanai

Machine Learning

We present an asymptotic criterion to determine the optimal number of clusters in k-means. We consider k-means as data compression, and propose to adopt the number of clusters that minimizes the estimated description length after compression. Here we report two types of compression ratio based on two ways to quantify the description length of data after compression. This approach further offers a way to evaluate whether clusters obtained with k-means have a hierarchical struc...

Find SimilarView on arXiv

An Information-Theoretic Analysis of Hard and Soft Assignment Methods for Clustering

February 6, 2013

90% Match

Michael Kearns, Yishay Mansour, Andrew Y. Ng

Machine Learning

Assignment methods are at the heart of many algorithms for unsupervised learning and clustering - in particular, the well-known K-means and Expectation-Maximization (EM) algorithms. In this work, we study several different methods of assignment, including the "hard" assignments used by K-means and the ?soft' assignments used by EM. While it is known that K-means minimizes the distortion on the data and EM maximizes the likelihood, little is known about the systematic differen...

Find SimilarView on arXiv

Clustering - What Both Theoreticians and Practitioners are Doing Wrong

May 22, 2018

89% Match

Shai Ben-David

Machine Learning

Unsupervised learning is widely recognized as one of the most important challenges facing machine learning nowa- days. However, in spite of hundreds of papers on the topic being published every year, current theoretical understanding and practical implementations of such tasks, in particular of clustering, is very rudimentary. This note focuses on clustering. I claim that the most signif- icant challenge for clustering is model selection. In contrast with other common computa...

Find SimilarView on arXiv

Clustering is difficult only when it does not matter

May 22, 2012

89% Match

Amit Daniely, Nati Linial, Michael Saks

Machine Learning

Data Structures and Algorith...

Numerous papers ask how difficult it is to cluster data. We suggest that the more relevant and interesting question is how difficult it is to cluster data sets {\em that can be clustered well}. More generally, despite the ubiquity and the great importance of clustering, we still do not have a satisfactory mathematical theory of clustering. In order to properly understand clustering, it is clearly necessary to develop a solid theoretical basis for the area. For example, from t...

Find SimilarView on arXiv

Hard Clusters Maximize Mutual Information

August 17, 2016

89% Match

Bernhard C. Geiger, Rana Ali Amjad

Information Theory

Information Retrieval

Machine Learning

Information Theory

In this paper, we investigate mutual information as a cost function for clustering, and show in which cases hard, i.e., deterministic, clusters are optimal. Using convexity properties of mutual information, we show that certain formulations of the information bottleneck problem are solved by hard clusters. Similarly, hard clusters are optimal for the information-theoretic co-clustering problem that deals with simultaneous clustering of two dependent data sets. If both data se...

Find SimilarView on arXiv

Clustering with Statistical Error Control

February 8, 2017

89% Match

Michael Vogt, Matthias Schmid

Statistics Theory

This paper presents a clustering approach that allows for rigorous statistical error control similar to a statistical test. We develop estimators for both the unknown number of clusters and the clusters themselves. The estimators depend on a tuning parameter alpha which is similar to the significance level of a statistical hypothesis test. By choosing alpha, one can control the probability of overestimating the true number of clusters, while the probability of underestimation...

Find SimilarView on arXiv