March 4, 2003
Clustering provides a common means of identifying structure in complex data, and there is renewed interest in clustering as a tool for the analysis of large data sets in many fields. A natural question is how many clusters are appropriate for the description of a given system. Traditional approaches to this problem are based either on a framework in which clusters of a particular shape are assumed as a model of the system or on a two-step procedure in which a clustering criterion determines the optimal assignments for a given number of clusters and a separate criterion measures the goodness of the classification to determine the number of clusters. In a statistical mechanics approach, clustering can be seen as a trade--off between energy-- and entropy--like terms, with lower temperature driving the proliferation of clusters to provide a more detailed description of the data. For finite data sets, we expect that there is a limit to the meaningful structure that can be resolved and therefore a minimum temperature beyond which we will capture sampling noise. This suggests that correcting the clustering criterion for the bias which arises due to sampling errors will allow us to find a clustering solution at a temperature which is optimal in the sense that we capture maximal meaningful structure -- without having to define an external criterion for the goodness or stability of the clustering. We show that, in a general information theoretic framework, the finite size of a data set determines an optimal temperature, and we introduce a method for finding the maximal number of clusters which can be resolved from the data in the hard clustering limit.
Similar papers 1
November 26, 2005
In an age of increasingly large data sets, investigators in many different disciplines have turned to clustering as a tool for data analysis and exploration. Existing clustering methods, however, typically depend on several nontrivial assumptions about the structure of data. Here we reformulate the clustering problem from an information theoretic perspective which avoids many of these assumptions. In particular, our formulation obviates the need for defining a cluster "protot...
October 15, 2013
We propose a novel method for clustering data which is grounded in information-theoretic principles and requires no parametric assumptions. Previous attempts to use information theory to define clusters in an assumption-free way are based on maximizing mutual information between data and cluster labels. We demonstrate that this intuition suffers from a fundamental conceptual flaw that causes clustering performance to deteriorate as the amount of data increases. Instead, we re...
September 6, 2017
We show that model-based Bayesian clustering, the probabilistically most systematic approach to the partitioning of data, can be mapped into a statistical physics problem for a gas of particles, and as a result becomes amenable to a detailed quantitative analysis. A central role in the resulting statistical physics framework is played by an entropy function. We demonstrate that there is a relevant parameter regime where mean-field analysis of this function is exact, and that,...
June 2, 2010
Model selection in clustering requires (i) to specify a suitable clustering principle and (ii) to control the model order complexity by choosing an appropriate number of clusters depending on the noise level in the data. We advocate an information theoretic perspective where the uncertainty in the measurements quantizes the set of data partitionings and, thereby, induces uncertainty in the solution space of clusterings. A clustering model, which can tolerate a higher level of...
February 28, 2017
We present an asymptotic criterion to determine the optimal number of clusters in k-means. We consider k-means as data compression, and propose to adopt the number of clusters that minimizes the estimated description length after compression. Here we report two types of compression ratio based on two ways to quantify the description length of data after compression. This approach further offers a way to evaluate whether clusters obtained with k-means have a hierarchical struc...
February 6, 2013
Assignment methods are at the heart of many algorithms for unsupervised learning and clustering - in particular, the well-known K-means and Expectation-Maximization (EM) algorithms. In this work, we study several different methods of assignment, including the "hard" assignments used by K-means and the ?soft' assignments used by EM. While it is known that K-means minimizes the distortion on the data and EM maximizes the likelihood, little is known about the systematic differen...
May 22, 2018
Unsupervised learning is widely recognized as one of the most important challenges facing machine learning nowa- days. However, in spite of hundreds of papers on the topic being published every year, current theoretical understanding and practical implementations of such tasks, in particular of clustering, is very rudimentary. This note focuses on clustering. I claim that the most signif- icant challenge for clustering is model selection. In contrast with other common computa...
May 22, 2012
Numerous papers ask how difficult it is to cluster data. We suggest that the more relevant and interesting question is how difficult it is to cluster data sets {\em that can be clustered well}. More generally, despite the ubiquity and the great importance of clustering, we still do not have a satisfactory mathematical theory of clustering. In order to properly understand clustering, it is clearly necessary to develop a solid theoretical basis for the area. For example, from t...
August 17, 2016
In this paper, we investigate mutual information as a cost function for clustering, and show in which cases hard, i.e., deterministic, clusters are optimal. Using convexity properties of mutual information, we show that certain formulations of the information bottleneck problem are solved by hard clusters. Similarly, hard clusters are optimal for the information-theoretic co-clustering problem that deals with simultaneous clustering of two dependent data sets. If both data se...
February 8, 2017
This paper presents a clustering approach that allows for rigorous statistical error control similar to a statistical test. We develop estimators for both the unknown number of clusters and the clusters themselves. The estimators depend on a tuning parameter alpha which is similar to the significance level of a statistical hypothesis test. By choosing alpha, one can control the probability of overestimating the true number of clusters, while the probability of underestimation...