ID: physics/0303011

How many clusters? An information theoretic perspective

March 4, 2003

View on ArXiv

Similar papers 2

Truecluster: robust scalable clustering with model selection

January 2, 2006

89% Match
Jens Oehlschlägel
Artificial Intelligence

Data-based classification is fundamental to most branches of science. While recent years have brought enormous progress in various areas of statistical computing and clustering, some general challenges in clustering remain: model selection, robustness, and scalability to large datasets. We consider the important problem of deciding on the optimal number of clusters, given an arbitrary definition of space and clusteriness. We show how to construct a cluster information criteri...

Find SimilarView on arXiv

Info-Clustering: A Mathematical Theory for Data Clustering

May 4, 2016

88% Match
Chung Chan, Ali Al-Bashabsheh, Qiaoqiao Zhou, ... , Liu Tie
Information Theory
Information Theory
Genomics
Neurons and Cognition

We formulate an info-clustering paradigm based on a multivariate information measure, called multivariate mutual information, that naturally extends Shannon's mutual information between two random variables to the multivariate case involving more than two random variables. With proper model reductions, we show that the paradigm can be applied to study the human genome and connectome in a more meaningful way than the conventional algorithmic approach. Not only can info-cluster...

Find SimilarView on arXiv

Stability of Information in the Heat Flow Clustering

May 2, 2024

88% Match
Brian Weber
Information Theory
Statistical Mechanics
Information Theory

Clustering methods must be tailored to the dataset it operates on, as there is no objective or universal definition of ``cluster,'' but nevertheless arbitrariness in the clustering method must be minimized. This paper develops a quantitative ``stability'' method of determining clusters, where stable or persistent clustering signals are used to indicate real structures have been identified in the underlying dataset. This method is based on modulating clustering methods by cont...

Find SimilarView on arXiv

Towards Automatic Clustering Analysis using Traces of Information Gain: The InfoGuide Method

January 23, 2020

88% Match
Paulo Rocha, Diego Pinheiro, ... , Bastos-Filho Carmelo
Machine Learning
Machine Learning

Clustering analysis has become a ubiquitous information retrieval tool in a wide range of domains, but a more automatic framework is still lacking. Though internal metrics are the key players towards a successful retrieval of clusters, their effectiveness on real-world datasets remains not fully understood, mainly because of their unrealistic assumptions underlying datasets. We hypothesized that capturing {\it traces of information gain} between increasingly complex clusterin...

Find SimilarView on arXiv

Too Much Information Kills Information: A Clustering Perspective

September 16, 2020

88% Match
Yicheng Xu, Vincent Chau, Chenchen Wu, Yong Zhang, ... , Zou Yifei
Machine Learning
Machine Learning

Clustering is one of the most fundamental tools in the artificial intelligence area, particularly in the pattern recognition and learning theory. In this paper, we propose a simple, but novel approach for variance-based k-clustering tasks, included in which is the widely known k-means clustering. The proposed approach picks a sampling subset from the given dataset and makes decisions based on the data information in the subset only. With certain assumptions, the resulting clu...

Find SimilarView on arXiv

On the Persistence of Clustering Solutions and True Number of Clusters in a Dataset

October 31, 2018

88% Match
Amber Srivastava, Mayank Baranwal, Srinivasa Salapaka
Machine Learning
Artificial Intelligence
Machine Learning

Typically clustering algorithms provide clustering solutions with prespecified number of clusters. The lack of a priori knowledge on the true number of underlying clusters in the dataset makes it important to have a metric to compare the clustering solutions with different number of clusters. This article quantifies a notion of persistence of clustering solutions that enables comparing solutions with different number of clusters. The persistence relates to the range of data-r...

Find SimilarView on arXiv

ExClus: Explainable Clustering on Low-dimensional Data Representations

November 4, 2021

88% Match
Xander Vankwikelberge, Bo Kang, ... , Lijffijt Jefrey
Machine Learning

Dimensionality reduction and clustering techniques are frequently used to analyze complex data sets, but their results are often not easy to interpret. We consider how to support users in interpreting apparent cluster structure on scatter plots where the axes are not directly interpretable, such as when the data is projected onto a two-dimensional space using a dimensionality-reduction method. Specifically, we propose a new method to compute an interpretable clustering automa...

Find SimilarView on arXiv

Phase transitions and optimal algorithms in high-dimensional Gaussian mixture clustering

October 10, 2016

88% Match
Thibault Lesieur, Bacco Caterina De, Jess Banks, Florent Krzakala, ... , Zdeborová Lenka
Machine Learning
Disordered Systems and Neura...
Information Theory
Information Theory

We consider the problem of Gaussian mixture clustering in the high-dimensional limit where the data consists of $m$ points in $n$ dimensions, $n,m \rightarrow \infty$ and $\alpha = m/n$ stays finite. Using exact but non-rigorous methods from statistical physics, we determine the critical value of $\alpha$ and the distance between the clusters at which it becomes information-theoretically possible to reconstruct the membership into clusters better than chance. We also determin...

Find SimilarView on arXiv

A Data-Driven Approach to Estimating the Number of Clusters in Hierarchical Clustering

August 16, 2016

88% Match
Antoine Zambelli
Quantitative Methods
Machine Learning
Methodology

We propose two new methods for estimating the number of clusters in a hierarchical clustering framework in the hopes of creating a fully automated process with no human intervention. The methods are completely data-driven and require no input from the researcher, and as such are fully automated. They are quite easy to implement and not computationally intensive in the least. We analyze performance on several simulated data sets and the Biobase Gene Expression Set, comparing o...

Find SimilarView on arXiv

An Information-Theoretic External Cluster-Validity Measure

December 12, 2012

88% Match
Byron E Dom
Machine Learning
Machine Learning

In this paper we propose a measure of clustering quality or accuracy that is appropriate in situations where it is desirable to evaluate a clustering algorithm by somehow comparing the clusters it produces with ``ground truth' consisting of classes assigned to the patterns by manual means or some other means in whose veracity there is confidence. Such measures are refered to as ``external'. Our measure also has the characteristic of allowing clusterings with different numbers...

Find SimilarView on arXiv