How many clusters? An information theoretic perspective

March 4, 2003

Truecluster: robust scalable clustering with model selection

January 2, 2006

89% Match

Jens Oehlschlägel

Artificial Intelligence

Data-based classification is fundamental to most branches of science. While recent years have brought enormous progress in various areas of statistical computing and clustering, some general challenges in clustering remain: model selection, robustness, and scalability to large datasets. We consider the important problem of deciding on the optimal number of clusters, given an arbitrary definition of space and clusteriness. We show how to construct a cluster information criteri...

Find SimilarView on arXiv

Info-Clustering: A Mathematical Theory for Data Clustering

May 4, 2016

88% Match

Chung Chan, Ali Al-Bashabsheh, Qiaoqiao Zhou, ... , Liu Tie

Information Theory

Genomics

Neurons and Cognition

We formulate an info-clustering paradigm based on a multivariate information measure, called multivariate mutual information, that naturally extends Shannon's mutual information between two random variables to the multivariate case involving more than two random variables. With proper model reductions, we show that the paradigm can be applied to study the human genome and connectome in a more meaningful way than the conventional algorithmic approach. Not only can info-cluster...

Find SimilarView on arXiv

Stability of Information in the Heat Flow Clustering

May 2, 2024

88% Match

Brian Weber

Information Theory

Statistical Mechanics

Information Theory

Clustering methods must be tailored to the dataset it operates on, as there is no objective or universal definition of ``cluster,'' but nevertheless arbitrariness in the clustering method must be minimized. This paper develops a quantitative ``stability'' method of determining clusters, where stable or persistent clustering signals are used to indicate real structures have been identified in the underlying dataset. This method is based on modulating clustering methods by cont...

Find SimilarView on arXiv

Towards Automatic Clustering Analysis using Traces of Information Gain: The InfoGuide Method

January 23, 2020

88% Match

Paulo Rocha, Diego Pinheiro, ... , Bastos-Filho Carmelo

Machine Learning

Clustering analysis has become a ubiquitous information retrieval tool in a wide range of domains, but a more automatic framework is still lacking. Though internal metrics are the key players towards a successful retrieval of clusters, their effectiveness on real-world datasets remains not fully understood, mainly because of their unrealistic assumptions underlying datasets. We hypothesized that capturing {\it traces of information gain} between increasingly complex clusterin...

Find SimilarView on arXiv

Too Much Information Kills Information: A Clustering Perspective

September 16, 2020

88% Match

Yicheng Xu, Vincent Chau, Chenchen Wu, Yong Zhang, ... , Zou Yifei

Machine Learning

Clustering is one of the most fundamental tools in the artificial intelligence area, particularly in the pattern recognition and learning theory. In this paper, we propose a simple, but novel approach for variance-based k-clustering tasks, included in which is the widely known k-means clustering. The proposed approach picks a sampling subset from the given dataset and makes decisions based on the data information in the subset only. With certain assumptions, the resulting clu...

Find SimilarView on arXiv

On the Persistence of Clustering Solutions and True Number of Clusters in a Dataset

October 31, 2018

88% Match

Amber Srivastava, Mayank Baranwal, Srinivasa Salapaka

Machine Learning

Artificial Intelligence

Machine Learning

Typically clustering algorithms provide clustering solutions with prespecified number of clusters. The lack of a priori knowledge on the true number of underlying clusters in the dataset makes it important to have a metric to compare the clustering solutions with different number of clusters. This article quantifies a notion of persistence of clustering solutions that enables comparing solutions with different number of clusters. The persistence relates to the range of data-r...

Find SimilarView on arXiv

ExClus: Explainable Clustering on Low-dimensional Data Representations

November 4, 2021

88% Match

Xander Vankwikelberge, Bo Kang, ... , Lijffijt Jefrey

Machine Learning

Dimensionality reduction and clustering techniques are frequently used to analyze complex data sets, but their results are often not easy to interpret. We consider how to support users in interpreting apparent cluster structure on scatter plots where the axes are not directly interpretable, such as when the data is projected onto a two-dimensional space using a dimensionality-reduction method. Specifically, we propose a new method to compute an interpretable clustering automa...

Find SimilarView on arXiv

Phase transitions and optimal algorithms in high-dimensional Gaussian mixture clustering

October 10, 2016

88% Match

Thibault Lesieur, Bacco Caterina De, Jess Banks, Florent Krzakala, ... , Zdeborová Lenka

Machine Learning

Disordered Systems and Neura...

Information Theory

We consider the problem of Gaussian mixture clustering in the high-dimensional limit where the data consists of $m$ points in $n$ dimensions, $n,m \rightarrow \infty$ and $\alpha = m/n$ stays finite. Using exact but non-rigorous methods from statistical physics, we determine the critical value of $\alpha$ and the distance between the clusters at which it becomes information-theoretically possible to reconstruct the membership into clusters better than chance. We also determin...

Find SimilarView on arXiv

A Data-Driven Approach to Estimating the Number of Clusters in Hierarchical Clustering

August 16, 2016

88% Match

Antoine Zambelli

Quantitative Methods

Machine Learning

Methodology

We propose two new methods for estimating the number of clusters in a hierarchical clustering framework in the hopes of creating a fully automated process with no human intervention. The methods are completely data-driven and require no input from the researcher, and as such are fully automated. They are quite easy to implement and not computationally intensive in the least. We analyze performance on several simulated data sets and the Biobase Gene Expression Set, comparing o...

Find SimilarView on arXiv

An Information-Theoretic External Cluster-Validity Measure

December 12, 2012

88% Match

Byron E Dom

Machine Learning

In this paper we propose a measure of clustering quality or accuracy that is appropriate in situations where it is desirable to evaluate a clustering algorithm by somehow comparing the clusters it produces with ``ground truth' consisting of classes assigned to the patterns by manual means or some other means in whose veracity there is confidence. Such measures are refered to as ``external'. Our measure also has the characteristic of allowing clusterings with different numbers...

Find SimilarView on arXiv