How many clusters? An information theore...

An Information-theoretic Perspective of Hierarchical Clustering

August 13, 2021

88% Match

Yicheng Pan, Feng Zheng, Bingchen Fan

Machine Learning

A combinatorial cost function for hierarchical clustering was introduced by Dasgupta \cite{dasgupta2016cost}. It has been generalized by Cohen-Addad et al. \cite{cohen2019hierarchical} to a general form named admissible function. In this paper, we investigate hierarchical clustering from the \emph{information-theoretic} perspective and formulate a new objective function. We also establish the relationship between these two perspectives. In algorithmic aspect, we get rid of th...

Find SimilarView on arXiv

Clustering Stability: An Overview

July 7, 2010

88% Match

Luxburg Ulrike von

Machine Learning

A popular method for selecting the number of clusters is based on stability arguments: one chooses the number of clusters such that the corresponding clustering results are "most stable". In recent years, a series of papers has analyzed the behavior of this method from a theoretical point of view. However, the results are very technical and difficult to interpret for non-experts. In this paper we give a high-level overview about the existing literature on clustering stability...

Find SimilarView on arXiv

Stop using the elbow criterion for k-means and how to choose the number of clusters instead

December 23, 2022

87% Match

Erich Schubert

Machine Learning

A major challenge when using k-means clustering often is how to choose the parameter k, the number of clusters. In this letter, we want to point out that it is very easy to draw poor conclusions from a common heuristic, the "elbow method". Better alternatives have been known in literature for a long time, and we want to draw attention to some of these easy to use options, that often perform better. This letter is a call to stop using the elbow method altogether, because it se...

Find SimilarView on arXiv

A Graph-based Approach to Estimating the Number of Clusters

February 23, 2024

87% Match

Yichuan Bai, Lynna Chu

Methodology

We consider the problem of estimating the number of clusters ($k$) in a dataset. We propose a non-parametric approach to the problem that is based on maximizing a statistic constructed from similarity graphs. This graph-based statistic is a robust summary measure of the similarity information among observations and is applicable even if the number of dimensions or number of clusters is possibly large. The approach is straightforward to implement, computationally fast, and can...

Find SimilarView on arXiv

Bayesian cluster analysis: Point estimation and credible balls

May 13, 2015

87% Match

Sara Wade, Zoubin Ghahramani

Methodology

Clustering is widely studied in statistics and machine learning, with applications in a variety of fields. As opposed to classical algorithms which return a single clustering solution, Bayesian nonparametric models provide a posterior over the entire space of partitions, allowing one to assess statistical properties, such as uncertainty on the number of clusters. However, an important problem is how to summarize the posterior; the huge dimension of partition space and difficu...

Find SimilarView on arXiv

The Informativeness of K -Means for Learning Mixture Models

March 30, 2017

87% Match

Zhaoqiang Liu, Vincent Y. F. Tan

Machine Learning

Information Theory

Machine Learning

Information Theory

Methodology

The learning of mixture models can be viewed as a clustering problem. Indeed, given data samples independently generated from a mixture of distributions, we often would like to find the {\it correct target clustering} of the samples according to which component distribution they were generated from. For a clustering problem, practitioners often choose to use the simple $k$-means algorithm. $k$-means attempts to find an {\it optimal clustering} that minimizes the sum-of-square...

Find SimilarView on arXiv

Clustering Optimisation Method for Highly Connected Biological Data

August 8, 2022

87% Match

Richard Tjörnhammar

Quantitative Methods

Machine Learning

Currently, data-driven discovery in biological sciences resides in finding segmentation strategies in multivariate data that produce sensible descriptions of the data. Clustering is but one of several approaches and sometimes falls short because of difficulties in assessing reasonable cutoffs, the number of clusters that need to be formed or that an approach fails to preserve topological properties of the original system in its clustered form. In this work, we show how a simp...

Find SimilarView on arXiv

Information based clustering: Supplementary material

November 25, 2005

87% Match

Noam Slonim, Gurinder Singh Atwal, ... , Bialek William

Quantitative Methods

This technical report provides the supplementary material for a paper entitled "Information based clustering", to appear shortly in Proceedings of the National Academy of Sciences (USA). In Section I we present in detail the iterative clustering algorithm used in our experiments and in Section II we describe the validation scheme used to determine the statistical significance of our results. Then in subsequent sections we provide all the experimental results for three very di...

Find SimilarView on arXiv

A Clustering Method Based on Information Entropy Payload

September 13, 2022

87% Match

Shaodong Deng, Long Sheng, ... , Deng Fuyi

Machine Learning

Artificial Intelligence

Existing clustering algorithms such as K-means often need to preset parameters such as the number of categories K, and such parameters may lead to the failure to output objective and consistent clustering results. This paper introduces a clustering method based on the information theory, by which clusters in the clustering result have maximum average information entropy (called entropy payload in this paper). This method can bring the following benefits: firstly, this method ...

Find SimilarView on arXiv

Entropy regularization in probabilistic clustering

July 19, 2023

87% Match

Beatrice Franzolini, Giovanni Rebaudo

Methodology

Computation

Machine Learning

Bayesian nonparametric mixture models are widely used to cluster observations. However, one major drawback of the approach is that the estimated partition often presents unbalanced clusters' frequencies with only a few dominating clusters and a large number of sparsely-populated ones. This feature translates into results that are often uninterpretable unless we accept to ignore a relevant number of observations and clusters. Interpreting the posterior distribution as penalize...

Find SimilarView on arXiv

How many clusters? An information theoretic perspective

An Information-theoretic Perspective of Hierarchical Clustering

Clustering Stability: An Overview

Stop using the elbow criterion for k-means and how to choose the number of clusters instead

A Graph-based Approach to Estimating the Number of Clusters

Bayesian cluster analysis: Point estimation and credible balls

The Informativeness of K -Means for Learning Mixture Models

Clustering Optimisation Method for Highly Connected Biological Data

Information based clustering: Supplementary material

A Clustering Method Based on Information Entropy Payload

Entropy regularization in probabilistic clustering