How many clusters? An information theore...

The information bottleneck and geometric clustering

December 27, 2017

87% Match

DJ Strouse, David J Schwab

Machine Learning

Artificial Intelligence

Information Theory

Machine Learning

Information Theory

The information bottleneck (IB) approach to clustering takes a joint distribution $P\!\left(X,Y\right)$ and maps the data $X$ to cluster labels $T$ which retain maximal information about $Y$ (Tishby et al., 1999). This objective results in an algorithm that clusters data points based upon the similarity of their conditional distributions $P\!\left(Y\mid X\right)$. This is in contrast to classic "geometric clustering'' algorithms such as $k$-means and gaussian mixture models (...

Find SimilarView on arXiv

Algorithms of maximum likelihood data clustering with applications

April 9, 2002

87% Match

Lorenzo Giada, Matteo Marsili

Statistical Mechanics

We address the problem of data clustering by introducing an unsupervised, parameter free approach based on maximum likelihood principle. Starting from the observation that data sets belonging to the same cluster share a common information, we construct an expression for the likelihood of any possible cluster structure. The likelihood in turn depends only on the Pearson's coefficient of the data. We discuss clustering algorithms that provide a fast and reliable approximation t...

Find SimilarView on arXiv

Entropy Regularized Power k-Means Clustering

January 10, 2020

87% Match

Saptarshi Chakraborty, Debolina Paul, ... , Xu Jason

Machine Learning

Despite its well-known shortcomings, $k$-means remains one of the most widely used approaches to data clustering. Current research continues to tackle its flaws while attempting to preserve its simplicity. Recently, the \textit{power $k$-means} algorithm was proposed to avoid trapping in local minima by annealing through a family of smoother surfaces. However, the approach lacks theoretical justification and fails in high dimensions when many features are irrelevant. This pap...

Find SimilarView on arXiv

To Cluster, or Not to Cluster: An Analysis of Clusterability Methods

August 24, 2018

87% Match

A. Adolfsson, M. Ackerman, N. C. Brownstein

Machine Learning

Clustering is an essential data mining tool that aims to discover inherent cluster structure in data. For most applications, applying clustering is only appropriate when cluster structure is present. As such, the study of clusterability, which evaluates whether data possesses such structure, is an integral part of cluster analysis. However, methods for evaluating clusterability vary radically, making it challenging to select a suitable measure. In this paper, we perform an ex...

Find SimilarView on arXiv

Computational Feasibility of Clustering under Clusterability Assumptions

January 2, 2015

87% Match

Shai Ben-David

Computational Complexity

Machine Learning

It is well known that most of the common clustering objectives are NP-hard to optimize. In practice, however, clustering is being routinely carried out. One approach for providing theoretical understanding of this seeming discrepancy is to come up with notions of clusterability that distinguish realistically interesting input data from worst-case data sets. The hope is that there will be clustering algorithms that are provably efficient on such 'clusterable' instances. In oth...

Find SimilarView on arXiv

Large Scale Correlation Clustering Optimization

December 13, 2011

87% Match

Shai Bagon, Meirav Galun

Computer Vision and Pattern ...

Clustering is a fundamental task in unsupervised learning. The focus of this paper is the Correlation Clustering functional which combines positive and negative affinities between the data points. The contribution of this paper is two fold: (i) Provide a theoretic analysis of the functional. (ii) New optimization algorithms which can cope with large scale problems (>100K variables) that are infeasible using existing methods. Our theoretic analysis provides a probabilistic gen...

Find SimilarView on arXiv

Automatic Parameter Selection for Non-Redundant Clustering

December 19, 2023

87% Match

Collin Leiber, Dominik Mautz, ... , Böhm Christian

Machine Learning

Artificial Intelligence

High-dimensional datasets often contain multiple meaningful clusterings in different subspaces. For example, objects can be clustered either by color, weight, or size, revealing different interpretations of the given dataset. A variety of approaches are able to identify such non-redundant clusterings. However, most of these methods require the user to specify the expected number of subspaces and clusters for each subspace. Stating these values is a non-trivial problem and usu...

Find SimilarView on arXiv

Iterative Optimization and Simplification of Hierarchical Clusterings

April 1, 1996

87% Match

D. Fisher

Artificial Intelligence

Clustering is often used for discovering structure in data. Clustering systems differ in the objective function used to evaluate clustering quality and the control strategy used to search the space of clusterings. Ideally, the search strategy should consistently construct clusterings of high quality, but be computationally inexpensive as well. In general, we cannot have it both ways, but we can partition the search so that a system inexpensively constructs a `tentative' clust...

Find SimilarView on arXiv

Partitioning Relational Matrices of Similarities or Dissimilarities using the Value of Information

October 28, 2017

87% Match

Isaac J. Sledge, Jose C. Principe

Artificial Intelligence

Machine Learning

In this paper, we provide an approach to clustering relational matrices whose entries correspond to either similarities or dissimilarities between objects. Our approach is based on the value of information, a parameterized, information-theoretic criterion that measures the change in costs associated with changes in information. Optimizing the value of information yields a deterministic annealing style of clustering with many benefits. For instance, investigators avoid needing...

Find SimilarView on arXiv

SMLSOM: The shrinking maximum likelihood self-organizing map

April 28, 2021

87% Match

Ryosuke Motegi, Yoichi Seki

Machine Learning

Information Retrieval

Determining the number of clusters in a dataset is a fundamental issue in data clustering. Many methods have been proposed to solve the problem of selecting the number of clusters, considering it to be a problem with regard to model selection. This paper proposes an efficient algorithm that automatically selects a suitable number of clusters based on a probability distribution model framework. The algorithm includes the following two components. First, a generalization of Koh...

Find SimilarView on arXiv

How many clusters? An information theoretic perspective

The information bottleneck and geometric clustering

Algorithms of maximum likelihood data clustering with applications

Entropy Regularized Power k-Means Clustering

To Cluster, or Not to Cluster: An Analysis of Clusterability Methods

Computational Feasibility of Clustering under Clusterability Assumptions

Large Scale Correlation Clustering Optimization

Automatic Parameter Selection for Non-Redundant Clustering

Iterative Optimization and Simplification of Hierarchical Clusterings

Partitioning Relational Matrices of Similarities or Dissimilarities using the Value of Information

SMLSOM: The shrinking maximum likelihood self-organizing map