How many clusters? An information theore...

Experimental Estimation of Number of Clusters Based on Cluster Quality

March 10, 2015

88% Match

G. Hannah Grace, Kalyani Desikan

Information Retrieval

Text Clustering is a text mining technique which divides the given set of text documents into significant clusters. It is used for organizing a huge number of text documents into a well-organized form. In the majority of the clustering algorithms, the number of clusters must be specified apriori, which is a drawback of these algorithms. The aim of this paper is to show experimentally how to determine the number of clusters based on cluster quality. Since partitional clusterin...

Find SimilarView on arXiv

Penalized k-means algorithms for finding the correct number of clusters in a dataset

November 15, 2019

88% Match

Behzad Kamgar-Parsi, Behrooz Kamgar-Parsi

Machine Learning

In many applications we want to find the number of clusters in a dataset. A common approach is to use the penalized k-means algorithm with an additive penalty term linear in the number of clusters. An open problem is estimating the value of the coefficient of the penalty term. Since estimating the value of the coefficient in a principled manner appears to be intractable for general clusters, we investigate "ideal clusters", i.e. identical spherical clusters with no overlaps a...

Find SimilarView on arXiv

Information theoretical clustering is hard to approximate

December 17, 2018

88% Match

Ferdinando Cicalese, Eduardo Laber

Data Structures and Algorith...

An impurity measures $I: \mathbb{R}^d \mapsto \mathbb{R}^+$ is a function that assigns a $d$-dimensional vector ${\bf v}$ to a non-negative value $I({\bf v})$ so that the more homogeneous ${\bf v}$, with respect to the values of its coordinates, the larger its impurity. A well known example of impurity measures is the Entropy impurity. We study the problem of clustering based on impurity measures. Let $V$ be a collection of $n$ many $d$-dimensional vectors with non-negative...

Find SimilarView on arXiv

On sampling and modeling complex systems

January 16, 2013

88% Match

Matteo Marsili, Iacopo Mastromatteo, Yasser Roudi

Data Analysis, Statistics an...

Statistical Mechanics

The study of complex systems is limited by the fact that only few variables are accessible for modeling and sampling, which are not necessarily the most relevant ones to explain the systems behavior. In addition, empirical data typically under sample the space of possible states. We study a generic framework where a complex system is seen as a system of many interacting degrees of freedom, which are known only in part, that optimize a given function. We show that the underlyi...

Find SimilarView on arXiv

The Computational Theory of Intelligence: Information Entropy

December 24, 2014

88% Match

Daniel Kovach

Artificial Intelligence

Machine Learning

This paper presents an information theoretic approach to the concept of intelligence in the computational sense. We introduce a probabilistic framework from which computational intelligence is shown to be an entropy minimizing process at the local level. Using this new scheme, we develop a simple data driven clustering example and discuss its applications.

Find SimilarView on arXiv

Revisiting k-means: New Algorithms via Bayesian Nonparametrics

November 2, 2011

88% Match

Brian Kulis, Michael I. Jordan

Machine Learning

Bayesian models offer great flexibility for clustering applications---Bayesian nonparametrics can be used for modeling infinite mixtures, and hierarchical Bayesian models can be utilized for sharing clusters across multiple data sets. For the most part, such flexibility is lacking in classical clustering methods such as k-means. In this paper, we revisit the k-means clustering algorithm from a Bayesian nonparametric viewpoint. Inspired by the asymptotic connection between k-m...

Find SimilarView on arXiv

Data clustering and noise undressing of correlation matrices

March 14, 2000

88% Match

M. INFM, Trieste-SISSA Marsili

Statistical Mechanics

Disordered Systems and Neura...

Adaptation and Self-Organizi...

We discuss a new approach to data clustering. We find that maximum likelyhood leads naturally to an Hamiltonian of Potts variables which depends on the correlation matrix and whose low temperature behavior describes the correlation structure of the data. For random, uncorrelated data sets no correlation structure emerges. On the other hand for data sets with a built-in cluster structure, the method is able to detect and recover efficiently that structure. Finally we apply the...

Find SimilarView on arXiv

Bayesian Cluster Enumeration Criterion for Unsupervised Learning

October 22, 2017

88% Match

Freweyni K. Teklehaymanot, Michael Muma, Abdelhak M. Zoubir

Statistics Theory

Machine Learning

Statistics Theory

We derive a new Bayesian Information Criterion (BIC) by formulating the problem of estimating the number of clusters in an observed data set as maximization of the posterior probability of the candidate models. Given that some mild assumptions are satisfied, we provide a general BIC expression for a broad class of data distributions. This serves as a starting point when deriving the BIC for specific distributions. Along this line, we provide a closed-form BIC expression for m...

Find SimilarView on arXiv

Data clustering and noise undressing for correlation matrices

January 16, 2001

88% Match

Lorenzo Giada, Matteo Marsili

Statistical Mechanics

Disordered Systems and Neura...

We discuss a new approach to data clustering. We find that maximum likelihood leads naturally to an Hamiltonian of Potts variables which depends on the correlation matrix and whose low temperature behavior describes the correlation structure of the data. For random, uncorrelated data sets no correlation structure emerges. On the other hand for data sets with a built-in cluster structure, the method is able to detect and recover efficiently that structure. Finally we apply the...

Find SimilarView on arXiv

Inference of hidden structures in complex physical systems by multi-scale clustering

March 5, 2015

88% Match

Z. Nussinov, P. Ronhovde, Dandan Hu, S. Chakrabarty, M. Sahu, Bo Sun, ... , Sahu K. K.

Materials Science

Statistical Mechanics

Computer Vision and Pattern ...

Data Analysis, Statistics an...

We survey the application of a relatively new branch of statistical physics--"community detection"-- to data mining. In particular, we focus on the diagnosis of materials and automated image segmentation. Community detection describes the quest of partitioning a complex system involving many elements into optimally decoupled subsets or communities of such elements. We review a multiresolution variant which is used to ascertain structures at different spatial and temporal scal...

Find SimilarView on arXiv

How many clusters? An information theoretic perspective

Experimental Estimation of Number of Clusters Based on Cluster Quality

Penalized k-means algorithms for finding the correct number of clusters in a dataset

Information theoretical clustering is hard to approximate

On sampling and modeling complex systems

The Computational Theory of Intelligence: Information Entropy

Revisiting k-means: New Algorithms via Bayesian Nonparametrics

Data clustering and noise undressing of correlation matrices

Bayesian Cluster Enumeration Criterion for Unsupervised Learning

Data clustering and noise undressing for correlation matrices

Inference of hidden structures in complex physical systems by multi-scale clustering