High-Dimensional Data Clustering

An Experimental Comparison of Several Clustering and Initialization Methods

January 30, 2013

88% Match

Marina Meila, David Heckerman

Machine Learning

We examine methods for clustering in high dimensions. In the first part of the paper, we perform an experimental comparison between three batch clustering algorithms: the Expectation-Maximization (EM) algorithm, a winner take all version of the EM algorithm reminiscent of the K-means algorithm, and model-based hierarchical agglomerative clustering. We learn naive-Bayes models with a hidden root node, using high-dimensional discrete-variable data sets (both real and synthetic)...

Find SimilarView on arXiv

Subspace Clustering through Sub-Clusters

November 15, 2018

88% Match

Weiwei Li, Jan Hannig, Sayan Mukherjee

Machine Learning

The problem of dimension reduction is of increasing importance in modern data analysis. In this paper, we consider modeling the collection of points in a high dimensional space as a union of low dimensional subspaces. In particular we propose a highly scalable sampling based algorithm that clusters the entire data via first spectral clustering of a small random sample followed by classifying or labeling the remaining out of sample points. The key idea is that this random subs...

Find SimilarView on arXiv

Improving Model Choice in Classification: An Approach Based on Clustering of Covariance Matrices

February 22, 2023

88% Match

David Rodríguez-Vítores, Carlos Matrán

Methodology

Computation

This work introduces a refinement of the Parsimonious Model for fitting a Gaussian Mixture. The improvement is based on the consideration of groupings of the covariance matrices according to a criterion, such as sharing Principal Directions. This and other similarity criteria that arise from the spectral decomposition of a matrix are the bases of the Parsimonious Model. The classification can be achieved with simple modifications of the CEM (Classification Expectation Maximiz...

Find SimilarView on arXiv

Robust Clustering using Hyperdimensional Computing

December 5, 2023

88% Match

Lulu Ge, Keshab K. Parhi

Machine Learning

Databases

Symbolic Computation

This paper addresses the clustering of data in the hyperdimensional computing (HDC) domain. In prior work, an HDC-based clustering framework, referred to as HDCluster, has been proposed. However, the performance of the existing HDCluster is not robust. The performance of HDCluster is degraded as the hypervectors for the clusters are chosen at random during the initialization step. To overcome this bottleneck, we assign the initial cluster hypervectors by exploring the similar...

Find SimilarView on arXiv

AutoGMM: Automatic and Hierarchical Gaussian Mixture Modeling in Python

September 6, 2019

88% Match

Thomas L. Athey, Tingshan Liu, ... , Vogelstein Joshua T.

Machine Learning

Background: Gaussian mixture modeling is a fundamental tool in clustering, as well as discriminant analysis and semiparametric density estimation. However, estimating the optimal model for any given number of components is an NP-hard problem, and estimating the number of components is in some respects an even harder problem. Findings: In R, a popular package called mclust addresses both of these problems. However, Python has lacked such a package. We therefore introduce AutoG...

Find SimilarView on arXiv

High-dimensional cluster analysis with the Masked EM Algorithm

September 11, 2013

88% Match

Shabnam N. Kadir, Dan F. M. Goodman, Kenneth D. Harris

Quantitative Methods

Machine Learning

Neurons and Cognition

Applications

Cluster analysis faces two problems in high dimensions: first, the `curse of dimensionality' that can lead to overfitting and poor generalization performance; and second, the sheer time taken for conventional algorithms to process large amounts of high-dimensional data. In many applications, only a small subset of features provide information about the cluster membership of any one data point, however this informative feature subset may not be the same for all data points. He...

Find SimilarView on arXiv

Subspace clustering without knowing the number of clusters: A parameter free approach

September 10, 2019

88% Match

Vishnu Menon, Gokularam M, Sheetal Kalyani

Machine Learning

Signal Processing

Subspace clustering, the task of clustering high dimensional data when the data points come from a union of subspaces is one of the fundamental tasks in unsupervised machine learning. Most of the existing algorithms for this task require prior knowledge of the number of clusters along with few additional parameters which need to be set or tuned apriori according to the type of data to be clustered. In this work, a parameter free method for subspace clustering is proposed, whe...

Find SimilarView on arXiv

Deep Sparse Subspace Clustering

September 25, 2017

88% Match

Xi Peng, Jiashi Feng, Shijie Xiao, Jiwen Lu, ... , Yan Shuicheng

Computer Vision and Pattern ...

In this paper, we present a deep extension of Sparse Subspace Clustering, termed Deep Sparse Subspace Clustering (DSSC). Regularized by the unit sphere distribution assumption for the learned deep features, DSSC can infer a new data affinity matrix by simultaneously satisfying the sparsity principle of SSC and the nonlinearity given by neural networks. One of the appealing advantages brought by DSSC is: when original real-world data do not meet the class-specific linear subsp...

Find SimilarView on arXiv

A Critical Note on the Evaluation of Clustering Algorithms

August 10, 2019

88% Match

Tiantian Zhang, Li Zhong, Bo Yuan

Machine Learning

Experimental evaluation is a major research methodology for investigating clustering algorithms and many other machine learning algorithms. For this purpose, a number of benchmark datasets have been widely used in the literature and their quality plays a key role on the value of the research work. However, in most of the existing studies, little attention has been paid to the properties of the datasets and they are often regarded as black-box problems. For example, it is comm...

Find SimilarView on arXiv

A Short Survey on Data Clustering Algorithms

November 25, 2015

88% Match

Ka-Chun Wong

Data Structures and Algorith...

Computer Vision and Pattern ...

Machine Learning

Computation

Machine Learning

With rapidly increasing data, clustering algorithms are important tools for data analytics in modern research. They have been successfully applied to a wide range of domains; for instance, bioinformatics, speech recognition, and financial analysis. Formally speaking, given a set of data instances, a clustering algorithm is expected to divide the set of data instances into the subsets which maximize the intra-subset similarity and inter-subset dissimilarity, where a similarity...

Find SimilarView on arXiv