High-Dimensional Data Clustering

How I learned to stop worrying and love the curse of dimensionality: an appraisal of cluster validation in high-dimensional spaces

January 13, 2022

89% Match

Brian A. Powell

Machine Learning

The failure of the Euclidean norm to reliably distinguish between nearby and distant points in high dimensional space is well-known. This phenomenon of distance concentration manifests in a variety of data distributions, with iid or correlated features, including centrally-distributed and clustered data. Unsupervised learning based on Euclidean nearest-neighbors and more general proximity-oriented data mining tasks like clustering, might therefore be adversely affected by dis...

Find SimilarView on arXiv

A Bayesian Fisher-EM algorithm for discriminative Gaussian subspace clustering

December 8, 2020

89% Match

Nicolas Jouvin, Charles Bouveyron, Pierre Latouche

Methodology

High-dimensional data clustering has become and remains a challenging task for modern statistics and machine learning, with a wide range of applications. We consider in this work the powerful discriminative latent mixture model, and we extend it to the Bayesian framework. Modeling data as a mixture of Gaussians in a low-dimensional discriminative subspace, a Gaussian prior distribution is introduced over the latent group means and a family of twelve submodels are derived cons...

Find SimilarView on arXiv

Clustering, Classification, Discriminant Analysis, and Dimension Reduction via Generalized Hyperbolic Mixtures

August 28, 2013

89% Match

Katherine Morris, Paul D. McNicholas

Methodology

Computation

Machine Learning

A method for dimension reduction with clustering, classification, or discriminant analysis is introduced. This mixture model-based approach is based on fitting generalized hyperbolic mixtures on a reduced subspace within the paradigm of model-based clustering, classification, or discriminant analysis. A reduced subspace of the data is derived by considering the extent to which group means and group covariances vary. The members of the subspace arise through linear combination...

Find SimilarView on arXiv

Clustering based on Mixtures of Sparse Gaussian Processes

March 23, 2023

89% Match

Zahra Moslehi, Abdolreza Mirzaei, Mehran Safayani

Machine Learning

Creating low dimensional representations of a high dimensional data set is an important component in many machine learning applications. How to cluster data using their low dimensional embedded space is still a challenging problem in machine learning. In this article, we focus on proposing a joint formulation for both clustering and dimensionality reduction. When a probabilistic model is desired, one possible solution is to use the mixture models in which both cluster indicat...

Find SimilarView on arXiv

Sparse Subspace Clustering: Algorithm, Theory, and Applications

March 5, 2012

89% Match

Ehsan Elhamifar, Rene Vidal

cs.CV

cs.IR

cs.IT

cs.LG

math.IT

math.OC

stat.ML

In many real-world problems, we are dealing with collections of high-dimensional data, such as images, videos, text and web documents, DNA microarray data, and more. Often, high-dimensional data lie close to low-dimensional structures corresponding to several classes or categories the data belongs to. In this paper, we propose and study an algorithm, called Sparse Subspace Clustering (SSC), to cluster data points that lie in a union of low-dimensional subspaces. The key idea ...

Find SimilarView on arXiv

Principal component analysis based clustering for high-dimension, low-sample-size data

March 16, 2015

89% Match

Kazuyoshi Yata, Makoto Aoshima

Statistics Theory

In this paper, we consider clustering based on principal component analysis (PCA) for high-dimension, low-sample-size (HDLSS) data. We give theoretical reasons why PCA is effective for clustering HDLSS data. First, we derive a geometric representation of HDLSS data taken from a two-class mixture model. With the help of the geometric representation, we give geometric consistency properties of sample principal component scores in the HDLSS context. We develop ideas of the geome...

Find SimilarView on arXiv

Analysis of Sparse Subspace Clustering: Experiments and Random Projection

April 2, 2022

89% Match

Mehmet F. Demirel, Enrico Au-Yeung

Machine Learning

Optimization and Control

Clustering can be defined as the process of assembling objects into a number of groups whose elements are similar to each other in some manner. As a technique that is used in many domains, such as face clustering, plant categorization, image segmentation, document classification, clustering is considered one of the most important unsupervised learning problems. Scientists have surveyed this problem for years and developed different techniques that can solve it, such as k-mean...

Find SimilarView on arXiv

Dirichlet Process Parsimonious Mixtures for clustering

January 14, 2015

89% Match

Faicel Chamroukhi, Marius Bartcus, Hervé Glotin

Machine Learning

Methodology

The parsimonious Gaussian mixture models, which exploit an eigenvalue decomposition of the group covariance matrices of the Gaussian mixture, have shown their success in particular in cluster analysis. Their estimation is in general performed by maximum likelihood estimation and has also been considered from a parametric Bayesian prospective. We propose new Dirichlet Process Parsimonious mixtures (DPPM) which represent a Bayesian nonparametric formulation of these parsimoniou...

Find SimilarView on arXiv

Clustering by latent dimensions

May 28, 2018

88% Match

Shohei Hidaka, Neeraj Kashyap

Machine Learning

This paper introduces a new clustering technique, called {\em dimensional clustering}, which clusters each data point by its latent {\em pointwise dimension}, which is a measure of the dimensionality of the data set local to that point. Pointwise dimension is invariant under a broad class of transformations. As a result, dimensional clustering can be usefully applied to a wide range of datasets. Concretely, we present a statistical model which estimates the pointwise dimensio...

Find SimilarView on arXiv

Variable Selection for Clustering and Classification

March 21, 2013

88% Match

Jeffrey L. Andrews, Paul D. McNicholas

Computation

As data sets continue to grow in size and complexity, effective and efficient techniques are needed to target important features in the variable space. Many of the variable selection techniques that are commonly used alongside clustering algorithms are based upon determining the best variable subspace according to model fitting in a stepwise manner. These techniques are often computationally intensive and can require extended periods of time to run; in fact, some are prohibit...

Find SimilarView on arXiv