Cohesion and Repulsion in Bayesian Dista...

Spatially-Aware Comparison and Consensus for Clusterings

January 31, 2011

84% Match

Parasaran Raman, Jeff M. Phillips, Suresh Venkatasubramanian

Machine Learning

Computational Geometry

Databases

This paper proposes a new distance metric between clusterings that incorporates information about the spatial distribution of points and clusters. Our approach builds on the idea of a Hilbert space-based representation of clusters as a combination of the representations of their constituent points. We use this representation and the underlying metric to design a spatially-aware consensus clustering procedure. This consensus procedure is implemented via a novel reduction to Eu...

Find SimilarView on arXiv

Efficient hierarchical clustering for continuous data

April 20, 2012

84% Match

Ricardo Henao, Joseph E. Lucas

Machine Learning

We present an new sequential Monte Carlo sampler for coalescent based Bayesian hierarchical clustering. Our model is appropriate for modeling non-i.i.d. data and offers a substantial reduction of computational cost when compared to the original sampler without resorting to approximations. We also propose a quadratic complexity approximation that in practice shows almost no loss in performance compared to its counterpart. We show that as a byproduct of our formulation, we obta...

Find SimilarView on arXiv

Bayesian Agglomerative Clustering with Coalescents

July 4, 2009

84% Match

Yee Whye Teh, Hal III Daumé, Daniel Roy

Machine Learning

We introduce a new Bayesian model for hierarchical clustering based on a prior over trees called Kingman's coalescent. We develop novel greedy and sequential Monte Carlo inferences which operate in a bottom-up agglomerative fashion. We show experimentally the superiority of our algorithms over others, and demonstrate our approach in document clustering and phylolinguistics.

Find SimilarView on arXiv

Cluster Analysis via Random Partition Distributions

June 5, 2021

84% Match

David B. Dahl, Jacob Andros, J. Brandon Carter

Methodology

Hierarchical and k-medoids clustering are deterministic clustering algorithms based on pairwise distances. Using these same pairwise distances, we propose a novel stochastic clustering method based on random partition distributions. We call our method CaviarPD, for cluster analysis via random partition distributions. CaviarPD first samples clusterings from a random partition distribution and then finds the best cluster estimate based on these samples using algorithms to minim...

Find SimilarView on arXiv

A new interpoint distance-based clustering algorithm using kernel density estimation

April 28, 2023

84% Match

Dr. Soumita Modak

Methodology

Applications

A novel nonparametric clustering algorithm is proposed using the interpoint distances between the members of the data to reveal the inherent clustering structure existing in the given set of data, where we apply the classical nonparametric univariate kernel density estimation method to the interpoint distances to estimate the density around a data member. Our clustering algorithm is simple in its formation and easy to apply resulting in well-defined clusters. The algorithm st...

Find SimilarView on arXiv

A Short Survey on Data Clustering Algorithms

November 25, 2015

84% Match

Ka-Chun Wong

Data Structures and Algorith...

Computer Vision and Pattern ...

Machine Learning

Computation

Machine Learning

With rapidly increasing data, clustering algorithms are important tools for data analytics in modern research. They have been successfully applied to a wide range of domains; for instance, bioinformatics, speech recognition, and financial analysis. Formally speaking, given a set of data instances, a clustering algorithm is expected to divide the set of data instances into the subsets which maximize the intra-subset similarity and inter-subset dissimilarity, where a similarity...

Find SimilarView on arXiv

Computational Feasibility of Clustering under Clusterability Assumptions

January 2, 2015

84% Match

Shai Ben-David

Computational Complexity

Machine Learning

It is well known that most of the common clustering objectives are NP-hard to optimize. In practice, however, clustering is being routinely carried out. One approach for providing theoretical understanding of this seeming discrepancy is to come up with notions of clusterability that distinguish realistically interesting input data from worst-case data sets. The hope is that there will be clustering algorithms that are provably efficient on such 'clusterable' instances. In oth...

Find SimilarView on arXiv

Distributed Bayesian clustering using finite mixture of mixtures

March 31, 2020

84% Match

Hanyu Song, Yingjian Wang, David B. Dunson

Computation

Methodology

In many modern applications, there is interest in analyzing enormous data sets that cannot be easily moved across computers or loaded into memory on a single computer. In such settings, it is very common to be interested in clustering. Existing distributed clustering algorithms are mostly distance or density based without a likelihood specification, precluding the possibility of formal statistical inference. Model-based clustering allows statistical inference, yet research on...

Find SimilarView on arXiv

Clustering Based on Pairwise Distances When the Data is of Mixed Dimensions

September 12, 2009

84% Match

Ery Arias-Castro

Machine Learning

Statistics Theory

In the context of clustering, we consider a generative model in a Euclidean ambient space with clusters of different shapes, dimensions, sizes and densities. In an asymptotic setting where the number of points becomes large, we obtain theoretical guaranties for a few emblematic methods based on pairwise distances: a simple algorithm based on the extraction of connected components in a neighborhood graph; the spectral clustering method of Ng, Jordan and Weiss; and hierarchical...

Find SimilarView on arXiv

Bayesian Clustering via Fusing of Localized Densities

March 31, 2023

84% Match

Alexander Dombowsky, David B. Dunson

Methodology

Bayesian clustering typically relies on mixture models, with each component interpreted as a different cluster. After defining a prior for the component parameters and weights, Markov chain Monte Carlo (MCMC) algorithms are commonly used to produce samples from the posterior distribution of the component labels. The data are then clustered by minimizing the expectation of a clustering loss function that favours similarity to the component labels. Unfortunately, although these...

Find SimilarView on arXiv

Cohesion and Repulsion in Bayesian Distance Clustering

Spatially-Aware Comparison and Consensus for Clusterings

Efficient hierarchical clustering for continuous data

Bayesian Agglomerative Clustering with Coalescents

Cluster Analysis via Random Partition Distributions

A new interpoint distance-based clustering algorithm using kernel density estimation

A Short Survey on Data Clustering Algorithms

Computational Feasibility of Clustering under Clusterability Assumptions

Distributed Bayesian clustering using finite mixture of mixtures

Clustering Based on Pairwise Distances When the Data is of Mixed Dimensions

Bayesian Clustering via Fusing of Localized Densities