ID: 2107.05414

Cohesion and Repulsion in Bayesian Distance Clustering

July 12, 2021

View on ArXiv
Abhinav Natarajan, Iorio Maria De, Andreas Heinecke, Emanuel Mayer, Simon Glenn
Statistics
Methodology

Clustering in high-dimensions poses many statistical challenges. While traditional distance-based clustering methods are computationally feasible, they lack probabilistic interpretation and rely on heuristics for estimation of the number of clusters. On the other hand, probabilistic model-based clustering techniques often fail to scale and devising algorithms that are able to effectively explore the posterior space is an open problem. Based on recent developments in Bayesian distance-based clustering, we propose a hybrid solution that entails defining a likelihood on pairwise distances between observations. The novelty of the approach consists in including both cohesion and repulsion terms in the likelihood, which allows for cluster identifiability. This implies that clusters are composed of objects which have small "dissimilarities" among themselves (cohesion) and similar dissimilarities to observations in other clusters (repulsion). We show how this modelling strategy has interesting connection with existing proposals in the literature as well as a decision-theoretic interpretation. The proposed method is computationally efficient and applicable to a wide variety of scenarios. We demonstrate the approach in a simulation study and an application in digital numismatics.

Similar papers 1

Unsupervised Statistical Learning for Die Analysis in Ancient Numismatics

December 1, 2021

88% Match
Andreas Heinecke, Emanuel Mayer, ... , Jung Yoonju
Computer Vision and Pattern ...

Die analysis is an essential numismatic method, and an important tool of ancient economic history. Yet, manual die studies are too labor-intensive to comprehensively study large coinages such as those of the Roman Empire. We address this problem by proposing a model for unsupervised computational die analysis, which can reduce the time investment necessary for large-scale die studies by several orders of magnitude, in many cases from years to weeks. From a computer vision vie...

Find SimilarView on arXiv

A Mathematical Theory for Clustering in Metric Spaces

September 25, 2015

88% Match
Cheng-Shang Chang, Wanjiun Liao, ... , Liou Li-Heng
Machine Learning

Clustering is one of the most fundamental problems in data analysis and it has been studied extensively in the literature. Though many clustering algorithms have been proposed, clustering theories that justify the use of these clustering algorithms are still unsatisfactory. In particular, one of the fundamental challenges is to address the following question: What is a cluster in a set of data points? In this paper, we make an attempt to address such a question by conside...

Find SimilarView on arXiv

Bayesian Distance Clustering

October 19, 2018

87% Match
Leo L Duan, David B Dunson
Machine Learning
Machine Learning

Model-based clustering is widely-used in a variety of application areas. However, fundamental concerns remain about robustness. In particular, results can be sensitive to the choice of kernel representing the within-cluster data density. Leveraging on properties of pairwise differences between data points, we propose a class of Bayesian distance clustering methods, which rely on modeling the likelihood of the pairwise distances in place of the original data. Although some inf...

Find SimilarView on arXiv

A review on Bayesian model-based clustering

March 30, 2023

87% Match
Clara Grazian
Methodology

Clustering is an important task in many areas of knowledge: medicine and epidemiology, genomics, environmental science, economics, visual sciences, among others. Methodologies to perform inference on the number of clusters have often been proved to be inconsistent, and introducing a dependence structure among the clusters implies additional difficulties in the estimation process. In a Bayesian setting, clustering is performed by considering the unknown partition as a random o...

Find SimilarView on arXiv

Revisiting k-means: New Algorithms via Bayesian Nonparametrics

November 2, 2011

86% Match
Brian Kulis, Michael I. Jordan
Machine Learning
Machine Learning

Bayesian models offer great flexibility for clustering applications---Bayesian nonparametrics can be used for modeling infinite mixtures, and hierarchical Bayesian models can be utilized for sharing clusters across multiple data sets. For the most part, such flexibility is lacking in classical clustering methods such as k-means. In this paper, we revisit the k-means clustering algorithm from a Bayesian nonparametric viewpoint. Inspired by the asymptotic connection between k-m...

Find SimilarView on arXiv

A generalized Bayes framework for probabilistic clustering

June 9, 2020

86% Match
Tommaso Rigon, Amy H. Herring, David B. Dunson
Methodology
Machine Learning

Loss-based clustering methods, such as k-means and its variants, are standard tools for finding groups in data. However, the lack of quantification of uncertainty in the estimated clusters is a disadvantage. Model-based clustering based on mixture models provides an alternative, but such methods face computational problems and large sensitivity to the choice of kernel. This article proposes a generalized Bayes framework that bridges between these two paradigms through the use...

Find SimilarView on arXiv

Information based clustering

November 26, 2005

86% Match
Noam Slonim, Gurinder Singh Atwal, ... , Bialek William
Quantitative Methods

In an age of increasingly large data sets, investigators in many different disciplines have turned to clustering as a tool for data analysis and exploration. Existing clustering methods, however, typically depend on several nontrivial assumptions about the structure of data. Here we reformulate the clustering problem from an information theoretic perspective which avoids many of these assumptions. In particular, our formulation obviates the need for defining a cluster "protot...

Find SimilarView on arXiv

Parsimonious Hierarchical Modeling Using Repulsive Distributions

January 16, 2017

86% Match
J. J. Quinlan, F. A. Quintana, G. L. Page
Methodology

Employing nonparametric methods for density estimation has become routine in Bayesian statistical practice. Models based on discrete nonparametric priors such as Dirichlet Process Mixture (DPM) models are very attractive choices due to their flexibility and tractability. However, a common problem in fitting DPMs or other discrete models to data is that they tend to produce a large number of (sometimes) redundant clusters. In this work we propose a method that produces parsimo...

Find SimilarView on arXiv

Bayesian cluster analysis: Point estimation and credible balls

May 13, 2015

86% Match
Sara Wade, Zoubin Ghahramani
Methodology

Clustering is widely studied in statistics and machine learning, with applications in a variety of fields. As opposed to classical algorithms which return a single clustering solution, Bayesian nonparametric models provide a posterior over the entire space of partitions, allowing one to assess statistical properties, such as uncertainty on the number of clusters. However, an important problem is how to summarize the posterior; the huge dimension of partition space and difficu...

Find SimilarView on arXiv

Interpretable Clustering with the Distinguishability Criterion

April 24, 2024

86% Match
Ali Turfah, Xiaoquan Wen
Machine Learning
Machine Learning
Methodology

Cluster analysis is a popular unsupervised learning tool used in many disciplines to identify heterogeneous sub-populations within a sample. However, validating cluster analysis results and determining the number of clusters in a data set remains an outstanding problem. In this work, we present a global criterion called the Distinguishability criterion to quantify the separability of identified clusters and validate inferred cluster configurations. Our computational implement...

Find SimilarView on arXiv