View Publication

Model-based Clustering with HDBSCAN*

Michael Strobl
Joerg Sander, Dept of Computing Science
Ricardo Campello
Osmar R. Zaiane, University of Alberta (Database)

Full Text: PKDD2020-hdbscan.pdf

We propose an efficient model-based clustering approach for creating Gaussian Mixture Models from finite datasets. Models are extracted from HDBSCAN* hierarchies using the Classification Likelihood and the Expectation Maximization algorithm. Prior knowledge of the number of components of the model, corresponding to the number of clusters, is not necessary and can be determined dynamically. Due to relatively small hierarchies created by HDBSCAN* compared to previous approaches, this can be done efficiently. The lower the number of objects in a dataset, the more difficult it is to accurately estimate the number of parameters of a fully unrestricted Gaussian Mixture Model. Therefore, more parsimonious models can be created by our algorithm, if necessary. The user has a choice of two information criteria for model selection, as well as a likelihood test using unseen data, in order to select the bestfitting model. We compare our approach to two baselines and show its superiority in two settings: recovering the original data-generating distribution and partitioning the data correctly. Furthermore, we show that our approach is robust to its hyperparameter settings

Citation

M. Strobl, J. Sander, R. Campello, O. Zaiane. "Model-based Clustering with HDBSCAN*". European Conference on Machine Learning and Principles and Practice of Knowledge Discovery in Databa, September 2020.

Keywords:	Hierarchical Clustering, Expectation Maximization, Model Selection
Category:	In Conference

BibTeX

@incollection{Strobl+al:ECMLPKDD20,
  author = {Michael Strobl and Joerg Sander and Ricardo Campello and Osmar R.
    Zaiane},
  title = {Model-based Clustering with HDBSCAN*},
  booktitle = {European Conference on Machine Learning and Principles and
    Practice of Knowledge Discovery in Databa},
  year = 2020,
}

Last Updated: September 15, 2020
Submitted by Sabina P

Not Logged In

PapersDB

Model-based Clustering with HDBSCAN*

Citation

BibTeX