Training Mixture Models at Scale via Coresets

by   Mario Lucic, et al.

How can we train a statistical mixture model on a massive data set? In this paper, we show how to construct coresets for mixtures of Gaussians and natural generalizations. A coreset is a weighted subset of the data, which guarantees that models fitting the coreset also provide a good fit for the original data set. We show that, perhaps surprisingly, Gaussian mixtures admit coresets of size polynomial in dimension and the number of mixture components, while being independent of the data set size. Hence, one can compute a (1+ ε)-approximation for the optimal model on a significantly smaller data set. More importantly, such coresets can be efficiently constructed both in distributed and streaming settings. Our results rely on a novel reduction of statistical estimation to problems in computational geometry and new complexity results for mixtures of Gaussians. As a by-product of our analysis, we prove that the pseudo-dimension of arbitrary mixtures of Gaussians is polynomial in the ambient dimension. Empirical evaluation on several real-world datasets suggest that our coreset-based approach enables significant reduction in training-time with negligible approximation error.


page 1

page 2

page 3

page 4


On the identifiability of mixtures of ranking models

Mixtures of ranking models are standard tools for ranking problems. Howe...

On the Identifiability of Finite Mixtures of Finite Product Measures

The problem of identifiability of finite mixtures of finite product meas...

Mixture models for spherical data with applications to protein bioinformatics

Finite mixture models are fitted to spherical data. Kent distributions a...

The More, the Merrier: the Blessing of Dimensionality for Learning Large Gaussian Mixtures

In this paper we show that very large mixtures of Gaussians are efficien...

The statistical Minkowski distances: Closed-form formula for Gaussian Mixture Models

The traditional Minkowski distances are induced by the corresponding Min...

Learning Mixtures of Plackett-Luce Models with Features from Top-l Orders

Plackett-Luce model (PL) is one of the most popular models for preferenc...

Distributed Learning of Finite Gaussian Mixtures

Advances in information technology have led to extremely large datasets ...