Training Mixture Models at Scale via Coresets

03/23/2017
by   Mario Lucic, et al.
0

How can we train a statistical mixture model on a massive data set? In this paper, we show how to construct coresets for mixtures of Gaussians and natural generalizations. A coreset is a weighted subset of the data, which guarantees that models fitting the coreset also provide a good fit for the original data set. We show that, perhaps surprisingly, Gaussian mixtures admit coresets of size polynomial in dimension and the number of mixture components, while being independent of the data set size. Hence, one can compute a (1+ ε)-approximation for the optimal model on a significantly smaller data set. More importantly, such coresets can be efficiently constructed both in distributed and streaming settings. Our results rely on a novel reduction of statistical estimation to problems in computational geometry and new complexity results for mixtures of Gaussians. As a by-product of our analysis, we prove that the pseudo-dimension of arbitrary mixtures of Gaussians is polynomial in the ambient dimension. Empirical evaluation on several real-world datasets suggest that our coreset-based approach enables significant reduction in training-time with negligible approximation error.

READ FULL TEXT

page 1

page 2

page 3

page 4

01/31/2022

On the identifiability of mixtures of ranking models

Mixtures of ranking models are standard tools for ranking problems. Howe...
07/14/2018

On the Identifiability of Finite Mixtures of Finite Product Measures

The problem of identifiability of finite mixtures of finite product meas...
04/27/2021

Mixture models for spherical data with applications to protein bioinformatics

Finite mixture models are fitted to spherical data. Kent distributions a...
11/12/2013

The More, the Merrier: the Blessing of Dimensionality for Learning Large Gaussian Mixtures

In this paper we show that very large mixtures of Gaussians are efficien...
01/09/2019

The statistical Minkowski distances: Closed-form formula for Gaussian Mixture Models

The traditional Minkowski distances are induced by the corresponding Min...
06/06/2020

Learning Mixtures of Plackett-Luce Models with Features from Top-l Orders

Plackett-Luce model (PL) is one of the most popular models for preferenc...
10/20/2020

Distributed Learning of Finite Gaussian Mixtures

Advances in information technology have led to extremely large datasets ...