Training Mixture Models at Scale via Coresets

03/23/2017
by   Mario Lucic, et al.
0

How can we train a statistical mixture model on a massive data set? In this paper, we show how to construct coresets for mixtures of Gaussians and natural generalizations. A coreset is a weighted subset of the data, which guarantees that models fitting the coreset also provide a good fit for the original data set. We show that, perhaps surprisingly, Gaussian mixtures admit coresets of size polynomial in dimension and the number of mixture components, while being independent of the data set size. Hence, one can compute a (1+ ε)-approximation for the optimal model on a significantly smaller data set. More importantly, such coresets can be efficiently constructed both in distributed and streaming settings. Our results rely on a novel reduction of statistical estimation to problems in computational geometry and new complexity results for mixtures of Gaussians. As a by-product of our analysis, we prove that the pseudo-dimension of arbitrary mixtures of Gaussians is polynomial in the ambient dimension. Empirical evaluation on several real-world datasets suggest that our coreset-based approach enables significant reduction in training-time with negligible approximation error.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
01/31/2022

On the identifiability of mixtures of ranking models

Mixtures of ranking models are standard tools for ranking problems. Howe...
research
07/14/2018

On the Identifiability of Finite Mixtures of Finite Product Measures

The problem of identifiability of finite mixtures of finite product meas...
research
04/27/2021

Mixture models for spherical data with applications to protein bioinformatics

Finite mixture models are fitted to spherical data. Kent distributions a...
research
08/21/2015

Strong Coresets for Hard and Soft Bregman Clustering with Applications to Exponential Family Mixtures

Coresets are efficient representations of data sets such that models tra...
research
01/19/2020

Algebraic and Analytic Approaches for Parameter Learning in Mixture Models

We present two different approaches for parameter learning in several mi...
research
11/12/2013

The More, the Merrier: the Blessing of Dimensionality for Learning Large Gaussian Mixtures

In this paper we show that very large mixtures of Gaussians are efficien...
research
06/12/2019

Coresets for Gaussian Mixture Models of Any Shape

An ε-coreset for a given set D of n points, is usually a small weighted ...

Please sign up or login with your details

Forgot password? Click here to reset