Coresets for Gaussian Mixture Models of Any Shape

06/12/2019
by   Dan Feldman, et al.
0

An ε-coreset for a given set D of n points, is usually a small weighted set, such that querying the coreset provably yields a (1+ε)-factor approximation to the original (full) dataset, for a given family of queries. Using existing techniques, coresets can be maintained for streaming, dynamic (insertion/deletions), and distributed data in parallel, e.g. on a network, GPU or cloud. We suggest the first coresets that approximate the negative log-likelihood for k-Gaussians Mixture Models (GMM) of arbitrary shapes (ratio between eigenvalues of their covariance matrices). For example, for any input set D whose coordinates are integers in [-n^100,n^100] and any fixed k,d≥ 1, the coreset size is ( n)^O(1)/ε^2, and can be computed in time near-linear in n, with high probability. The optimal GMM may then be approximated quickly by learning the small coreset. Previous results [NIPS'11, JMLR'18] suggested such small coresets for the case of semi-speherical unit Gaussians, i.e., where their corresponding eigenvalues are constants between 1/2π to 2π. Our main technique is a reduction between coresets for k-GMMs and projective clustering problems. We implemented our algorithms, and provide open code, and experimental results. Since our coresets are generic, with no special dependency on GMMs, we hope that they will be useful for many other functions.

READ FULL TEXT
research
06/09/2020

Coresets for Near-Convex Functions

Coreset is usually a small weighted subset of n input points in R^d, tha...
research
11/26/2020

Faster Projective Clustering Approximation of Big Data

In projective clustering we are given a set of n points in R^d and wish ...
research
12/13/2018

Automatic Differentiation in Mixture Models

In this article, we discuss two specific classes of models - Gaussian Mi...
research
03/01/2019

Approximation by finite mixtures of continuous density functions that vanish at infinity

Given sufficiently many components, it is often cited that finite mixtur...
research
03/23/2017

Training Mixture Models at Scale via Coresets

How can we train a statistical mixture model on a massive data set? In t...
research
06/09/2020

Faster PAC Learning and Smaller Coresets via Smoothed Analysis

PAC-learning usually aims to compute a small subset (ε-sample/net) from ...
research
08/28/2020

The UU-test for Statistical Modeling of Unimodal Data

Deciding on the unimodality of a dataset is an important problem in data...

Please sign up or login with your details

Forgot password? Click here to reset