Learning Mixture of Gaussians with Streaming Data

07/08/2017
by   Aditi Raghunathan, et al.
0

In this paper, we study the problem of learning a mixture of Gaussians with streaming data: given a stream of N points in d dimensions generated by an unknown mixture of k spherical Gaussians, the goal is to estimate the model parameters using a single pass over the data stream. We analyze a streaming version of the popular Lloyd's heuristic and show that the algorithm estimates all the unknown centers of the component Gaussians accurately if they are sufficiently separated. Assuming each pair of centers are Cσ distant with C=Ω((k k)^1/4σ) and where σ^2 is the maximum variance of any Gaussian component, we show that asymptotically the algorithm estimates the centers optimally (up to constants); our center separation requirement matches the best known result for spherical Gaussians vempalawang. For finite samples, we show that a bias term based on the initial estimate decreases at O(1/ poly(N)) rate while variance decreases at nearly optimal rate of σ^2 d/N. Our analysis requires seeding the algorithm with a good initial estimate of the true cluster centers for which we provide an online PCA based clustering algorithm. Indeed, the asymptotic per-step time complexity of our algorithm is the optimal d· k while space complexity of our algorithm is O(dk k). In addition to the bias and variance terms which tend to 0, the hard-thresholding based updates of streaming Lloyd's algorithm is agnostic to the data distribution and hence incurs an approximation error that cannot be avoided. However, by using a streaming version of the classical (soft-thresholding-based) EM method that exploits the Gaussian distribution explicitly, we show that for a mixture of two Gaussians the true means can be estimated consistently, with estimation error decreasing at nearly optimal rate, and tending to 0 for N→∞.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
10/31/2017

On Learning Mixtures of Well-Separated Gaussians

We consider the problem of efficiently learning mixtures of a large numb...
research
04/13/2020

Learning Mixtures of Spherical Gaussians via Fourier Analysis

Suppose that we are given independent, identically distributed samples x...
research
07/26/2012

Achieving Approximate Soft Clustering in Data Streams

In recent years, data streaming has gained prominence due to advances in...
research
12/28/2020

No-substitution k-means Clustering with Adversarial Order

We investigate k-means clustering in the online no-substitution setting ...
research
09/16/2019

Streaming PTAS for Constrained k-Means

We generalise the results of Bhattacharya et al. (Journal of Computing S...
research
03/10/2019

One-Pass Sparsified Gaussian Mixtures

We present a one-pass sparsified Gaussian mixture model (SGMM). Given P-...
research
05/06/2022

What Makes A Good Fisherman? Linear Regression under Self-Selection Bias

In the classical setting of self-selection, the goal is to learn k model...

Please sign up or login with your details

Forgot password? Click here to reset