1 Introduction
This contribution is in the context of Gaussian Mixture Models (GMMs), which is a probabilistic unsupervised method for clustering and data modeling. GMMs have been used in a wide range of scenarios, e.g., Melnykov et al. (2010)
. Traditionally, free parameters of a GMM are estimated using the socalled
ExpectationMaximization (EM) algorithm, which has the appealing property of requiring no learning rates and which automatically enforces all constraints that GMMs impose.1.1 Motivation
GMMs have several properties that make their application to image data appealing, like the ability to generate samples from the modeled distribution, or their intrinsic outlier detection ability. However, for such highdimensional and dataintensive scenarios which require a high degree of parallelization to be efficiently solvable, the traditional way of training GMMs by EM is reaching its limits, both in terms of memory consumption and execution speed. Memory requirements skyrocket mainly because GMM is intrinsically a batchtype algorithm that needs to store and process all samples in memory in order to be parallelizable. In addition, computation time becomes excessive because EM involves covariance matrix inversions, which become prohibitive for high data dimensionality. Here, an online or minibatch type of optimization such as Stochastic Gradient Descent (SGD) has strong advantages w.r.t. the memory consumption in particular, but also w.r.t. computation time since it requires no matrix inversions. Therefore, we seek to develop a Stochastic Gradient Descent algorithm for GMMs, making it possible to efficiently train GMMs on large collections of images. Additionally, since the number of free GMM parameters depends quadratically on the data dimensionality
, we seek to develop simplified GMM models with a less severe impact of .1.2 Related Work
There are several proposals for what is termed “incremental” or “online” EM for training GMMs Vlassis and Likas (2002); Engel and Heinen (2010); Pinto and Engel (2015); Cederborg et al. (2010); Song and Wang (2005); Kristan et al. (2008)
, allowing GMMs to be trained by providing one sample at a time. However, none of these models reproduce the original GMM algorithm faithfully, which is why most of them additionally resort to component pruning or merging heuristics for components, leading to learning dynamics that are difficult to understand. This is also the case for the Locally Weighted Projection Regression (LWPR) algorithm
Vijayakumar et al. (2005) which can be interpreted as a GMM model adapted for online learning. To our knowledge, there is only a single work proposing to train GMMs by SGD Hosseini and Sra (2015). Here, constraint enforcement is ensured by using manifold optimization techniques and a carefully chosen set of regularizers, leading to a rather complex model with several additional and hardtochoose hyperparameters. The idea of including regularization/annealing into GMMs (in the context of EM) was proposed in Verbeek et al. (2005); Ormoneit and Tresp (1998), although their regularizes significantly differs from ours. Approximating the full GMM loglikelihood is used in Verbeek et al. (2003); Pinheiro and Bates (1995); Dognin et al. (2009), although together with EM training. There has been significant work on simplifying GMM models (e.g., Jakovljević (2014)), leading to what is called parsimonious GMM models, which is reviewed in Bouveyron and BrunetSaumard (2014).1.3 Contributions
The main novel contributions of this article are:

[leftmargin=2em]

a generic and novel method for training GMMs using standard SGD that enforces all constraints and is numerically stable for high dimensional data (e.g., )

an annealing procedure that ensures convergence from a wide range of initial conditions in a streaming setting (excluding, e.g., centroid initialization by kmeans)

a simplification of the basic GMM model that is particularly suited for applying GMMs to highdimensional data

an analysis of the link between SGDtrained GMM models and selforganizing maps
Apart from these novel contributions, we provide a publicly available TensorFlow implementation.^{1}^{1}1https://gitlab.cs.hsfulda.de/MLProjects/sgdgmm
2 Datasets
We use two datasets for all experiments. MNIST (LeCun et al., 1998) consists of gray scale images of handwritten digits (
) and is a common benchmark for computer vision systems and classification problems.
SVHN (Netzer et al., 2011) is a class benchmark based on color images of house numbers (). We use the cropped and centered digit format. For both benchmarks, we use either the full images ( for SVHN, for MNIST), or a crop of the central patch of each image. The class information is not used since GMM is a purely unsupervised method.3 Methods: Stochastic Gradient Descent for GMMs
Gaussian Mixture Models (GMMs) are normally formulated in terms of
component probabilities, modeled by multivariate Gaussians. It is assumed that each data sample
, represented as a single row of the data matrix , has been sampled from a single Gaussian component , selected with a priori probability . Sampling is performed from the Gaussian conditional probability , whose parameters are the centroids and covariance matrices : . A GMM aims to explain the data by minimizing the loglikelihood function(1) 
Free parameters to be adapted are the component probabilities , the component centroids and the component covariance matrices .
3.1 Maxcomponent Approximation
As a first step to implement Stochastic Gradient Descent (SGD) for optimizing the loglikelihood (see Eq. 1), we observe that the component weights and the conditional probabilities are positive by definition. It is therefore evident that any single component of the inner sum over the components is a lower bound for the whole inner sum. The largest of these lower bounds is given by the maximum over the components, thus giving
(2) 
This is what we term the maxcomponent approximation of Eq. 2. Since , we can decrease by minimizing . The advantage of is that it is not affected by numerical instabilities as is. An easy way to see this is by considering the case of a single Gaussian component with weight , centroid and whose covariance matrix has only diagonal entries which are all equal to . In this case, an
dimensional input vector
, would give a conditional probability of which will, depending on floating point inaccuracy, produce NaN or infinity for the first factor due to overflow, and zero for the second factor due to underflow, resulting in an undefined behavior.3.2 Constraint Enforcement
GMMs require the weights to be normalized: and elements of the diagonal matrices to be nonnegative: . In traditional GMM training via ExpectationMaximization, these constraints are taken care of by the method of Lagrangian multipliers. In SGD, they must be enforced in different manner. For the weights , we adopt the approach proposed in Hosseini and Sra (2015), which replaces them by other free parameters from which the are computed such that normalization is ensured. One such computational scheme is
(3) 
For ensuring nonnegativeness of the covariances, we identify all covariances inferior to a certain minimal value after each gradient descent step, and subsequently clip them to .
3.3 Annealing procedure
A major issue for SGD optimization of GMMs are local minima that obviously correspond to undesirable solutions. Prominently among these is what we term the singlecomponent solution: here, a single component has a weight close to , with its centroid and covariance matrix being given by the mean and covariance of the data:
(4) 
Another undesirable local minimum is the degenerate solution where all components have the same weight, centroid and covariance matrix, the latter being identical to the singlecomponent solution. See Fig. 4 (left) for a visualization of the latter.
Such solutions usually evolve from an incorrect initialization of the GMM model, and it is easy to show that the gradient w.r.t. all model parameters is zero in both of these scenarios. This is easy to show: As the gradients read (with denoting the responsibility of component for sample )
(5)  
(6) 
it is easy to see that when either all or single centroid/covariance matrix are equal to the total mean and variance of the data (
and ), we obtain zero gradients for degenerate solutions ( ) and singlecomponent solutions (), respectively.Our approach for avoiding these undesirable solutions in principle is to punish their characteristic response patterns by an appropriate modification of the loss function that is minimized, i.e.,
. The following guidelines should be adhered to:
[leftmargin=2em]

singlecomponent and degenerate solution should be punished by the annealing procedure

as we mainly wish to ensure a good starting point for optimization, annealing should be active only at the beginning of the training process

as training advances, annealed loss should smoothly transition into the original loss

the number of free parameters introduced by the annealing procedure should be low
These requirements are met by what we term the smoothed maxcomponent loglikelihood :
(7) 
Here, we assign a normalized coefficient vector to each Gaussian mixture component . The entries of are computed in the following fashion:

[leftmargin=2em]

Assume that the Gaussian components are arranged on a 1D grid of dimensions or on a 2D grid of dimensions . Each linear component index has thus an unique associated 1D or 2D coordinate .

Assume that the vector of length is actually representing a 1D structure of dimension or a 2D structure of dimension . Each linear vector index in has thus a unique associated 1D or 2D coordinate .

The entries of the vector are computed as
(8)
and subsequently normalize to have unit sum. Thus, Eq. 7 essentially represents a convolution of the probabilities , arranged in an 1D or 2D grid, with a Gaussian convolution filter, thus amounting to a smoothing operation. The 1D or 2D variance in Eq. 8 is a parameter that must be set as a function of the grid size such that Gaussians are neither homogeneous nor delta peaks. Thus, the loss function in Eq. 7 is maximized if the log probabilities follow an unimodal Gaussian profile of variance , whereas singlecomponent and degenerate solutions are punished.
It is trivial to so see that the annealed loss function in Eq. 7 reduces to the nonannealed form Eq. 2 in the limit where . This is because the vectors approach Kronecker deltas in this case, with only a single entry of value , which essentially removes the inner sum in Eq. 7. By making timedependent, starting at an intermediate value of and then let it approach a small final value , we can smoothly transition the annealed loss function Eq. 7 to the original maxcomponent loglikelihood Eq. 2. Time dependency of can thus be chosen to be:
(9) 
where the time constant in the exponential is chosen as to ensure a smooth transition.
3.4 Full GMM Model Trained by SGD and its HyperParameters
Putting everything together, training Gaussian Mixture Models with SGD is performed by minimizing the smoothed maxcomponent loglikelihood of Eq. 7, enforcing the constraints on the component weights and covariances as detailed in Sec. 3.2 and transitioning from the smoothed to the “bare” loss function as detailed in Sec. 3.3. In order to avoid convergence to other undesirable local minima by SGD, we introduce three weighting constants , and , all in the range, for controlling the relative adaptation speeds for the three groups of free parameters in the GMM models (setting, e.g., would disable the adaptation of weights). We have thus the update rules:
(10) 
Centroids are initialized to random values in range , weights are chosen equiprobable, that is to say , for all components, and covariance matrix entries are uniformly initialized to very small positive values (a good adhoc choice is ). Last but not least, a learning rate is required for performing SGD. We find that our algorithm depends very weakly on , and that a fixed value of is usually a good choice. We summarize good practices for choosing hyperparameters of the proposed SGD approach for GMMs in App. C. Please not that it is not required to initialize the component centroids by kmeans as it is usually recommended when training GMMs by EM.
3.5 Simplifications and Memory/Performance Issues
For the full GMM model, regardless of whether it is trained by EM or SGD, the memory requirements of a GMM with components depend strongly on the data dimensionality . A good measure for memory requirements is the number of free model parameters computed as . To reduce this number, we propose several strategies that are compatible with training by SGD (see also Fig. 1):
Diagonal Covariance Matrix
A wellknown simplification of GMMs consists of using a diagonal covariance matrix instead of a full one. This reduces the number of trainable parameters to , where is the dimensionality of the data vectors and the number of components.
Local Principal Directions
A potentially very powerful simplification consist of using a diagonal covariance matrix of diagonal entries, and letting SGD adapt the local principal directions along which these variances are measured for each component (in addition to the other free parameters, of course). The expression of the conditional component probabilities now includes, for each component , the normalized principal direction vectors and reads:
(11) 
This leads to a value of , and thus achieves satisfactory model simplification only if can be chosen very small and is high. In order for the model to be fully determined, another constraint needs to be enforced after each SGD step, namely the orthogonality of the local principal directions for each component : . For now, the matrix whose rows are the vectors
is subjected to a QR decomposition
and setting .4 Experiments
Unless otherwise stated, the experiments in this section will always be conducted with the following parameter values (see App. C for a justification of these choices): total iterations , , , minibatch size , , , , , and . The adaptation strengths , and are all set to . Except for Sec. 3.5, we will use a diagonal covariance matrix for each of the components. Training/test data are taken from MNIST or SVHN, either using full images or the central patches.
4.1 Validity of the MaxComponent Approximation and Comparison to EM
Since we minimize only an approximation to the GMM loglikelihood, a first important question is about the quality of this approximation. To this end, we plot both energy functions over time when training on full MNIST and SVHN images, see Fig. 3. This figure shows that the loglikelihood obtained by EMbased training (using the implementation of sklearn) with the same configuration (using iterations and a kmeansbased initialization) achieves a very similar loglikelihood, which confirms that SGD is indeed comparable in performance.
4.2 Effects and Benefits of annealing
In order to demonstrate the beneficial effects of the annealing procedure, we conduct the basic experiment on full MNIST with annealing effectively turned off, which is achieved by setting . As a result, in Fig. 4 we observe typical singlecomponent solutions when visualizing the centroids and weights, along with a markedly inferior (higher) loss value of , in contrast to when annealing is turned on. As a side effect, annealing enforces a topological ordering of centroids in the manner of a SOM (see also App. B
for a mathematical analysis), which is only of aesthetic value for the moment, but might be exploited in the future.
4.3 Basic Feasibility and Parameter Space
We put everything together and train an GMM using SGD on various datasets of high and intermediate dimensionality (results: see Fig. 2 and App. D). In particular, we analyze how the choice of the relative adaptation strengths , and number of components affects the final loglikelihood. The results presented in Tab. 1 suggest that increasing is always beneficial, and that the covariances and weights should be adapted more slowly than the centroids. The batch size had an effect on convergence speed but no significant impact on the final loss value (not shown here).
Parameter  

,  ,  ,  ,  ,  
,  ,  ,  ,  ,  
,  ,  ,  ,  , 
4.4 Robustness to initial conditions
To test whether SGD converges as a function of the initial conditions, we train three GMMs on MNIST using three different initializations: . All of these choices led to nonconvergence of the EMbased GMM implementation of sklearn. Differently from the other experiments, we choose a learning rate of since the original value does not lead to convergent learning. The alternative is to double the training time which works as well. Fig. 5 shows the development of the loss function in each of the three cases, and we observe that all three cases indeed converge by comparing the final loss values to Fig. 3 (middle) and to the EM baseline. This behavior persists across all datasets, but only MNIST is shown due to space limitations.
4.5 SGD Training of Simplified GMMs with Principal Local Directions
We train GMMs with local principal directions on MNIST and SVHN patches and log the final loss function values to assess model quality. We find that learned centroids and final loss values approach those of the diagonal model for for both datasets. Please see App. A for a visualization of the final centroids. For the full datasets, we find that MNIST and SVHN require values of and , respectively for achieving similar loss values as the diagonal model.
5 Discussion
We showed that GMMs can be trained on highdimensional image data by SGD, at very small batch sizes, in a purely online fashion.
Memory and execution time requirements are very modest, and can be reduced even more by using intelligent simplifications of the GMM model.
Here, we would like to discuss a few noteworthy points and suggest some avenues for improving the model:
Robust Convergence
Unlike EM approaches which require a careful selection of starting point, the presented SGD scheme is very robust to initial conditions due to the proposed annealing scheme.
In all experiments with highdimensional image data, we found no initializations that did not, given reasonable time, converge to a regular solution.
Numerical Stability
While we are not optimizing the full loglikelihood here, the maxcomponent approximation is actually a good one, and has the advantage of absolute numerical stability.
Clearly, clever ways might be found to avoid instabilities in the loglikelihood function computation, but when it comes to automatically computed gradients this is no longer possible, and indeed we found that it was mainly the (inaccessible) gradient computations where NaN values happened mostly.
Complex annealing procedure
The complex way of regularizing, which is similar in spirit to simulated annealing approaches, may seem daunting, although the time constants can be determined by simple heuristics, see App. C.
A next step will be to replace this procedure by an automated scheme based on an evaluation of the loss function: when it is stationary, can be decreased slightly.
Repeating this strategy until a small minimal value is reached should remove the need to fix annealing parameters except for time constants which can be chosen as a function of dataset size.
Free Parameters
Apart from the annealing mechanism, the model contains the same parameters any SGD approach would, the learning rate and the weighting constants for the model parameters: , and .
We found no experimental evidence suggesting that the latter require values different from , although from general principles it might be sensible to adapt the variances and weights slower than the centroids.
Further, it is required to understand how these constants can be used to obtain better solutions.
Interesting is that standard DNN optimizers like Adam Kingma and Ba (2014) did not seem to produce satisfactory solutions which also requires looking into.
Approximate Gradient Descent
The procedure described in Sec. 3.4 performs gradient descent on an approximation to the full loglikelihood Eq. 1.
This might seem unsatisfactory, so we would like to point out that the original EM procedure does nothing else: it optimizes a lower bound of the loglikelihood.
In particular, the Mstep is not guaranteed to improve the loglikelihood at all: all we know is that it will never make it worse.
In our SGD approach, there is a simple way to fix this shortcoming, which we will tackle as an next step: we can replace the maxoperation in Eq. 2 by a softmax function with an initially large steepness parameter which will make it indistinguishable from a discrete max operation.
After convergence, can be relaxed, which will result in a smooth transition to optimizing the full loglikelihood, but from an already wellconverged state.
6 Conclusion and Outlook
On a more abstract level, we developed the presented method because of our interest in continual or incremental learning, which is essentially about eliminating the wellknown catastrophic forgetting effect. We believe GMMs are an essential building block for continual learning models since their parameter updates are purely local in the sense that only components that are close to the currently bestmatching component are updated. Next steps will consist of constructing deep network classifiers from convolutional GMM layers, with a readout layer on top (all layers being trained by SGD). Further, we will investigate how to sample efficiently from such hierarchical convolutional GMMs, allow generating large batches of samples in a replaybased architecture for continual learning.
References
 Modelbased clustering of highdimensional data: a review. Computational Statistics & Data Analysis 71, pp. 52–78. Cited by: §1.2.

Incremental local online gaussian mixture regression for imitation learning of multiple tasks
. In 2010 IEEE/RSJ International Conference on Intelligent Robots and Systems, pp. 267–274. Cited by: §1.2.  A fast, accurate approximation to log likelihood of gaussian mixture models. In 2009 IEEE International Conference on Acoustics, Speech and Signal Processing, pp. 3817–3820. Cited by: §1.2.

Incremental learning of multivariate gaussian mixture models.
In
Advances in Artificial Intelligence – SBIA 2010
, A. C. da Rocha Costa, R. M. Vicari, and F. Tonidandel (Eds.), Berlin, Heidelberg, pp. 82–91. External Links: ISBN 9783642161384 Cited by: §1.2.  Energy functions for selforganizing maps. In Kohonen maps, pp. 303–315. Cited by: Appendix B.
 Matrix manifold optimization for gaussian mixtures. In Advances in Neural Information Processing Systems, pp. 910–918. Cited by: §1.2, §3.2.

Gaussian mixture model with precision matrices approximated by sparsely represented eigenvectors
. In 2014 22nd Telecommunications Forum Telfor (TELFOR), pp. 435–440. Cited by: §1.2.  Adam: a method for stochastic optimization. Note: cite arxiv:1412.6980Comment: Published as a conference paper at the 3rd International Conference for Learning Representations, San Diego, 2015 External Links: Link Cited by: §5.
 The selforganizing map. Proceedings of the IEEE 78 (9), pp. 1464–1480. Cited by: Appendix B.
 Incremental learning with gaussian mixture models. In Computer Vision Winter Workshop, pp. 25–32. Cited by: §1.2.
 Gradientbased learning applied to document recognition. Proceedings of the IEEE 86 (11), pp. 2278–2324. Cited by: §2.
 Finite mixture models and modelbased clustering. Statistics Surveys 4, pp. 80–116. Cited by: §1.

Reading digits in natural images with unsupervised feature learning.
In
NIPS workshop on deep learning and unsupervised feature learning
, Vol. 2011, pp. 5. Cited by: §2. 
Averaging, maximum penalized likelihood and bayesian estimation for improving gaussian mixture probability density estimates.
IEEE Transactions on Neural Networks
9 (4), pp. 639–650. Cited by: §1.2.  A Comprehensive, Applicationoriented Study of Catastrophic Forgetting in DNNs. In International Conference on Learning Representations (ICLR), Cited by: Appendix D.
 Approximations to the loglikelihood function in the nonlinear mixedeffects model. Journal of computational and Graphical Statistics 4 (1), pp. 12–35. Cited by: §1.2.
 A fast incremental gaussian mixture model. PloS one 10 (10), pp. e0139931. Cited by: §1.2.
 Highly efficient incremental estimation of gaussian mixture models for online data stream clustering. In Intelligent Computing: Theory and Applications III, Vol. 5803, pp. 174–183. Cited by: §1.2.
 Selforganizing mixture models. Neurocomputing 63, pp. 99–123. Cited by: §1.2.
 Efficient greedy learning of gaussian mixture models. Neural computation 15 (2), pp. 469–485. Cited by: §1.2.
 Incremental online learning in high dimensions. Neural computation 17 (12), pp. 2602–2634. Cited by: §1.2.
 A greedy em algorithm for gaussian mixture learning. Neural Processing Letters 15 (1), pp. 77–87. External Links: ISSN 1573773X, Document, Link Cited by: §1.2.
Appendix A Local principal directions
Here, we give a visualization of the resulting centroids for various choices of when training on MNIST patches.
Appendix B Rigorous Link to SelfOrganizing Maps
It is known that the selforganizing map (SOM, Kohonen (1990)) rule has no energy function it is minimizing. However, some modifications (see Heskes (1999)) have been proposed that ensure the existence of a energy function. These energybased SOM models reproduce all features of the original model and use a learning rule that the original SOM algorithm is actually approximating very closely. In the notation of this article, SOMs model the data through prototypes and neighborhood functions defined on a periodic 2D grid, and their energy function is written as
(12) 
Discarding constant terms, we find that this is actually identical to the maxcomponent loglikelihood approximation given in Eq. 2 with constant equiprobable weights and a constant diagonal with equal diagonal entries. To our knowledge, this is the first time that a rigorous link between SOMs and GMMs has been established based on a comparison of energy functions, showing that SOMs are actually implementing a annealingtype approximation to the full GMM model, with component weights and variances chosen in a special way.
Appendix C RulesOfThumb for SGDTraining of GMMs on images
When training DNNs by SGD, several parameters need to be set according to heuristics. Here, we present some rulesofthumb for performing this selection with GMMs. Generally, we can always set the batch size to , and . In all of our experiments, it was always feasible to set
to half an epoch, and
to half an epoch, too. However, tuning these parameters can speed up training significantly. The number of prototypes is a critical choice, but follows a simple “the more the better” rule. From the mathematical foundations of GMMs, it is evident that more components must always be able to reach a lower loss (except for possible local minima). The relative adaptation strengths , and can always be set to 1 (at least we did not find any examples where that did not work).
[leftmargin=2em]

Choose the number of iterations a an increasing function of : the more components, the more free parameters, the more time to train is needed

Flatten images, and put them as rows with random indices (as usual for SGD) into a tensor

Choose a quadratic number of components as

Initialize to .

Preliminary train the GMM with constant . Select as the first time step where loss does no longer decrease. It never hurts if is chosen too large, it just takes longer.

Select by trial and error on the train set: start with very large values and use the smallest value that gives the same final loss value. Larger values for never hurt but take longer.

Train with the chosen parameters!
Appendix D More Results
Visualization of prototypes training on different datasets (extracted 10 random classes, see Pfülb and Gepperth (2019) for details about the datasets). Each result (in the form of centroid visualizations) is recorded after training for epoch on a given dataset with (see Figure 7 and Figure 8).