Abstract
Modelling of multivariate densities is a core component in many signal processing, pattern recognition and machine learning applications. The modelling is often done via Gaussian mixture models (GMMs), which use computationally expensive and potentially unstable training algorithms. We provide an overview of a fast and robust implementation of GMMs in the C++ language, employing multithreaded versions of the Expectation Maximisation (EM) and kmeans training algorithms. Multithreading is achieved through reformulation of the EM and kmeans algorithms into a MapReducelike framework. Furthermore, the implementation uses several techniques to improve numerical stability and modelling accuracy. We demonstrate that the multithreaded implementation achieves a speedup of an order of magnitude on a recent 16 core machine, and that it can achieve higher modelling accuracy than a previously wellestablished publically accessible implementation. The multithreaded implementation is included as a userfriendly class in recent releases of the open source Armadillo C++ linear algebra library. The library is provided under the permissive Apache 2.0 license, allowing unencumbered use in commercial products.
1 Introduction
Modelling multivariate data through a convex mixture of Gaussians, also known as a Gaussian mixture model (GMM), has many uses in fields such as signal processing, econometrics, pattern recognition, machine learning and computer vision. Examples of applications include multistage feature extraction for action recognition
[4], modelling of intermediate features derived from deep convolutional neural networks
[11, 12, 16], classification of human epithelial cell images [32], implicit sparse coding for face recognition
[33], speechbased identity verification [28], and probabilistic foreground estimation for surveillance systems
[26]. GMMs are also commonly used as the emission distribution for hidden Markov models
[2].In the GMM approach, a distribution of samples (vectors) is modelled as:
(1) 
where is a dimensional vector, is the weight for component (with constraints , ), and is a dimensional Gaussian density function with mean and covariance matrix :
(2) 
where and denote the determinant and inverse of , respectively, while denotes the transpose of . The full parameter set can be compactly stated as , where is the number of Gaussians.
Given a training dataset and a value for , the estimation of is typically done through a tailored instance of Expectation Maximisation (EM) algorithm [8, 21, 24, 27]. The kmeans algorithm [3, 9, 18] is also typically used for providing the initial estimate of for the EM algorithm. Choosing the optimal is data dependent and beyond the scope of this work; see [14, 25] for example methods.
Unfortunately, GMM parameter estimation via the EM algorithm is computationally intensive and can suffer from numerical stability issues. Given the ever growing sizes of datasets and the need for fast, robust and accurate modelling of such datasets, we have provided an open source implementation of multithreaded (parallelised) versions of the kmeans and EM algorithms. In addition, core functions are recast in order to considerably reduce the likelihood of numerical instability due to floating point underflows and overflows. The implementation is provided as a userfriendly class in recent releases of the crossplatform Armadillo C++ linear algebra library [29, 30]. The library is licensed under the permissive Apache 2.0 license [31], thereby allowing unencumbered use in commercial products.
We continue the paper as follows. In Section 2 we provide an overview of parameter estimation via the EM algorithm, its reformulation for multithreaded execution, and approaches for improving numerical stability. In Section 3 we provide a summary of the kmeans algorithm along with approaches for improving its convergence and modelling accuracy. The implementation in C++ is overviewed in Section 4, where we list and describe the user accessible functions. In Section 5 we provide a demonstration that the implementation can achieve a speedup of an order of magnitude on a recent 16 core machine, as well as obtain higher modelling accuracy than a previously wellestablished publically accessible implementation.
2 Expectation Maximisation and MultiThreading
The overall likelihood for a set of samples, , is found using . A parameter set that suitably models the underlying distribution of can be estimated using a particular instance of the Expectation Maximisation (EM) algorithm [8, 21, 24, 27]. As its name suggests, the EM algorithm is comprised of iterating two steps: the expectation step, followed by the maximisation step. GMM parameters generated by the previous iteration () are used by the current iteration to generate a new set of parameters (), such that .
In a direct implementation of the EM algorithm specific to GMMs, the estimated versions of the parameters (, , ) within one iteration are calculated as follows:
(3)  
(4)  
(5)  
(6)  
(7) 
Once the estimated parameters for all Gaussians are found, the parameters are updated, , and the iteration starts anew. The process is typically repeated until the number of iterations has reached a predefined number, and/or the increase in the overall likelihood after each iteration falls below a predefined threshold.
In Eqn. (3),
is the aposteriori probability of Gaussian
given and current parameters. Thus the estimates and are weighted versions of the sample mean and sample covariance, respectively.Overall, the algorithm is a hill climbing procedure for maximising . While there are no guarantees that it will reach a global maximum, it is guaranteed to monotonically converge to a saddle point or a local maximum [8, 9, 22]. The above implementation can also be interpreted as an unsupervised probabilistic clustering procedure, with being the assumed number of clusters. For a full derivation of the EM algorithm tailored to GMMs, the reader is directed to [2, 27] or Appendix A.
2.1 Reformulation for MultiThreaded Execution
The EM algorithm is quite computationally intensive. This is in large part due to the use of the function, which needs to be applied numerous times for each and every sample. Fortunately, CPUs with a multitude of cores are now quite common and accessible, allowing for multithreaded (parallel) execution.
One approach for parallelisation is the MapReduce framework [7], where data is split into chunks and farmed out to separate workers for processing (mapping). The results are then collected and combined (reduced) to produce the final result. Below we provide a reformulation of the EM algorithm into a MapReducelike framework.
As Eqn. (3) can be executed independently for each sample, the summations in Eqns. (4) and (6) can be split into separate sets of summations, where the summation in each set can be executed independently and in parallel with other sets. To allow similar splitting of the summation for calculating covariance matrices, Eqn. (7) needs to be rewritten into the following form:
(8) 
The multithreaded estimation of the parameters can now be formally stated as follows. Given threads, the training samples are split into chunks, with each chunk containing approximately the same amount of samples. For thread with index , the start index of the samples is denoted by , while the end index is denoted by . For each thread and Gaussian , accumulators , and are calculated as follows:
(9)  
(10)  
(11) 
where is defined in Eqn. (3).
Once the accumulators for all threads are calculated, for each Gaussian the reduction operation combines them to form the estimates of and as follows:
(12)  
(13)  
(14) 
2.2 Improving Numerical Stability
Due to the necessarily limited precision of numerical floating point representations [13, 23], the direct computation of Eqns. (1) and (2) can quickly lead to numerical underflows or overflows, which in turn lead to either poor models or a complete failure to estimate the parameters. To address this problem, the following reformulation can be used. First, logarithm version of Eqn. (2) is taken:
(15) 
which leads to the corresponding logarithm version of Eqn. (1):
(16) 
The right hand side of Eqn. (16) can be expressed as a repeated addition in the form of:
(17) 
which in turn can be rewritten in the form of:
(18) 
In the latter form, if we ensure that (through swapping and when required), the exponential will always produce values which helps to reduce the occurrence of overflows. Overall, by keeping most of the computation in the log domain, both underflows and overflows are considerably reduced.
A further practical issue is the occurrence of degenerate or illconditioned covariance matrices, stemming from either not enough samples with contributing to the calculation of in Eqn. (7
), or from too many samples which are essentially the same (ie., very low variance). When the diagonal entries in a covariance matrix are too close to zero, inversion of the matrix is unstable and can cause the calculated loglikelihood to become unreliable or nonfinite. A straightforward and effective approach to address this problem is to place an artificial floor on the diagonal entries in each covariance matrix after each EM iteration. While the optimum value of the floor is data dependent, a small positive constant is typically sufficient to promote numerical stability and convergence.
3 Initialisation via MultiThreaded kMeans
As a starting point, the initial means can be set to randomly selected training vectors, the initial covariance matrices can be set equal to identity matrices, and the initial weights can be uniform. However, the function as well as the matrix inverse in Eqn. (2) are typically quite time consuming to compute. In order to speed up training, the initial estimate of is typically provided via the kmeans clustering algorithm [3, 9, 15] which avoids such time consuming operations.
The baseline kmeans clustering algorithm is a straightforward iterative procedure comprised of two steps: (i) calculating the distance from each sample to each mean, and (ii) calculating the new version of each mean as the average of samples which were found to be the closest to the previous version of the corresponding mean. The required number of iterations is data dependent, but about 10 iterations are often sufficient to generate a good initial estimate of .
The means algorithm can be interpreted as a simplified version (or special case) of the EM algorithm for GMMs [15]. Instead of each sample being assigned a set probabilities representing cluster membership (soft assignment), each sample is assigned to only one cluster (hard assignment). Furthermore, it can be assumed that the covariance matrix of each Gaussian is noninformative, diagonal, and/or shared across all Gaussians. More formally, the estimation of model parameters is as per Eqns. (5), (6) and (7), but is redefined as:
(19) 
where is a distance metric. Apart from this difference, the parameter estimation is the same as for EM. As such, multithreading is achieved as per Section 2.1.
We note that it is possible to implement the kmeans algorithm is a multitude of ways, such as the cluster splitting LBG algorithm [18], or use an elaborate strategy for selecting the initial means [1]. While there are also alternative and more complex implementations offering relatively fast execution [10], we have elected to adapt the baseline kmeans algorithm due to its straightforward amenability to multithreading.
3.1 Issues with Modelling Accuracy and Convergence
A typical and naive choice for the distance in Eqn. (19) is the squared Euclidean distance,
. However, for multivariate datasets formed by combining data from various sensors, there is a notable downside to using the Euclidean distance. When one of the dimensions within the data has a much larger range than the other dimensions, it will dominate the contribution to the overall distance, with the other dimensions effectively ignored. This can adversely skew the initial parameter estimates, easily leading to poor initial conditions for the EM algorithm. This in turn can lead to poor modelling, as the EM algorithm is only guaranteed to reach a local maximum
[8, 9, 22]. To address this problem, the squared Mahalanobis distance can be used [3, 9]:(20) 
where is a global covariance matrix, estimated from all available training data. To maintain efficiency, is typically diagonal, which makes calculating its inverse straightforward (ie., reciprocals of the values on the main diagonal).
In practice it is possible that while iterating at least one of the means has no vectors assigned to it, becoming a “dead” mean. This might stem from an unfortunate starting point, or specifying a relatively large value for
for modelling a relatively small dataset. As such, an additional heuristic is required to attempt to resolve this situation. An effective approach for resurrecting a “dead” mean is to make it equal to one of the vectors that has been assigned to the most “popular” mean, where the most “popular” mean is the mean that currently has the most vectors assigned to it.
4 Implementation in C++
We have provided a numerical implementation of Gaussian Mixture Models in the C++ language as part of recent releases of the open source Armadillo C++ linear algebra library [29]. The library is available under the permissive Apache 2.0 license [31], and can be obtained from http://arma.sourceforge.net. To considerably reduce execution time, the implementation contains multithreaded versions of the EM and kmeans training algorithms (as overviewed in Sections 2 and 3). Implementation of multithreading is achieved with the aid of OpenMP pragma directives [5].
There are two main choices for the type of covariance matrix : full and diagonal. While full covariance matrices have more capacity for modelling data, diagonal covariance matrices provide several practical advantages:

the computationally expensive (and potentially unstable) matrix inverse operation in Eqn. (2) is reduced to simply to taking the reciprocals of the diagonal elements,

the determinant operation is considerably simplified to taking the product of the diagonal elements,

diagonal covariance matrices contain fewer parameters that need to be estimated, and hence require fewer training samples [9].
Given the above practical considerations, the implementation uses diagonal covariance matrices. We note that diagonal covariance GMMs with can model distributions of samples with correlated elements, which in turn suggests that full covariance GMMs can be approximated using diagonal covariance GMMs with a larger number of Gaussians [28].
4.1 User Accessible Classes and Functions
The implementation is provided as two userfriendly classes within the arma namespace: gmm_diag and fgmm_diag. The former uses double precision floating point values, while the latter uses single precision floating point values. For an instance of the double precision gmm_diag class named as M, its member functions and variables are listed below. The interface allows the user full control over the parameters for GMM fitting, as well as easy and flexible access to the trained model. Figure 1 contains a complete C++ program which demonstrates usage of the gmm_diag class.
In the description below, all vectors and matrices refer to corresponding objects from the Armadillo library; scalars have the type double, matrices have the type mat, column vectors have the type vec, row vectors have the type rowvec, row vectors of unsigned integers have the type urowvec, and indices have the type uword (representing an unsigned integer). When using the single precision fgmm_diag class, all vector and matrix types have the f prefix (for example, fmat), while scalars have the type float. The word “heft” is explicitly used in the classes as a shorter version of “weight”, while keeping the same meaning with the context of GMMs.

M.log_p(V)
return a scalar (double precision floating point value) representing the loglikelihood of column vector V 
M.log_p(V, g)
return a scalar (double precision floating point value) representing the loglikelihood of column vector V, according to Gaussian with index g (specified as an unsigned integer of type uword) 
M.log_p(X)
return a row vector (of type rowvec) containing loglikelihoods of each column vector in matrix X 
M.log_p(X, g)
return a row vector (of type rowvec) containing loglikelihoods of each column vector in matrix X, according to Gaussian with index g (specified as an unsigned integer of type uword) 
M.avg_log_p(X)
return a scalar (double precision floating point value) representing the average loglikelihood of all column vectors in matrix X 
M.avg_log_p(X, g)
return a scalar (double precision floating point value) representing the average loglikelihood of all column vectors in matrix X, according to Gaussian with index g (specified as an unsigned integer of type uword) 
M.assign(V, dist_mode)
return an unsigned integer (of type uword) representing the index of the closest mean (or Gaussian) to vector V; the parameter dist_mode is one of:
Euclidean distance (takes only means into account)

probabilistic “distance”, defined as the inverse likelihood (takes into account means, covariances and hefts)


M.assign(X, dist_mode)
return a row vector of unsigned integers (of type urowvec) containing the indices of the closest means (or Gaussians) to each column vector in matrix X; parameter dist_mode is eucl_dist or prob_dist, as per the .assign() function above 
M.raw_hist(X, dist_mode)
return a row vector of unsigned integers (of type urowvec) representing the raw histogram of counts; each entry is the number of counts corresponding to a Gaussian; each count is the number times the corresponding Gaussian was the closest to each column vector in matrix X; parameter dist_mode is eucl_dist or prob_dist, as per the .assign() function above 
M.norm_hist(X, dist_mode)
similar to the .raw_hist() function above; return a row vector (of type rowvec) containing normalised counts; the vector sums to one; parameter dist_mode is either eucl_dist or prob_dist, as per the .assign() function above 
M.generate()
return a column vector (of type vec) representing a random sample generated according to the model’s parameters 
M.generate(N)
return a matrix (of type mat) containing N column vectors, with each vector representing a random sample generated according to the model’s parameters 
M.n_gaus()
return an unsigned integer (of type uword) containing the number of means/Gaussians in the model 
M.n_dims()
return an unsigned integer (of type uword) containing the dimensionality of the means/Gaussians in the model 
M.reset(n_dims, n_gaus)
set the model to have dimensionality n_dims, with n_gaus number of Gaussians, specified as unsigned integers of type uword; all the means are set to zero, all diagonal covariances are set to one, and all the hefts (weights) are set to be uniform 
M.save(filename)
save the model to a file and return a bool indicating either success (true) or failure (false) 
M.load(filename)
load the model from a file and return a bool indicating either success (true) or failure (false) 
M.means
readonly matrix (of type mat) containing the means (centroids), stored as column vectors 
M.dcovs
readonly matrix (of type mat) containing the diagonal covariances, with the set of diagonal covariances for each Gaussian stored as a column vector 
M.hefts
readonly row vector (of type rowvec) containing the hefts (weights) 
M.set_means(X)
set the means (centroids) to be as specified in matrix X (of type mat), with each mean (centroid) stored as a column vector; the number of means and their dimensionality must match the existing model 
M.set_dcovs(X)
set the diagonal covariances to be as specified in matrix X (of type mat), with the set of diagonal covariances for each Gaussian stored as a column vector; the number of diagonal covariance vectors and their dimensionality must match the existing model 
M.set_hefts(V)
set the hefts (weights) of the model to be as specified in row vector V (of type rowvec); the number of hefts must match the existing model 
M.set_params(means, dcovs, hefts)
set all the parameters at the same time, using matrices denoted as means and dcovs as well as the row vector denoted as hefts; the layout of the matrices and vectors is as per the .set_means(), .set_dcovs() and .set_hefts() functions above; the number of Gaussians and dimensionality can be different from the existing model 
M.learn(data, n_gaus, dist_mode, seed_mode, km_iter, em_iter, var_floor, print_mode)
learn the model parameters via the kmeans and/or EM algorithms, and return a boolean value, with true indicating success, and false indicating failure; the parameters have the following meanings:
data
matrix (of type mat) containing training samples; each sample is stored as a column vector 
n_gaus
set the number of Gaussians to n_gaus; to help convergence, it is recommended that the given data matrix (above) contains at least 10 samples for each Gaussian 
dist_mode
specifies the distance used during the seeding of initial means and kmeans clustering:
Euclidean distance

Mahalanobis distance, which uses a global diagonal covariance matrix estimated from the given training samples


seed_mode
specifies how the initial means are seeded prior to running kmeans and/or EM algorithms:keep_existing keep the existing model (do not modify the means, covariances and hefts) static_subset a subset of the training samples (repeatable) random_subset a subset of the training samples (random) static_spread a maximally spread subset of training samples (repeatable) random_spread a maximally spread subset of training samples (random start) Note that seeding the initial means with static_spread and random_spread can be more time consuming than with static_subset and random_subset; these seed modes are inspired by the socalled kmeans++ approach [1], with the aim to improve clustering quality.

km_iter
the maximum number of iterations of the kmeans algorithm; this is data dependent, but typically 10 iterations are sufficient 
em_iter
the maximum number of iterations of the EM algorithm; this is data dependent, but typically 5 to 10 iterations are sufficient 
var_floor
the variance floor (smallest allowed value) for the diagonal covariances; setting this to a small nonzero value can help with convergence and/or better quality parameter estimates 
print_mode
boolean value (either true or false) which enables/disables the printing of progress during the kmeans and EM algorithms

5 Evaluation
5.1 Speedup from MultiThreading
To demonstrate the achievable speedup with the multithreaded versions of the EM and kmeans algorithms, we trained a GMM with 100 Gaussians on a recent 16 core machine using a synthetic dataset comprising 1,000,000 samples with 100 dimensions. 10 iterations of the kmeans algorithm and 10 iterations of the EM algorithm were used. The samples were stored in double precision floating point format, resulting in a total data size of approximately 762 Mb.
Figure 2 shows that a speedup of an order of magnitude is achieved when all 16 cores are used. Specifically, for the synthetic dataset used in this demonstration, the training time was reduced from approximately 272 seconds to about 27 seconds. In each case, the kmeans algorithm took approximately 30% of the total training time.
5.2 Comparison with FullCovariance GMMs in MLPACK
In order to validate our intuition that a diagonal GMM is a good choice instead of the significantly more complex problem of estimating GMMs with full covariance matrices, we compare the gmm_diag class (described in Section 4) against the fullcovariance GMM implementation in the wellestablished MLPACK C++ machine learning library [6].
We selected common datasets from the UCI machine learning dataset repository [17], and trained both diagonal and fullcovariance GMMs on these datasets. The number of Gaussians was chosen according to the original source of each dataset; where possible, 3 times the number of classes in the dataset was used. In some cases, small amounts of Gaussian noise was added to the dataset to ensure training stability of the fullcovariance GMMs. Both implementations used 10 iterations of kmeans for initialisation, followed by running the EM algorithm until convergence or reaching a maximum of 250 iterations. The entire fitting procedure was repeated 10 times, each time with a different random starting point.
The results are given in Table 1, which shows the best loglikelihood of the 10 runs, the average wallclock runtime for the fitting, as well as dataset information (number of samples, dimensionality, and number of Gaussians used for modelling). We can see that the diagonal GMM implementation in the gmm_diag class provides speedups from one to two ordersofmagnitude over the fullcovariance implementation in MLPACK. Furthermore, in most cases there is no significant loss in goodnessoffit (as measured by loglikelihood). In several cases (winequality, phy, covertype, pokerhand) the loglikelihood is notably higher for the gmm_diag class; we conjecture that in these cases the diagonal covariance matrices are acting as a form of regularisation to reduce overfitting [3].
dataset  num.  num.  num.  MLPACK  gmm_diag  MLPACK  gmm_diag  

samples  dims  Gaus.  fit time  fit time  fit time ratio  
cloud  2,048  10  5  1.50s  0.14s  10.7  59.9810  64.1210 
ozone  2,534  72  6  8.59s  0.10s  85.9  226.1310  307.9510 
winequality  6,497  11  30  16.10s  0.68s  23.7  47.1210  15.8510 
corel  37,749  32  50  544.62s  4.55s  119.7  +4.5210  +4.4410 
birch3  100,000  2  6  18.13s  2.39s  7.6  2.7010  2.7110 
phy  150,000  78  30  3867.12s  29.25s  132.2  2.1010  1.8810 
covertype  581,012  55  21  10360.53s  64.83s  159.8  9.4610  6.9010 
pokerhand  1,000,000  10  25  3653.94s  55.85s  65.4  1.9010  1.6810 
6 Conclusion
In this paper we have demonstrated a multithreaded and robust implementation of Gaussian Mixture Models in the C++ language. Multithreading is achieved through reformulation of the ExpectationMaximisation and kmeans algorithms into a MapReducelike framework. The implementation also uses several techniques to improve numerical stability and improve modelling accuracy. We demonstrated that the implementation achieves a speedup of an order of magnitude on a recent 16 core machine, and that it can achieve higher modelling accuracy than a previously wellestablished publically accessible implementation. The multithreaded implementation is released as open source software and included in recent releases of the crossplatform Armadillo C++ linear algebra library. The library is provided under the permissive Apache 2.0 license, allowing unencumbered use in commercial products.
Appendix A: Abridged Derivation of the EM Algorithm for Gaussian Mixture Models
In the Gaussian Mixture Model (GMM) approach, the distribution of samples (vectors) is modelled as:
(21) 
where is a dimensional vector, is a weight (with constraints , ), and is a multivariate Gaussian density function with parameter set :
(22) 
where is the mean vector and is the covariance matrix. Thus the complete parameter set for Eqn. (21) is expressed as . Given a set of training samples, , we need to find that suitably models the underlying distribution. Stated more formally, we need to find that maximises the following likelihood function:
(23) 
The ExpectationMaximisation (EM) algorithm [8, 21, 24, 27] is an iterative likelihood function optimisation technique, often used in the pattern recognition and machine learning [3, 9]. It is a general method for finding the maximumlikelihood estimate of the parameters of an assumed distribution, when either the training data is incomplete or has missing values, or when the likelihood function can be made analytically tractable by assuming the existence of (and values for) missing data.
To apply the EM algorithm to finding , we must first assume that our training data is incomplete and assume the existence of missing data , where each indicates the mixture component that “generated” the corresponding . Thus and if the th feature vector () was “generated” by the th component. If we know the values for , then Eqn. (23) can be modified to:
(24) 
As its name suggests, the EM algorithm is comprised of two steps which are iterated: (i) expectation, followed by (ii) maximisation. In the expectation step, the expected value of the complete data loglikelihood, , is found with respect to the unknown data given training data and current parameter estimates, (where indicates the iteration number):
(25) 
where is an instance of the missing data and is the space of values can take on. The maximisation step then maximises the expectation:
(27) 
The expectation and maximisation steps are iterated until convergence, or when the increase in likelihood falls below a predefined threshold. As can be seen in Eqn. (26), we require . We can define it as follows:
(28) 
Given initial parameters^{1}^{1}1Parameters for can be found via the kmeans algorithm [3, 9, 18] (see also Section 3). , we can compute . Moreover, we can interpret the mixing weights () as apriori probabilities of each mixture component, ie., . Hence we can apply Bayes’ rule [9] to obtain:
(29)  
(30) 
Expanding Eqn. (26) yields:
(31)  
(32)  
(33) 
Hence and can be maximised separately, to obtain and , respectively. To find the expression which maximises , we need to introduce the Lagrange multiplier [9] , with the constraint , take the derivative of with respect to and set the result to zero:
(37)  
(38)  
(39) 
Rearranging Eqn. (39) to obtain a value for :
(40) 
Summing both sides over yields:
(41)  
(42)  
(43) 
To find expressions which maximise and , let us now expand :
(46)  
(47) 
where was omitted since it vanishes when taking a derivative with respect to or . To find the expression which maximises , we need to take the derivative of with respect to , and set the result to zero:
(48)  
(49) 
Lütkepohl [19] states that , and if is symmetric, then . Since is symmetric, Eqn. (49) reduces to:
(50)  
(51)  
(52) 
Multiplying both sides by yields:
(53)  
(54) 
According to Lütkepohl [19], and . Moreover, we note that is a symmetric matrix. To find an expression which maximises , we can take the derivative of Eqn. (55) with respect to and set the result to zero:
(56)  
(57)  
(58) 
thus
(60)  
(61) 
In summary,
(62)  
(63)  
(64) 
where
(65) 
which can be explicitly stated as:
(66) 
References
 [1] D. Arthur and S. Vassilvitskii. kmeans++: the advantages of careful seeding. In ACMSIAM Symposium on Discrete Algorithms, pages 1027–1035, 2007.
 [2] J. Bilmes. A gentle tutorial of the EM algorithm and its applications to parameter estimation for Gaussian mixture and hidden Markov models. Technical Report TR97021, International Computer Science Institute, Berkeley, California, 1998.
 [3] C. Bishop. Pattern Recognition and Machine Learning. Springer, 2006.
 [4] J. Carvajal, A. Wiliem, C. McCool, B. Lovell, and C. Sanderson. Comparative evaluation of action recognition methods via Riemannian manifolds, Fisher vectors and GMMs: Ideal and challenging conditions. In Lecture Notes in Computer Science (LNCS), Vol. 9794, pages 88–100, 2016.
 [5] B. Chapman, G. Jost, and R. van der Pas. Using OpenMP: Portable Shared Memory Parallel Programming. MIT Press, 2007.
 [6] R. R. Curtin, J. R. Cline, N. P. Slagle, W. B. March, P. Ram, N. A. Mehta, and A. G. Gray. MLPACK: A scalable C++ machine learning library. Journal of Machine Learning Research, 14:801–805, 2013.
 [7] J. Dean and S. Ghemawat. MapReduce: Simplified data processing on large clusters. In Symposium on Operating Systems Design and Implementation, pages 137–150, 2004.
 [8] A. Dempster, N. Laird, and D. Rubin. Maximum likelihood from incomplete data via the EM algorithm. Journal of the Royal Statistical Society, Series B (Methodological), 39(1):1–38, 1977.
 [9] R. Duda, P. Hart, and D. Stork. Pattern Classification. John Wiley & Sons, 2001.
 [10] C. Elkan. Using the triangle inequality to accelerate kmeans. In International Conference on Machine Learning, pages 147–153, 2003.
 [11] Z. Ge, C. McCool, C. Sanderson, and P. Corke. Modelling local deep convolutional neural network features to improve finegrained image classification. In International Conference on Image Processing (ICIP), pages 4112–4116, 2015.
 [12] Z. Ge, C. McCool, C. Sanderson, P. Wang, L. Liu, I. Reid, and P. Corke. Exploiting temporal information for DCNNbased finegrained object classification. In International Conference on Digital Image Computing: Techniques and Applications, 2016.

[13]
D. Goldberg.
What every computer scientist should know about floatingpoint arithmetic.
ACM Computing Surveys, 23(1):5–48, 1991.  [14] G. Hamerly and C. Elkan. Learning the k in kmeans. In Neural Information Processing Systems, 2003.
 [15] B. Kulis and M. I. Jordan. Revisiting kmeans: New algorithms via Bayesian nonparametrics. In International Conference on Machine Learning, pages 513–520, 2012.
 [16] Y. LeCun, Y. Bengio, and G. Hinton. Deep learning. Nature, 521:436–444, 2015.
 [17] M. Lichman. UCI machine learning repository, 2013. http://archive.ics.uci.edu/ml.
 [18] Y. Linde, A. Buzo, and R. Gray. An algorithm for vector quantization. IEEE Transactions on Communications, 28(1):84–95, 1980.
 [19] H. Lütkepohl. Handbook of Matrices. John Wiley & Sons, 1996.
 [20] M. McCool, J. Reinders, and A. Robison. Structured Parallel Programming: Patterns for Efficient Computation. Morgan Kaufmann, 2012.
 [21] G. McLachlan and T. Krishnan. The EM Algorithm and Extensions. John Wiley & Sons, 2nd edition, 2008.
 [22] T. Mitchell. Machine Learning. McGrawHill, 1997.
 [23] D. Monniaux. The pitfalls of verifying floatingpoint computations. ACM Transactions on Programming Languages and Systems, 30(3), 2008.
 [24] T. Moon. Expectationmaximization algorithm. IEEE Signal Processing Magazine, 13(6):47–60, 1996.
 [25] D. Pelleg and A. Moore. Xmeans: Extending Kmeans with efficient estimation of the number of clusters. In International Conference on Machine Learning, pages 727–734, 2000.

[26]
V. Reddy, C. Sanderson, and B. Lovell.
Improved foreground detection via blockbased classifier cascade with probabilistic decision integration.
IEEE Transactions on Circuits and Systems for Video Technology, 23(1):83–93, 2013.  [27] R. Redner and H. F. Walker. Mixture densities, maximum likelihood and the EM algorithm. SIAM Review, 26(2):195–239, 1984.
 [28] D. Reynolds, T. Quatieri, and R. Dunn. Speaker verification using adapted Gaussian mixture models. Digital Signal Processing, 10(1–3):19–41, 2000.
 [29] C. Sanderson and R. Curtin. Armadillo: a templatebased C++ library for linear algebra. Journal of Open Source Software, 1:26, 2016.
 [30] C. Sanderson and R. Curtin. Armadillo: C++ template metaprogramming for compiletime optimization of linear algebra. In Platform for Advanced Scientific Computing (PASC) Conference, Switzerland, 2017.
 [31] A. St. Laurent. Understanding Open Source and Free Software Licensing. O’Reilly Media, 2008.
 [32] A. Wiliem, C. Sanderson, Y. Wong, P. Hobson, R. Minchin, and B. Lovell. Automatic classification of human epithelial type 2 cell indirect immunofluorescence images using cell pyramid matching. Pattern Recognition, 47(7):2315–2324, 2014.
 [33] Y. Wong, M. Harandi, and C. Sanderson. On robust face recognition via sparse coding: The good, the bad and the ugly. IET Biometrics, 3(4):176–189, 2014.
Comments
There are no comments yet.