1 Introduction
Online learning is needed when data is not known a priori, its distribution evolves, or when the data model must be updated without interrupting operation: the model must evolve as new data appears, allowing older data to become progressively less important. Our goal is to produce a fast, robust way of handling longrunning modelling tasks. Since capturing, processing, and saving all past observed samples is computationally unfeasible, similar samples can be grouped and represented by more compact entities, such as gmms.
kde methods are nonparametric approaches to build gmms that do not require prior definition of the number of components. In order to manage complexity, some approaches compress the model to a predefined number of components Goldberger and Roweis (2004), or optimize some datadriven choice Chen et al. (2010). A different approach Ozertem et al. (2008)
is to view model compression as a clustering problem. The main difficulty of adapting kde methods to online operation lies in maintaining sufficient information to generalize to unobserved data and to adjust model complexity without access to all past samples. An online approach based on meanshift mode finding adds each new sample to the model as a Gaussian component. However, it is sensitive to nonGaussian areas due to skewed or heavy tailed data
Han et al. (2008). A twolevel approach Declercq and Piater (2008), based on the idea that each component of the (nonoverfitting) mixture is in turn represented by an underlying mixture that represents data very precisely (possibly overfitting), allows the model to be refined without sacrificing accuracy, as a merge in the upper level can later be accurately split. It is also able handle the problem of nonGaussian data by using a uniform distribution in those regions.
The okde approach Kristan and Leonardis (2014); Kristan et al. (2011) builds a twolevel model of the target distribution. It maintains a sample distribution that is used to compute the corresponding kde. It supports singlesample updates and has a forgetting factor. This allows model adaptation when the target distribution evolves. Compressions approximate component clusters using single Gaussians. If the target distribution changes sufficiently, compressions can be reversed, using the underlying model kept by each component.
Our approach follows okde, obtaining comparable or better results. It allows for flexibility, abstraction, efficient memory use, computational efficiency, and numerical robustness. Furthermore, our C++ implementation, based on stateoftheart libraries Guennebaud et al. (2010), allows easy exploration of modelling alternatives.
The article is organized as follows: Section 2 describes okde. Section 3 describes our adaptations for numerical stability. Section 4 describes our approach to high dimensionality. Section 5 presents specific computational strategies to improve performance. Section 6 describes the evaluation setup and Section 7 presents our results. Section 8 concludes the article.
2 Online Kernel Density Estimation
okde was not developed with high dimensionality in mind. Although it allows for any number of dimensions, computational complexity scales quadratically with the number of dimensions, making the approach computationally prohibitive. Furthermore, problems due to overflows, underflows, and precision issues also arise. These issues worsen as the number of dimensions grows, limiting the usefulness of the approach. Before describing our approach, we present okde in detail.
2.1 Model Definition
Sample models are defined as dimensional, component gmms (Eq. 1): x are dimensional observations; are mixture weights, such that ; and are variate components (Eq. 2), with mean and covariance .
(1) 
(2) 
To maintain low complexity, is compressed when a threshold is met. To allow recovery from miscompressions, a model of the observations is also kept for each component: , where is a mixture model (with at most two components) for the th component of .
The kernel density estimate is defined as a convolution of by a kernel with a covariance (bandwidth) H:
(3) 
2.2 Optimal Bandwidth
KDE approaches determine a bandwidth H that minimizes the distance between and (the unknown) generating the data. The bandwidth can be written as ( is the scale and is the structure). Finding H is equivalent to finding the optimal scaling (Eq. 4), where (Eq. 2.2) and , , , , and is the covariance of the th component of .
(4) 
(5) 
G is the pilot bandwidth (Eq. 6). is the covariance of a single Gaussian approximating the entire model. The structure of H can be approximated by the structure of the covariance of the observations Wand and Jones (1994); Duong and Hazelton (2003), i.e., .
(6) 
2.3 Model Compression
Compression approximates the original component by an component, , (Eq. 7), such that the compressed kde does not change significantly.
(7) 
Clustering approaches can be used to compress . The aim is to find clusters such that each cluster can be approximated by a single component in . Let be a set of assignments clustering into submixtures. The submixture corresponding to indexes is defined as
(8) 
must be such that and (a threshold). is the largest local clustering error , i.e., the error induced under the KDE if is approximated by a single Gaussian.
(10) 
2.4 Local Clustering Error
2.5 Distance between mixture models
The Hellinger distance cannot be calculated analytically for mixture models and is approximated by the unscented transform Julier and Uhlmann (1996) (Eq. 13), where are weighted sets of sigma points of the th Gaussian .
(13) 
Note that is the th column of ( has columns). The sigma points are simply () and . Thus, must be counted separately for the and sets, such that for each set of sigma points. Let
be a singular value decomposition of
, such that and , then . In line with Julier and Uhlmann (1996), was set to 3.2.6 Hierarchical Compression
A hierachical approach avoids evaluating all possible assignments Goldberger and Roweis (2004); Ihler (2005).
is first split into two submixtures using Goldberger’s Kmeans
Goldberger and Roweis (2004). To avoid singularities associated with Diracdelta components of , Kmeans is applied to . Next, each submixture is approximated by a single Gaussian (section 2.3) and the submixture yielding the largest local error is further split into two submixtures. The process continues until , producing a binary tree where each of the leaves represents the clustering assignments . can then be used to approximate the corresponding components in by a single Gaussian, resulting in the compressed distribution .If two components of are merged, their detailed models must also be merged: the detailed model of the th component in is calculated by first defining a normalized extended mixture (Eq. 14). If this mixture has more than two components, then the twocomponent is generated by splitting into two submixtures using Golberger’s Kmeans and approximating each submixture by a single Gaussian.
(14) 
2.7 Revitalization
Overcompressions can be detected by checking whether the local clustering error of each component, evaluated against its detailed model , exceeds . Those components are removed from and replaced by the components of their detailed models. Each new component needs a detailed model: new ones are generated based on their covariances. Let be one of the new components. If the determinant of is zero, then the component is a single datapoint and its detailed model is just the component itself. Otherwise, the component has been generated through clustering in previous compression steps: its detailed model is initialized by splitting along its principal axis Huber (2015) into a twocomponent mixture, whose first two moments match those of the original component. This has two advantages: first, since the component is symmetric around the mean, the splitting process minimizes the induced error; and second, it is moment preserving, i.e., the mean and covariance of the split Gaussian (and, thus, of the entire mixture), remains unchanged.
3 Improving Numeric Stability
In this section, we detail strategies that allow working with skewed or degenerate sample distributions, as well as avoiding underflows and overflows.
3.1 Degenerate Covariance Matrices
Some data dimensions may seldom change or sufficient observations with different values may not have occurred, causing covariance values of 0 along those axes. One way to detect and correct singular or nearsingular matrices is to compute their eigen decomposition , with
, and check for eigenvalues smaller than
. These eigenvalues are then corrected by 1% of the average of the eigenvalues (Eq. 15, 16, 17). The corrected covariance is then given by .(15) 
(16) 
(17) 
3.2 Determinant Computation
The Gaussian pdf (Eq. 2) uses the determinant () and the inverse () of the covariance (). A way to efficiently compute the inverse of a nonsingular matrix is to first compute its LU decomposition and use it to compute the inverse. Since we guarantee that is positive definite, the determinant can be computed by multiplying the diagonal entries of U. However, when working in high dimensionality, overflows or underflows are bound to occur, even when using double precision floating point. To avoid this problem, we compute the logarithm of the determinant instead: this is simply the sum of the logarithm of each entry in the diagonal of matrix U.
3.3 Whitening Transformation
A single means that all components are scaled equally in all dimensions. If the spread of data is much greater in one dimension than in others, extreme differences of spread in the various dimensions may occur. This can be avoided by first whitening the data, by transforming it to have unit covariance, smoothing it using a radially symmetric kernel, and, finally, transforming it back Fukunaga (1990). Whitening allows to better fit the global distribution of the data, leading to better models Silverman (1986).
In our approach, both the computation of
and model compression are performed on whitened data. Whitening starts by approximating the whole model by a single Gaussian. The parameters needed to transform the model covariance into the identity matrix are then computed and individually applied to all components in the mixture. The inverse operation can be used to recover the model.
Formally, the whitening transformation of data x with covariance is defined as , where and
are the eigenvalues and eigenvectors from the eigendecomposition of the covariance matrix
.Whitening impacts the model in two ways. The first is computational performance: since the structure is equal to , it significantly saves computational costs in matrix multiplications (eq. 2.2). The second is model quality, as both the computation of and the compression phase, performed using Goldberger’s Kmeans, benefit from working on spherized data.
3.4 Optimal Bandwidth Stability
Sometimes, observed data may produce zerovalued . Recall that we assume that the structure of is wellapproximated by the covariance of the entire model and that finding amounts to finding (section 2.2). If this scaling factor is zero, e.g. caused by precision issues or by data distribution in the model, will be zero. This means that no smoothing will be applied to the KDE. In this case, Diracdeltas will remain with zero covariance, not allowing the use of the standard likelihood function. Since is computed on whitened data, one solution for this problem is to detect these occurrences and set the bandwidth as the identity matrix, and then computing the backwards whitening transformation.
4 Using Diagonal Covariance Matrices
Since a kde is a linear combination of components, full covariances may not be necessary, even when features are not statistically independent: a second linear combination of (a higher number of) components with diagonal covariances is capable of modelling the correlations between features.
With diagonal covariances, computational needs grow linearly with the number of dimensions, a very convenient aspect for high dimensionality applications. This is in contrast with quadratic growth for full covariances. Furthermore, for nonnormalized data, diagonal covariances may improve stability, as certain relationships that could result in singular covariances are ignored. Our numerical stability strategies have a stronger impact with diagonal covariances: for data with flat or nearflat dimensions this is particularly relevant (section 7).
5 Lazy Operation and Result Buffering
In theory, each time a sample is added, should be updated. However, is only needed for the compression phase or to evaluate the likelihood of a given sample. In other cases, computation of may be postponed. This means that the cost of adding a sample is simply that of adding one Diracdelta component to the mixture and updating the current weights. This can be done without impact on model quality.
The determinant and the inverse of covariance matrices are needed to compute the likelihood function (which is called very often). Thus, it makes sense to save them, once they have been computed, and keep them until the covariance changes. Then, previous values are marked as invalid, but are not recomputed immediately. As with , it is advantageous to postpone these computations until they are actually needed, making the process of adding components to the mixture faster.
6 Evaluation Setup
To evaluate our approach and to critically compare our results, we follow the same evaluation strategies of the original oKDE paper Kristan et al. (2011). For intrinsic evaluation, we assess model quality by using the average negative loglikelihood and model complexity. For extrinsic evaluation, we assess the accuracy of a classification task. For the latter, we also compare our results with a discriminative approach, since it is typically better suited for classification tasks. We used the oisvm Orabona (2009); Orabona et al. (2010).
The datasets (Table 1 and Fig. 1) and evaluation setup from the original oKDE paper were used. Since some of the datasets have been updated, we ran both approaches on the most recent versions. We also evaluated our approach on the LF Face Database, a high dimensional scenario (section 6.1). For the online classification task, we randomly shuffled the data in each dataset and used 75% of the data to train and the rest to test. For each dataset, to minimize impact of lucky/unlucky partitions, we generated 12 random shuffles. Since we wish to make minimal assumptions about data nature and distribution, we did not do any data preprocessing.
Dataset  N  N  N 

Iris  150  4  3 
Yeast  1484  8  10 
Pima  768  8  2 
Red (wine quality, red)  1599  11  6 
White (wine quality, white)  4898  11  7 
Wine  178  13  3 
Letter  20000  16  26 
Seg (image segmentation)  2310  19  7 
Steel  1941  27  7 
Cancer (breast cancer)  569  30  2 
Skin  245057  3  2 
Covtype  581012  10  7 
For the kde approaches, each class is represented by the model built from its training samples: for oKDE, each model is initialized with 3 samples before adding the rest, one at a time; in our approach, due to lazy operation, there is no need for initialization, and each model is trained by adding one sample at a time from the very start. To evaluate classification performance we used .
For oisvm, we trained
binary classifiers, using training samples from a class as positive examples, and the other training samples as negative examples. Then, we followed a 1vsAll approach,
, in which is the confidence score of the th binary classifier for x. A second order polynomial kernel was used with gamma and coef0 parameters set to 1. The complexity parameter C was also set to 1.6.1 LF Face Database
The LF Face Database consists of 30,000 indoor free pose face images from 10 subjects (3,000 images per subject). Capture was performed using a PlayStation Eye camera (640x480 pixel). A Haar cascade Viola and Jones (2001); Bradski (2000) detected frontal face poses, but also some high inclination and head rotation poses. The square face regions were cropped and resized to 64x64 pixels. During capture, each subject was asked to behave normally and avoid being static, to avoid restrictions on facial pose or expression. Since faces tend to be centered in the cropped square, a fixed mask, roughly selecting parts of eyes, nose, and cheeks, was applied to the crops, producing 128pixel vectors.
7 Results
Tests were run on an Intel Xeon E5530@2.40GHz, 48GB RAM, Linux openSUSE 13.1. oKDE and oisvm were run on MATLAB R2013a with nosplash nodesktop nodisplay nojvm. Our approach (xokde++) was compiled enabling all vectorizing options available on our CPU architecture. In the tables, “—” indicates that the model could not be built.
7.1 Model Quality
As a proxy for estimation quality, we use the average negative loglikelihood and the average complexity of the models. Tables 2 and 3 present results after observing all samples.
xokde++  xokde++/d  oKDE  

Dataset        
Iris  7.4  1.7  3.3  1.3  7.8  1.6 
Yeast  13.2  0.5  14.3  0.3  7.1  1.7 
Pima  29.3  0.5  27.6  0.2  31.0  0.7 
Red  0.2  0.6  0.2  0.3  14.7  4.2 
White  0.5  0.2  1.6  0.2  55.2  54.3 
Wine  51.1  4.3  33.0  3.4  64.9  18.5 
Letter  17.9  0.1  19.9  0.1  18.2  0.1 
Seg  25.7  0.8  31.0  1.0  156.6  33.7 
Steel  —  —  85.2  4.3  —  — 
Cancer  25.6  5.7  16.4  2.4  433.9  33.7 
Skin  13.5  0.3  13.9  0.2  13.4  0.2 
Covtype  51.9  0.1  55.9  0.0  52.6  0.1 
= one standard deviation).
xokde++  xokde++/d  oKDE  

Dataset  
Iris  28  3  22  3  28  3 
Yeast  31  15  30  19  31  15 
Pima  62  9  42  19  64  7 
Red  39  23  53  37  40  24 
White  31  24  54  40  36  25 
Wine  44  7  44  7  45  7 
Letter  65  12  42  9  66  14 
Seg  49  12  51  24  52  11 
Steel  —  —  20  7  —  — 
Cancer  40  7  153  11  50  12 
Skin  8  2  10  2  8  2 
Covtype  18  5  17  5  20  5 
Relative to okde, xokde++ produces models with similar complexity but with lower average negative loglikelihood (better fits). This is clear in datasets where some dimensions have very little variance while others have very large variance. The “steel”, “segmentation”, and “cancer” datasets are examples of this kind of problem. The numerical stability methods in xokde++ allow recovery from these situations (section
3.1).7.2 Classifier Accuracy
Table 4 shows that xokde++ achieves better performance in 7 out of 12 datasets, and lower performance in only 3 datasets. Some of the results for oKDE are different than those reported in the original paper, perhaps due to the fact that no dataset was balanced or normalized for unit variance. We intentionally used the datasets with no further processing to study a more realistic online operation scenario, where the sample distribution is not known. These results demonstrate the numerical robustness of xokde++ in handling nonnormalized datasets. oisvm typically outperforms the generative approaches, but only slightly. However, it shares some of the oKDE numerical instabilities.
xokde++  xokde++/d  oKDE  OISVM  

Dataset  
Iris  96.4  2.7  95.0  3.4  96.4  2.4  97.1  2.1 
Yeast  49.7  2.3  48.1  1.6  50.6  3.3  59.2  2.2 
Pima  67.8  3.4  70.1  3.9  69.7  2.9  76.9  2.5 
Red  62.0  2.5  54.6  1.9  56.9  6.3  58.3  1.9 
White  49.9  1.3  42.4  1.7  44.9  10.6  53.2  1.6 
Wine  97.7  1.4  98.5  1.8  93.9  6.1  96.8  2.8 
Letter  95.8  0.2  93.4  0.4  95.8  0.2  93.0  0.4 
Seg  91.5  1.1  89.4  1.2  75.0  5.3  95.0  0.9 
Steel  —  —  56.9  9.0  —  —  —  — 
Cancer  94.8  1.7  96.2  1.7  52.8  12.0  95.9  0.8 
Skin  99.6  0.1  99.4  0.0  99.7  0.1  99.8  0.0 
Covtype  52.0  1.2  51.6  0.6  68.0  0.9  —  — 
The high dimensionality scenario was tested using the LF Face Database. okde, oisvm, and xokde++ (with full covariances), were unable to build models for this dataset. xokde++ (with diagonal covariances) obtained an accuracy of 94.1%, very close to the accuracy of 95.4% obtained by a batch SVM.
7.3 Time and Memory Performance
xokde++ produces models with similar or better quality than okde. Moreover, it does so at lower computational cost. Tables 5 and 6 show the time and memory needed to train and test each dataset: xokde++ achieves speedups from 3 to 10; with diagonal covariances, speedups range from 11 to 40. Regarding memory, xokde++ uses at most 10% of the memory required by okde. This difference is more critical in large datasets such as the “skin” and “covtype”. For these datasets, oKDE needed 913MB and 5064MB, while xokde++ needed only 83MB and 361MB, respectively.
Regarding memory, oisvm performs poorly when compared with the other approaches. For the “letter” dataset, with 26 classes, the memory needed by oisvm was 4 times the required by okde and 43 and 78 times more than xokde++ with full and diagonal covariances, respectively. For large datasets, time performance also degrades: the “skin” dataset took nearly twice the time it took for okde. This is even clearer for the “covtype” dataset, which has 500,000 samples: 7 days of computation were not enough to complete training a single shuffle, showing that it was unfeasible to complete the 12 shuffles of our evaluation setup in an acceptable time frame.
xokde++  xokde++/d  oKDE  OISVM  

Dataset  MB  %  MB  %  MB  MB  % 
Iris  4.6  3.8  4.5  3.7  123.0  116.6  94.8 
Yeast  6.8  3.9  5.8  3.4  173.1  148.9  86.0 
Pima  6.2  3.9  5.2  3.3  157.5  124.8  79.3 
Red  8.1  5.6  6.2  4.2  145.4  162.2  111.5 
White  10.3  5.1  8.3  4.1  204.3  228.4  111.8 
Wine  6.3  4.4  4.8  3.3  144.3  120.2  83.3 
Letter  42.6  9.8  23.2  5.4  433.3  1823.9  420.9 
Seg  15.7  7.7  7.7  3.8  203.9  192.3  94.3 
Steel  —  —  7.2  4.0  —  —  — 
Cancer  10.6  5.5  6.6  3.4  193.1  141.8  73.4 
Skin  83.1  9.1  83.2  9.1  913.3  2255.1  246.9 
Covtype  361.2  7.1  360.4  7.1  5064.8  —  — 
xokde++  xokde++/d  oKDE  OISVM  

Dataset  time  time  time  time  
Iris  0.5  0.1  11.0  0.1  0.0  34.2  5.0  0.7  0.2  0.1  26.7 
Yeast  39.4  3.2  4.5  6.5  0.7  27.1  177.2  8.1  7.7  0.3  23.1 
Pima  16.9  2.1  7.9  3.3  0.3  40.1  133.9  2.3  1.5  0.4  90.1 
Red  61.0  7.7  5.4  14.4  0.7  22.6  326.4  27.5  10.6  1.8  30.7 
White  130.7  4.0  5.8  39.0  1.7  19.6  764.1  51.3  37.2  3.6  20.5 
Wine  3.0  0.5  3.2  0.5  0.1  21.4  9.7  0.6  0.3  0.1  29.4 
Letter  2119.0  66.7  3.0  191.7  3.5  33.7  6452.9  191.9  5225.4  1341.0  1.2 
Seg  296.8  26.1  1.7  45.0  6.1  11.5  515.7  21.5  84.1  12.0  6.1 
Steel  —  —  —  8.7  0.5  —  —  —  —  —  — 
Cancer  52.6  6.9  2.6  11.7  1.1  11.6  135.6  11.2  0.6  0.1  221.8 
Skin  572.9  204.5  11.5  192.2  40.9  34.2  6565.1  1471.0  10522.5  2144.0  0.6 
Covtype  9803.8  1614.9  2.6  1156.6  124.5  22.2  25713.3  2081.3  —  —  — 
7.4 Performance of Full vs. Diagonal Covariances
Table 7 presents a detailed comparison of the two xokde++ variants, regarding model quality and computational needs. Using diagonal covariances produces models with slightly higher complexity at a small loss in model quality, while needing less memory and greatly improving computational time performance. For the “steel” dataset, diagonal covariances also allowed the model to cope with feature dimensions with low variance, thus avoiding the severe numerical instability found when using full covariances, where the lack of feature variance frequently produced singular matrices.
For full covariances, the number of variables increases quadratically with the number of dimensions and datasets with higher dimensionality will benefit from using diagonal covariances. Even though the diagonal approach tends to need more components, partially offsetting memory savings, the average memory usage of the diagonal approach is just 78.5% of the needed for full covariances. We observed an improvement of this number when we considered more than 15 dimensions: the average memory usage falls to 73% with only small losses in accuracy, around 0.72% (absolute). Considering all datasets, we obtain an average value of 1.67% (absolute) accuracy loss against full covariance models.
xokde++  xokde++/d  comparison  

Dataset  %    %    %  %  
Iris  96.4  28  7.4  0.5  4.6  95.0  22  3.3  0.1  4.5  1.4  7  3.1  97.9 
Yeast  49.7  31  13.2  39.4  6.8  48.1  30  14.3  6.5  5.8  1.6  1  6.0  85.2 
Pima  67.8  62  29.3  16.9  6.2  70.1  42  27.6  3.3  5.2  2.3  21  5.1  83.5 
Red  62.0  39  0.2  61.0  8.1  54.6  53  0.2  14.4  6.2  7.4  14  4.2  75.8 
White  49.9  31  0.5  130.7  10.3  42.4  54  1.6  39.0  8.3  7.5  22  3.4  80.8 
Wine  97.7  44  51.1  3.0  6.3  98.5  44  33.0  0.5  4.8  0.8  0  6.6  75.8 
Letter  95.8  65  17.9  2119.0  42.6  93.4  42  19.9  191.7  23.2  2.4  23  11.1  54.5 
Seg  91.5  49  25.7  296.8  15.7  89.4  51  31.0  45.0  7.7  2.2  2  6.6  48.7 
Steel  —  —  —  —  —  56.9  20  85.2  8.7  7.2  —  —  —  — 
Cancer  94.8  40  25.6  52.6  10.6  96.2  153  16.4  11.7  6.6  1.5  113  4.5  61.8 
Skin  99.6  8  13.5  572.9  83.1  99.4  10  13.9  192.2  83.2  0.1  3  3.0  100.1 
Covtype  52.0  18  51.9  9803.8  361.2  51.6  17  55.9  1156.6  360.4  0.4  2  8.5  99.8 
8 Conclusions
xokde++ is a stateoftheart online kde approach. It is efficient, numerically robust, and able to handle high dimensionality. Model quality comparable to that of oKDE, while needing significantly less memory and computation time. The numerical stability improvements allow xokde++ to cope with nonnormalized data and achieve an average classifier accuracy of 78% (against oKDE’s 73%). This is a 6.5% average improvement over the baseline. Furthermore, it continues to achieve good results where other approaches fail to even convege (section 7.2). From a software standpoint, xokde++ is extensible, due to its pure template library nature, as shown by the use of diagonal covariances, which were easily implemented and provide aditional time and memory efficiency while sacrificing little model quality.
Besides speed and memory efficiency, one central contribution of our approach is its adaptability and extensibility. Since it was implemented in an OOP language (C++), the templated classes allow for easy extension, as shown by switching between full and diagonal covariances. Other parts of the code are also easily extended, e.g., using triangular covariances, or replacing the Gaussian pdf with a different one. Similarly, changing the compression phase is also a simple process.
Future work to improve computational performance includes highperformance computing adaptations. Other improvements include changing the hierarchical compression approach, to another, more paralelizable approach such as pairwise merging of components Runnalls (2007). Moreover, model quality can also be improved. Instead of using full or diagonal covariances, which are either too large for high dimensionality or too restrictive for certain nonlinearly independent data, covariances could be approximated with lowrank perturbations MagdonIsmail and Purnell (2009), or using Toeplitz matrices Pasupathy and Damodar (1992); Cai et al. (2014). Finally, improvements on numerical stability and model quality may be obtained by avoiding correction of degenerate covariances and, instead, compute pseudoinverses and pseudologdeterminants Mikheev (2006).
9 Acknowledgements
This work was supported by national funds through Fundação para a Ciência e a Tecnologia (FCT) with reference UID/CEC/50021/2013.
References
 Bradski (2000) Bradski, G., 2000. The OpenCV Library. Dr. Dobb’s Journal of Software Tools .
 Cai et al. (2014) Cai, T.T., Ren, Z., Zhou, H.H., 2014. Estimating structured highdimensional covariance and precision matrices: Optimal rates and adaptive estimation. The Annals of Statistics 38, 2118–2144.
 Chen et al. (2010) Chen, S., Hong, X., Harris, C.J., 2010. Probability density estimation with tunable kernels using orthogonal forward regression. Systems, Man, and Cybernetics, Part B: Cybernetics, IEEE Trans. on 40, 1101–1114.
 Declercq and Piater (2008) Declercq, A., Piater, J.H., 2008. Online learning of Gaussian mixture models  a twolevel approach., in: VISAPP (1), pp. 605–611.
 Duong and Hazelton (2003) Duong, T., Hazelton, M., 2003. Plugin bandwidth matrices for bivariate kernel density estimation. Journal of Nonparametric Statistics 15, 17–30.
 Fukunaga (1990) Fukunaga, K., 1990. Introduction to statistical pattern recognition. 2 ed.. Academic press. p. 128.
 Goldberger and Roweis (2004) Goldberger, J., Roweis, S.T., 2004. Hierarchical clustering of a mixture model, in: Advances in Neural Information Processing Systems, pp. 505–512.
 Guennebaud et al. (2010) Guennebaud, G., Jacob, B., et al., 2010. Eigen v3. http://eigen.tuxfamily.org.
 Han et al. (2008) Han, B., Comaniciu, D., Zhu, Y., Davis, L.S., 2008. Sequential kernel density approximation and its application to realtime visual tracking. Pattern Analysis and Machine Intelligence, IEEE Trans. on 30, 1186–1197.
 Huber (2015) Huber, M., 2015. Nonlinear Gaussian Filtering: Theory, Algorithms, and Applications. KIT Scientific Publishing. volume 19. pp. 108–109.
 Ihler (2005) Ihler, A.T., 2005. Inference in sensor networks: Graphical models and particle methods. Ph.D. thesis. Massachusetts Institute of Technology.

Julier and Uhlmann (1996)
Julier, S.J., Uhlmann, J.K.,
1996.
A general method for approximating nonlinear transformations of probability distributions.
Technical Report. RRG, Department of Engineering Science, University of Oxford.  Kristan and Leonardis (2014) Kristan, M., Leonardis, A., 2014. Online discriminative kernel density estimator with gaussian kernels. Cybernetics, IEEE Trans. on 44, 355–365.
 Kristan et al. (2011) Kristan, M., Leonardis, A., Skočaj, D., 2011. Multivariate online kernel density estimation with Gaussian kernels. Pattern Recognition 44, 2630–2642.

Lichman (2013)
Lichman, M., 2013.
UCI machine learning repository.
http://archive.ics.uci.edu/ml.  MagdonIsmail and Purnell (2009) MagdonIsmail, M., Purnell, J.T., 2009. Approximating the covariance matrix of GMMs with lowrank perturbations. Int. J. Data Mining and Bioinformatics 3, 1–15.
 Mikheev (2006) Mikheev, P., 2006. Multidimensional gaussian probability density and its applications in the degenerate case. Radiophysics and quantum electronics 49, 564–571.
 Orabona (2009) Orabona, F., 2009. DOGMA: a MATLAB toolbox for Online Learning. http://dogma.sourceforge.net.
 Orabona et al. (2010) Orabona, F., Castellini, C., Caputo, B., Jie, L., Sandini, G., 2010. Online independent support vector machines. Pattern Recognition 43, 1402–1412.

Ozertem et al. (2008)
Ozertem, U., Erdogmus, D.,
Jenssen, R., 2008.
Mean shift spectral clustering.
Pattern Recognition 41, 1924–1938.  Pasupathy and Damodar (1992) Pasupathy, J., Damodar, R., 1992. The gaussian toeplitz matrix. Linear algebra and its applications 171, 133–147.
 Pollard (2002) Pollard, D., 2002. A user’s guide to measure theoretic probability. volume 8. Cambridge University Press.
 Runnalls (2007) Runnalls, A.R., 2007. Kullbackleibler approach to gaussian mixture reduction. Aerospace and Electronic Systems, IEEE Trans. on 43, 989–999.
 Silverman (1986) Silverman, B.W., 1986. Density estimation for statistics and data analysis. volume 26. CRC press.

Viola and Jones (2001)
Viola, P., Jones, M., 2001.
Rapid object detection using a boosted cascade of simple features, in: Computer Vision and Pattern Recognition, 2001. CVPR 2001. Proceedings of the 2001 IEEE Computer Society Conference on, IEEE. pp. I–511.
 Wand and Jones (1994) Wand, M.P., Jones, M.C., 1994. Kernel smoothing. volume 60 of Monographs on statistics and applied probability. Chapman & Hall/CRC.