Online learning is needed when data is not known a priori, its distribution evolves, or when the data model must be updated without interrupting operation: the model must evolve as new data appears, allowing older data to become progressively less important. Our goal is to produce a fast, robust way of handling long-running modelling tasks. Since capturing, processing, and saving all past observed samples is computationally unfeasible, similar samples can be grouped and represented by more compact entities, such as gmms.
kde methods are nonparametric approaches to build gmms that do not require prior definition of the number of components. In order to manage complexity, some approaches compress the model to a predefined number of components Goldberger and Roweis (2004), or optimize some data-driven choice Chen et al. (2010). A different approach Ozertem et al. (2008)
is to view model compression as a clustering problem. The main difficulty of adapting kde methods to online operation lies in maintaining sufficient information to generalize to unobserved data and to adjust model complexity without access to all past samples. An online approach based on mean-shift mode finding adds each new sample to the model as a Gaussian component. However, it is sensitive to non-Gaussian areas due to skewed or heavy tailed dataHan et al. (2008). A two-level approach Declercq and Piater (2008)
, based on the idea that each component of the (non-overfitting) mixture is in turn represented by an underlying mixture that represents data very precisely (possibly overfitting), allows the model to be refined without sacrificing accuracy, as a merge in the upper level can later be accurately split. It is also able handle the problem of non-Gaussian data by using a uniform distribution in those regions.
The okde approach Kristan and Leonardis (2014); Kristan et al. (2011) builds a two-level model of the target distribution. It maintains a sample distribution that is used to compute the corresponding kde. It supports single-sample updates and has a forgetting factor. This allows model adaptation when the target distribution evolves. Compressions approximate component clusters using single Gaussians. If the target distribution changes sufficiently, compressions can be reversed, using the underlying model kept by each component.
Our approach follows okde, obtaining comparable or better results. It allows for flexibility, abstraction, efficient memory use, computational efficiency, and numerical robustness. Furthermore, our C++ implementation, based on state-of-the-art libraries Guennebaud et al. (2010), allows easy exploration of modelling alternatives.
The article is organized as follows: Section 2 describes okde. Section 3 describes our adaptations for numerical stability. Section 4 describes our approach to high dimensionality. Section 5 presents specific computational strategies to improve performance. Section 6 describes the evaluation setup and Section 7 presents our results. Section 8 concludes the article.
2 Online Kernel Density Estimation
okde was not developed with high dimensionality in mind. Although it allows for any number of dimensions, computational complexity scales quadratically with the number of dimensions, making the approach computationally prohibitive. Furthermore, problems due to overflows, underflows, and precision issues also arise. These issues worsen as the number of dimensions grows, limiting the usefulness of the approach. Before describing our approach, we present okde in detail.
2.1 Model Definition
To maintain low complexity, is compressed when a threshold is met. To allow recovery from mis-compressions, a model of the observations is also kept for each component: , where is a mixture model (with at most two components) for the -th component of .
The kernel density estimate is defined as a convolution of by a kernel with a covariance (bandwidth) H:
2.2 Optimal Bandwidth
KDE approaches determine a bandwidth H that minimizes the distance between and (the unknown) generating the data. The bandwidth can be written as ( is the scale and is the structure). Finding H is equivalent to finding the optimal scaling (Eq. 4), where (Eq. 2.2) and , , , , and is the covariance of the -th component of .
G is the pilot bandwidth (Eq. 6). is the covariance of a single Gaussian approximating the entire model. The structure of H can be approximated by the structure of the covariance of the observations Wand and Jones (1994); Duong and Hazelton (2003), i.e., .
2.3 Model Compression
Compression approximates the original -component by an -component, , (Eq. 7), such that the compressed kde does not change significantly.
Clustering approaches can be used to compress . The aim is to find clusters such that each cluster can be approximated by a single component in . Let be a set of assignments clustering into sub-mixtures. The sub-mixture corresponding to indexes is defined as
The parameters of each
are defined by matching the first two moments (mean and covariance) of
must be such that and (a threshold). is the largest local clustering error , i.e., the error induced under the KDE if is approximated by a single Gaussian.
2.4 Local Clustering Error
2.5 Distance between mixture models
The Hellinger distance cannot be calculated analytically for mixture models and is approximated by the unscented transform Julier and Uhlmann (1996) (Eq. 13), where are weighted sets of sigma points of the -th Gaussian .
Note that is the -th column of ( has columns). The sigma points are simply () and . Thus, must be counted separately for the and sets, such that for each set of sigma points. Let
be a singular value decomposition of, such that and , then . In line with Julier and Uhlmann (1996), was set to 3.
2.6 Hierarchical Compression
is first split into two sub-mixtures using Goldberger’s K-meansGoldberger and Roweis (2004). To avoid singularities associated with Dirac-delta components of , K-means is applied to . Next, each sub-mixture is approximated by a single Gaussian (section 2.3) and the sub-mixture yielding the largest local error is further split into two sub-mixtures. The process continues until , producing a binary tree where each of the leaves represents the clustering assignments . can then be used to approximate the corresponding components in by a single Gaussian, resulting in the compressed distribution .
If two components of are merged, their detailed models must also be merged: the detailed model of the -th component in is calculated by first defining a normalized extended mixture (Eq. 14). If this mixture has more than two components, then the two-component is generated by splitting into two sub-mixtures using Golberger’s K-means and approximating each sub-mixture by a single Gaussian.
Over-compressions can be detected by checking whether the local clustering error of each component, evaluated against its detailed model , exceeds . Those components are removed from and replaced by the components of their detailed models. Each new component needs a detailed model: new ones are generated based on their covariances. Let be one of the new components. If the determinant of is zero, then the component is a single data-point and its detailed model is just the component itself. Otherwise, the component has been generated through clustering in previous compression steps: its detailed model is initialized by splitting along its principal axis Huber (2015) into a two-component mixture, whose first two moments match those of the original component. This has two advantages: first, since the component is symmetric around the mean, the splitting process minimizes the induced error; and second, it is moment preserving, i.e., the mean and covariance of the split Gaussian (and, thus, of the entire mixture), remains unchanged.
3 Improving Numeric Stability
In this section, we detail strategies that allow working with skewed or degenerate sample distributions, as well as avoiding underflows and overflows.
3.1 Degenerate Covariance Matrices
Some data dimensions may seldom change or sufficient observations with different values may not have occurred, causing covariance values of 0 along those axes. One way to detect and correct singular or near-singular matrices is to compute their eigen decomposition , with
, and check for eigenvalues smaller than. These eigenvalues are then corrected by 1% of the average of the eigenvalues (Eq. 15, 16, 17). The corrected covariance is then given by .
3.2 Determinant Computation
The Gaussian pdf (Eq. 2) uses the determinant () and the inverse () of the covariance (). A way to efficiently compute the inverse of a non-singular matrix is to first compute its LU decomposition and use it to compute the inverse. Since we guarantee that is positive definite, the determinant can be computed by multiplying the diagonal entries of U. However, when working in high dimensionality, overflows or underflows are bound to occur, even when using double precision floating point. To avoid this problem, we compute the logarithm of the determinant instead: this is simply the sum of the logarithm of each entry in the diagonal of matrix U.
3.3 Whitening Transformation
A single means that all components are scaled equally in all dimensions. If the spread of data is much greater in one dimension than in others, extreme differences of spread in the various dimensions may occur. This can be avoided by first whitening the data, by transforming it to have unit covariance, smoothing it using a radially symmetric kernel, and, finally, transforming it back Fukunaga (1990). Whitening allows to better fit the global distribution of the data, leading to better models Silverman (1986).
In our approach, both the computation of
and model compression are performed on whitened data. Whitening starts by approximating the whole model by a single Gaussian. The parameters needed to transform the model covariance into the identity matrix are then computed and individually applied to all components in the mixture. The inverse operation can be used to recover the model.
Formally, the whitening transformation of data x with covariance is defined as , where and
are the eigenvalues and eigenvectors from the eigendecomposition of the covariance matrix.
Whitening impacts the model in two ways. The first is computational performance: since the structure is equal to , it significantly saves computational costs in matrix multiplications (eq. 2.2). The second is model quality, as both the computation of and the compression phase, performed using Goldberger’s K-means, benefit from working on spherized data.
3.4 Optimal Bandwidth Stability
Sometimes, observed data may produce zero-valued . Recall that we assume that the structure of is well-approximated by the covariance of the entire model and that finding amounts to finding (section 2.2). If this scaling factor is zero, e.g. caused by precision issues or by data distribution in the model, will be zero. This means that no smoothing will be applied to the KDE. In this case, Dirac-deltas will remain with zero covariance, not allowing the use of the standard likelihood function. Since is computed on whitened data, one solution for this problem is to detect these occurrences and set the bandwidth as the identity matrix, and then computing the backwards whitening transformation.
4 Using Diagonal Covariance Matrices
Since a kde is a linear combination of components, full covariances may not be necessary, even when features are not statistically independent: a second linear combination of (a higher number of) components with diagonal covariances is capable of modelling the correlations between features.
With diagonal covariances, computational needs grow linearly with the number of dimensions, a very convenient aspect for high dimensionality applications. This is in contrast with quadratic growth for full covariances. Furthermore, for non-normalized data, diagonal covariances may improve stability, as certain relationships that could result in singular covariances are ignored. Our numerical stability strategies have a stronger impact with diagonal covariances: for data with flat or near-flat dimensions this is particularly relevant (section 7).
5 Lazy Operation and Result Buffering
In theory, each time a sample is added, should be updated. However, is only needed for the compression phase or to evaluate the likelihood of a given sample. In other cases, computation of may be postponed. This means that the cost of adding a sample is simply that of adding one Dirac-delta component to the mixture and updating the current weights. This can be done without impact on model quality.
The determinant and the inverse of covariance matrices are needed to compute the likelihood function (which is called very often). Thus, it makes sense to save them, once they have been computed, and keep them until the covariance changes. Then, previous values are marked as invalid, but are not recomputed immediately. As with , it is advantageous to postpone these computations until they are actually needed, making the process of adding components to the mixture faster.
6 Evaluation Setup
To evaluate our approach and to critically compare our results, we follow the same evaluation strategies of the original oKDE paper Kristan et al. (2011). For intrinsic evaluation, we assess model quality by using the average negative log-likelihood and model complexity. For extrinsic evaluation, we assess the accuracy of a classification task. For the latter, we also compare our results with a discriminative approach, since it is typically better suited for classification tasks. We used the oisvm Orabona (2009); Orabona et al. (2010).
The datasets (Table 1 and Fig. 1) and evaluation setup from the original oKDE paper were used. Since some of the datasets have been updated, we ran both approaches on the most recent versions. We also evaluated our approach on the LF Face Database, a high dimensional scenario (section 6.1). For the online classification task, we randomly shuffled the data in each dataset and used 75% of the data to train and the rest to test. For each dataset, to minimize impact of lucky/unlucky partitions, we generated 12 random shuffles. Since we wish to make minimal assumptions about data nature and distribution, we did not do any data preprocessing.
|Red (wine quality, red)||1599||11||6|
|White (wine quality, white)||4898||11||7|
|Seg (image segmentation)||2310||19||7|
|Cancer (breast cancer)||569||30||2|
For the kde approaches, each class is represented by the model built from its training samples: for oKDE, each model is initialized with 3 samples before adding the rest, one at a time; in our approach, due to lazy operation, there is no need for initialization, and each model is trained by adding one sample at a time from the very start. To evaluate classification performance we used .
For oisvm, we trained
binary classifiers, using training samples from a class as positive examples, and the other training samples as negative examples. Then, we followed a 1-vs-All approach,, in which is the confidence score of the -th binary classifier for x. A second order polynomial kernel was used with gamma and coef0 parameters set to 1. The complexity parameter C was also set to 1.
6.1 LF Face Database
The LF Face Database consists of 30,000 indoor free pose face images from 10 subjects (3,000 images per subject). Capture was performed using a PlayStation Eye camera (640x480 pixel). A Haar cascade Viola and Jones (2001); Bradski (2000) detected frontal face poses, but also some high inclination and head rotation poses. The square face regions were cropped and resized to 64x64 pixels. During capture, each subject was asked to behave normally and avoid being static, to avoid restrictions on facial pose or expression. Since faces tend to be centered in the cropped square, a fixed mask, roughly selecting parts of eyes, nose, and cheeks, was applied to the crops, producing 128-pixel vectors.
Tests were run on an Intel Xeon E5530@2.40GHz, 48GB RAM, Linux openSUSE 13.1. oKDE and oisvm were run on MATLAB R2013a with -nosplash -nodesktop -nodisplay -nojvm. Our approach (xokde++) was compiled enabling all vectorizing options available on our CPU architecture. In the tables, “—” indicates that the model could not be built.
7.1 Model Quality
= one standard deviation).
Relative to okde, xokde++ produces models with similar complexity but with lower average negative log-likelihood (better fits). This is clear in datasets where some dimensions have very little variance while others have very large variance. The “steel”, “segmentation”, and “cancer” datasets are examples of this kind of problem. The numerical stability methods in xokde++ allow recovery from these situations (section3.1).
7.2 Classifier Accuracy
Table 4 shows that xokde++ achieves better performance in 7 out of 12 datasets, and lower performance in only 3 datasets. Some of the results for oKDE are different than those reported in the original paper, perhaps due to the fact that no dataset was balanced or normalized for unit variance. We intentionally used the datasets with no further processing to study a more realistic online operation scenario, where the sample distribution is not known. These results demonstrate the numerical robustness of xokde++ in handling non-normalized datasets. oisvm typically outperforms the generative approaches, but only slightly. However, it shares some of the oKDE numerical instabilities.
The high dimensionality scenario was tested using the LF Face Database. okde, oisvm, and xokde++ (with full covariances), were unable to build models for this dataset. xokde++ (with diagonal covariances) obtained an accuracy of 94.1%, very close to the accuracy of 95.4% obtained by a batch SVM.
7.3 Time and Memory Performance
xokde++ produces models with similar or better quality than okde. Moreover, it does so at lower computational cost. Tables 5 and 6 show the time and memory needed to train and test each dataset: xokde++ achieves speedups from 3 to 10; with diagonal covariances, speedups range from 11 to 40. Regarding memory, xokde++ uses at most 10% of the memory required by okde. This difference is more critical in large datasets such as the “skin” and “covtype”. For these datasets, oKDE needed 913MB and 5064MB, while xokde++ needed only 83MB and 361MB, respectively.
Regarding memory, oisvm performs poorly when compared with the other approaches. For the “letter” dataset, with 26 classes, the memory needed by oisvm was 4 times the required by okde and 43 and 78 times more than xokde++ with full and diagonal covariances, respectively. For large datasets, time performance also degrades: the “skin” dataset took nearly twice the time it took for okde. This is even clearer for the “covtype” dataset, which has 500,000 samples: 7 days of computation were not enough to complete training a single shuffle, showing that it was unfeasible to complete the 12 shuffles of our evaluation setup in an acceptable time frame.
7.4 Performance of Full vs. Diagonal Covariances
Table 7 presents a detailed comparison of the two xokde++ variants, regarding model quality and computational needs. Using diagonal covariances produces models with slightly higher complexity at a small loss in model quality, while needing less memory and greatly improving computational time performance. For the “steel” dataset, diagonal covariances also allowed the model to cope with feature dimensions with low variance, thus avoiding the severe numerical instability found when using full covariances, where the lack of feature variance frequently produced singular matrices.
For full covariances, the number of variables increases quadratically with the number of dimensions and datasets with higher dimensionality will benefit from using diagonal covariances. Even though the diagonal approach tends to need more components, partially offsetting memory savings, the average memory usage of the diagonal approach is just 78.5% of the needed for full covariances. We observed an improvement of this number when we considered more than 15 dimensions: the average memory usage falls to 73% with only small losses in accuracy, around 0.72% (absolute). Considering all datasets, we obtain an average value of 1.67% (absolute) accuracy loss against full covariance models.
xokde++ is a state-of-the-art online kde approach. It is efficient, numerically robust, and able to handle high dimensionality. Model quality comparable to that of oKDE, while needing significantly less memory and computation time. The numerical stability improvements allow xokde++ to cope with non-normalized data and achieve an average classifier accuracy of 78% (against oKDE’s 73%). This is a 6.5% average improvement over the baseline. Furthermore, it continues to achieve good results where other approaches fail to even convege (section 7.2). From a software standpoint, xokde++ is extensible, due to its pure template library nature, as shown by the use of diagonal covariances, which were easily implemented and provide aditional time and memory efficiency while sacrificing little model quality.
Besides speed and memory efficiency, one central contribution of our approach is its adaptability and extensibility. Since it was implemented in an OOP language (C++), the templated classes allow for easy extension, as shown by switching between full and diagonal covariances. Other parts of the code are also easily extended, e.g., using triangular covariances, or replacing the Gaussian pdf with a different one. Similarly, changing the compression phase is also a simple process.
Future work to improve computational performance includes high-performance computing adaptations. Other improvements include changing the hierarchical compression approach, to another, more paralelizable approach such as pairwise merging of components Runnalls (2007). Moreover, model quality can also be improved. Instead of using full or diagonal covariances, which are either too large for high dimensionality or too restrictive for certain non-linearly independent data, covariances could be approximated with low-rank perturbations Magdon-Ismail and Purnell (2009), or using Toeplitz matrices Pasupathy and Damodar (1992); Cai et al. (2014). Finally, improvements on numerical stability and model quality may be obtained by avoiding correction of degenerate covariances and, instead, compute pseudo-inverses and pseudo-log-determinants Mikheev (2006).
This work was supported by national funds through Fundação para a Ciência e a Tecnologia (FCT) with reference UID/CEC/50021/2013.
- Bradski (2000) Bradski, G., 2000. The OpenCV Library. Dr. Dobb’s Journal of Software Tools .
- Cai et al. (2014) Cai, T.T., Ren, Z., Zhou, H.H., 2014. Estimating structured high-dimensional covariance and precision matrices: Optimal rates and adaptive estimation. The Annals of Statistics 38, 2118–2144.
- Chen et al. (2010) Chen, S., Hong, X., Harris, C.J., 2010. Probability density estimation with tunable kernels using orthogonal forward regression. Systems, Man, and Cybernetics, Part B: Cybernetics, IEEE Trans. on 40, 1101–1114.
- Declercq and Piater (2008) Declercq, A., Piater, J.H., 2008. Online learning of Gaussian mixture models - a two-level approach., in: VISAPP (1), pp. 605–611.
- Duong and Hazelton (2003) Duong, T., Hazelton, M., 2003. Plug-in bandwidth matrices for bivariate kernel density estimation. Journal of Nonparametric Statistics 15, 17–30.
- Fukunaga (1990) Fukunaga, K., 1990. Introduction to statistical pattern recognition. 2 ed.. Academic press. p. 128.
- Goldberger and Roweis (2004) Goldberger, J., Roweis, S.T., 2004. Hierarchical clustering of a mixture model, in: Advances in Neural Information Processing Systems, pp. 505–512.
- Guennebaud et al. (2010) Guennebaud, G., Jacob, B., et al., 2010. Eigen v3. http://eigen.tuxfamily.org.
- Han et al. (2008) Han, B., Comaniciu, D., Zhu, Y., Davis, L.S., 2008. Sequential kernel density approximation and its application to real-time visual tracking. Pattern Analysis and Machine Intelligence, IEEE Trans. on 30, 1186–1197.
- Huber (2015) Huber, M., 2015. Nonlinear Gaussian Filtering: Theory, Algorithms, and Applications. KIT Scientific Publishing. volume 19. pp. 108–109.
- Ihler (2005) Ihler, A.T., 2005. Inference in sensor networks: Graphical models and particle methods. Ph.D. thesis. Massachusetts Institute of Technology.
Julier and Uhlmann (1996)
Julier, S.J., Uhlmann, J.K.,
A general method for approximating nonlinear transformations of probability distributions.Technical Report. RRG, Department of Engineering Science, University of Oxford.
- Kristan and Leonardis (2014) Kristan, M., Leonardis, A., 2014. Online discriminative kernel density estimator with gaussian kernels. Cybernetics, IEEE Trans. on 44, 355–365.
- Kristan et al. (2011) Kristan, M., Leonardis, A., Skočaj, D., 2011. Multivariate online kernel density estimation with Gaussian kernels. Pattern Recognition 44, 2630–2642.
Lichman, M., 2013.
UCI machine learning repository.http://archive.ics.uci.edu/ml.
- Magdon-Ismail and Purnell (2009) Magdon-Ismail, M., Purnell, J.T., 2009. Approximating the covariance matrix of GMMs with low-rank perturbations. Int. J. Data Mining and Bioinformatics 3, 1–15.
- Mikheev (2006) Mikheev, P., 2006. Multidimensional gaussian probability density and its applications in the degenerate case. Radiophysics and quantum electronics 49, 564–571.
- Orabona (2009) Orabona, F., 2009. DOGMA: a MATLAB toolbox for Online Learning. http://dogma.sourceforge.net.
- Orabona et al. (2010) Orabona, F., Castellini, C., Caputo, B., Jie, L., Sandini, G., 2010. On-line independent support vector machines. Pattern Recognition 43, 1402–1412.
Ozertem et al. (2008)
Ozertem, U., Erdogmus, D.,
Jenssen, R., 2008.
Mean shift spectral clustering.Pattern Recognition 41, 1924–1938.
- Pasupathy and Damodar (1992) Pasupathy, J., Damodar, R., 1992. The gaussian toeplitz matrix. Linear algebra and its applications 171, 133–147.
- Pollard (2002) Pollard, D., 2002. A user’s guide to measure theoretic probability. volume 8. Cambridge University Press.
- Runnalls (2007) Runnalls, A.R., 2007. Kullback-leibler approach to gaussian mixture reduction. Aerospace and Electronic Systems, IEEE Trans. on 43, 989–999.
- Silverman (1986) Silverman, B.W., 1986. Density estimation for statistics and data analysis. volume 26. CRC press.
Viola and Jones (2001)
Viola, P., Jones, M., 2001.
Rapid object detection using a boosted cascade of simple features, in: Computer Vision and Pattern Recognition, 2001. CVPR 2001. Proceedings of the 2001 IEEE Computer Society Conference on, IEEE. pp. I–511.
- Wand and Jones (1994) Wand, M.P., Jones, M.C., 1994. Kernel smoothing. volume 60 of Monographs on statistics and applied probability. Chapman & Hall/CRC.