I Introduction
The covariance matrix of multivariate data is required in many sequential machine learning and neuralnetworks (NN) based applications
[1], including speech recognition [2], deep learning architectures for image processing and computer vision
[3, 4, 5], stochastic fuzzy NN’s [6], pricing option contracts in financial markets [7], adaptive tracking control problems [8], detection tasks [9][10], and many others.In settings where data arrives sequentially, the covariance matrix is required to be updated in an online manner [11, 12]. Techniques such as crossvalidation, which attempt to impose regularization, or model selection are typically not feasible in such settings [13]. Instead, to minimize complexity, it is often assumed that the covariance matrix is known in advance [6] or that it is restricted to a specific simplified structure, such as a diagonal matrix [14, 3]. Moreover, when the number of observations is comparable to the number of variables the covariance estimation problem becomes far more challenging. In such scenarios, the sample covariance matrix is not wellconditioned nor is it necessarily invertible (despite the fact that those two properties are required for most applications). When , the inversion cannot be computed at all [15, 16].
An extensive body of literature concerning improved estimators in such situations exists [17, 18]. However, in the absence of a specific knowledge about the structure of the true covariance matrix, the most successful approach so far has, arguably, been shrinkage estimation [19]. It has been demonstrated in [20]
that the largest sample eigenvalues are systematically biased upward, and the smallest ones downward. This bias is corrected by pulling down the largest eigenvalues and pushing up the smallest ones, toward their grand mean.
The optimal solution of the shrinkage estimator is solved analytically, which is a huge advantage for deep learning architectures, since a key factor in realizing such architectures is the resource complexity involved in their training [21]. An example of such deep architecture is the deep spatiotemporal inference network (DeSTIN) [3]. The latter extensively utilizes the quadratic discriminant analysis
(QDA) classifier under the simplified assumption that the covariance matrices involved in the process are diagonal. Such assumption is made in order to avoid additional complexity during the training and inference processes. It is well known that for a small ratio of training observations
to observation dimensionality, the QDA classifier performs poorly, due to highly variable class conditional sample covariance matrices. In order to improve the classifiers’ performance, regularization is required, with the aim of providing an appropriate compromise between the bias and variance of the solution. It have been demonstrated in
[22, 23] that the QDA classifier can be improved tremendously using shrinkage estimators. The sequential approximated inverse of the shrinkage estimator, derived in this paper, allows us to utilize the shrinkage estimator in the DeSTIN architecture with relatively negligible additional complexity to the architecture. In addition, the relatively simple update rules pave the way to implement the inverse shrinkage estimator on analog computational circuits, offering the potential for large improvement in power efficiency [24].The rest of this paper is organized as follows: Section 2 presents the general idea of the shrinkage estimator. In Section 3, we derived a sequential update for the shrinkage estimator, while in Section 4, the related approximated inverses are derived. In Section 5, we conduct an experimental study and examine the sequential update rules.
Notations: we denote vectors in lowercase boldface letters and matrices in uppercase boldface. The transpose operator is denoted as
. The trace, the determinant and the Frobenius norm of a matrix are denoted as , and, respectively. The identity matrix is denoted as
, while is a column vector of all ones. For any real matrices and , the inner product is defined as , where [15, Sec. 2.20].Ii Shrinkage Estimator for Covariance Matrices
We briefly review a singletarget shrinkage estimator by following [20, 25], which is generally applied to highdimensional estimation problems. Let be a sample of independent identically distributed (i.i.d.) dimensional vectors drawn from a density with a mean and covariance matrix . When the number of observations is large (i.e., ), the most common estimator of is the sample covariance matrix
(1) 
where is the sample mean, defined as
(2) 
Both and
are unbiased estimators of
and , respectively, i.e., and . The shrinkage estimator is in the form(3) 
where the target is a restricted estimator of defined as
(4) 
The work in [20] proposed to find an estimator which minimizes the mean squared error (MSE) with respect to , i.e.,
(5) 
and can be given by the distributionfree formula
(6) 
The scalar is called the oracle shrinkage coefficient, since its depends on the unknown covariance matrix . Therefore, (6) must be estimated. The latter can be estimated from its sample counterparts as in [25]. We denote this estimator as .
Iii Sequential Update of the Shrinkage Estimator
Iv Sequential Update for the Inverse of the Shrinkage Estimator
In this section, we derived approximated inverses of the shrinkage estimator which are updated sequentially and do not involve any matrix inversion. We start, therefore, from the inverse of the sample covariance matrix that can be obtained from the current inverse of (1) using the ShermanMorrisonWoodbury matrix identity [26, Ch. 3] as
(13) 
The last update rule can be used only if is invertible. It will not be invertible for . Since the shrinkage estimator (3) is a regularized version of (1), an inverse exists for any . This inverse of (3) involves two main steps. The first one is to update the inverse of (11) from an inverse of (3). The second is to update the next step inverse of from (12) and the inverse of (11) calculated in the first step. Suppose, for example, that the exact inverse of (3), denoted as , is known. In the same manner as in (13), the inverse for (11) can be calculated from as
(14) 
Using [15, 15.11.(b)], the exact inverse of can be calculated from (14) and (12) with iterations
(15) 
where and are the columns of (12) and the identity matrix , respectively. The inverse of (10) is equal to the output of the last iteration, i.e.,
(16) 
In order to avoid the calculation of iterations, we can use approximations for (16). The inverse approximations of the shrinkage estimator are discussed in the following section.
Iva Inverse Approximations for the Shrinkage Estimator
We consider two approximations for (16). The first approximation is defined as
(17) 
where
(18) 
The matrix (18) differs from (14) in the fact that it relies on the approximated inverse (17), instead of the exact inverse (16). A possible motivation to justify the update rule (17) stems from the mean value theorem as explained in [27]. Another motivation arises from the Neumann series [15, Sec.19.15] where (16) is approximately equal to (17) for and relatively small . We define as the value that minimizes the reconstruction squared error, i.e.,
(19) 
and is equal to
(20) 
Additional simplification can be taken by looking at the last term in (12). Under the assumption that the difference is relatively small, we can write an approximation for (12) by neglecting its last term, i.e.,
(21) 
This will lead to the second approximation for (16), denoted as
(22) 
where
(23) 
and is calculated by
(24) 
We examine these two approximations in the following section.
V Experiments
In this section we implement and evaluate the sequential update of the inverse shrinkage estimator. As in [28], we assume that the observations are i.i.d Gaussian vectors. In order to study the estimators performance, an autoregressive covariance matrix is used. We let be the covariance matrix of a Gaussian AR(1) process [29], denoted by
(25) 
As in [17, 28], we use . In all simulations, we set and let range from 1 to 30. Each simulation is repeated 200 times and the average values are plotted as a function of . The experimental results are summarized in box plots. On each box, the central mark is the median, the edges of the box are the 25th and 75th percentiles, and the whiskers correspond to approximately +/–
or 99.3 coverage if the data are normally distributed. The outliers are plotted individually.
The reconstruction errors of the approximated inverses (17) and (22) are defined by
(26) 
and
(27) 
respectively. These reconstruction errors are normalized with since it is the squared Frobenius norm of the identity matrix . We examine the approximated inverse (17) and (22) where is equal to [25]. The experimental results for the reconstruction errors (26) and (27) are summarized in Fig. 1 and Fig. 2, respectively. The values of (26) converge on average to zero as the number of observations increase. In several simulations, however, the update rule accumulates error and diverges.
The related reconstruction error (27) is depicted in Fig. 2. The reconstruction error (27) does not converge to zero due to its relative simplification involving the use of (21) instead of (12). However, the use of (21) renders (22) much more robust to outliers in comparison to the first estimator (17). In that sense, a relatively small and fixed reconstruction error can be assumed in order to avoid unexpected outliers.
Vi Conclusions
A key challenge in many largescale sequential machine learning methods stems from the need to obtain the covariance matrix of the data, which is unknown in practice and should be estimated. In order to avoid additional complexity during the modeling process, it is commonly assumed that the covariance matrix is known in advanced or, alternatively, that simplified estimators are employed. In Section 3, we derived a sequential update rule for the shrinkage estimator that offers a compromise between the sample covariance matrix and a wellconditioned matrix. The optimal shrinkage coefficient, in the sense of meansquared error, is analytically obtained, which is a notable advantage since a key factor in realizing largescale architectures is the resource complexity involved. In Section 4, sequential update rules that approximate the inverse shrinkage estimator are derived. The experimental results in Section 5 clearly demonstrates that the reconstruction errors of the approximated inverses are relatively small. The sequential update rules that approximate the inverse of the shrinkage estimator provide a general result that can be utilized in a wide range of sequential machine learning applications. Therefore, the approach paves the way for improved largescale machine learning methods that involve sequential updates in highdimensional data spaces.
References
 [1] JunKun Wang et al., “Robust inverse covariance estimation under noisy measurements,” in Proceedings of the 31st International Conference on Machine Learning (ICML14), 2014, pp. 928–936.
 [2] G.E. Dahl, Dong Yu, Li Deng, and A. Acero, “Contextdependent pretrained deep neural networks for largevocabulary speech recognition,” Audio, Speech, and Language Processing, IEEE Transactions on, vol. 20, no. 1, pp. 30–42, Jan 2012.
 [3] S. Young, Junjie Lu, J. Holleman, and I. Arel, “On the impact of approximate computation in an analog destin architecture,” Neural Networks and Learning Systems, IEEE Transactions on, vol. 25, no. 5, pp. 934–946, May 2014.

[4]
M. Ranzato and G.E. Hinton,
“Modeling pixel means and covariances using factorized thirdorder boltzmann machines,”
inComputer Vision and Pattern Recognition (CVPR), 2010 IEEE Conference on
, June 2010, pp. 2551–2558.  [5] Tomer Lancewicki and Mayer Aladjem, “Locally multidimensional scaling by creating neighborhoods in diffusion maps,” Neurocomputing, vol. 139, no. 0, pp. 382 – 396, 2014.

[6]
Long Xu, Junping Wang, and Quanshi Chen,
“Kalman filtering state of charge estimation for battery management system based on a stochastic fuzzy neural network battery model,”
Energy Conversion and Management, vol. 53, no. 1, pp. 33 – 39, 2012.  [7] Joao FG de Freitas, Mahesan Niranjan, Andrew H. Gee, and Arnaud Doucet, “Sequential monte carlo methods to train neural network models,” Neural computation, vol. 12, no. 4, pp. 955–993, 2000.
 [8] H.E. Psillakis and A.T. Alexandridis, “Nnbased adaptive tracking control of uncertain nonlinear systems disturbed by unknown covariance noise,” Neural Networks, IEEE Transactions on, vol. 18, no. 6, pp. 1830–1835, Nov 2007.
 [9] F. Porikli and T. Kocak, “Robust license plate detection using covariance descriptor in a neural network framework,” in Video and Signal Based Surveillance, 2006. AVSS ’06. IEEE International Conference on, Nov 2006, pp. 107–107.
 [10] Tomer Lancewicki and Itamar Arel, “Covariance matrix estimation for reinforcement learning,” in the 2nd Multidisciplinary Conference on Reinforcement Learning and Decision Making, 2015, pp. 14–18.
 [11] Tomer Lancewicki, Benjamin Goodrich, and Itamar Arel, “Sequential covariancematrix estimation with application to mitigating catastrophic forgetting,” in Machine Learning and Applications (ICMLA), 2015 IEEE 14th International Conference on. IEEE, 2015, pp. 628–633.
 [12] G. Sateesh Babu and S. Suresh, “Metacognitive neural network for classification problems in a sequential learning framework,” Neurocomputing, vol. 81, no. 0, pp. 86 – 96, 2012.
 [13] Joao FG de Freitas, Mahesan Niranjan, and Andrew H Gee, “Regularisation in sequential learning algorithms,” in Advances in Neural Information Processing Systems, 1998, pp. 458–464.
 [14] E. Manitsas, R. Singh, B.C. Pal, and G. Strbac, “Distribution system state estimation using an artificial neural network approach for pseudo measurement modeling,” Power Systems, IEEE Transactions on, vol. 27, no. 4, pp. 1888–1896, Nov 2012.
 [15] George A. F. Seber, A Matrix Handbook for Statisticians, John Wiley and Sons, Inc., 2007.
 [16] Tomer Lancewicki, “Regularization of the kernel matrix via covariance matrix shrinkage estimation,” arXiv preprint arXiv:1707.06156, 2017.
 [17] Peter J. Bickel and Elizaveta Levina, “Regularized estimation of large covariance matrices,” The Annals of Statistics, vol. 36, no. 1, pp. pp. 199–227, 2008.
 [18] Angelika Rohde and Alexandre B. Tsybakov, “Estimation of highdimensional lowrank matrices,” Ann. Statist., vol. 39, no. 2, pp. 887–930, 04 2011.
 [19] Olivier Ledoit and Michael Wolf, “Nonlinear shrinkage estimation of largedimensional covariance matrices,” Institute for Empirical Research in Economics University of Zurich Working Paper, , no. 515, 2011.

[20]
Olivier Ledoit and Michael Wolf,
“A wellconditioned estimator for largedimensional covariance
matrices,”
Journal of Multivariate Analysis
, vol. 88, no. 2, pp. 365 – 411, 2004. 
[21]
Dumitru Erhan, PierreAntoine Manzagol, Yoshua Bengio, Samy Bengio, and Pascal
Vincent,
“The difficulty of training deep architectures and the effect of
unsupervised pretraining,”
in
International Conference on artificial intelligence and statistics
, 2009, pp. 153–160.  [22] T. Lancewicki and M. Aladjem, “Multitarget shrinkage estimation for covariance matrices,” IEEE Transactions on Signal Processing, vol. 62, no. 24, pp. 6380–6390, Dec 2014.
 [23] Tomer Lancewicki, Dimensionality Reduction and Shrinkage Estimation in Multivariate Data, Ph.D. thesis, BENGURION UNIVERSITY OF THE NEGEV (ISRAEL), 2014.
 [24] Junjie Lu, S. Young, I. Arel, and J. Holleman, “30.10 a 1tops/w analog deep machinelearning engine with floatinggate storage in 0.13 mum cmos,” in SolidState Circuits Conference Digest of Technical Papers (ISSCC), 2014 IEEE International, Feb 2014, pp. 504–505.
 [25] J. Schafer and K. Strimmer, “A shrinkage approach to largescale covariance matrix estimation and implications for functional genomics,” Statist. Appl. Genet. Molec. Biol., vol. 4, no. 1, 2005.
 [26] R. O. Duda, P. E. Hart, and D. G. Stork, Pattern classification (2nd ed), John Wiley and Sons, 2001.
 [27] Kenneth S. Miller, “On the inverse of the sum of matrices,” Mathematics Magazine, vol. 54, no. 2, pp. pp. 67–72, 1981.
 [28] Yilun Chen, A. Wiesel, Y.C. Eldar, and A.O. Hero, “Shrinkage algorithms for MMSE covariance estimation,” IEEE Transactions on Signal Processing, vol. 58, no. 10, pp. 5016–5029, 2010.
 [29] S.M. Pandit and S.M. Wu, Time series and system analysis, with applications, Wiley, 1983.
Comments
There are no comments yet.