I Introduction
Machine learning tasks often require a data preprocessing stage also known as feature extraction. One common and powerful way of extracting features is by employing the "kernel trick" concept, formulated as an inner product in a feature space, which allows us to build interesting extensions to existing algorithms. The general idea is that, if an algorithm can be formulated in a way that the input vector enters only in a scalar product form, then that scalar product can be replaced with some other choice of kernel. Many models for regression and classification can be reformulated in terms of a dual representation in which the kernel function arises naturally. For instance, the kernel trick technique can be applied to principal component analysis in order to develop a nonlinear variant of principal component analysis (PCA) [1, 2], while other examples include nearestneighborclassifiers [3], kernel Fisher discriminant analysis (KFDA)[4] and support vector machines (SVM’s) [5].
The aforementioned schemes attempt to discover structure in the data. For example, in pattern recognition and regression estimation, we are given a training set of inputs and outputs, and attempt to infer the test outputs for unseen inputs. This is only possible if we have some measure for determining how the test set is related to the train set. Generally speaking, we expect similar inputs to lead to similar outputs whereby similarity is commonly measured in terms of a loss function. The latter indicates how well the inferred outputs match the true outputs. The training stage commonly involves a risk function that contains a term measuring the loss incurred for the training patterns. However, in order to generalize well to the test data, it is also necessary to control the complexity of the model used for explaining the training data, a task that is often accomplished with the help of regularization terms
[5, Ch. 4]. Minimizing the empirical risk without regularization terms can lead to numerical instabilities and poor generalization performance. Therefore, it is essential to utilize objective functions that involve both the empirical loss term and a regularization term. A possible way to avoid the aforementioned problems is to use the class of admissible solutions [6], often referred to in statistics as shrinkage estimators [7].In this paper we point out the connections between regularization of the kernel matrix and a shrinkage estimation of the covariance matrix in the feature space. Since the kernel matrix and the sample covariance matrix involve inner and outer products in the feature space, respectively, their spectral properties are tightly connected [8]. This allows us to examine the kernel matrix stability through the sample covariance matrix in the feature space and vice versa. More specifically, the use of kernels often involve a large number of features, compared to the number of observations. In this scenario, the sample covariance matrix is not wellconditioned nor is it necessarily invertible (despite the fact that those two properties are required for most machine learning applications). This necessitates that a solution be found for the problem of estimating highdimensional covariance matrices under small sample size settings.
There already exists an extensive body of literature concerning the small sample size scenario [9, 10, 11, 12, 13], achieved by incorporating additional knowledge into the estimation process, such as sparseness [14, 15, 16], a graph model [17], [18], a factor model [19], or other references therein. However, such additional knowledge is often either unavailable or is not trustworthy. In the absence of further knowledge about the structure of the true covariance matrix, the most successful approach so far has been shrinkage estimation [20].
This paper proposes an analytic distributionfree regularization of the kernel matrix through a shrinkage estimation of the sample covariance matrix, which is optimal in the sense of meansquared error (MSE). The regularization can be utilized directly from the kernel matrix, therefore releasing us from dealing with the feature space explicitly.
The rest of the paper is organized as follows. In Section 2, we formulate the problem. In Section 3 we introduce the shrinkage estimation and derive its optimal solution with respect to the kernel matrix. In Section 4 we examine the relation between the kernel matrix and the sample covariance matrix. Section 5 presents numerical simulation results for classification tasks. Section 6 summarizes our principal conclusions. To make for easier reading, the derivations of some of our results appear in the appendix.
Notation: We depict vectors in lowercase boldface letters and matrices in uppercase boldface. The transpose operator is denoted as . The trace and the Frobenius norm of a matrix are denoted as and
, respectively. The identity matrix is denoted as
, while is a column vector of all ones, and is a matrix of ones. The centering matrix is denoted as . For any real matrices and , the inner product is defined as , where [21, Sec. 2.20]. To make for easier reading, whenis a random matrix, we use the notation
(the sum of variances of the elements in
), where denotes the expectation operator.Ii Problem Formulation
Let be a sample of independent identical distributed (i.i.d.) dimensional vectors, and let be a nonlinear mapping to a dimensional feature space with a covariance matrix of size . When the number of observations is large (i.e., ), the most common estimator of is the sample covariance matrix of size , defined as
(1) 
where is the matrix design matrix whose column is given by . The matrix (1
) is an unbiased estimator of
, i.e., .The kernel function in the feature space is given by the relation
(2) 
where the kernel matrix of size is defined by
(3) 
Multiplication by the centering matrix makes the data centered in the feature space [5, Ch. 14.2]. Since the use of kernels mostly involves large number of features , compared to the number of observations , we are forced to deal with the problem of estimating high dimensional covariance matrices under a small sample size. In this scenario, the sample covariance matrix (1) is not wellconditioned nor is it necessarily invertible. When , the inversion cannot be computed at all [22, Sec. 2.2]. Although significant progress in dealing with the small sample size problem has been made by various regularization methods [9, 23, 24], for example, by improving the accuracy of the pseudoinverse of [25, 26] or by regularizing for specific tasks in discriminant analysis [27, 28, 29, 30, 31]; these methods require the explicit expression of the feature space . In general, when using kernels, the feature space is only known implicitly.
In the following section, we develop an analytical solution to regularized the kernel matrix (3) by examining the relationship between , (1) and (3) through a shrinkage estimation for covariance matrices. It has been demonstrated by [32]
that the largest sample eigenvalues of
are systematically biased upwards, while the smallest ones downwards. This bias can be corrected by pulling down the largest eigenvalues and pushing up the smallest ones, toward the grand mean of all sample eigenvalues. We tackle this problem through the use of a shrinkage estimator that offers a compromise between the sample covariance matrix and a wellconditioned matrix (also known as the "target") with the aim of minimizing the meansquared error (MSE). Since the spectral properties of (1) and (3) are tightly connected, any modification of the eigenvalues in automatically result a related change on the eigenvalues of that reflect the optimal estimation of the covariance matrix in the feature space, in the sense of MSE.Iii Shrinkage Estimator for Covariance Matrices
We briefly review the topic of singletarget shrinkage estimator for an unknown covariance matrix by following [32, 33], which is generally applied to highdimensional estimation problems. The shrinkage estimator is in the form
(4) 
where the target is a restricted estimator of defined as
(5) 
The objective is to find an estimator which minimizes the mean squared error (MSE)
(6) 
The value of that minimize the MSE (6) is defined as
(7) 
and can be given by the distributionfree formula
(8) 
The scalar is called the oracle shrinkage coefficient, since it depends on the unknown covariance matrix . Therefore, (8) must be estimated. As we will show next, the optimal shrinkage coefficient of the covariance matrix can be estimated directly from the kernel matrix (3), without explicitly dealing with , commonly unknown in practice.
Iiia Estimations of the Oracle Shrinkage Coefficient (8)
In this section, we show that the unbiased estimator of (9), denoted as , can be written as a function of the kernel matrix (3). It has been shown in [34, Sec. 3.B] that the target (5) is a private case of the general target framework which allows to reformulate (8) as
(9) 
The oracle shrinkage coefficient (9) can be estimated from its sample counterparts as
(10) 
where the symbol ^ indicates an estimated value of the parameter. The estimated oracle shrinkage coefficient (10) is bounded in [0,1] in order to keep the shrinkage estimator positivedefinite as required from a covariance matrix [34, 35]. The unbiased estimators of and are derived in appendix A and can be written as a function of the kernel matrix (3), i.e.,
(11) 
and
(12) 
respectively. The denominator of (10) can be also written as a function of (3), i.e.,
(13) 
Therefore, by using (11), (12) and (13), the estimated oracle shrinkage coefficient (10) can be written as a function of the kernel matrix (3); importantly, without dealing explicitly with the feature space . In the next section we utilize the shrinkage coefficient (10) in order to regularized the kernel matrix (3).
Iv Kernel regularization through shrinkage estimation
In this section we examine the relation between , (1) and (3) with respect to the shrinkage estimator (4). Let denote as the eigenvalues of the unknown covariance matrix in decreasing order, i.e., . It is well known that [21, Ch. 6.17]. As a result, the squared bias of (5) with respect to can be written as
(14) 
where is the mean of the eigenvalues , i.e.,
(15) 
and the matrices
are the eigenvector and eigenvalue matrices of
, respectively, such that .The above result shows that the squared bias of (5), i.e., (14), is equal to the dispersion of the eigenvalues around their mean. Let denote as the eigenvalues of (1) in decreasing order, i.e., . Using (14), the eigenvalues dispersion of around their mean is equal to
(16) 
indicate that the eigenvalues of are more dispersed around their mean then the true ones, where the excess dispersion is equal to . The excess dispersion implies that the largest eigenvalues of are biased upwards while the smallest downwards. Therefore, we can improve upon the sample covariance matrix by shrinking its eigenvalues toward their mean (15). This is done practically via the shrinkage estimator (4) where the optimal shrinkage coefficient in the sense of MSE is equal to (10).
The above results relates to the kernel matrix (3) as follows. Let denote as the eigenvalues of (3) in decreasing order, i.e., . In the small sample size scenario () , it is straight forward to show that the eigenvalues of (3) and (1) are related by
(17) 
The other eigenvalues of are all zero and their eigenvectors are indefinite.
The procedure of regularize the kernel matrix (3) by
(18) 
is therefore equivalent to the correction of the first eigenvalues of the sample covariance matrix (1) with respect to their mean in the feature space, which is optimal in the sense of MSE. We examine the regularized kernel matrix (18) in the next section.
V Experiments
To provide some insight into how the regularized kernel matrix (18
) behaves, we consider a twoclass classification problem. The first class observations are generated from a two Gaussians in a twodimensional space with standard deviation of 0.1, centered at
and . The second class observations are generated from a twodimensional Gaussian with standard deviation of 0.1, centered at . The problem is not linearly separable, giving raise to the use of kernels. In order to distinguish between the two classes, the kernel Fisher discriminant analysis (KFDA) [4]is used by utilizing the radial basis function kernel
(19) 
with . We regularize the withinclass scatter of the KFDA using (18) versus the proposed fixed regularization of [4]. These two scenarios are referred to as “shrinkage KFDA” and “KFDA”, respectively. We run the experiments when the number of training observation per Gaussian, denoted by , varies from 3 to 30. Each simulation is repeated 100 times and the average values of the misclassification rates as a function of are depicted in Fig. 1.
As can be seen from Fig. 1, the shrinkage KFDA outperforms the KFDA for any value of
. We used the paired ttest in order to evaluate the null hypothesis in which the difference between the two misclassification rates over the 100 simulations comes from a population with mean equal to zero. The ttest rejects the null hypothesis at the 99% significance level, meaning that the shrinkage KFDA consistently outperformed the KFDA.
The average value of (10) as a function of (training observations per Gaussian) is plotted in Fig. 2. When considering a small number of observations, the shrinkage coefficient is relatively high, giving rise to the well structured target (5). As the number of observations increases, using (1) is preferred, primarily since it provides a more accurate description of the true covariance matrix . Consequently, the shrinkage coefficient (10) decreases as the number of training observations increase.
In order to visually examine the impact of regularization, we present the KFDA decision boundaries where the number of observations for each Gaussian is 5. In other words, the first class contains 10 training observations while the second group includes 5 training observations. Fig. 3 depicts the classification performance resulting from training the KFDA and shrinkage KFDA on the data set using the kernel (19). Fig. 3(a) shows the output value produced by the KFDA from inputs in the twodimensional grid. The outputs produce contour lines of constant values that varies monotonically along the underlying nonlinear structure of the data. Fig. 3(b) illustrates the decision boundaries found by KFDA. It can be seen clearly that although the data set is not linearly separable in the twodimensional data space, it is linearly separable in the nonlinear feature space defined implicitly by the nonlinear kernel (19). Hence, the training data points are perfectly separated in the original data space. However, it can be seen from Fig. 3(b) that the decision boundaries stretched far beyond the area surrounding the second class. This is the result of a poor regularization term used by the KFDA, leading to over fitting with respect to the training observations and bad generalization performance. In comparison, Fig. 3(c) shows the output value produced by the shrinkage KFDA from inputs in the twodimensional grid. Again, the outputs produces contour lines of constant values that varies monotonically along the underlying nonlinear structure of the data. The contour lines in Fig. 3(c) follows the data structure more moderately than the contour lines in Fig. 3(a). Fig. 3(d) present the decision boundaries found by shrinkage KFDA. Again, the data set is linearly separable in the nonlinear feature space. The decision boundaries in Fig. 3(d) does not over fit to the training data as in Fig. 3(b).
The same experiment is repeated where the number of observations for each Gaussian is equal 20, i.e., the first class have 40 training observations while the second group have 20 training observations. The results are provided in Fig. 4. Although the training data points are perfectly separated in the original data space both by KFDA and by shrinkage KFDA; the KFDA overfits the training data. As a consequence, the decision boundaries produces by the KFDA creates three different areas that relates to class 2. This is again the result of a poor regularization term. The shrinkage KFDA contour lines in Fig. 4(c) follow the data more moderately than in Fig. 4(a). As a result, the decision boundaries found by the shrinkage KFDA, shown in Fig. 4(d), clearly separate the two classes, without overfitting to the training data as in Fig. 4(b).
Vi Conclusions
In this paper we point out the connections between regularization of the kernel matrix (3) and a shrinkage estimation of the covariance matrix in the feature space. Since the kernel matrix (3) and the sample covariance matrix (1) involve inner and outer products in the feature space, respectively, their spectral properties are tightly connected. This allows us to examine the kernel matrix stability through the sample covariance matrix in the feature space and vice versa. More specifically, the use of kernels often involves a large number of features when compared to the number of observations. In this scenario, the sample covariance matrix is not wellconditioned nor is it necessarily invertible. This forces us to deal with the problem of estimating high dimensional covariance matrices under a small sample size. The use of a shrinkage estimator allows us to effectively address this problem by providing a compromise between the sample covariance matrix (1) and the target (5) with the aim of minimizing the meansquared error (MSE). Since the spectral properties of (1) and (3) are tightly connected, any modification of the eigenvalues in automatically result a related change on the eigenvalues of that reflect the optimal correction of the covariance matrix in the feature space, in the sense of MSE. The result provides an analytical distributionfree kernel matrix regularization approach that is tuned directly from the kernel matrix and releases us from dealing with the feature space explicitly. Numerical simulations demonstrate that the proposed regularization is significantly effective in classification tasks.
Appendix A Derivation of (11) and (12)
The unbiased estimator (11) is derived as follow. Let the matrix to be defined as
(20) 
where . Then can be written as
(21) 
where the last term equals zero (observed after entering the sum into the inner product), which simplifies the expression (21) to
(22) 
and finally
(23) 
References
 [1] B. Schölkopf, A. Smola, and K.R. Müller, “Nonlinear component analysis as a kernel eigenvalue problem,” Neural computation, vol. 10, no. 5, pp. 1299–1319, 1998.
 [2] T. Lancewicki and M. Aladjem, “Locally multidimensional scaling by creating neighborhoods in diffusion maps,” Neurocomputing, vol. 139, no. 0, pp. 382 – 396, 2014.

[3]
L. Zhang, X. Zhen, and L. Shao, “Learning objecttoclass kernels for scene classification,”
Image Processing, IEEE Transactions on, vol. 23, no. 8, pp. 3241–3253, Aug 2014.  [4] B. Scholkopft and K.R. Mullert, “Fisher discriminant analysis with kernels,” Neural networks for signal processing IX, vol. 1, p. 1, 1999.
 [5] B. Schölkopf and A. J. Smola, Learning with kernels: Support vector machines, regularization, optimization, and beyond. MIT press, 2002.
 [6] A. N. Tikhonov, A. Goncharsky, V. Stepanov, and A. G. Yagola, Numerical methods for the solution of illposed problems. Springer Science & Business Media, 2013, vol. 328.

[7]
W. James and C. Stein, “Estimation with quadratic loss,” in
Proceedings of the fourth Berkeley symposium on mathematical statistics and probability
, vol. 1, no. 1961, 1961, pp. 361–379.  [8] X. Lu, M. Unoki, S. Matsuda, C. Hori, and H. Kashioka, “Controlling tradeoff between approximation accuracy and complexity of a smooth function in a reproducing kernel hilbert space for noise reduction,” Signal Processing, IEEE Transactions on, vol. 61, no. 3, pp. 601–610, Feb 2013.
 [9] P. J. Bickel and E. Levina, “Regularized estimation of large covariance matrices,” The Annals of Statistics, vol. 36, no. 1, pp. pp. 199–227, 2008.
 [10] A. Rohde and A. B. Tsybakov, “Estimation of highdimensional lowrank matrices,” Ann. Statist., vol. 39, no. 2, pp. 887–930, 04 2011.

[11]
T. Lancewicki and I. Arel, “Covariance matrix estimation for reinforcement learning,” in
the 2nd Multidisciplinary Conference on Reinforcement Learning and Decision Making, 2015, pp. 14–18.  [12] S. Sahyoun, S. Djouadi, and T. Lancewicki, “Nonlinear optimal control for reduced order quadratic nonlinear systems,” in 22nd International Symposium on Mathematical Theory of Networks and Systems, 2016, pp. 334–339.
 [13] T. Lancewicki, B. Goodrich, and I. Arel, “Sequential covariancematrix estimation with application to mitigating catastrophic forgetting,” in Machine Learning and Applications (ICMLA), 2015 IEEE 14th International Conference on. IEEE, 2015, pp. 628–633.
 [14] P. Ravikumar, M. J. Wainwright, G. Raskutti, and B. Yu, “Highdimensional covariance estimation by minimizing l1penalized logdeterminant divergence,” Electron. J. Statist., vol. 5, pp. 935–980, 2011.

[15]
E. Yang, A. Lozano, and P. Ravikumar, “Elementary estimators for sparse covariance matrices and other structured moments,” in
Proceedings of the 31st International Conference on Machine Learning (ICML14), 2014, pp. 397–405.  [16] T. T. Cai and H. H. Zhou, “Minimax estimation of large covariance matrices under l1 norm,” Statist. Sinica, vol. 22, no. 4, pp. 1319–1349, 2012.
 [17] B. Rajaratnam, H. Massam, and C. M. Carvalho, “Flexible covariance estimation in graphical gaussian models,” Ann. Statist., vol. 36, no. 6, pp. 2818–2849, 12 2008.
 [18] K. Khare and B. Rajaratnam, “Wishart distributions for decomposable covariance graph models,” Ann. Statist., vol. 39, no. 1, pp. 514–555, 02 2011.
 [19] J. Fan, Y. Fan, and J. Lv, “High dimensional covariance matrix estimation using a factor model,” Journal of Econometrics, vol. 147, no. 1, pp. 186 – 197, 2008, econometric modelling in finance and risk management: An overview.
 [20] O. Ledoit and M. Wolf, “Nonlinear shrinkage estimation of largedimensional covariance matrices,” Institute for Empirical Research in Economics University of Zurich Working Paper, no. 515, 2011.
 [21] G. A. F. Seber, A Matrix Handbook for Statisticians. John Wiley and Sons, Inc., 2007.
 [22] E. Theodorou, J. Buchli, and S. Schaal, “A generalized path integral control approach to reinforcement learning,” J. Mach. Learn. Res., vol. 11, pp. 3137–3181, Dec. 2010.
 [23] P. Bickel, B. Li, A. Tsybakov, S. Geer, B. Yu, T. Valdes, C. Rivero, J. Fan, and A. Vaart, “Regularization in statistics,” Test, vol. 15, no. 2, pp. 271–344, 2006.
 [24] G. Cao, L. Bachega, and C. Bouman, “The sparse matrix transform for covariance estimation and analysis of high dimensional signals,” IEEE Transactions on Image Processing, vol. 20, no. 3, pp. 625–640, 2011.
 [25] D. Hoyle, “Accuracy of pseudoinverse covariance learning  a random matrix theory analysis,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 33, no. 7, pp. 1470–1481, 2011.
 [26] S. Raudys and R. P. Duin, “Expected classification error of the fisher linear classifier with pseudoinverse covariance matrix,” Pattern Recognition Letters, vol. 19, no. 5 6, pp. 385 – 392, 1998.

[27]
L.F. Chen, H.Y. M. Liao, M.T. Ko, J.C. Lin, and G.J. Yu, “A new ldabased face recognition system which can solve the small sample size problem,”
Pattern Recognition, vol. 33, no. 10, pp. 1713 – 1726, 2000.  [28] J. Wang, K. Plataniotis, J. Lu, and A. Venetsanopoulos, “Kernel quadratic discriminant analysis for small sample size problem,” Pattern Recognition, vol. 41, no. 5, pp. 1528 – 1538, 2008.
 [29] J. Ye, R. Janardan, C. H. Park, and H. Park, “An optimization criterion for generalized discriminant analysis on undersampled problems,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 26, no. 8, pp. 982–994, 2004.

[30]
P. Howland and H. Park, “Generalizing discriminant analysis using the generalized singular value decomposition,”
IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 26, no. 8, pp. 995–1006, 2004.  [31] W. J. Krzanowski, P. Jonathan, W. V. McCarthy, and M. R. Thomas, “Discriminant analysis with singular covariance matrices: Methods and applications to spectroscopic data,” Journal of the Royal Statistical Society. Series C (Applied Statistics), vol. 44, no. 1, pp. pp. 101–115, 1995.

[32]
O. Ledoit and M. Wolf, “A wellconditioned estimator for largedimensional
covariance matrices,”
Journal of Multivariate Analysis
, vol. 88, no. 2, pp. 365 – 411, 2004.  [33] J. Schafer and K. Strimmer, “A shrinkage approach to largescale covariance matrix estimation and implications for functional genomics,” Statist. Appl. Genet. Molec. Biol., vol. 4, no. 1, 2005.
 [34] T. Lancewicki and M. Aladjem, “Multitarget shrinkage estimation for covariance matrices,” IEEE Transactions on Signal Processing, vol. 62, no. 24, pp. 6380–6390, Dec 2014.
 [35] T. Lancewicki, “Dimensionality reduction and shrinkage estimation in multivariate data,” Ph.D. dissertation, BENGURION UNIVERSITY OF THE NEGEV (ISRAEL), 2014.
Comments
There are no comments yet.