## 1 Introduction

To accelerate the analysis of complex data, kernel based methods (i.e., the support vector machine, kernel ridge regression, multiple kernel learning, kernel dimension reduction in regression, and so on) have proved to be powerful techniques and have been actively studied over the last two decades due to their many flexibilities

(SVM92, ; Saunders98, ; Charpiat-15, ; Back-08, ; Steinwart-08, ; Hofmann-08, ). Examples of unsupervised kernel methods include kernel principal component analysis (kernel PCA), kernel canonical correlation analysis (standard kernel CCA), and weighted multiple kernel CCA

(Schlkof-kpca, ; Akaho, ; Back-02, ; Ashad-14, ; Yu-11, ). These methods have been extensively studied for decades in the use of unsupervised kernel methods. However, all of these approaches are not robust and are sensitive to the contaminated model. This paper introduces the robust kernel covariance operator (kernel CO) and kernel cross-covariance operator (kernel CCO) for unsupervised kernel methods such as kernel CCA.Although many researchers have been studying the robustness issue in a supervised learning setting (e.g., the support vector machine for classification and regression

(Christmann-04, ; Christmann-07, ; Debruyne-08, )) there are generally few well-founded robust methods for kernel unsupervised learning. The robustness is an important and challenging issue in using statistical machine learning for multiple source data analysis. This is because

outliers often occur in real data, which can wreak havoc when used in statistical machine learning methods. Since 1960s, many robust methods, which are less sensitive to outliers, have been developed to overcome this problem. The objective of robust statistics is to use the methods from the bulk of the data and detect the deviations from the original patterns (Huber-09, ; Hampel-11, ).Recently, in the field of kernel methods, a robust kernel density estimator (robust kernel DE) based on robust kernel mean elements (robust kernel ME) has been proposed by

Kim-12, which is less sensitive to outliers than the kernel density estimator. Robust kernel DE is computed using a kernelized iteratively re-weighted least squares (KIRWLS) algorithm in a reproducing kernel Hilbert space (RKHS). In addition, two spatial robust kernel PCA methods have been proposed based on the weighted eigenvalue decomposition

(Huang-KPCA, ) and spherical kernel PCA (Debruyne-10, ), showing that the influence function (IF) of kernel PCA, a well-known measure of robustness, can be arbitrarily large for unbounded kernels.The kernel methods explicitly or implicitly depend on the kernel CO or the kernel CCO. These operators are among the most useful tools in unsupervised kernel methods but have not yet been robustified. This paper shows that they can be formulated as an empirical optimization problem to achieve robustness by combining empirical optimization problems with the idea of Huber or Hampel on the M-estimation model (Huber-09, ; Hampel-11, ). The robust kernel CO and robust kernel CCO can be computed efficiently via a KIRWLS algorithm.

In the past decade, CCA with a positive definite kernel has been proposed and is called standard kernel CCA. Several of its variants have also been proposed (Fukumizu-SCKCCA, ; Hardoon2009, ; Otopal-12, ; Ashad-15, ). Due to the use of simple eigen decomposition, they are still a well-used method for multiple source data analysis. An empirical comparison and sensitivity analysis for robust linear CCA and standard kernel CCA have also been discussed, which give a similar interpretation as kernel PCA but without any robustness measure (e.g., IF of standard kernel CCA) (Ashad-10, ). In addition, the author in Romanazii-92 has proposed the IF of canonical correlation and canonical vectors of linear CCA. While the IF of an estimator can characterize its robustness, asymptotic properties and standard error, the IF of standard kernel CCA has not yet been proposed. In addition, a robust kernel CCA has not yet been studied. All of these considerations provide motivation to study the IF of kernel CCA and the robust kernel CCA in unsupervised learning.

The contribution of this paper is fourfold. First, we propose a robust kernel CO and robust kernel CCO based on a generalized loss function instead of the quadratic loss function. Second, we propose the IF of standard kernel CCA: kernel canonical correlation (kernel CC) and kernel canonical variates (kernel CV). Third, we propose a method for detecting the influential observations from multiple sets of data, by proposing a visualization method using the IF of kernel CCA. Finally, we propose a method based on robust kernel CO and robust kernel CCO, called robust kernel CCA, which is less sensitive than standard kernel CCA. Experiments on both synthesized data and imaging genetics analysis demonstrate that the proposed visualization and robust kernel CCA can be applied effectively to ideal and contaminated data.

The remainder of this paper is organized as follows. In the following section, we provide a brief review of positive definite kernel, kernel ME, robust kernel ME and kernel CCO. In Section 3 we present the definition, representer theorem, KIRWLS convergence, and a algorithm of robust kernel CCO. In Section 4, we discuss the basic notion of the IF, the IF of kernel ME, kernel CO, kernel CCO and robust kernel CCO. After a brief review of standard kernel CCA in Section 5.1, we propose the IF of standard kernel CCA (kernel CC and kernel CV) and the robust kernel CCA in Section 5.2 and in Section 5.3, respectively. In Section 6, we describe experiments conducted on both synthesized data and real imaging genetics analysis. In Section 7, concluding remarks and future research directions are presented. In the appendix, we discuss the detailed results.

## 2 Standard and robust kernel (cross-) covariance operator

The kernel ME, kernel CO, and kernel CCO with positive definite kernel have been extensively applied to nonparametric statistical inference through representing distributions in the form of means and covariance in the RKHS (Gretton-08, ; Fukumizu-08, ; Song-08, ; Kim-12, ; Gretton-12, ). To define the kernel ME, robust kernel ME, kernel CO and kernel CCO, we need the basic notions of positive definite kernels and Reproducing kernel Hilbert space (RKHS), which are briefly addressed in the following (Aron-RKHS, ; Berlinet-04, ; Ashad-14T, ).

### 2.1 Basic notion of kernel methods

Let , and

be probability measures on the given nonempty sets

, and , respectively, such that and are the marginals of . Also let ; and be the independent and identically distributed (IID) samples from the distribution , and , respectively. A symmetric kernel, , defined on a space is called a positive definite kernel if the Gram matrix is positive semi-definite for all . A RKHS is a Hilbert space with a reproducing kernel whose span is dense in the Hilbert space. We can equivalently define an RKHS as a Hilbert space of functions with all evaluation functionals bounded and linear. The Moore-Aronszajn theorem states that every symmetric, positive definite kernel defines a unique reproducing kernel Hilbert space Aron-RKHS . The feature map is a mapping and defined as ). The vector is called a feature vector. The inner product of two feature vectors can be defined as for all . This is called the kernel trick. By the reproducing property, , with and the kernel trick, the kernel can evaluate the inner product of any two feature vectors efficiently, without knowing an explicit form of either the feature map or the feature vector. Another great advantage is that the computational cost does not depend on the dimension of the original space after computing the Gram matrices (Fukumizu-14, ; Ashad-14, ).### 2.2 Standard kernel mean element

Let be a measurable positive definite kernel on with . The kernel mean, , of on is an element of and is defined by the mean of the

-valued random variable

,The kernel mean always exists with arbitrary probability under the assumption that positive definite kernels are bounded and measurable. By the reproducing property, the kernel ME satisfies the following equality

for all .

The empirical kernel ME, is an element of the RKHS,

The empirical kernel ME of the feature vectors can be regarded as a solution to the empirical risk optimization problem (Kim-12, )

(1) |

### 2.3 Robust kernel mean element

As explained in Section 2.2, the kernel ME is the solution to the empirical risk optimization problem, which is a least square type of estimator. This type of estimator is sensitive to the presence of outliers in the feature, . To reduce the effect of outliers, we can use -estimation. In recent years, the robust kernel ME has been proposed for density estimation (Kim-12, ). The robust kernel ME, based on a robust loss function on , is defined as

(2) |

Examples of robust loss functions include Huber’s loss function, Hampel’s loss function, or Tukey’s biweight loss function. Unlike the quadratic loss function, the derivative of these loss functions are bounded (Huber-09, ; Hampel-86, ; Tukey-77, ). The Huber’s function, a hybrid approach between squared and absolute error losses, is defined as:

where c () is a tuning parameter. The Hampel’s loss function is defined as:

where the non-negative free parameters allow us to control the degree of suppression of large errors. The Tukey’s biweight loss functions is defined as:

where .

The basic assumptions of the loss functions are; (i) is non-decreasing, and as , (ii) exists and is finite, where is the derivative of , (iii) and are continuous, and bounded, and (iv) is Lipschitz continuous. All of these assumptions hold for Huber’s loss function as well as others (Kim-12, ). Figure 1 presents the family of loss functions, , , , and (second derivative of ).

Essentially Eq. (2) does not have a closed form solution, but using KIRWLS, the solution of robust kernel mean is,

where

Given the weights of the robust kernel ME, , of a set of observations , the points are centered and the centered robust Gram matrix is , where is a Gram matrix, and . For a set of test points , we define two matrices of order as and . Like the centered Gram matrix, the centered robust Gram matrix of test points, , in terms of the robust Gram matrix and is defined as,

### 2.4 Standard kernel (cross-) covariance operator

In this section we study the covariance of two random feature vectors and . As for the standard random vectors, the notion of kernel covariance is useful as the basis in describing the statistical dependence among two or more variables.

Let and be two measurable spaces and be a random variable on with distribution . The kernel CCO (centered) is a linear operator defined as

where and

is a tensor product operator

, where and are Hilbert spaces) (Reed-80, ).Given two and measurable positive definite kernels with respective RKHS and . By the reproducing property, the kernel CCO, with , and is satisfied

for all and . This is a bounded operator. As shown in Eq. (1), we can define kernel CCO as an empirical risk optimization problem as follows,

(3) |

The empirical kernel CCO is then

(4) | |||||

where and are centered kernels. For the special case, when is equal to , it gives a kernel CO.

## 3 Robust kernel (cross-) covariance operator

Because a robust kernel ME (see Section 2.3) is used, to reduce the effect of outliers, we propose to use -estimation to find a robust sample covariance of and . To do this, we estimate kernel CO and kernel CCO based on robust loss functions, namely, robust kernel CO and robust kernel CCO, respectively. Eq. (3) can be written as

(5) |

### 3.1 Representation of robust kernel (cross-) covariance operator

In this section, we represent as a weighted combination of the product of two kernels . We will also address necessary and sufficient conditions for the robust kernel CCO. Eq (5) can be reformulated as where

(6) |

In order to optimize in a product RKHS, the necessary conditions are characterized through the Gâteaux differentials of . Given a product vector space and a function , the Gâteaux differential of at with incremental is defined as

The Gâteaux differential on a probability distribution is also defined in Section

4.Based on the optimality principle (Luenberger-97, ), the Gâteaux differential is well defined for all and a necessary condition for to have a minimum at is that . We can state the following lemma.

###### Lemma 3.0

Under the assumptions (i) and (ii) the Gâteaux differential of the objective function at and incremental is

where is defined as

A necessary condition for , robust kernel CCO is

The key difference of Lemma 3.1 and Lemma of Kim-12 is the RKHS. The latter lemma is based on a single RKHS but the former one is on a product RKHS . This is a generalization result.

###### Theorem 3.2

Under the same assumption of Lemma 3.1, the robust kernel CCO (centered) for any is then

(7) |

where , and . Furthermore,

(8) |

Representer Theorem 3.2 tells us that in the robust loss function, when is decreasing the large value of , will be small. Therefore, the robust kernel CCO is robust in the sense that it down-weights outlying points.

In order to state the sufficient condition for to be the minimizer of Eq. (5), we need an additional assumption on .

###### Theorem 3.3

For a positive definite kernel, becomes strictly convex for the Huber loss function.

### 3.2 Algorithm for robust kernel (cross-) covariance operator

As explained in (Kim-12, ), Eq. (5) does not have a closed form solution, but using the kernel trick the standard IRWLS can be extended to a RKHS. The solution at hth iteration is then,

where

###### Theorem 3.4

Under the assumptions (i) - (iii) and is non-increasing. Let

and be the sequence produced by the KIRWLS algorithm. Then decreases monotonically at every iteration and converges.

as .

Theorem 3.4 sates that becomes close to the set of stationary points of by increasing the number of iterations. Under the assumptions of Theorem 3.4 and for a strictly convex set , it is also granted that the converges to in the Hilbert-Schmidt norm and supremum norm.

The algorithm for estimating robust kernel CCO is given in Figure 2. The input of this algorithm is a robust kernel ME. The computational complexity of a robust kernel ME is in each iteration, where is the number of data points. The algorithm that we have presented involves finding the robust kernel CCO with the dimension . A naive implementation of the algorithm in Figure 2 would show that both time and memory complexity are similar to in each iteration. In practice, the required number of iterations is around . A computational complexity with cubic growth in the number of data points would be a serious liability in application to large dataset. We are able to reduce the time complexity using the low-rank approximation of the Gram matrix (Drineas-05, ). We can also use the random features approach. Random Features provide a finite-dimensional alternative to the kernel trick by instead mapping the data to an equivalent randomized feature space (Rahimi-07, ).

## 4 Influence function of robust kernel and kernel (cross-) covariance operator

To define the robustness in statistics, different approaches have been proposed, for example, the minimax approach (Huber-64, ), the sensitivity curve (Tukey-77, ), the IF (Hampel-74, ; Hampel-86, ) and the finite sample breakdown point (Dono-83, ). Due to its simplicity, the IF is the most useful approach in statistical supervised learning (Christmann-07, ; Christmann-04, ). In this section, we briefly introduce the definition of IF and the IF of kernel ME, kernel CO, and kernel CCO. We then propose the IF of robust kernel CO and the robust kernel CCO.

Let is a IID sample from a population with distribution function , its empirical distribution function is , and is a statistic. Also let be a class of all possible distributions containing for all and . We assume that there exists a functional , where is the set of all probability distributions in for which is defined, such that

where does not depend on . is then called a statistical functional. If the domain of is a convex set containing all distributions, and the data do not follow the model in exactly but slightly going toward a distribution . The Gâteaux derivative, of at is defined as

The Gâteaux differentiability at ensures the directional derivative of exists in all directions that stay in .

Suppose and is the probability measure which gives mass at the point . Then, is a contaminated distribution. The influence function (special case of Gâteaux Derivative) of at is defined by

(9) |

provided that the limit exists. It can be intuitively interpreted as a suitably normalized asymptotic influence of outliers on the value of an estimate or test statistic. The IF exists with an even weaker condition than Gâteaux differentiability. The IF reflects the bias caused by adding a few outliers at the point

, standardized by the amount of contamination. Therefore a bounded IF accelerates the robustness of an estimator (Hampel-86, ).### 4.1 Influence function based robustness measures

The three metrics of the IF function that can be used for robustness measures of the functional are the gross error sensitivity, local shift sensitivity and rejection point. The gross error sensitivity of at is defined as

(10) |

The gross error sensitivity measures the worst effect that a small amount of contamination of fixed size can have on the estimator. The local shift sensitivity of at for all is defined by

measures the worst effect of rounding error (small function in the observation). The rejection point of at is defined by

The is infinite if there exits no such . We can reject those observations, which are farther away than . For a robust estimator, will be finite.

### 4.2 Influence function of kernel (cross-) covariance operator

In kernel methods, every estimate is a function. For a scalar-valued estimate, we define the IF at a fixed point. But if the estimate is a function, we are able to express the change of the function value at every point. Suppose and are two function estimates on the distribution and the contaminated distribution at , respectively. The influence function for is defined by

We can estimate the IF using the empirical distribution which is called empirical IF (EIF). Suppose a sample of size is drawn from the empirical distributions . Also let be a contamination model with the empirical data. The empirical IF for is defined as

As a first example, let the kernel ME, , where The value of the parameter at the contamination model, is

Thus the IF of kernel ME at point is given by

We can estimate the IF of the kernel ME with the empirical distribution, , at the data points , at for every point as

which is called the EIF of kernel ME.

As a second example, let the mean of the product of two random variables, and with , , for all , and . The value of parameter at the contamination model at , is given by

Thus the IF of is given by

(11) | |||||

We can find the IF for a combined statistic given the IF for the statistic itself. The IF of complicated statistics can be calculated with the chain rule, say

, that is,For example, the IF of covariance of two random variables, and can be calculated using the above chain rule as

for , , and .

Using Eq. (11) and the reproducing property, the IF of with distribution, at is given by

(12) | |||||

Comments

There are no comments yet.