Robust Kernel (Cross-) Covariance Operators in Reproducing Kernel Hilbert Space toward Kernel Methods

02/17/2016
by   Md. Ashad Alam, et al.
0

To the best of our knowledge, there are no general well-founded robust methods for statistical unsupervised learning. Most of the unsupervised methods explicitly or implicitly depend on the kernel covariance operator (kernel CO) or kernel cross-covariance operator (kernel CCO). They are sensitive to contaminated data, even when using bounded positive definite kernels. First, we propose robust kernel covariance operator (robust kernel CO) and robust kernel crosscovariance operator (robust kernel CCO) based on a generalized loss function instead of the quadratic loss function. Second, we propose influence function of classical kernel canonical correlation analysis (classical kernel CCA). Third, using this influence function, we propose a visualization method to detect influential observations from two sets of data. Finally, we propose a method based on robust kernel CO and robust kernel CCO, called robust kernel CCA, which is designed for contaminated data and less sensitive to noise than classical kernel CCA. The principles we describe also apply to many kernel methods which must deal with the issue of kernel CO or kernel CCO. Experiments on synthesized and imaging genetics analysis demonstrate that the proposed visualization and robust kernel CCA can be applied effectively to both ideal data and contaminated data. The robust methods show the superior performance over the state-of-the-art methods.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 19

page 20

05/09/2017

Influence Function and Robust Variant of Kernel Canonical Correlation Analysis

Many unsupervised kernel methods rely on the estimation of the kernel co...
06/01/2016

Gene-Gene association for Imaging Genetics Data using Robust Kernel Canonical Correlation Analysis

In genome-wide interaction studies, to detect gene-gene interactions, mo...
06/01/2016

Identifying Outliers using Influence Function of Multiple Kernel Canonical Correlation Analysis

Imaging genetic research has essentially focused on discovering unique a...
11/06/2018

Robust multiple-set linear canonical analysis based on minimum covariance determinant estimator

By deriving influence functions related to multiple-set linear canonical...
06/12/2020

Kernel Distributionally Robust Optimization

This paper is an in-depth investigation of using kernel methods to immun...
01/27/2019

The Robust Kernel Association Test

Testing the association between SNP effects and a response is a common t...
07/15/2021

The Completion of Covariance Kernels

We consider the problem of positive-semidefinite continuation: extending...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

The incorporation of various unsupervised learning methods for multiple data sources into genomic analysis is a rather recent topic. Using the dual representations, the task of learning with multiple data sources is related to the kernel-based data fusion, which has been actively studied in the last decade Bach (2008), Steinwart and Christmann (2008), Hofmann et al. (2008)

. Kernel fusion in unsupervised learning has a close connection with unsupervised kernel methods. As unsupervised kernel methods, kernel principal component analysis

(Schölkopf et al., 1998, Alam and Fukumizu, 2014, kernel PCA), kernel canonical correlation analysis (Akaho, 2001, Bach and Jordan, 2002, classical kernel CCA), weighted multiple kernel CCA and others have been extensively studied in unsupervised kernel fusion for decades (S. Yu and Moreau, 2011)

. But these methods are not robust; these are sensitive to contaminated data. Even though a number of researches has been done on robustness issue for supervised learning, especially support vector machine for classification and regression

(Christmann and Steinwart, 2004, 2007, Debruyne et al., 2008), there are no general well-founded robust methods for unsupervised learning.

Robustness is an essential and challenging issue in statistical machine learning for multiple sources data analysis. Because

outliers

, data that cause surprise in relation to the majority of the data, are often occur in the real data. Outliers may be right, but we need to examine for transcription errors. They can play havoc with classical statistical methods or statistical machine learning methods. To overcome this problem, since 1960 many robust methods have been developed, which are less sensitive to outliers. The goals of robust statistics are to use the methods from the bulk of the data and indicate the points deviating from the original pattern for further investment

(Huber and Ronchetti, 2009, Hampel et al., 2011)

. In recent years, a robust kernel density estimation (robust kernel DE) has been proposed

Kim and Scott (2012)

, which is less sensitive than the kernel density estimation. To the best of our knowledge, two spacial robust kernel PCA methods have been proposed based on weighted eigenvalues decomposition

(Huang et al., 2009b) and spherical kernel PCA (Debruyne et al., 2010). They show that the influence function (IF), a well-known measure of robustness, of kernel PCA can be arbitrary large for unbounded kernels.

During the last ten years, a number of papers have been about the properties of kernel CCA, CCA using positive definite kernels, called classical kernel CCA and its variants have been proposed (Fukumizu et al., 2007, Hardoon and Shawe-Taylor, 2009, Otopal, 2012, Alam and Fukumizu, 2015). Due to the properties of eigen decomposition it is still a well applied methods for multiple souses data analysis. In recent years, two canonical correlation analysis (CCA) methods based on Hilbert-Schmidt independence criterion (hsicCCA) and centered kernel target alignment (ktaCCA) have been proposed by Chang et al. (2013). These methods are able to extract nonlinear structure of the data as well. Due to the gradient based optimization, these methods are not able to extract all canonical variates using the same initial value and do not work for high dimensional datasets. For more details, see Section 5.3. An empirical comparison and sensitivity analysis for robust linear CCA and classical kernel CCA is also discussed, and gives similar interpretation as kernel PCA for kernel CCA without any theoretical results (Alam et al., 2010).

Most of the kernel methods explicitly or implicitly depend on kernel covariance operator (kernel CO) or kernel cross-covariance operator (kernel CCO). Among others, these are most useful tools of unsupervised kernel methods but have not been robust yet. They can be formulated as an empirical optimization problem to achieve robustness by combining empirical optimization problem with ideas of Huber or Hampel s M-estimation model (Huber and Ronchetti, 2009, Hampel et al., 2011). The robust kernel CO and robust kernel CCO can be computed efficiently via a kernelized iteratively re-weighted least square (KIRWLS) problem. In robust kernel DE based on robust kernel mean elements (robust kernel ME) is used KIRWLS in reproducing kernel Hilbert space (RKHS) (Kim and Scott, 2012). Debruyne et al. (2010) have proposed a visualization methods for detecting influential observations from one set of the data using IF of kernel PCA. In addition, Romanazzi (1992) has proposed the IF of canonical correlation and canonical vectors of linear CCA but the IF of classical kernel CCA and any robust kernel CCA have not been proposed, yet. All of these considerations motivate us to conduct studies on robust kernel CCO toward kernel unsupervised methods.

Contribution of this paper is fourfold. First, we propose robust kernel CO and robust kernel CCO based on generalized loss function instead of the quadratic loss function. Second, we propose IF of classical kernel CCA: kernel canonical correlation (kernel CC) and kernel canonical variates (kernel CV). Third, to detect influential observations from multiple sets of data, we propose a visualization method using the inflection function of kernel CCA. Finally, we propose a method based on robust kernel CO and robust kernel CCO, called robust kernel CCA, which is less sensitive than classical kernel CCA. Experiments on synthesized and imaging genetics analysis demonstrate that the proposed visualization and robust kernel CCA can be applied effectively to both ideal data (ID) and contaminated data (CD).

The remainder of this paper is organized as follows. In the next Section, we provide a brief review of kernel ME, kernel CCO, robust kernel ME, robust kernel CO, robust kernel CCO and robust Gram matrices with algorithms. In Section , we discuss in brief the IF, IF of kernel ME and IF of kernel CO and kernel CCO. After a brief review of classical kernel CCA in Section 4.1, we propose the IF of classical kernel CCA: kernel CC and kernel CV in Section 4.1.1. The robust kernel CCA is proposed in Section 4.2. In Section , we describe experiments conducted on both synthesized data and the imaging genetics analysis with a visualizing method. In Appendix, we discuss the results in detail.

2 Classical and robust kernel (cross-) covariance operator in RKHS

Kernel ME, kernel CO and kernel CCO with positive definite kernel have been extensively applied to nonparametric statistical inference through representing distribution in the form of means and covariance in RKHS (Gretton et al., 2008, Fukumizu et al., 2008, Song et al., 2008, Kim and Scott, 2012, Gretton et al., 2012). Basic notion of kernel MEs, kernel CO and kernel CCO with its robustness through IF are briefly discussed below.

2.1 Classical kernel (cross-) covariance operator

Let , and

be the probability measure on

, and , respectively. Also let ,; and be the random sample from the distribution , and , respectively. A symmetric kernel defined on a space is called positive definite kernel if the Gram matrix is positive semi-definite (Aronszajn, 1950). By the reproduction properties and kernel trick, the kernel can evaluate the inner product of any two feature vectors efficiently without knowing an explicit form of either the feature map () or feature spaces (). In addition, the computational cost does not depend on dimension of the original space after computing the Gram matrices (Fukumizu and Leng, 2014, Alam and Fukumizu, 2014).

A mapping with is an element of the RKHS . By the reproducing property with , kernel mean elements is defined as

. Given an independent and identically distributed sample, the mapping is an empirical element of the RKHS, , The sample kernel ME of the feature vectors can be regraded as a solution to the empirical risk optimization problem (Kim and Scott, 2012)

(1)

Similarly, we can define kernel CCO as an empirical risk optimization problem. An operator, with , and , by the reproducing property which is defined as

and called kernel CCO. Given the pair of independent and identically distributed sample, , the kernel CCO is an operator of the RKHS, , Eq. (1) becomes

(2)

where . and the kernel covariance operator at point is then

Special case, if Y is equal to X, gives kernel CO.

2.2 Robust kernel (cross-) covariance operator

It is known that (as in Section 2.1) the kernel ME is the solution to the empirical risk optimization problem, which are the least square type estimators. This type of estimators are sensitive to the presence of outliers in the features, . In recent years, the robust kernel ME has been proposed for density estimation (Kim and Scott, 2012). Our goal is to extend this notion to kernel CO and kernel CCO. To do these, we estimates kernel CO and kernel CCO based on robust loss functions, M-estimator, and called, robust kernel CO and robust kernel CCO, respectively. Most common example of robust loss functions, on , are Huber’s or Hampel’s loss function. Unlike the quadratic loss function, the derivative of these loss functions is bounded (Huber and Ronchetti, 2009, Hampel et al., 1986). The Huber’s function is defined as

and Hampel’s function is defined as

The basic assumptions are: (i) is non-decreasing, and as (ii) exists and is finite, where is derivative of . (iii) and are continuous and bounded (iv) is Lipschitz continuous. Huber’s loss function as well as others hold for all of these assumptions (Kim and Scott, 2012).

Given weights of robust kernel ME, , of a set of observations, the points are centered and the centered Gram matrix is , where and .

Eq. (2) can be written as

(3)

As in (Kim and Scott, 2012), Eq. (3) does not has a closed form solution, but using the kernel trick the classical re-weighted least squares (IRWLS) can be extended to a RKHS. The solution is then,

where

The algorithms of estimating robust Gram matrix and robust kernel CCO are given in Figure 1 and in Figure 2, respectively.

 Input: in . The kernel matrix with kernel and . Threshold , (e.g., ). The objective function of robust mean element is

  1. Do the following steps until:

    where

    • Set and .

    • Solve and make a vector for .

    • Update the mean element, .

    • Update error, .

    • Update as .

Output: the centered robust kernel matrix, where  

Figure 1: The algorithm of estimating centered kernel matrix using robust kernel mean element.

 Input: . The robust centered kernel matrix and with kernel and , and, are the th column of the and , respectively. Also define and . Threshold (e.g., ). The objective function of robust cross-covariance operator is

  1. Do the following steps until:

    where 

    • Set , and

    • Solve and make a vector for .

    • Calculate a vector, and make a matrix , where is matrix that th column consists of all elements of the matrix .

    • Update the robust covariance, .

    • Update error, .

    • Update as .

Output: the robust cross-covariance operator.  

Figure 2: The algorithm of estimating robust cross-covariance operator.

3 Influence function of kernel (cross-) covariance operator

To define the notion of robustness in statistics, different approaches have been proposed science 70’s decay for examples, the minimax approach (Huber, 1964), the sensitivity curve (Tukey, 1977), the influence functions (Hampel, 1974, Hampel et al., 1986) and in the finite sample em breakdown point (Donoho and Huber, 1983). Due to simplicity, IF is the most useful approach in statistics and in statistical supervised learning Christmann and Steinwart (2007, 2004). In this section, we briefly discuss the notion of IF, IF of kernel ME, IF of kernel CO and kernel CCO. (For details see in Appendix).

Let (, , ) be a probability space and a measure space. We want to estimate the parameter of a distribution in . We assume that exists a functional , where

is the set of all probability distribution in

. Let be some distribution in . If data do not fallow the model exactly but slightly going toward , the Gâteaux Derivative at is given by

(4)

Suppose and is the probability measure which gives mass to . The influence function (special case of Gâteaux Derivative) of at is defined by

(5)

provided that the limit exists. It can be intuitively interpreted as a suitably normalized asymptotic influence of outliers on the value of an estimate or test statistic.

There are three properties of IF: gross error sensitivity, local shift sensitivity and rejection point. They measured the worst effect of gross error, the worst effect of rounding error and rejection point. For a scalar, we just define influence function (IF) at a fixed point. But if the estimate is a function, we are able to express the change of the function value at every points (Kim and Scott, 2012).

3.1 Influence function of kernel mean element and kernel cross-raw moment

For a scalar we just define IF at a fixed point. But if the estimate is a function, we are able to express the change of the function value at every point.

Let the cross-raw moments

The IF of at for every points is given by

, which is estimated with the pairs of data points at any evaluated point

3.2 Influence function of complicated statistics

The IF of complicated statistics, which are functions of simple statistics, can be calculated with the chain rule. Say

, then

It can also be used to find the IF for a transformed statistic, given the influence function for the statistic itself.

The IF of kernel CCO,

, with joint distribution,

, using complicated statistics at is given by

which is estimated with the data points for every as

(6)

For the bounded kernels, the above IFs have three properties: gross error sensitivity, local shift sensitivity and rejection point. It is not true for the unbounded kernels, for example, liner and polynomial kernels. We can make similar conclusion for the kernel covariance operator.

4 Classical and robust kernel canonical correlation analysis

In this Section, we review classical kernel CCA and propose the IF and empirical IF (EIF) of kernel CCA. After that we propose a robust kernel CCA method based on robust kernel CO and robust kernel CCO.

4.1 Classical kernel CCA

Classical kernel CCA has been proposed as a nonlinear extension of linear CCA (Akaho, 2001, Lai and Fyfe, 2000). Bach and Jordan (2002) has extended the classical kernel CCA with efficient computational algorithm, incomplete Cholesky factorization. Over the last decade, classical kernel CCA

has been used for various purposes including preprocessing for classification, contrast function of independent component analysis, test of independence between two sets of variables, which has been applied in many domains such as genomics, computer graphics and computer-aided drug discovery and computational biology

(Alzate and Suykens, 2008, Hardoon et al., 2004, Huang et al., 2009a). Theoretical results on the convergence of kernel CCA have also been obtained (Fukumizu et al., 2007, Hardoon and Shawe-Taylor, 2009).

The aim of classical kernel CCA

is to seek the sets of functions in the RKHS for which the correlation (Corr) of random variables is maximized. The simplest case, given two sets of random variables

and with two functions in the RKHS, and , the optimization problem of the random variables and is

(7)

The optimizing functions and are determined up to scale.

Using a finite sample, we are able to estimate the desired functions. Given an i.i.d sample, from a joint distribution , by taking the inner products with elements or “parameters” in the RKHS, we have features and , where and are the associated kernel functions for and , respectively. The kernel Gram matrices are defined as and . We need the centered kernel Gram matrices and , where with and is the vector with ones. The empirical estimate of Eq. (7) is then given by

where

and is a diagonal matrix with elements , and and are the directions of and , respectively. The regularized coefficient .

4.1.1 Influence function of classical kernel CCA

By using the IF results of kernel PCA, linear PCA and of linear CCA, we can derive the IF of kernel CCA: kernel CC and kernel CVs,

Theorem 4.1

Given two sets of random variables having distribution , the influence function of kernel canonical correlation and canonical variate at are given by

(8)

where .

To prove Theorem 4.1, we need to find the IF of . All notations and proof are explained in Appendix.

It is known that the inverse of an operator may not exits even exist it may not be continuous operator in general (Fukumizu et al., 2007). While we can derive kernel canonical correlation using correlation operator , even when and are not proper operators, the IF of covariance operator is true only for the finite dimensional RKHSs. For infinite dimensional RKHSs, we can find IF of by introducing a regularization term as follows

(9)

where is a regularization coefficient, which gives empirical estimator.

Let be a sample from the distribution . The EIF of kernel CC and kernel CV at for all points are , respectively.

For the bounded kernels the IFs or EIFs, which are stated in Theorem 4.1 and after that, have the three properties: gross error sensitivity, local shift sensitivity and rejection point. But for unbounded kernels, say a linear, polynomial, the IFs are not bounded. In this consequence, the results of classical kernel CCA using the bounded kernels are less sensitive than the results of classical kernel CCA using the unbounded kernels. In practice, classical kernel CCA affected by the contaminated data even using the bounded kernels (Alam et al., 2010).

4.2 Robust kernel CCA

In this Section, we propose a robust kernel CCA methods based on robust kernel CO and robust kernel CCO. While many robust linear CCA methods have proposed to emphasize on the linear CCA methods that they fit the bulk of the data well and indicate the points deviating from the original pattern for further investment (Adrover and donato, 2015, Alam et al., 2010), there is no general well-founded robust methods of kernel CCA. The classical kernel CCA considers the same weights for each data point, , to estimate kernel CO and kernel CCO, which is the solution of an empirical risk optimization problem using the quadratic loss function. It is known that the least square loss function is a no robust loss function. Instead of, we can solve empirical risk optimization problem using the robust least square loss function and the weights are determined based on data via KIRWLS. After getting robust kernel CO and kernel CCO, they are used in classical kernel CCA, which we called a robust kernel CCA method. Figure 3 presents detailed algorithm of the proposed methods (except first two steps, all steps are similar as classical kernel CCA). This method is designed for contaminated data as well, and the principles we describe apply also to the kernel methods, which must deal with the issue of kernel CO and kernel CCO.

 Input: in .

  1. Calculate the robust cross-covariance operator, using algorithm in Figure 2.

  2. Calculate the robust covariance operator and using the same weight of cross-covariance operator (for simplicity).

  3. Find

  4. For , we have the largest eigenvalue of for .

  5. The unit eigenfunctions of

    corresponding to the th eigenvalues are and

  6. The jth () kernel canonical variates are given by

    where and

Output: the robust kernel CCA  

Figure 3: The algorithm of estimating robust kernel CCA

5 Experiments

We generate two types of simulated data, original data and those with of contamination, which are called ideal data (ID) and contaminated data (CD), respectively. We conduct experiments on the synthetic data as well as real data sets. The description of real data sets are in Sections 5.2 and 5.3, respectively. The synthetic data sets are as follows:

Three circles structural data (TCSD): Data are generated along three circles of different radii with small noise:

where , and , for , , and , respectively, and independently for an ID and for the CD.

Sign function structural data (SFSD): 1500 data are generated along sine function with small noise:

where and independently for the ID and for the CD.

Multivariate Gaussian structural data (MGSD): Given multivariate normal data, () where is the same as in (Alam and Fukumizu, 2015). We divide into two sets of variables (,), and use the first six variables of as and perform transformation of the absolute value of the remaining variables () as . For the CD ().

Sign and cosine function structural data (SCSD): We use uniform marginal distribution, and transform the data by two periodic and functions to make two sets and , respectively, with additive Gaussian noise: For the CD .

SNP and fMRI structural data (SMSD): Two data sets of SNP data X with SNPs and fMRI data Y with 1000 voxels were simulated. To correlate the SNPs with the voxels, a latent model is used as in Parkhomenko et al. (2009)). For contamination, we consider the signal level, and noise level, to and , respectively.

In our experiments, first we compare classical and robust kernel covariance operators. After that the robust kernel CCA is compared with the classical kernel CCA, hsicCCA and ktaCCA. In all experiments, for the Gaussian kernel we use median of the pairwise distance as a bandwidth and for the Laplacain kernel the bandwidth is equal to 1. The regularized parameters of classical kernel CCA and robust kernel CCA is . In robust methods, we consider Hubuer’s loss function with the constant, , equals to the median.

5.1 Kernel covariance operator and robust kernel operator covariance

We evaluate the performance of kernel CO and robust kernel CO in two different settings. First, we check the accuracy of both operators by considering the kernel CO with large data (say a population kernel CO). The measure of the kernel CO and robust kernel CO estimators are defined as