    Sparse Kernel Canonical Correlation Analysis via ℓ_1-regularization

Canonical correlation analysis (CCA) is a multivariate statistical technique for finding the linear relationship between two sets of variables. The kernel generalization of CCA named kernel CCA has been proposed to find nonlinear relations between datasets. Despite their wide usage, they have one common limitation that is the lack of sparsity in their solution. In this paper, we consider sparse kernel CCA and propose a novel sparse kernel CCA algorithm (SKCCA). Our algorithm is based on a relationship between kernel CCA and least squares. Sparsity of the dual transformations is introduced by penalizing the ℓ_1-norm of dual vectors. Experiments demonstrate that our algorithm not only performs well in computing sparse dual transformations but also can alleviate the over-fitting problem of kernel CCA.

Authors

09/13/2006

A kernel method for canonical correlation analysis

Canonical correlation analysis is a technique to extract common features...
04/23/2020

Sparse Generalized Canonical Correlation Analysis: Distributed Alternating Iteration based Approach

Sparse canonical correlation analysis (CCA) is a useful statistical tool...
05/29/2016

A simple and provable algorithm for sparse diagonal CCA

Given two sets of variables, derived from a common set of samples, spars...
08/27/2021

Learning primal-dual sparse kernel machines

Traditionally, kernel methods rely on the representer theorem which stat...
09/15/2016

Learning Schizophrenia Imaging Genetics Data Via Multiple Kernel Canonical Correlation Analysis

Kernel and Multiple Kernel Canonical Correlation Analysis (CCA) are empl...
03/27/2018

Canonical Correlation Analysis of Datasets with a Common Source Graph

Canonical correlation analysis (CCA) is a powerful technique for discove...
02/21/2015

Regularization and Kernelization of the Maximin Correlation Approach

Robust classification becomes challenging when each class consists of mu...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

The description of relationship between two sets of variables has long been an interesting topic to many researchers. Canonical correlation analysis (CCA), which was originally introduced in 

, is a multivariate statistical technique for finding the linear relationship between two sets of variables. Those two sets of variables can be considered as different views of the same object or views of different objects, and are assumed to contain some joint information in the correlations between them. CCA seeks a linear transformation for each of the two sets of variables in a way that the projected variables in the transformed space are maximally correlated.

Let and be samples for variables and , respectively. Denote

 X=[x1⋯xn]∈Rd1×n,Y=[y1⋯yn]∈Rd2×n,

and assume both and have zero mean, i.e., and . Then CCA solves the following optimization problem

 maxwx,wywTxXYTwys.t.wTxXXTwx=1,wTyYYTwy=1, (1.1)

to get the first pair of weight vectors and , which are further utilized to obtain the first pair of canonical variables and Y, respectively. For the rest pairs of weight vectors and canonical variables, CCA solves sequentially the same problem as (1.1) with additional constraints of orthogonality among canonical variables. Suppose we have obtained a pair of linear transformations and , then for a pair of new data , its projection into the new coordinate system determined by will be

 (WTxx,WTyy). (1.2)

Since CCA only consider linear transformation of the original variables, it can not capture nonlinear relations among variables. However, in a wide range of practical problems linear relations may not be adequate for studying relation among variables. Detecting nonlinear relations among data is important and useful in modern data analysis, especially when dealing with data that are not in the form of vectors, such as text documents, images, micro-array data and so on. A natural extension, therefore, is to explore and exploit nonlinear relations among data. There has been a wide concern in the nonlinear CCA [11, 30], among which one most frequently used approach is the kernel generalization of CCA, named kernel canonical correlation analysis (kernel CCA). Motivated from the development and successful applications of kernel learning methods [37, 39]

, such as support vector machines (SVM)

[7, 37]

, kernel principal component analysis (KPCA)

, kernel Fisher discriminant analysis , kernel partial least squares  and so on, there has emerged lots of research on kernel CCA [1, 32, 2, 16, 17, 25, 24, 29, 30, 39].

Kernel methods have attracted a great deal of attention in the field of nonlinear data analysis. In kernel methods, we first implicitly represent data as elements in reproducing kernel Hilbert spaces associated with positive definite kernels, then apply linear algorithms on the data and substitute the linear inner product by kernel functions, which results in nonlinear variants. The main idea of kernel CCA is that we first virtually map data into a high dimensional feature space via a mapping such that data in the feature space become

 Φx=[ϕx(x1)⋯ϕx(xn)]∈RNx×n,

where is the dimension of feature space that can be very high or even infinite. The mapping from input data to the feature space is performed implicitly by considering a positive definite kernel function satisfying

 κx(x1,x2)=⟨ϕx(x1),ϕx(x2)⟩, (1.3)

where is an inner product in , rather than by giving the coordinates of explicitly. The feature space is known as the Reproducing Kernel Hilbert Space (RKHS)  associated with kernel function . In the same way, we can map into a feature space associated with kernel through mapping such that

 Φy=[ϕy(y1)⋯ϕy(yn)]∈RNy×n.

After mapping to and to , we then apply ordinary linear CCA to data pair .

Let

 Kx=⟨Φx,Φx⟩=[κx(xi,xj)]ni,j=1∈Rn×n,Ky=⟨Φy,Φy⟩=[κy(yi,yj)]ni,j=1∈Rn×n (1.4)

be matrices consisting of inner products of datasets and , respectively. and are called kernel matrices or Gram matrices. Then kernel CCA seeks linear transformation in the feature space by expressing the weight vectors as linear combinations of the training data, that is

 wx=Φxα=n∑i=1αiϕx(xi),wy=Φyβ=n∑i=1βiϕy(yi),

where are called dual vectors. The first pair of dual vectors can be determined by solving the following optimization problem

 maxα,βαTKxKyβs.t.αTK2xα=1,βTK2yβ=1. (1.5)

The rest pairs of dual vectors are obtained via sequentially solving the same problem as (1.5) with extra constraints of orthogonality. More details on the derivation of kernel CCA are presented in Section 2.

Suppose we have obtained dual transformations and corresponding CCA transformations and in feature spaces, then projection of data pair onto the kernel CCA directions can be computed by first mapping and into the feature space and , then evaluate their inner products with and . More specifically, projections can be carried out as

 ⟨Wx,ϕx(x)⟩=⟨ΦxWx,ϕx(x)⟩=WTxKx(X,x), (1.6)

with , and

 ⟨Wy,ϕy(y)⟩=⟨ΦyWy,ϕy(y)⟩=WTyKy(Y,y), (1.7)

with .

Both optimization problems (1.1) and (1.5

) can be solved by considering generalized eigenvalue problems

 of the form

 Ax=λBx, (1.8)

where , are symmetric positive semi-definite. This generalized eigenvalue problem can be solved efficiently using approaches from numerical linear algebra . CCA and kernel CCA have been successfully applied in many fields, including crosslanguage documents retrieval , content

based image retrieval

, bioinformatics [46, 53][2, 17], computation of principal angles between linear subspaces [6, 20].

Despite the wide usage of CCA and kernel CCA, they have one common limitation that is lack of sparseness in transformation matrices and and dual transformation matrices and . Equation (1.2) shows that projections of the data pair and are linear combinations of themselves which make interpretation of the extracted features difficult if the transformation matrices and are dense. Similarly, from (1.6) and (1.7) we can see that the kernel functions and must be evaluated for all and when dual transformation matrices and are dense, which can lead to excessive computational time to compute projections of new data. To handle the limitation of CCA, researchers suggested to incorporate sparsity into weight vectors and many papers have studied sparse CCA [9, 23, 35, 40, 41, 48, 50, 51, 52]. Similarly, we shall find sparse solutions for kernel CCA so that projections of new data can be computed by evaluating the kernel function at a subset of the training data. Although there are many sparse kernel approaches , such as support vector machines , relevance vector machine  and sparse kernel partial least squares [14, 34], seldom can be found in the area of sparse kernel CCA [13, 43].

In this paper we first consider a new sparse CCA approach and then generalize it to incorporate sparsity into kernel CCA. A relationship between CCA and least squares is established so that CCA solutions can be obtained by solving a least squares problem. We attempt to introduce sparsity by penalizing -norm of the solutions, which eventually leads to a -norm penalized least squares optimization problem of the form

 minx∈Rd12∥Ax−b∥22+λ∥x∥1,

where is a regularizer controlling the sparsity of . We adopt a fixed-point continuation (FPC) method [21, 22] to solve the -norm regularized least squares above, which results in a new sparse CCA algorithm (SCCALS). Since the optimization criteria of CCA and kernel CCA are of the same form, the same idea can be extended to kernel CCA to get a sparse kernel CCA algorithm (SKCCA).

The remainder of the paper is organized as follows. In Section 2, we present background results on both CCA and kernel CCA, including a full parameterization of the general solutions of CCA and a detailed derivation of kernel CCA. In Section 3, we first establish a relationship between CCA and least squares problems, then based on this relationship we propose to incorporate sparsity into CCA by penalizing the least squares with -norm. Solving the penalized least squares problems by FPC leads to a new sparse CCA algorithm SCCALS. In Section 4, we extend the idea of deriving SCCALS to its kernel counterpart, which results in a novel sparse kernel CCA algorithm SKCCA. Numerical results of applying the newly proposed algorithms to various applications and comparative empirical results with other algorithms are presented in Section 5. Finally, we draw some conclusion remarks in Section 6.

2 Background

In this section we provide enough background results on CCA and kernel CCA so as to make the paper self-contained. In the first subsection, we present the full parameterization of the general solutions of CCA and related results; in the second subsection, based on the parameterization in previous subsection, we demonstrate a detailed derivation of kernel CCA.

2.1 Canonical correlation analysis

As stated in Introduction, by solving (1.1), or equivalently

 minwx,wy∥XTwx−YTwy∥22s.t.wTxXXTwx=1,wTyYYTwy=1, (2.1)

we can get a pair of weight vectors and for CCA. Only one pair of weight vectors is not enough for most practical problems, however. To obtain multiple projections of CCA, we recursively solve the following optimization problem

 (wkx,wky)=argmaxwx,wywTxXYTwys.t.wTxXXTwx=1,XTwx⊥{XTw1x,⋯,XTwk−1x},wTyYYTwy=1,YTwy⊥{YTw1y,⋯,YTwk−1y},k=2,⋯,l, (2.2)

where is the number of projections we need. The unit vectors and in (2.2) are called the th pair of canonical variables. If we denote

 Wx=[w1x⋯wlx]∈Rd1×l,Wy=[w1y⋯wly]∈Rd2×l,

then we can show  that the optimization problem above is equivalent to

 maxWx,WyTrace(WTxXYTWy)s.t.WTxXXTWx=I, Wx∈Rd1×l,WTyYYTWy=I, Wy∈Rd2×l. (2.3)

Hence, optimization problem (2.3) will be used as the criterion of CCA.

A solution of (2.3) can be obtained via solving a generalized eigenvalue problem of the form (2.1). Furthermore, we can fully characterize all solutions of the optimization problem (2.3). Define

 r=rank(X),s=rank(Y),m=rank(XYT),t=min{r,s}.

Let the (reduced) SVD factorizations of and be, respectively,

 X=U[Σ10]QT1=[U1U2][Σ10]QT1=U1Σ1QT1, (2.4)

and

 Y=V[Σ20]QT2=[V1V2][Σ20]QT2=V1Σ2QT2, (2.5)

where

 U∈Rd1×d1, U1∈Rd1×r, U2∈Rd1×(d1−r), Σ1∈Rr×r, Q1∈Rn×r,
 V∈Rd2×d2, V1∈Rd2×s, V2∈Rd2×(d2−s), Σ2∈Rs×s, Q2∈Rn×s,

and are orthogonal, and are nonsingular and diagonal, and are column orthogonal. It follows from the two orthogonality constraints in (2.3) that

 l≤min{rank(X),rank(Y)}=min{r,s}=t. (2.6)

Next, let

 QT1Q2=P1ΣPT2 (2.7)

be the singular value decomposition of

, where and are orthogonal, , and assume there are distinctive nonzero singular values with multiplicity , respectively, then

 m=q∑i=1mi=rank(QT1Q2)≤min{r,s}=t.

The full characterization of and is given in the following theorem .

Theorem 2.1.

i). If for some satisfying , then with and is a solution of optimization problem (2.3) if and only if

 {Wx=U1Σ−11P1(:,1:l)W+U2E,Wy=V1Σ−12P2(:,1:l)W+V2F, (2.8)

where is orthogonal, and are arbitrary.

ii). If for some satisfying , then with and is a solution of optimization problem (2.3) if and only if

 {Wx=U1Σ−11[P1(:,1:αk)P1(:,1+αk:αk+1)G]W+U2E,Wy=V1Σ−12[P2(:,1:αk)P2(:,1+αk:αk+1)G]W+V2F, (2.9)

where , is orthogonal, is column orthogonal, and are arbitrary.

iii). If , then with and is a solution of optimization problem (2.3) if and only if

 {Wx=U1Σ−11[P1(:,1:m)P1(:,m+1:r)G1]W+U2E,Wy=V1Σ−12[P2(:,1:m)P2(:,m+1:s)G2]W+V2F, (2.10)

where is orthogonal, and are column orthogonal, and are arbitrary.

An immediate application of Theorem 2.1 is that we can prove that Uncorrelated Linear Discriminant Analysis (ULDA) [8, 27, 55] is a special case of CCA when one set of variables is derived form the data matrix and the other set of variables is constructed from class information. This theorem has also been utilized in  to design a sparse CCA algorithm.

2.2 Kernel canonical correlation analysis

Now, we look at some details on the derivation of kernel CCA. Note from Theorem 2.1 that each solution of CCA can be expressed as

 Wx=XWx+W⊥x,Wy=YWy+W⊥y,

where and are orthogonal to the range space of and , respectively. Since, intrinsically, kernel CCA is performing ordinary CCA on and , it follows that the solutions of kernel CCA should be obtained by virtually solving

 maxWx,WyTrace(WTxΦxΦyWy)s.t.WTxΦxΦTxWx=I, Wx∈RNx×l,WTyΦyΦTyWy=I, Wy∈RNy×l, (2.11)

Similar to ordinary CCA, each solution of (2.11) shall be represented as

 Wx=ΦxWx+W⊥x,Wy=ΦyWy+W⊥y, (2.12)

where are usually called dual transformation matrices, and are orthogonal to the range space of and , respectively.

Substituting (2.12) into (2.11), we have

 WTxΦxΦyWy=WTxKxKyWy,WTxΦxΦTxWx=WTxK2xWx,WTyΦyΦTyWy=WTyK2yWy.

Thus, the computation of transformations of kernel CCA can be converted to the computation of dual transformation matrices and by solving the following optimization problem

 maxWx,WyTrace(WTxKxKyWy)s.t.WTxK2xWx=I, Wx∈Rn×l,WTyK2yWy=I, Wy∈Rn×l, (2.13)

which is used as the criterion of kernel CCA in this paper.

As can be seen from the analysis above, terms and in (2.12) do not contribute to the canonical correlations between and , thus, are usually neglected in practice. Therefore, when we are given a set of testing data consisting of points, the projection of onto kernel CCA direction can be performed by first mapping into feature space , then compute its inner product with . More specifically, suppose is the projection of in feature space , then the projection of onto kernel CCA direction is given by

 WTxΦx,t=WTxKx,t,

where is the matrix consisting of the kernel evaluations of with all training data . Similar process can be adopted to compute projections of new data drawn from variable .

In the process of deriving (2.13), we assumed data and have been centered (that is, the column mean of both and are zero), otherwise, we need to perform data centering before applying kernel CCA. Unlike data centering of and , we can not perform data centering directly on and since we do not know their explicit coordinates. However, as shown in [38, 37], data centering in RKHS can be accomplished via some operations on kernel matrices. To center , a natural idea should be computing , where denotes column vector in with all entries being 1. However, since kernel CCA makes use of the data through kernel matrix , the centering process can be performed on as

 Kx,c=⟨Φx,c,Φx,c⟩=(I−eneTnn)⟨Φx,Φx⟩(I−eneTnn)=(I−eneTnn)Kx(I−eneTnn). (2.14)

Similarly, we can center testing data as

 Kx,t,c=⟨Φx,c,Φx,t−ΦxeneTNn⟩=(I−eneTnn)Kx,t−(I−eneTnn)KxeneTNn. (2.15)

More details about data centering in RKHS can be found in [38, 37]. In the sequel of this paper, we assume the given data have been centered.

There are papers studying properties of kernel CCA, including the geometry of kernel CCA in  and statistical consistency of kernel CCA in . In the remainder of this paper, we consider sparse kernel CCA. Before that, we explore a relation between CCA and least squares in the next section.

3 Sparse CCA based on least squares formulation

Note form (2.1) that when one of and

is one dimensional, CCA is equivalent to least squares estimation to a linear regression problem. For more general cases, some relation between CCA and linear regression has been established under the condition that

and in . In this section, we establish a relation between CCA and linear regression without any additional constraint on and . Moreover, based on this relation we design a new sparse CCA algorithm.

We focus on a solution subset of optimization problem (2.3) presented in the following lemma, whose proof is trivial and omitted.

Lemma 3.1.

Any of the following forms

 {Wx=U1Σ−11P1(:,1:l)+U2E,Wy=V1Σ−12P2(:,1:l)+V2F, (3.1)

is a solution of optimization problem (2.3), where and are arbitrary.

Suppose matrix factorizations (2.4)-(2.7) have been accomplished, and let

 Tx=YT[(YYT)12]†V1P2(:,1:l)Σ(1:l,1:l)−1=Q2P2(:,1:l)Σ(1:l,1:l)−1, (3.2) Ty=XT[(XXT)12]†U1P1(:,1:l)Σ(1:l,1:l)−1=Q1P1(:,1:l)Σ(1:l,1:l)−1, (3.3)

where denotes the Moore-Penrose inverse of a general matrix and , then we have the following theorem.

Theorem 3.2.

For any satisfying , suppose and satisfy

 Wx=argmin{∥XTWx−Tx∥2F:Wx∈Rd1×l}, (3.4)

and

 Wy=argmin{∥YTWx−Ty∥2F:Wy∈Rd2×l}, (3.5)

where and are defined in (3.2) and (3.3), respectively. Then and form a solution of optimization problem (2.3).

Proof.

Since (3.4) and (3.5) have the same form, we only prove the result for , the same idea can be applied to .

We know that is a solution of (3.4) if and only if it satisfies the normal equation

 XXTWx=XTx. (3.6)

Substituting factorizations (2.4), (2.5) and (2.7) into the equation above, we get

 XXT=U1Σ21UT1,

and

 XTx = U1Σ1QT1Q2P2(:,1:l)Σ(1:l,1:l)−1 = U1Σ1P1(:,1:l),

which yield an equivalent reformulation of (3.6)

 U1Σ21UT1Wx=U1Σ1P1(:,1:l). (3.7)

It is easy to check that is a solution of (3.7) if and only if

 Wx=U1Σ−11P1(:,1:l)+U2E, (3.8)

where is an arbitrary matrix. Therefore, is a solution of (3.4) if and only if can be formulated as (3.8).

Similarly, is a solution of (3.5) if and only if can be written as

 Wy=V1Σ−12P2(:,1:l)+V2F, (3.9)

where is an arbitrary matrix.

Now, comparing equations (3.8) and (3.9) with the equation (3.1) in Lemma 3.1, we can conclude that for any solution of the least squares problem (3.4) and any solution of the least squares problem (3.5), and form a solution of optimization problem (2.3), hence a solution of CCA. ∎

Remark 3.1.

In Theorem 3.2 we only consider satisfying . This is reasonable, since there are nonzero canonical correlations between and , and weight vectors corresponding to zero canonical correlation does not contribute to the correlation between data and .

Consider the usual regression situation: we have a set of observations where and are the regressor and response for the th observation. Suppose has been centered, then linear regression model has the form

 f(X)=n∑i=1xiβi,

and aims to estimate so as to predict an output for each input . The famous least squares estimation minimizes the residual sum of squares

 Res(β)=∥XTβ−b∥22.

Therefore, (3.4) and (3.5) can be interpreted as least squares estimations of linear regression problems with columns of and being regressors and rows of and being corresponding responses.

Recent research on lasso  shows that simultaneous sparsity and regression can be achieved by penalizing the -norm of the variables. Motivated by this, we incorporate sparsity into CCA via the established relationship between CCA and least squares and considering the following -norm penalized least squares problems

 minWx{12∥XTWx−Tx∥2F+l∑i=1λx,i∥Wx,i∥1:Wx∈Rd1×l}, (3.10)

and

 minWy{12∥YTWy−Ty∥2F+l∑i=1λy,i∥Wy,i∥1:Wy∈Rd2×l}, (3.11)

where , are positive regularization parameters and , are the th column of and , respectively. When we set and , problems (3.10) and (3.11) become

 minWx{12∥XTWx−Tx∥2F+λx∥Wx∥1:Wx∈Rd1×l}, (3.12)

and

 minWy{12∥YTWy−Ty∥2F+λy∥Wy∥1:Wy∈Rd2×l}, (3.13)

where

 ∥Wx∥1=d1∑i=1l∑j=1|Wx(i,j)|, ∥Wy∥1=d2∑i=1l∑j=1|Wy(i,j)|.

Since (3.10) and (3.11) (also, (3.12) and (3.13))have the same form, all results holding for one problem can be naturally extended to the other, so we concentrate on (3.10). Optimization problem (3.10) reduces to a -regularized minimization problem of the form

 minx∈Rd12∥Ax−b∥22+λ∥x∥1, (3.14)

when . In the field of compressed sensing, (3.14) has been intensively studied as denoising basis pursuit problem, and many efficient approaches have been proposed to solve it, see [3, 15, 21, 54]. In this paper we adopt the fixed-point continuation (FPC) method [21, 22], due to its simple implementation and nice convergence property.

Fixed-point algorithm for (3.14) is an iterative method which updates iterates as

 xk+1=Sν(xk−τAT(Ax−b)),~{}with~{}ν=τλ, (3.15)

where denotes the step size, and is the soft-thresholding operator defined as

 Sν(x)=[Sν(x1)⋯Sν(xd)]T

with

 Sν(ω)=sign(ω)max{|ω|−ν,0}, ω∈R. (3.16)

reduces any with magnitude less than to zero, thus reducing the -norm and introducing sparsity.

The fixed-point algorithm can be naturally extended to solve (3.10), which yields

 Wk+1x,i=Sνx,i(Wkx,i−τxX(XTWkx,i−Tx,i)), i=1,⋯,l, (3.17)

where with denoting the step size. We can prove that fixed-point iterations have some nice convergence properties which are presented in the following theorem.

Theorem 3.3.

 Let be the solution set of (3.10), then there exists such that

 X(XTWx−Tx)≡M∗, ∀Wx∈Ω. (3.18)