## 1 Introduction

The description of relationship between two sets of variables has long been an interesting topic to many researchers. Canonical correlation analysis (CCA), which was originally introduced in [26]

, is a multivariate statistical technique for finding the linear relationship between two sets of variables. Those two sets of variables can be considered as different views of the same object or views of different objects, and are assumed to contain some joint information in the correlations between them. CCA seeks a linear transformation for each of the two sets of variables in a way that the projected variables in the transformed space are maximally correlated.

Let and be samples for variables and , respectively. Denote

and assume both and have zero mean, i.e., and . Then CCA solves the following optimization problem

(1.1) |

to get the first pair of weight vectors and , which are further utilized to obtain the first pair of canonical variables and Y, respectively. For the rest pairs of weight vectors and canonical variables, CCA solves sequentially the same problem as (1.1) with additional constraints of orthogonality among canonical variables. Suppose we have obtained a pair of linear transformations and , then for a pair of new data , its projection into the new coordinate system determined by will be

(1.2) |

Since CCA only consider linear transformation of the original variables, it can not capture nonlinear relations among variables. However, in a wide range of practical problems linear relations may not be adequate for studying relation among variables. Detecting nonlinear relations among data is important and useful in modern data analysis, especially when dealing with data that are not in the form of vectors, such as text documents, images, micro-array data and so on. A natural extension, therefore, is to explore and exploit nonlinear relations among data. There has been a wide concern in the nonlinear CCA [11, 30], among which one most frequently used approach is the kernel generalization of CCA, named kernel canonical correlation analysis (kernel CCA). Motivated from the development and successful applications of kernel learning methods [37, 39]

, such as support vector machines (SVM)

[7, 37], kernel principal component analysis (KPCA)

[38], kernel Fisher discriminant analysis [33], kernel partial least squares [36] and so on, there has emerged lots of research on kernel CCA [1, 32, 2, 16, 17, 25, 24, 29, 30, 39].Kernel methods have attracted a great deal of attention in the field of nonlinear data analysis. In kernel methods, we first implicitly represent data as elements in reproducing kernel Hilbert spaces associated with positive definite kernels, then apply linear algorithms on the data and substitute the linear inner product by kernel functions, which results in nonlinear variants. The main idea of kernel CCA is that we first virtually map data into a high dimensional feature space via a mapping such that data in the feature space become

where is the dimension of feature space that can be very high or even infinite. The mapping from input data to the feature space is performed implicitly by considering a positive definite kernel function satisfying

(1.3) |

where is an inner product in , rather than by giving the coordinates of explicitly. The feature space is known as the Reproducing Kernel Hilbert Space (RKHS) [49] associated with kernel function . In the same way, we can map into a feature space associated with kernel through mapping such that

After mapping to and to , we then apply ordinary linear CCA to data pair .

Let

(1.4) |

be matrices consisting of inner products of datasets and , respectively. and are called kernel matrices or Gram matrices. Then kernel CCA seeks linear transformation in the feature space by expressing the weight vectors as linear combinations of the training data, that is

where are called dual vectors. The first pair of dual vectors can be determined by solving the following optimization problem

(1.5) |

The rest pairs of dual vectors are obtained via sequentially solving the same problem as (1.5) with extra constraints of orthogonality. More details on the derivation of kernel CCA are presented in Section 2.

Suppose we have obtained dual transformations and corresponding CCA transformations and in feature spaces, then projection of data pair onto the kernel CCA directions can be computed by first mapping and into the feature space and , then evaluate their inner products with and . More specifically, projections can be carried out as

(1.6) |

with , and

(1.7) |

with .

Both optimization problems (1.1) and (1.5

) can be solved by considering generalized eigenvalue problems

[4] of the form(1.8) |

where , are symmetric positive semi-definite. This generalized eigenvalue problem can be solved efficiently using approaches from numerical linear algebra [19]. CCA and kernel CCA have been successfully applied in many fields, including crosslanguage documents retrieval [47], content

based image retrieval

[25], bioinformatics [46, 53], independent component analysis

[2, 17], computation of principal angles between linear subspaces [6, 20].Despite the wide usage of CCA and kernel CCA, they have one common limitation that is lack of sparseness in transformation matrices and and dual transformation matrices and . Equation (1.2) shows that projections of the data pair and are linear combinations of themselves which make interpretation of the extracted features difficult if the transformation matrices and are dense. Similarly, from (1.6) and (1.7) we can see that the kernel functions and must be evaluated for all and when dual transformation matrices and are dense, which can lead to excessive computational time to compute projections of new data. To handle the limitation of CCA, researchers suggested to incorporate sparsity into weight vectors and many papers have studied sparse CCA [9, 23, 35, 40, 41, 48, 50, 51, 52]. Similarly, we shall find sparse solutions for kernel CCA so that projections of new data can be computed by evaluating the kernel function at a subset of the training data. Although there are many sparse kernel approaches [5], such as support vector machines [37], relevance vector machine [45] and sparse kernel partial least squares [14, 34], seldom can be found in the area of sparse kernel CCA [13, 43].

In this paper we first consider a new sparse CCA approach and then generalize it to incorporate sparsity into kernel CCA. A relationship between CCA and least squares is established so that CCA solutions can be obtained by solving a least squares problem. We attempt to introduce sparsity by penalizing -norm of the solutions, which eventually leads to a -norm penalized least squares optimization problem of the form

where is a regularizer controlling the sparsity of . We adopt a fixed-point continuation (FPC) method [21, 22] to solve the -norm regularized least squares above, which results in a new sparse CCA algorithm (SCCALS). Since the optimization criteria of CCA and kernel CCA are of the same form, the same idea can be extended to kernel CCA to get a sparse kernel CCA algorithm (SKCCA).

The remainder of the paper is organized as follows. In Section 2, we present background results on both CCA and kernel CCA, including a full parameterization of the general solutions of CCA and a detailed derivation of kernel CCA. In Section 3, we first establish a relationship between CCA and least squares problems, then based on this relationship we propose to incorporate sparsity into CCA by penalizing the least squares with -norm. Solving the penalized least squares problems by FPC leads to a new sparse CCA algorithm SCCALS. In Section 4, we extend the idea of deriving SCCALS to its kernel counterpart, which results in a novel sparse kernel CCA algorithm SKCCA. Numerical results of applying the newly proposed algorithms to various applications and comparative empirical results with other algorithms are presented in Section 5. Finally, we draw some conclusion remarks in Section 6.

## 2 Background

In this section we provide enough background results on CCA and kernel CCA so as to make the paper self-contained. In the first subsection, we present the full parameterization of the general solutions of CCA and related results; in the second subsection, based on the parameterization in previous subsection, we demonstrate a detailed derivation of kernel CCA.

### 2.1 Canonical correlation analysis

As stated in Introduction, by solving (1.1), or equivalently

(2.1) |

we can get a pair of weight vectors and for CCA. Only one pair of weight vectors is not enough for most practical problems, however. To obtain multiple projections of CCA, we recursively solve the following optimization problem

(2.2) |

where is the number of projections we need. The unit vectors and in (2.2) are called the th pair of canonical variables. If we denote

then we can show [9] that the optimization problem above is equivalent to

(2.3) |

Hence, optimization problem (2.3) will be used as the criterion of CCA.

A solution of (2.3) can be obtained via solving a generalized eigenvalue problem of the form (2.1). Furthermore, we can fully characterize all solutions of the optimization problem (2.3). Define

Let the (reduced) SVD factorizations of and be, respectively,

(2.4) |

and

(2.5) |

where

and are orthogonal, and are nonsingular and diagonal, and are column orthogonal. It follows from the two orthogonality constraints in (2.3) that

(2.6) |

Next, let

(2.7) |

be the singular value decomposition of

, where and are orthogonal, , and assume there are distinctive nonzero singular values with multiplicity , respectively, thenThe full characterization of and is given in the following theorem [9].

###### Theorem 2.1.

i). If for some satisfying , then with and is a solution of optimization problem (2.3) if and only if

(2.8) |

where is orthogonal, and are arbitrary.

ii). If for some satisfying , then with and is a solution of optimization problem (2.3) if and only if

(2.9) |

where , is orthogonal, is column orthogonal, and are arbitrary.

iii). If , then with and is a solution of optimization problem (2.3) if and only if

(2.10) |

where is orthogonal, and are column orthogonal, and are arbitrary.

An immediate application of Theorem 2.1 is that we can prove that Uncorrelated Linear Discriminant Analysis (ULDA) [8, 27, 55] is a special case of CCA when one set of variables is derived form the data matrix and the other set of variables is constructed from class information. This theorem has also been utilized in [9] to design a sparse CCA algorithm.

### 2.2 Kernel canonical correlation analysis

Now, we look at some details on the derivation of kernel CCA. Note from Theorem 2.1 that each solution of CCA can be expressed as

where and are orthogonal to the range space of and , respectively. Since, intrinsically, kernel CCA is performing ordinary CCA on and , it follows that the solutions of kernel CCA should be obtained by virtually solving

(2.11) |

Similar to ordinary CCA, each solution of (2.11) shall be represented as

(2.12) |

where are usually called dual transformation matrices, and are orthogonal to the range space of and , respectively.

Substituting (2.12) into (2.11), we have

Thus, the computation of transformations of kernel CCA can be converted to the computation of dual transformation matrices and by solving the following optimization problem

(2.13) |

which is used as the criterion of kernel CCA in this paper.

As can be seen from the analysis above, terms and in (2.12) do not contribute to the canonical correlations between and , thus, are usually neglected in practice. Therefore, when we are given a set of testing data consisting of points, the projection of onto kernel CCA direction can be performed by first mapping into feature space , then compute its inner product with . More specifically, suppose is the projection of in feature space , then the projection of onto kernel CCA direction is given by

where is the matrix consisting of the kernel evaluations of with all training data . Similar process can be adopted to compute projections of new data drawn from variable .

In the process of deriving (2.13), we assumed data and have been centered (that is, the column mean of both and are zero), otherwise, we need to perform data centering before applying kernel CCA. Unlike data centering of and , we can not perform data centering directly on and since we do not know their explicit coordinates. However, as shown in [38, 37], data centering in RKHS can be accomplished via some operations on kernel matrices. To center , a natural idea should be computing , where denotes column vector in with all entries being 1. However, since kernel CCA makes use of the data through kernel matrix , the centering process can be performed on as

(2.14) |

Similarly, we can center testing data as

(2.15) |

More details about data centering in RKHS can be found in [38, 37]. In the sequel of this paper, we assume the given data have been centered.

## 3 Sparse CCA based on least squares formulation

Note form (2.1) that when one of and

is one dimensional, CCA is equivalent to least squares estimation to a linear regression problem. For more general cases, some relation between CCA and linear regression has been established under the condition that

and in [42]. In this section, we establish a relation between CCA and linear regression without any additional constraint on and . Moreover, based on this relation we design a new sparse CCA algorithm.We focus on a solution subset of optimization problem (2.3) presented in the following lemma, whose proof is trivial and omitted.

###### Lemma 3.1.

Any of the following forms

(3.1) |

is a solution of optimization problem (2.3), where and are arbitrary.

Suppose matrix factorizations (2.4)-(2.7) have been accomplished, and let

(3.2) | |||

(3.3) |

where denotes the Moore-Penrose inverse of a general matrix and , then we have the following theorem.

###### Theorem 3.2.

###### Proof.

Since (3.4) and (3.5) have the same form, we only prove the result for , the same idea can be applied to .

We know that is a solution of (3.4) if and only if it satisfies the normal equation

(3.6) |

Substituting factorizations (2.4), (2.5) and (2.7) into the equation above, we get

and

which yield an equivalent reformulation of (3.6)

(3.7) |

It is easy to check that is a solution of (3.7) if and only if

(3.8) |

where is an arbitrary matrix. Therefore, is a solution of (3.4) if and only if can be formulated as (3.8).

Similarly, is a solution of (3.5) if and only if can be written as

(3.9) |

where is an arbitrary matrix.

###### Remark 3.1.

In Theorem 3.2 we only consider satisfying . This is reasonable, since there are nonzero canonical correlations between and , and weight vectors corresponding to zero canonical correlation does not contribute to the correlation between data and .

Consider the usual regression situation: we have a set of observations where and are the regressor and response for the th observation. Suppose has been centered, then linear regression model has the form

and aims to estimate so as to predict an output for each input . The famous least squares estimation minimizes the residual sum of squares

Therefore, (3.4) and (3.5) can be interpreted as least squares estimations of linear regression problems with columns of and being regressors and rows of and being corresponding responses.

Recent research on lasso [44] shows that simultaneous sparsity and regression can be achieved by penalizing the -norm of the variables. Motivated by this, we incorporate sparsity into CCA via the established relationship between CCA and least squares and considering the following -norm penalized least squares problems

(3.10) |

and

(3.11) |

where , are positive regularization parameters and , are the th column of and , respectively. When we set and , problems (3.10) and (3.11) become

(3.12) |

and

(3.13) |

where

Since (3.10) and (3.11) (also, (3.12) and (3.13))have the same form, all results holding for one problem can be naturally extended to the other, so we concentrate on (3.10). Optimization problem (3.10) reduces to a -regularized minimization problem of the form

(3.14) |

when . In the field of compressed sensing, (3.14) has been intensively studied as denoising basis pursuit problem, and many efficient approaches have been proposed to solve it, see [3, 15, 21, 54]. In this paper we adopt the fixed-point continuation (FPC) method [21, 22], due to its simple implementation and nice convergence property.

Fixed-point algorithm for (3.14) is an iterative method which updates iterates as

(3.15) |

where denotes the step size, and is the soft-thresholding operator defined as

with

(3.16) |

reduces any with magnitude less than to zero, thus reducing the -norm and introducing sparsity.

The fixed-point algorithm can be naturally extended to solve (3.10), which yields

(3.17) |

where with denoting the step size. We can prove that fixed-point iterations have some nice convergence properties which are presented in the following theorem.

Comments

There are no comments yet.