 # An Inexact Manifold Augmented Lagrangian Method for Adaptive Sparse Canonical Correlation Analysis with Trace Lasso Regularization

Canonical correlation analysis (CCA for short) describes the relationship between two sets of variables by finding some linear combinations of these variables that maximizing the correlation coefficient. However, in high-dimensional settings where the number of variables exceeds sample size, or in the case of that the variables are highly correlated, the traditional CCA is no longer appropriate. In this paper, an adaptive sparse version of CCA (ASCCA for short) is proposed by using the trace Lasso regularization. The proposed ASCCA reduces the instability of the estimator when the covariates are highly correlated, and thus improves its interpretation. The ASCCA is further reformulated to an optimization problem on Riemannian manifolds, and an manifold inexact augmented Lagrangian method is then proposed for the resulting optimization problem. The performance of the ASCCA is compared with the other sparse CCA techniques in different simulation settings, which illustrates that the ASCCA is feasible and efficient.

## Authors

##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1 Introduction

Canonical correlation analysis (CCA for short) firstly proposed by Hotelling  aims to tackle the associations between two sets of variables. It has wide applications in many important fields such as biology [26, 18, 13], medicine , image analysis [7, 15], etc.

Suppose that there are two data sets: containing variables and containing variables, both are obtained from observations. The CCA seeks two linear combinations of these variables from and with maximal correlation coefficient. Specially, let

 Σxx=1nXTX,Σyy=1nYTY

be the sample covariance matrices of and respectively, and be the sample cross-covariance matrix, then the CCA finds a pair such that

 corr(Xu,Yv)=uTΣxyv√uTΣxxu√vTΣyyv (1.1)

is maximized. The new variables and are canonical variables, and the correlations between canonical variables are called canonical correlations. The canonical variables and

can be respectively obtained by the eigenvectors of matrix

 Σ−1xxΣxyΣ−1yyΣyx.

The canonical correlations are given by the positive square root of those eigenvalues. Since the CCA model (

1.1) is in form of fractional, it is difficult to optimize. A equivalent formulation of CCA is given by

 {maxu,v uTΣxyvs.t. uTΣxxu=1,vTΣyyv=1,

which can be regarded as an optimization problem on the generalized Stiefel manifolds. However, a potential disadvantage of the CCA is that, the learned solution is a linear combination of all original variables, which brings down the interpretability. If the number of variables exceeds sample size, traditional CCA cannot be performed due to that and are singular. Hence, many researchers proposed various sparse CCA (SCCA) to handle the case that the number of variables exceeds observations, and to improve the interpretability of canonical variables by restricting the linear combinations to a subset of original variables.

In this paper, we propose an adaptive sparse CCA model by incorporating the trace Lasso regularization. The matrix version of trace Lasso regularization can be adopted to both highly correlated and uncorrelated data. Our major contributions are summarized in follows:

1. We present a matrix version of trace Lasso regularization, and show that the new regularization function enjoys the properties of original trace Lasso.

2. By introducing trace Lasso regularization into the CCA model, we obtain an adaptive sparse CCA model (ASCCA). To our knowledge, our ASCCA is the first to takes the data correlation into account in the CCA model. In addition, our model consider multiple variables simultaneously.

3. The new model is reformulated to an optimization problem on the generalized Stiefel manifold. An manifold inexact augmented Lagrangian method is proposed for the resulted optimization problem, and the convergence is established under some assumptions.

4. The experimental results demonstrate that, the proposed ASCCA is superior to some existing sparse CCA models.

The rest of the paper is organized as follows. Section 2 briefly gives some reviews on the related works. Section 3 proposes an adaptive sparse version of the CCA introduced by the new trace Lasso regularization. Section 4 provides an optimization reformulation and the manifold inexact augmented Lagrangian method for the new model, and gives the convergence analysis. In Section 5, a simulation study is provided to show the validity and efficiency of the proposed method. Section 6 concludes this paper with some final remarks.

## 2 Related works

It is well known that, if the sample size exceeds dimension, the traditional CCA does not perform. To overcome this difficulty, various methods were proposed via incorporate different regularization function. Vinod 

proposed a canonical ridge, which is an adaptation of the ridge regression for the CCA framework proposed by Hoerl and Kennard

, and introduced an efficient sparsity penalty strategy. After that, various approaches for sparse CCA (SCCA for short) were proposed in literature, which includes regularization[17, 25], elastic net , group sparse and structured sparse [14, 4], etc. There also exists some limitations. If there is a group of variables which the pairwise correlation is high, the Lasso tends to select only one variable from this group, which may lead some misunderstands to the truth. Group sparse regularization needs the prior knowledge of group, which is unrealistic in some real applications. The proposed adaptive sparse CCA model utilized the new trace Lasso regularization, which incorporates data matrix into regularization, to adaptively deal with the correlation of covariation matrix.

The original SCCA model is difficult to handle, so many researchers simplify it by assuming that and are diagonal matrices or identity matrices. Parkhomenko, et al  assume that, the covariance matrices

are the identity matrices, and used a sparse singular value decomposition to derive sparse singular vectors. Wilms and Croux

 converted the SCCA model into a penalized regression framework. Suo  presented an approximated SCCA model as follows

 {minu,v−uTΣxyv+λ1∥u∥1+λ2∥v∥1s.t. uTΣxxu≤1,vTΣxxv≤1,

and problem (2) was solved by a linearized alternating direction method of multipliers (LADMM). Witten  further relaxed (2) to

 {minu,v−uTΣxyvs.t. ∥u∥22≤1,∥v∥22≤1,P1(u)≤c1,P2(v)≤c2,

where and are some regularizations for sparsity, and then developed a penalized matrix decomposition algorithm to solve model (2). Focusing on a sparse version of the original CCA model (1), Gao  proposed a two-stage method by a convex relaxation of CCA model. For the matrix case, many researchers adopted the residual model to obtain the high-order canonical variables [17, 25, 19]. In this paper, we get multiple variables simultaneously in our new model. In addition, all results on the matrix case mentioned above have not given convergence analysis for their algorithms, we proposed an efficient method to solve our new model, and provided the convergence analysis.

The original trace Lasso was proposed by Grave . Trace Lasso regularization was successful applied to various scenarios including subspace clustering , sparse representation classification  and subspace segment , and so on. However, they only considered the trace Lasso regularization in vector case in literature. In this paper, we generalize the original trace Lasso regularization to matrix case, and adopt it as a new regularization for the SCCA, and get an adaptive SCCA model.

#### Notations:

We use capital and lowercase symbols to represent matrix and vector, respectively. Let denote the vector of all 1’s, be a vector whose -th entry is 1 and 0 for others, is a diagonal matrix where the -th diagonal entry is , and be a vector where the -th entry is . Let and denote the -th row and -th column of , be the trace of . For a vector , let be the and norm. For a matrix , let be the norm, and denote the Frobenius norm and nuclear norm respectively, denote the operator norm.

## 3 Adaptive sparse CCA using trace Lasso regularization

### 3.1 Trace Lasso in vector case

Consider the following linear estimator:

 minw12∥Xw−y∥+λΩ(w)

where is a data matrix. The trace Lasso is a correlation based penalized norm proposed by Grave et al  for balancing the and norm. It is defined as follows

 Ω(w)=∥Xdiag(w)∥∗

where is nuclear norm. A main advantage of trace Lasso being superior to other norm is that, the trace Lasso involves the data matrix , makes it adaptive to the correlation of data. As shown in , if each column of is normalized to

, the trace Lasso interpolates between the

norm and norm in the sense of

 ∥w∥1≤∥Xdiag(w)∥∗≤∥w∥2.

The inequality are tight. To see this, if the data are uncorrelated (), trace Lasso reduce to , and if the data are highly correlated (), trace Lasso equals to .

### 3.2 Trace Lasso in matrix case

Let , define a linear operator as

 AX(W)=(XDiag(W⋅1),⋯,XDiag(W⋅r))

 A∗X(M)=(diag(XTM1),⋯,diag(XTMr))

where denotes -th block matrix of . Then, the trace Lasso in matrix case is defined as follows

 Ω(W)=∥AX(W)∥∗

It is easy to show that, the trace Lasso regularizer in matrix case (3.2) has similar properties to that in vector case. If each column of is normalized, then the linear operator can be rewritten to

 AX(W)=r∑i=1p∑j=1X⋅jWij¯eTij=p∑j=1X⋅j⋅(r∑i=1wij¯eTij)

where is an unit vector in which the -th component is 1 and the others are 0. There are two special case:

1. If the data (i.e., column vectors of ) are uncorrelated, i.e., . Then (3.2) gives a singular value decomposition of . In the case, trace Lasso (3.2) reduces to the norm

 ∥AX(W)∥∗=p∑j=1∥X⋅j∥2∥Wj⋅∥2=∥W∥2,1.
2. If the data are highly correlated, especially if all columns of are identical and have unit size, we have

 AX(W)=X⋅1⋅p∑j=1r∑i=1Wij¯eTij=X⋅1⋅(vec(W))T,

where . Then trace Lasso (3.2) reduces to Frobenius norm

 ∥AX(W)∥∗=∥X⋅1⋅(vec(W))T∥∗=∥X⋅1∥2⋅∥vec(W)∥2=∥W∥F.

The following proposition show that the trace Lasso (3.2) in matrix case is adaptive to the correlation of data, which is similar to the original trace Lasso.

###### Proposition 3.1.

Let , and each column of is normalized, . Then

 ∥W∥F≤∥AX(W)∥∗≤√r∥W∥2,1.
###### Proof..

We first show that . Specifically, we have

 ∥AX(W)∥2F=r∑i=1∥Xdiag(W⋅i)∥2F=r∑i=1p∑j=1W2ij∥X⋅j∥22=r∑i=1p∑j=1W2ij=∥W∥2F

Then, for the first inequality of (3.1) we have

 ∥W∥F=∥AX(W)∥F≤∥AX(W)∥∗.

Denote the -th column of the -th submatrix in by , and let , then for the second inequality of (3.1) we have

 ∥AX(W)∥∗=max∥M∥op≤1⟨M,AX(W)⟩=max∥M∥2≤1⟨A∗X(M),W⟩=max∥M∥2≤1r∑i=1WT⋅idiag(XTMi)=max∥M∥2≤1p∑j=1XT⋅j(r∑i=1WijMi⋅j)≤max∥M∥2≤1p∑j=1∥X⋅j∥2⋅∥∥ ∥∥r∑i=1WijMi⋅j∥∥ ∥∥2≤max∥M∥2≤1p∑j=1∥X⋅j∥2⋅∥^Mj∥F∥Wj⋅∥2≤√rp∑j=1∥Wj⋅∥2=√r∥W∥2,1.

The first equality used the fact that the dual norm of the trace norm is the operator norm. The last inequality used that , and which deduces . ∎

###### Remark 3.1.

If , then Proposition 3.1 is indeed Proposition 3 in .

### 3.3 Regression framework of the adaptive SCCA

Given two data matrices and on the same set of observations, where is the sample size, and are the feature numbers. Without loss of generality, we assume that data matrices and are mean centered. By and , the CCA problem can be rewritten as

 {(u∗,v∗)=argmaxu,v uTXTYv,s.t.uTXTXu=1,vTYTYv=1. (3.1)

For multiple canonical vectors, let and where denote the -th pair of the canonical vectors, the multiple CCA problem is

 (3.2)

The CCA problem (3.2) can be reformulated to a constrained bilinear regression problem of the form

To adapt to the dependence of data, we consider an adaptive sparse CCA (SCCA) model with trace Lasso regularization. Specifically, we have

 ⎧⎪⎨⎪⎩(U∗,V∗)=argminU,V12∥XU−YV∥2F+λu∥AX(U)∥∗+λv∥AY(V)∥∗,s.t.UT(XTX)U=Ir, VT(YTY)V=Ir,

where , and are the penalty parameters, and are linear operators.

## 4 Optimization method for SCCA (3.3)

The SCCA model (3.3) is a nonconvex and nonsmooth optimization problem, and it is difficult to be solved. Riemannian optimization methods are popular to solve a class constrained optimization problem with special structure. Hence, in this section we first reformulate problem (3.3) to a nonsmooth optimization problem on the generalized Stiefel manifolds, then adopt an manifold inexact augmented Lagrangian method in  to solve the resulting problem. Finally, we give a convergence analysis of the proposed method.

### 4.1 Augmented Lagrangian scheme

Let , and , then problem (3.3) can be reformulated as

Here, we assume that and are positive define 333If it is not positive define, we can replace by .. Then and can be regarded to generalized Stiefel manifolds, and problem (4.1) is an optimization problem on generalized Stiefel manifolds. We further reformulate (4.1) to

 ⎧⎪ ⎪ ⎪⎨⎪ ⎪ ⎪⎩(U∗,V∗)=argminU,V12∥XU−YV∥2F+g(P)+h(Q),s.t.AX(U)=P,AY(V)=QU∈M1, V∈M2,

The Lagrangian function associated with (4.1) is given by

 L(U,V,P,Q;Λ1,Λ2)=12∥XU−YV∥2F+g(P)+h(Q)−⟨Λ1,AX(U)−P⟩−⟨Λ2,AY(V)−Q⟩,

where and denote the Lagrangian multipliers. Let be a penalty parameter. Then, the corresponding augmented Lagrangian function is given by

 Lρ(U,V,P,Q;Λ1,Λ2)=L(U,V,P,Q;Λ1,Λ2)+ρ2∥AX(U)−P∥2F+ρ2∥AY(V)−Q∥2F

Then, the proposed manifold inexact augmented Lagrangian method for (4.1) is summarized in Algorithm 1.

### 4.2 Convergence analysis

Let be a variable formed by concatenating and . Let where . Then, problem (4.1) can be rewritten as

 minWF(W)  s.t.   h(W)=0, W∈N,

where , and is given by

 h(W):=(AX(U)−P , AY(V)−Q).

The corresponding augmented Lagrangian can be rewritten as

 Lρ(W;Z)=F(W)+r∑i,jZij[h(W)]ij+ρ2r∑i,j[h(W)]2ij

The corresponding KKT condition is given by

where is the Riemannian subdifferential of at . To obtain an efficient implementation of Algorithm 1, we inexactly solve the iteration subproblem (4.1) in which the following stoping criteria is used:

 δk∈∂Ψk(Wk+1)  and ∥δk∥≤ϵk

where as .

Following Yang, Zhang and Song , we give the constraint qualifications of problem (4.2):

###### Definition 4.1 (Licq).

Linear independence constraint qualifications (LICQ) are said to hold at for problem (4.2) if

 {grad[h(W)]ij|i=1,⋯,m;j=1,⋯,r} are linearly % independent in TWN.
###### Theorem 4.1.

Suppose is a sequence generated by Algorithm 1, and the stoping criteria (4.2) is hit at the -th iteration. Then the limit point set of is nonempty. Let be a limit point of , and the LICQ holds at . Then is a KKT point of the problem (4.2).

See . ∎

###### Lemma 4.1.

The LICQ always holds at for problem (4.2) .

###### Proof..

Let , then

 ∇[h1(W)]ij=∇U[h1(W)]ij×∇V[h1(W)]ij×∇P[h1(W)]ij×∇Q[h1(W)]ij,i=1,⋯,n;j=1,⋯,pr,∇[h2(W)]ij=∇U[h2(W)]ij×∇V[h2(W)]ij×∇P[h2(W)]ij×∇Q[h2(W)]ij,i=1,⋯,n;j=1,⋯,qr.

For all , let be a matrix in which the entry at the -th row and the -column is 1, the others are 0. Then

 ∇P[h1(W)]ij=En×prij,∇Q[h1(W)]ij=0,i=1,⋯,n;j=1,⋯,pr,∇P[h1(W)]ij=0,∇Q[h1(W)]ij=En×qrij,i=1,⋯,n;j=1,⋯,qr.

A basis of the normal cone of at , denoted by , is given by

 {ΣxxU(eieTj+ejeTi):i=1,⋯,r,j=i,⋯,r}×{ΣyyV(eieTj+ejeTi):i=1,⋯,r,j=i,⋯,r}.

It is easy to show that, , if there exists such that

 n∑i=1pr∑j=1Z1ij∇[h1(W)]ij+n∑i=1qr∑j=1Z2ij∇[h2(W)]ij∈NWN,

then . Since is a submanifold of Euclidean space, it derives immediately from (4.2) that

Which implies that LICQ holds at and completes the proof. ∎

### 4.3 Riemannian gradient method for subproblem (4.1)

In section 4.1, we present an manifold inexact augmented Lagrangian method to solve problem (4.1). The main challenge in the proposed method (Algorithm 1) is to solve subproblem (4.1) efficiently. Problem (4.1) is a nonsmooth problem under manifold constrained. In this subsection, we first get an equivalence smooth optimization problem by using the Moreau envelop technique, then we present Riemannian gradient method to solve the equivalent problem.

The proximal mapping associated with is defined by

 Proxp(U)=argminW{p(W)+12∥U−W∥2F}.

For fixed and , we consider

 minU,V,P,Q{Ψ(U,V,P,Q):=Lρ(U,V,P,Q;Λ1,Λ2),  s.t.  U∈M1,V∈M2}

Let

 ψ(U,V)=infP,QΨ(U,V,P,Q)=12∥XU−YV∥2F+g(Proxg/ρ(AX(U)−Λ1/ρ))+h(Proxh/ρ(AY(V)−Λ2/ρ))+ρ2∥AX(U)−Λ1/ρ−Proxg/ρ(AX(U)−Λ1/ρ)∥2F+ρ2∥AY(V)−Proxh/ρ(AY(V)−Λ2/ρ)∥2F−12ρ∥Λ1∥2F−12ρ∥Λ2∥2F.

Hence, if

 (~U,~V,~P,~Q)=argminU∈M1,V∈M2,P,QΨ(U,V,P,Q),

then can be computed by

 ⎧⎪ ⎪ ⎪ ⎪⎨⎪ ⎪ ⎪ ⎪⎩(~U,~V)=argminU∈M1,V∈M2ψ(U,V),~P=Proxg/ρ(AX(~U)−Λ1/ρ),~Q=Proxh/ρ(AY(~V)−Λ2/ρ).

Notice that the subproblems for and are proximal operators. Both and in (4.3) are nuclear norm functions, the proximal operator is indeed a singular value shrinkage operator, which is given by:

where and .

Now we focus on the subproblem regarding jointed variable in (4.3). Recall that

 (~U,~V)=argminU,V{ψ(U,V), s.t. U∈M1,V∈M2}.

Let and be a product manifold. Then, problem (4.3) can be formulated

 ~W=argminW{ψ(W), s% .t. W∈M}.

By Lemma B.1, is continuously differentiable in Euclidean space, and its Euclidean gradient is

 ∇ψ(W)=(∇Uψ(W)∇Vψ(W))=⎛⎜⎝XT(XU−YV)+ρA∗X(U)[AX(U)−1ρΛ1−Proxg/ρ(AX(U)−1ρΛ1)]YT(YV−XU)+ρA∗Y(V)[AY(V)−1ρΛ2−Proxh/ρ(AY(V)−1ρΛ2)]⎞⎟⎠

Since is a Riemannian submanifold in Euclidean space, by lemma (B.2) is retraction smooth, and its Riemannian gradient is