Canonical correlation analysis (CCA for short) firstly proposed by Hotelling  aims to tackle the associations between two sets of variables. It has wide applications in many important fields such as biology [26, 18, 13], medicine , image analysis [7, 15], etc.
Suppose that there are two data sets: containing variables and containing variables, both are obtained from observations. The CCA seeks two linear combinations of these variables from and with maximal correlation coefficient. Specially, let
be the sample covariance matrices of and respectively, and be the sample cross-covariance matrix, then the CCA finds a pair such that
is maximized. The new variables and are canonical variables, and the correlations between canonical variables are called canonical correlations. The canonical variables and
can be respectively obtained by the eigenvectors of matrix
The canonical correlations are given by the positive square root of those eigenvalues. Since the CCA model (1.1) is in form of fractional, it is difficult to optimize. A equivalent formulation of CCA is given by
which can be regarded as an optimization problem on the generalized Stiefel manifolds. However, a potential disadvantage of the CCA is that, the learned solution is a linear combination of all original variables, which brings down the interpretability. If the number of variables exceeds sample size, traditional CCA cannot be performed due to that and are singular. Hence, many researchers proposed various sparse CCA (SCCA) to handle the case that the number of variables exceeds observations, and to improve the interpretability of canonical variables by restricting the linear combinations to a subset of original variables.
In this paper, we propose an adaptive sparse CCA model by incorporating the trace Lasso regularization. The matrix version of trace Lasso regularization can be adopted to both highly correlated and uncorrelated data. Our major contributions are summarized in follows:
We present a matrix version of trace Lasso regularization, and show that the new regularization function enjoys the properties of original trace Lasso.
By introducing trace Lasso regularization into the CCA model, we obtain an adaptive sparse CCA model (ASCCA). To our knowledge, our ASCCA is the first to takes the data correlation into account in the CCA model. In addition, our model consider multiple variables simultaneously.
The new model is reformulated to an optimization problem on the generalized Stiefel manifold. An manifold inexact augmented Lagrangian method is proposed for the resulted optimization problem, and the convergence is established under some assumptions.
The experimental results demonstrate that, the proposed ASCCA is superior to some existing sparse CCA models.
The rest of the paper is organized as follows. Section 2 briefly gives some reviews on the related works. Section 3 proposes an adaptive sparse version of the CCA introduced by the new trace Lasso regularization. Section 4 provides an optimization reformulation and the manifold inexact augmented Lagrangian method for the new model, and gives the convergence analysis. In Section 5, a simulation study is provided to show the validity and efficiency of the proposed method. Section 6 concludes this paper with some final remarks.
2 Related works
It is well known that, if the sample size exceeds dimension, the traditional CCA does not perform. To overcome this difficulty, various methods were proposed via incorporate different regularization function. Vinod 
proposed a canonical ridge, which is an adaptation of the ridge regression for the CCA framework proposed by Hoerl and Kennard, and introduced an efficient sparsity penalty strategy. After that, various approaches for sparse CCA (SCCA for short) were proposed in literature, which includes regularization[17, 25], elastic net , group sparse and structured sparse [14, 4], etc. There also exists some limitations. If there is a group of variables which the pairwise correlation is high, the Lasso tends to select only one variable from this group, which may lead some misunderstands to the truth. Group sparse regularization needs the prior knowledge of group, which is unrealistic in some real applications. The proposed adaptive sparse CCA model utilized the new trace Lasso regularization, which incorporates data matrix into regularization, to adaptively deal with the correlation of covariation matrix.
The original SCCA model is difficult to handle, so many researchers simplify it by assuming that and are diagonal matrices or identity matrices. Parkhomenko, et al  assume that, the covariance matrices24] converted the SCCA model into a penalized regression framework. Suo  presented an approximated SCCA model as follows
where and are some regularizations for sparsity, and then developed a penalized matrix decomposition algorithm to solve model (2). Focusing on a sparse version of the original CCA model (1), Gao  proposed a two-stage method by a convex relaxation of CCA model. For the matrix case, many researchers adopted the residual model to obtain the high-order canonical variables [17, 25, 19]. In this paper, we get multiple variables simultaneously in our new model. In addition, all results on the matrix case mentioned above have not given convergence analysis for their algorithms, we proposed an efficient method to solve our new model, and provided the convergence analysis.
The original trace Lasso was proposed by Grave . Trace Lasso regularization was successful applied to various scenarios including subspace clustering , sparse representation classification  and subspace segment , and so on. However, they only considered the trace Lasso regularization in vector case in literature. In this paper, we generalize the original trace Lasso regularization to matrix case, and adopt it as a new regularization for the SCCA, and get an adaptive SCCA model.
We use capital and lowercase symbols to represent matrix and vector, respectively. Let denote the vector of all 1’s, be a vector whose -th entry is 1 and 0 for others, is a diagonal matrix where the -th diagonal entry is , and be a vector where the -th entry is . Let and denote the -th row and -th column of , be the trace of . For a vector , let be the and norm. For a matrix , let be the norm, and denote the Frobenius norm and nuclear norm respectively, denote the operator norm.
3 Adaptive sparse CCA using trace Lasso regularization
3.1 Trace Lasso in vector case
Consider the following linear estimator:
where is a data matrix. The trace Lasso is a correlation based penalized norm proposed by Grave et al  for balancing the and norm. It is defined as follows
where is nuclear norm. A main advantage of trace Lasso being superior to other norm is that, the trace Lasso involves the data matrix , makes it adaptive to the correlation of data. As shown in , if each column of is normalized to
, the trace Lasso interpolates between thenorm and norm in the sense of
The inequality are tight. To see this, if the data are uncorrelated (), trace Lasso reduce to , and if the data are highly correlated (), trace Lasso equals to .
3.2 Trace Lasso in matrix case
Let , define a linear operator as
and its adjoint operator as
where denotes -th block matrix of . Then, the trace Lasso in matrix case is defined as follows
It is easy to show that, the trace Lasso regularizer in matrix case (3.2) has similar properties to that in vector case. If each column of is normalized, then the linear operator can be rewritten to
where is an unit vector in which the -th component is 1 and the others are 0. There are two special case:
If the data are highly correlated, especially if all columns of are identical and have unit size, we have
where . Then trace Lasso (3.2) reduces to Frobenius norm
The following proposition show that the trace Lasso (3.2) in matrix case is adaptive to the correlation of data, which is similar to the original trace Lasso.
Let , and each column of is normalized, . Then
We first show that . Specifically, we have
Then, for the first inequality of (3.1) we have
Denote the -th column of the -th submatrix in by , and let , then for the second inequality of (3.1) we have
The first equality used the fact that the dual norm of the trace norm is the operator norm. The last inequality used that , and which deduces . ∎
3.3 Regression framework of the adaptive SCCA
Given two data matrices and on the same set of observations, where is the sample size, and are the feature numbers. Without loss of generality, we assume that data matrices and are mean centered. By and , the CCA problem can be rewritten as
For multiple canonical vectors, let and where denote the -th pair of the canonical vectors, the multiple CCA problem is
The CCA problem (3.2) can be reformulated to a constrained bilinear regression problem of the form
To adapt to the dependence of data, we consider an adaptive sparse CCA (SCCA) model with trace Lasso regularization. Specifically, we have
where , and are the penalty parameters, and are linear operators.
4 Optimization method for SCCA (3.3)
The SCCA model (3.3) is a nonconvex and nonsmooth optimization problem, and it is difficult to be solved. Riemannian optimization methods are popular to solve a class constrained optimization problem with special structure. Hence, in this section we first reformulate problem (3.3) to a nonsmooth optimization problem on the generalized Stiefel manifolds, then adopt an manifold inexact augmented Lagrangian method in  to solve the resulting problem. Finally, we give a convergence analysis of the proposed method.
4.1 Augmented Lagrangian scheme
Let , and , then problem (3.3) can be reformulated as
Here, we assume that and are positive define 333If it is not positive define, we can replace by .. Then and can be regarded to generalized Stiefel manifolds, and problem (4.1) is an optimization problem on generalized Stiefel manifolds. We further reformulate (4.1) to
The Lagrangian function associated with (4.1) is given by
where and denote the Lagrangian multipliers. Let be a penalty parameter. Then, the corresponding augmented Lagrangian function is given by
4.2 Convergence analysis
Let be a variable formed by concatenating and . Let where . Then, problem (4.1) can be rewritten as
where , and is given by
The corresponding augmented Lagrangian can be rewritten as
The corresponding KKT condition is given by
where is the Riemannian subdifferential of at . To obtain an efficient implementation of Algorithm 1, we inexactly solve the iteration subproblem (4.1) in which the following stoping criteria is used:
where as .
Definition 4.1 (Licq).
Linear independence constraint qualifications (LICQ) are said to hold at for problem (4.2) if
See . ∎
The LICQ always holds at for problem (4.2) .
Let , then
For all , let be a matrix in which the entry at the -th row and the -column is 1, the others are 0. Then
A basis of the normal cone of at , denoted by , is given by
It is easy to show that, , if there exists such that
then . Since is a submanifold of Euclidean space, it derives immediately from (4.2) that
Which implies that LICQ holds at and completes the proof. ∎
4.3 Riemannian gradient method for subproblem (4.1)
In section 4.1, we present an manifold inexact augmented Lagrangian method to solve problem (4.1). The main challenge in the proposed method (Algorithm 1) is to solve subproblem (4.1) efficiently. Problem (4.1) is a nonsmooth problem under manifold constrained. In this subsection, we first get an equivalence smooth optimization problem by using the Moreau envelop technique, then we present Riemannian gradient method to solve the equivalent problem.
The proximal mapping associated with is defined by
For fixed and , we consider
then can be computed by
Notice that the subproblems for and are proximal operators. Both and in (4.3) are nuclear norm functions, the proximal operator is indeed a singular value shrinkage operator, which is given by:
where and .
Now we focus on the subproblem regarding jointed variable in (4.3). Recall that
Let and be a product manifold. Then, problem (4.3) can be formulated
By Lemma B.1, is continuously differentiable in Euclidean space, and its Euclidean gradient is
Since is a Riemannian submanifold in Euclidean space, by lemma (B.2) is retraction smooth, and its Riemannian gradient is
It is shown that problem (4.3) is a smooth optimization problem on Riemannian manifold. In this paper, we adopt a Riemannian Barzilai-Borwein (RBB) gradient method  to solve problem (4.3), see Algorithm 2 for details.
5 Random Simulation
In the section, the performance of the SCCA model and the proposed method is verified by random simulation. The proposed adaptive trace Lasso regularization CCA in this paper is compared with the sparse CCA- model (named as CoLaR) proposed by Gao . The CoLaR is a computationally feasible two-stage method, which consists of a convex-programming-based initialization stage and a group-Lasso-based refinement stage. In the first stage, CoLaR solves the following convex minimization problem:
where and are sample covariance matrices. Let and be the matrices whose column vectors are respectively the top left and right singular vectors of . Then, in the second stage, CoLaR solve the following group-Lasso problem: