 # Tensor Canonical Correlation Analysis

In many applications, such as classification of images or videos, it is of interest to develop a framework for tensor data instead of ad-hoc way of transforming data to vectors due to the computational and under-sampling issues. In this paper, we study canonical correlation analysis by extending the framework of two dimensional analysis (Lee and Choi, 2007) to tensor-valued data. Instead of adopting the iterative algorithm provided in Lee and Choi (2007), we propose an efficient algorithm, called the higher-order power method, which is commonly used in tensor decomposition and more efficient for large-scale setting. Moreover, we carefully examine theoretical properties of our algorithm and establish a local convergence property via the theory of Lojasiewicz's inequalities. Our results fill a missing, but crucial, part in the literature on tensor data. For practical applications, we further develop (a) an inexact updating scheme which allows us to use the state-of-the-art stochastic gradient descent algorithm, (b) an effective initialization scheme which alleviates the problem of local optimum in non-convex optimization, and (c) an extension for extracting several canonical components. Empirical analyses on challenging data including gene expression, air pollution indexes in Taiwan, and electricity demand in Australia, show the effectiveness and efficiency of the proposed methodology.

## Code Repositories

### TCCA

Tensor Canonical Correlation Analysis

##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1 Introduction

Canonical Correlation Analysis (CCA) is a widely used multivariate statistical method that aims to learn a low dimensional representation from two sets of data by maximizing the correlation of linear combinations of two sets of data (Hotelling, 1936)

. The sets of data could be different views (defomation) of images, different measuring on same object or data with labels. CCA has been widely used in many applications, ranging from text and image retrieval

(Mroueh et al., 2016) and image clustering (Jin et al., 2015), to various scientific fields (Hardoon et al., 2004)

. In contrast to principal component analysis (PCA)

(Pearson, 1901)

, which is an unsupervised learning technique for finding low dimensional data representation, CCA benefits from the label information

(Sun and Chen, 2007), other views of data, or correlated side information as in multi-view learning (Sun, 2013).

In this paper, we study CCA in the context of matrix valued and tensor valued data. Matrix valued data has been studied in (Gupta and Nagar, 2000; Kollo and von Rosen, 2005; Werner et al., 2008; Leng and Tang, 2012; Yin and Li, 2012; Zhao and Leng, 2014; Zhou and Li, 2014), while tensor valued data in (Ohlson et al., 2013; Manceur and Dutilleul, 2013; Lu et al., 2011). Practitioners often apply CCA to such data by first converting them into vectors thus destroying the inherent structure present in the samples, increasing the dimensionality of the problem, and increasing the computational complexity. Lee and Choi (2007)

proposed a two-dimensional CCA (2DCCA) to address these problems for matrix valued data by preserving data representation and alleviating expensive computation that arises by vectorizing the data. But convergence properties of the iterative singular value decomposition (SVD) algorithm proposed in

Lee and Choi (2007) are not known.

In this paper, we develop a power method that can solve the 2DCCA problem faster and provably converges to a stationary point of the objective function. More generally, we present a natural extension of 2DCCA to tensor-valued data that projects the data onto the space of low-rank tensors to maximize the correlation. We present two algorithms for solving the resulting optimization problem, the higher-order power method (HOPM) and, its variant, alternative least squares (ALS), which are particularly effective in a large data setting. HOPM is a widely used algorithm for tensor decomposition (Kolda and Bader, 2009; De Lathauwer et al., 2000a, b), with convergence properties intensely studied in recent years (Wen et al., 2012; Uschmajew, 2015; Xu and Yin, 2013; Li et al., 2015; Guan et al., 2018)

. However, due to the constraints of the CCA optimization problem, these convergence results are not directly applicable. In order to establish convergence of HOMP for the tensor valued CCA problem, we first establish the equivalence of HOMP and ALS algorithms. Utilizing this equivalence, we present a transparent convergence proof for the rank one case by applying the theory of analytical gradient flows and Lojasiewicz gradient inequality. The convergence is established in a model-free setting and it applies to any stationary point. We propose an effective initialization procedure to reduce the probability of algorithms getting trapped in a local maximum. Moreover, we develop an inexact updating rule, which enables us to use a gradient descent method suitable for large-scale data and convenient for error analysis. We also discuss how to extend the HOPM for extracting multiple canonical components.

### 1.1 Main Contributions

We make several contributions in this paper.

First, we interpret 2DCCA (Lee and Choi, 2007) as a non-convex method for the low rank tensor factorization problem. This allows us to formulate a novel tensor extension of 2DCCA, which we term tensor canonical correlation analysis (TCCA). Kim et al. (2007) and Luo et al. (2015) discuss tensor extension of CCA in different settings. The former stresses correlation between each mode, while the latter discusses CCA when multiple views of data are available. None of them provide theoretical justifications or efficient algorithms. See Section 1.2 for more discussions.

Second, we develop the higher-order power method and, its variant, alternating least squares for solving the TCCA problem. One obstacle in analysis of tensor factorization is unidentifiability, that is, if is a solution, then is also a solution. Thus, without additional regularization or penalty, there is no guarantee to prevent the iteration sequence of the HOPM from going to or infinity. Moreover, this property also implies that it is possible for the iteration sequence to oscillate even when the sequence is close to a stationary point. We illustrate this in Section 6. In other words, the local convergence may not hold for the HOPM. To circumvent this difficulty and seek further numerical stability, we propose the ALS, which uses a different normalization scheme, and show that HOPM and ALS with the same initialization are equivalent. See Proposition 3 for a formal statement. Utilizing this insight and borrowing tools from tensor factorization, we show that ALS and, hence, HOPM have the local convergence property.

Third, we consider several practical issues that arise in analysis of real data. We provide an effective initialization scheme to reduce the probability of the algorithm getting trapped in a poor local optimum and develop a faster algorithm for large scale data. Furthermore, we discuss the TCCA generalization to -TCCA, which allows for extraction of multiple features from data. We justify the generalization under a probabilistic model. A deflation procedure is provided for practical applications.

Finally, we apply our method to several real data sets, including gene expression data, air pollution data, and electricity demands. The results demonstrate that our method is efficient and effective even in the extremely high-dimensional setting and can be used to extract useful and interpretable features from the data.

### 1.2 Related Work

Two papers in the literature are closely related to our tensor canonical correlation analysis. Luo et al. (2015) consider multiple-view extension of CCA, which results in a tensor covariance structure. However, they only consider the vector-valued data, so the vectorization is still required in their setting. Kim et al. (2007) extend CCA to the tensor data, focusing on 3 dimensional videos, but only consider joint space-time linear relationships, which is neither the low-rank approximation of CCA nor the extension of 2DCCA. Moreover, none of the aforementioned papers provide convergence analysis or efficient algorithms, as we detail in Section 4. Furthermore, 2DCCA (Lee and Choi, 2007) and other 2D extensions proposed in the literature (Yang et al., 2004; Ye et al., 2004; Zhang and Zhou, 2005; Yang et al., 2008) also lack rigorous analysis.

Various extensions of CCA are available in the literature. Kernel method aims to extract nonlinear relations via implicitly mapping data to a high-dimensional feature space (Hardoon et al., 2004; Fukumizu et al., 2007). Other nonlinear transformations have been considered to extract more sophisticated relations embedded in the data. Andrew et al. (2013)

utilized neural networks to learn nonlinear transformations,

Michaeli et al. (2016) used Lancaster’s theory to extend CCA to a nonparametric setting, while Lopez-Paz et al. (2014) proposed a randomized nonlinear CCA, which uses random features to approximate a kernel matrix. Bach and Jordan (2006) and Safayani et al. (2018)

interpreted CCA as a latent variable model, which allows development of an EM algorithm for parameter estimation. In the probabilistic setting,

Wang et al. (2016b) developed variational inference. In general, nonlinear extensions of CCA often lead to complicated optimization problems and lack rigorous theoretical justifications.

Our work is also related to scalable methods for solving the CCA problem. Although CCA can be solved as a generalized eigenvalue problem,

111See Section 3.1 for details. computing singular value (or eigenvalue) decomposition of a large (covariance) matrix is computationally expensive. Finding an efficient optimization method for CCA remains an important problem, especially for large-scale data, and has been intensively studied (Yger et al., 2012; Ge et al., 2016; Allen-Zhu and Li, 2017; Xu et al., 2018; Fu et al., 2017; Arora et al., 2017; Bhatia et al., 2018). Ma et al. (2015) and Wang et al. (2016a) are particularly related to our work. Wang et al. (2016a) formulated CCA as a regularized least squares problem and solved it with a stochastic gradient descent. Inspired by their work, we optimize the TCCA objective using an alternating least squares algorithm and bound its error accumulation in Section 5. Ma et al. (2015) developed an augmented approximate gradient (AppGrad) scheme to avoid computing the inverse of a large matrix. Although AppGrad does not directly maximize the correlation in each iteration, it can be shown that the CCA solution is a fixed point of the AppGrad scheme. Therefore, if the initialization is sufficiently close to the global CCA solution, the iterates of AppGrad would converge to it. A similar observation holds for the HOPM and ALS algorithms developed in this paper, with the difference that we use the alternating minimization instead of gradient descent. More details and discussions are provided in Section 4.2

### 1.3 Organization of the paper

In Section 2, we define basic notation and operators of multilinear algebra and introduce the Lojasiewicz inequality. Section 3 reviews the concept of CCA as well as 2DCCA and formulates the tensor CCA. Section 4 presents our algorithms, HOPM and ALS, and studies their convergence. Section 5 deals with practical considerations including an inexact updating rule, effective initialization, and extension to higher rank tensors. Numerical simulations verifying the theoretical results and applications to real data are given in Section 6. All technical proofs can be found in the appendix.

## 2 Preliminaries

In this section, we provide necessary background for the subsequent analysis. First, we present some basic notation and operators in multilinear algebra. Then, we introduce the Lojasiewicz inequality and the associated convergence results.

### 2.1 Multilinear Algebra and Notation

We briefly introduce notation and concepts in multilinear algebra needed for our analysis. See Kolda and Bader (2009) for more details. We start by introducing the notion of multi-dimensional arrays, which are called tensors. The order of a tensor, also called way or mode, is the number of dimensions the tensor has. For example, a vector is a first-order tensor, and a matrix is a second-order tensor. We use the following convention to distinguish between scalars, vectors, matrices, and higher-order tensors: scalars are denoted by lower-case letters (), vectors by upper-case letters (), matrices by bold-face upper-case letters (), and tensors by calligraphic letters (). For an -mode tensor , we let its -th element be or .

Next, we define some useful operators in multilinear algebra. Matricization, also known as unfolding, is a process of transforming a tensor into a matrix. The mode- matricization of a tensor is denoted by and arranges the mode- fibers to be the columns of the resulting matrix. More specifically, tensor element maps to matrix element , where with . The vector obtained by vectorization of is denoted by . The Frobenius norm of the tensor is defined as , where is the inner product defined on two tensors , and given by

 ⟨X,Y⟩=d1∑i1=1⋯dm∑im=1xi1i2…imyi1i2…im.

The mode- product of a tensor with a matrix is a tensor of size defined by

 (X×kA)i1…in−1jin+1…im=dk∑ik=1xi1i2…imajik.

The outer product of vectors , …, is an -order tensor defined by

 (U1∘⋯∘Um)i1i2…im=(U1)i1…(Um)im.

The Kronecker product of two matrices and is an matrix given by . We call a rank-one tensor if there exist vectors such that . Given samples , it is also useful to define the data tensor with elements .

### 2.2 Convergence Analysis via Lojasiewicz Inequality

There are two common approaches for establishing convergence of non-convex methods. One assumes that the initial point and the global optimum lie in a convergence region that is locally convex and, thus, the optimization procedure converges to the global optimum. For certain problems, there are effective initialization schemes that provide, with high probability, an initial point that is close enough (within the convergence region) to the global optimum. We provide an effective initialization scheme in Section 5.3 for our method.

Another approach is based on the theory of analytical gradient flows, which allows establishing convergence of gradient descent based algorithms for difficult problems, such as, non-smooth, non-convex, or manifold optimization. This approach is also useful for problems arising in tensor decomposition where the optimum set is not compact and isolated. For example, in the case of PCA or CCA problems, the optimum set is a low dimensional subspace or manifold. The advantage of this approach is that it can be applied to study convergence to all stationary points, without assuming a model for the data. The drawback of the approach is that the convergence rate on a particular problem depends on the Lojasiewicz exponent in Lemma 1 below, which is hard to explicitly compute. However, Liu et al. (2018) recently showed that the Lojasiewicz exponent at any critical point of the quadratic optimization problem with orthogonality constraint is , which leads to linear convergence of gradient descent. Li and Pong (2018) developed various calculus rules to deduce the Lojasiewicz exponent.

The method of Lojasiewicz gradient inequality allows us to study an optimization problem

 minZ∈Rpf(Z), (1)

where may not be convex, and we apply a gradient based algorithm for finding stationary points. The key ingredient for establishing linear or sublinear convergence rate is the following Lojasiewicz gradient inequality (Łojasiewicz, 1965).

###### Lemma 1 (Lojasiewicz gradient inequality).

Let be a real analytic function on a neighborhood of a stationary point in . Then, there exist constants and the Lojasiewicz exponent , such that

 |f(Y)−f(X)|1−θ≤c∥∇f(Y)∥, (L)

where is in some neighborhood of .

See Absil et al. (2005) and references therein for the proof and discussion. Since the objective function appearing in the CCA optimization problem is a polynomial, we can use the Lojasiewicz gradient inequality in our convergence analysis. Suppose is a sequence of iterates produced by a descent algorithm that satisfy the following assumptions:

Primary descent condition:

there exists such that, for a sufficiently large , it holds that

 f(Zk)−f(Zk+1)≥σ∥∇f(Zk)∥∥Zk−Zk+1∥. (A1)
Stationary condition:

for a sufficiently large , it holds that

 ∇f(Zk)=0⟹Zk=Zk+1. (A2)
Asymptotic small step-size safeguard:

there exists such that, for a large enough , it holds that

 ∥Zk+1−Zk∥≥κ∥∇f(Zk)∥. (A3)

Then, we have the following theorem which is the main tool we use in our analysis and its proof can be found in Schneider and Uschmajew (2015).

###### Theorem 2.

Under the condition of Lemma 1 and assumptions (A1) and (A2), if there exists a cluster point of the sequence satisfying (L), then is the limit of the sequence, i.e., . Furthermore, if (A3) holds, then

 ∥Zk−Z∗∥≤C{e−ckif θ=12 for% some c>0,k−θif 0<θ<12,

for some . Moreover, .

It is not hard to see that the gradient descent iterates for a strongly convex and smooth function satisfy conditions (A1)-(A3) and Lojasiewicz gradient inequality. Moreover, if one can show that Lojasiewicz’s inequality holds with for the set of stationary points, then linear convergence to stationary points can be established (Liu et al., 2018). In fact, they related Lojasiewicz’s inequality to the following error bound (Luo and Tseng, 1993):

 ∥f(X)∥≥μ∥X−X∗∥,

where and is in the set of stationary points. Karimi et al. (2016) proved that these two conditions actually imply quadratic growth:

 f(X)−f(X∗)≥μ′2∥X−X∗∥2,

where . This means that is strongly convex when is close to a stationary point. Those local conditions are useful in non-convex and even non-smooth, non-Euclidean setting. See Karimi et al. (2016) and Bolte et al. (2017) for more discussions.

## 3 Tensor Canonical Correlation Analysis

In this section, we review the canonical correlation analysis for vector-valued data, focusing on finding the first pair of canonical variables. Next, we present the extension of CCA to the matrix-valued data and our proposal for the tensor-valued data. Section 5.2 discusses how to extend the approach for finding multiple canonical variables. Unless otherwise stated, we assume that all random elements in this paper are zero-mean.

### 3.1 Canonical Correlation Analysis

We review the classical canonical correlation analysis in this section. Consider two multivariate random vectors and . The canonical correlation analysis aims to maximize the correlation between the projections of two random vectors and can be formulated as the following maximization problem

 maxU∈Rdx,V∈Rdycorr(U⊤X,V⊤Y)=maxU∈Rdx,V∈Rdycov(U⊤X,V⊤Y)√var(U⊤X)var(V⊤Y). (2)

The above optimization problem can be written equivalently in the following constrained form

 maxU,VU⊤ΣXYV s.t. U⊤ΣXXU=1=V⊤ΣYYV,

where, under the zero-mean assumption, , , and . Using the standard technique of Lagrangian multipliers, we can obtain and by solving the following generalized eigenvalue problem

 [0ΣXYΣYX0][UV]=λ[ΣXX00ΣYY][UV], (3)

where and is the Lagrangian multiplier. In practice, the data distribution is unknown, but i.i.d. draws from the distribution are available. We can estimate the canonical correlation coefficients by replacing the expectations with sample averages, leading to the empirical version of the optimization problem.

### 3.2 Two-Dimensional Canonical Correlation Analysis

Lee and Choi (2007) proposed two-dimensional canonical correlation analysis (2DCCA), which extends CCA to matrix-valued data. The naive way of dealing with matrix-valued data is to reshape each data matrix into a vector and then apply CCA. However, this naive preprocessing procedure breaks the structure of the data and introduces several side effects, including increased computational complexity and larger amount of data required for accurate estimation of canonical directions. To overcome this difficulty, 2DCCA maintains the original data representation and performs CCA-style dimension reduction as follows. Given two data matrices , 2DCCA seeks left and right transformations, , that maximize the correlation between and and can be formulated as

 maxLx,Rx,Ly,RyCov(L⊤xXRx,L⊤yYRy) s.t. Var(L⊤xXRx)=1=Var(L⊤yYRy). (4)

Lee and Choi (2007) developed an iterative SVD-based algorithm for solving the problem above. The idea is that, for fixed and , and are random vectors and and can be obtained using any CCA algorithm. Similarly we can switch the roles of variables and fix and . Thus, the algorithm iterates between updating and with and fixed and updating and with and fixed. We call it SVD-based algorithm, since Lee and Choi (2007) solves CCA using the generalized eigendecomposition in (3). It is well-known that solving SVD of large matrices is computationally and memory intensive (Arora et al., 2012; Wang et al., 2018). Moreover, there is evidence shown in Section 6 that the sequence of iterates obtained by this procedure oscillates even if the iterates are close to the stationary point. See comparison with HOPM in the last paragraph of Section 4.1.

To the best of our knowledge, no clear interpretation why 2DCCA can improve the classification results, how the method is related to data structure, nor rigorous convergence analysis exists for 2DCCA. We address these issues, by first noting that the 2DCCA problem in (4) can be expressed as

 maxU,V:rank(U)=1=rank(V)corr(Tr(U⊤X),Tr(V⊤Y)). (5)

Therefore, we can treat 2DCCA as the CCA optimization problem with the low-rank restriction on canonical directions and the SVD-based algorithm is a particular non-convex solver for this constrained program. Furthermore, formulating the optimization problem as in (5) allows us to use techniques based on the Burer-Monteiro factorization (Chen and Wainwright, 2015; Park et al., 2018, 2017; Sun and Dai, 2017; Haeffele et al., 2014) and non-convex optimization (Yu et al., 2018; Zhao et al., 2015; Yu et al., 2017; Li et al., 2019; Yu et al., 2019).

### 3.3 Canonical Correlation Analysis with Tensor-valued Data

We are ready to present the problem of tensor canonical correlation analysis, which is a natural extension of 2DCCA. Consider two zero-mean random tensors and . Note that, for simplicity in presentation, we assume the two random tensors and have the same mode and shape. Note that this assumption is not needed in the analysis or the algorithm. TCCA seeks two rank-one tensors and that maximize the correlation between and ,

 maxU,VCorr(⟨U,X⟩,⟨V,Y⟩). (6)

Since the population distribution is unknown, by replacing covariance and cross-covariance matrices by their empirical counterparts, we get the empirical counterpart of the optimization problem in (6)

 maxU,Vρn(U,V),

where is the sample correlation defined as

 ρn(U,V)=1n∑nt=1⟨U,Xt⟩⟨V,Yt⟩√1n∑nt=1⟨U,Xt⟩2⋅1n∑nt=1⟨V,Yt⟩2 (7)

and are the samples from the unknown distributions of . The following empirical version of residual form of the problem (6) will be useful for developing efficient algorithms

 minU,V12nn∑t=1(⟨U,Xt⟩−⟨V,Yt⟩)2 s.t. 1nn∑t=1⟨U,Xt⟩2=1=1nn∑t=1⟨V,Yt⟩2. (8)

This formulation reveals that TCCA is related to tensor decomposition. Furthermore, the sub-problem obtained by fixing all components of except for components or is equivalent to the least squares problem. This allows us to use state-of-the-art solvers based on (stochastic) gradient descent that are especially suitable for large-scale data (Wang et al., 2016a) and develop computationally efficient algorithms for CCA with tensor data proposed in Section 5.

## 4 Algorithms

We present the higher-order power method and its alternating least squares variant for solving the TCCA problem. Furthermore, we establish convergence results. Note that we only discuss the case of rank one factorization and leave the generalization to Section 5.

### 4.1 Higher-order Power Method

The power method is a practical tool for finding the leading eigenvector of a matrix, which is used in tensor factorization. There are many interpretations for the power method and here we show that it is minimizing a certain Lagrange function.

The Lagrange function associated with the optimization problem in (8) is

 L(U,V,λ,μ)=12nn∑t=1(⟨U,Xt⟩−⟨V,Yt⟩)2+λ(1−1nn∑t=1⟨U,Xt⟩2)+μ(1−1nn∑t=1⟨V,Yt⟩2), (9)

where and are the Lagrange multipliers. Minimizing the above problem in one component of , with the other components of fixed, leads to a least squares problem. Define the partial contraction of with all component of except as

 Xj=X1:n×2U⊤1⋯×jU⊤j−1×j+2U⊤j+1⋯×m+1U⊤m.

Similar notation is defined for the partial contraction of . With this notation, we set the gradients and equal to zero, which yields the following stationary conditions

 1−2λnX⊤jXjUj =1nX⊤jYjVj, 1−2λ =U⊤j(1nX⊤jXj)Uj.

Combining the two equations with similar stationary conditions for , we obtain the following updating rule

 Uj=X†jYjVj√V⊤jY⊤jXjX†jYjVj,Vj=Y†jXjUj√U⊤jX⊤jYjY†jXjVj, (10)

where are the pseudo inverses of , respectively. The update (10) is in the form of the power method on matrices and (Regalia and Kofidis, 2000; Kolda and Bader, 2009). Cyclically updating each component yields the higher-order power method, which is detailed in Algorithm 1. Note that the updating rule is similar to one in Wang et al. (2017). However, when the mode of the tensor , matrices of power method vary in each iteration, which causes obstacles for theoretical analysis.

It is easy to see that updating one component of and simultaneously results in a SVD-based algorithm. On the other hand, in HOPM we update only one component in an iteration. For exact updating, the SVD-based algorithm requires SVD in each iteration, which may be computational expensive when the data size is large. HOPM only requires solving a least squares problem, which can be done even in a large scale data setting. From the results of Section 6, we see that the two methods have similar convergence behavior. However, the SVD-based algorithm lacks local convergence property due to the unidentifiability, which is also shown in Section 6. In contrast, by connecting HOPM to alternating least squares method introduced in the next section, we show that HOPM enjoys local convergence property without any additional modifications.

### 4.2 Alternating Least Squares

In a CCA problem, only the projection subspace is identifiable, since the correlation is scale invariant. Same is true in PCA and partial least squares (PLS) problems in both matrix and tensor cases (Arora et al., 2012; Uschmajew, 2015; Gao et al., 2017). Therefore, normalization steps restricting canonical variables on the unit sphere are only important for numerical stability. The major difference between PCA and CCA is that the normalization steps are essential in HOPM for preventing and iterates from converging to zero quickly.

The key insight we present in this section is that different regularization schemes actually generate the same sequence of correlation values, which is stated formally in Proposition 3. This insight allows us to develop alternating least squares for solving TCCA, which uses the Euclidean norm as regularization. See Algorithm 1. The Euclidean norm provides the simplest form of regularization and does not depend on the data, which increases the numerical stability and allows for simple theoretical analysis. One of the reasons for numerical instability of HOPM comes from rank deficiency of the covariance matrices and , which commonly arises in CCA due to the low rank structure of the data and undersampling. The numerical stability can be improved by adding a regularization term to the empirical covariance matrix.

Another major benefit of ALS is related to identifiability. Observe that if and are a stationary point, then and are also a stationary point for any non-zero constant . In particular, the optimum set is not isolated nor compact. In this case, it is possible that the iteration sequence diverges, even while approaching the stationary set. This is the main difficulty in the analysis of convergence in tensor factorization. Moreover, any component approaching zero or infinity causes numerical instability. One way to overcome the numerical instability is by adding a penalty function to balance magnitude of each component. However, using this approach, one cannot find the exact solution when updating each component and the whole optimization process is slowed down. ALS solves the identifiability by restricting each component to a compact sphere.

Note that projection to the unit sphere is not the projection onto the constraint in (8). Therefore, we need to address the question whether HOPM and ALS generate two different sequences of iterates that have different behaviors. The key question is whether ALS minimizes the TCCA objective function in (8). In what follows, we explain how to derive ALS and answer the above question by introducing a new objective function. Consider the modified loss (potential) function

 ~L(α,U,β,V,λ,μ)=12nn∑t=1(⟨αU,Xt⟩−⟨βV,Yt⟩)2+λ(1−1nn∑t=1⟨αU,Xt⟩2)+μ(1−1nn∑t=1⟨βV,Yt⟩2), (11)

where we have added two extra normalization variables to the components of and . This type of potential function appears in the literature on PCA and tensor decomposition. For example, the following two optimization problems are equivalent for the problem of finding the best rank-one tensor approximation of

 minU,α:∥U1∥=⋯=∥Um∥=1∥X−αU∥2⟺minU∥X−U∥2.

HOPM represents the latter problem which is convex with respect to each component , and the ALS represents the former optimization problem is no longer a convex problem with respect to and . In the first formula, it is obvious that once is found, then so is . Therefore, we may ignore in the optimization procedure, but considering this form is convenient for the analysis.

To see how the ALS relates to (11), given the gradient of the potential

 ∇Uj~L=α2(1−2λ)nX⊤jXjUj−αβnX⊤jYjVj,∇α~L=α(1−2λ)nU⊤jX⊤jXjUj−βnU⊤jX⊤jYjVj,∇λ~L=1−α2nU⊤jX⊤jXjUj, (12)

and, similarly for , and the dynamical iterates of ALS as following

 Ukj=X†kjYkjVk−1,j/∥X†kjYkjVk−1,j∥,αkj=(U⊤k,jX⊤kjXkjUk,j)−1/2,1−2λkj=αkjβk−1,jU⊤kjX⊤kjYkjVk−1,j=ρn(Ukj,Vk−1,j),Vkj=Y†kjXkjUkj/∥Y†kjXkjUkj∥,βkj=(V⊤k,jY⊤kjYkjVk,j)−1/2,1−2μkj=αkjβkjU⊤kjX⊤kjYkjVk,j=ρn(Ukj,Vkj). (13)

it is not hard to see that ALS satisfy the stationary condition

 ∇α,Uj,λ~L(αkj,Ukj,λkj,βk,j,Vk−1,j,μk−1,j)=0, ∇β,Vj,μ~L(αkj,Ukj,λkj,βkj,Vkj,μkj)=0.

Therefore ALS alternatively produces iterates that satisfy the stationary condition of (11). However, since we introduced the normalization variables , we changed the subproblem to a noncovex problem. In particular, we do not know if ALS increases the correlation in each iteration or not. To answer this, the following proposition shows that HOPM and ALS generate iterates with the same correlation in each iteration based on the fact that correlation is scale invariant. In particular, this shows that ALS increases the correlation each time it solves the TCCA problem in (8). Moreover, the proposition states that the stationary point of HOPM and ALS are equivalent in the sense that the correlations are the same.

###### Proposition 3.

Let , be the iterates generated by HOPM and ALS with the same starting values, respectively. Then, it holds that

 Ukj=Akj√A⊤kj(1nX⊤kjXkj)Akj,Vkj=Bkj√B⊤kj(1nY⊤kjYkj)Bkj,

and

 ρn(Ukj,Vkj)=ρn(Akj,Bkj).

Moreover, if is a critical point of the modified loss and , then is a critical point of the original loss .

Ma et al. (2015) used a similar idea to develop a faster algorithm for CCA, called AppGrad. The difference is that we establish the relationship between two alternating minimization schemes, while Ma et al. (2015) established the relationship between gradient descent schemes. Notably, they only show that CCA is a fixed point of AppGrad, while we further illustrate that this type of scheme actually finds a stationary point of a modified non-convex loss (11). Thus ALS and AppGrad are nonconvex methods for the CCA problem. ALS is also related to the optimization on matrix manifold (Absil et al., 2008), but the discussion is beyond the scope of this paper.

### 4.3 Convergence Analysis

We present our main convergence result for ALS. We start by introducing several assumptions under which we prove the convergence result.

###### Assumption 1.

Assume the following conditions hold:

 0<σl,x=:σmin(1nn∑t=1vec% (Xt)vec(Xt)⊤)<σmax(1nn∑t=1vec(Xt)vec(Xt)⊤):=σu,x<∞, 0<σl,y=:σmin(1nn∑t=1vec% (Yt)vec(Yt)⊤)<σmax(1nn∑t=1vec(Yt)vec(Yt)⊤):=σu,y<∞, ρn(U0,V0)>0.

The same conditions appeared in Ma et al. (2015), who studied CCA for vector valued data. Wang et al. (2017) require the smallest eigenvalue to be bounded away from zero, which can always be achieved by adding the regularization term to the covariance matrix, as discussed in the previous section. However, instead of assuming a bound on the largest eigenvalue, they assume that and are bounded. The third condition easily be satisfied by noting that if , we can flip the signs of the components of or to obtain . Finally, we remark that the first two conditions are sufficient for preventing the iterates from converging to zero. See Lemma 7 for details.

###### Theorem 4.

If Assumption 1 holds, then the dynamic (13) satisfies Conditions (A1), (A2), and (A3). Furthermore, the iterates generated by ALS converge to a stationary point at the rate that depends on the exponent in the Lojasiewicz gradient inequality in Lemma 1.

Theorem 4 establishes a local convergence result that does not depend on any explicit model assumptions. We only require the data to be well conditioned. We note that the local convergence result is not trivial. Experiments in Section 6 illustrate that the SVD-based algorithm does not enjoy this property. However, this convergence analysis does not tell us any information about statistical properties of the estimator, which may require further assumptions on the data generating process.

Although this analysis does not give us an exact convergence rate, Espig et al. (2015), Espig (2015), and Hu and Li (2018) indicated that sublinear, linear and superlinear convergence can happen in the problem of rank-one approximation of a tensor. For the canonical correlation analysis with tensor-valued data, however, we conjecture that with a stronger assumption on a data generating process, it may be possible to get linear convergence. This is corroborated by our experiment in Section 6. We further conjecture that as the sample size increases, the probability that the algorithm converges to a global optimum approaches , even when the initial point is chosen randomly. We leave these theoretic justifications under a suitable data generating process for future work.

## 5 Practical considerations

In this section, we first discuss computational issues associated with TCCA. Inexact updating rule and several useful schemes are included. Moreover, our algorithms and analysis are based on the case where only the first pair of canonical variables is identified based on rank-one decomposition. In practice, we often want to identify additional pairs of canonical variables and seek for a higher rank decomposition. Unfortunately, it is a subtle issue to define the proper notion of the higher rank in the tensor setting. Even worse, obtaining more canonical variables may not be possible, since projection restricted to a low-rank tensor space may not be well-defined. To solve this issue, we investigate a probabilistic model and observe that 2DCCA is solving CCA with an additional low rank constraint. This leads to the development of a practical deflation procedure.

### 5.1 Efficient Algorithms for Large-scale Data

The major obstacle in applying CCA to large scale data is that many algorithms involve inversion of large matrices, which is computationally and memory intensive. This problem also appears in Algorithm 1. Inspired by the inexact updating of CCA, we first note that that is the solution to the following least squares problem

 min~U12n∥Xkj~U−YkjVk−1,j∥2, (14)

and similarly for . In the following theorem, we show that it suffices to solve the least squares problems inexactly. As long as we can bound the error of this inexact update, we will obtain sufficiently accurate estimate of canonical variables. More specifically, we show the error accumulates exponentially. See Algorithm 1 for the complete procedure with inexact updating for ALS or HOPM. Theorem 5

allows us to use advanced stochastic optimization methods for least squares problem, e.g., stochastic variance reduced gradient

(Johnson and Zhang, 2013; Shalev-Shwartz and Zhang, 2013).

###### Theorem 5.

Denote generated by option I in Algorithm 1 and generated by option II in Algorithm 1. Under Assumption 1, we have

 max{∥Ukj−U∗kj∥,∥Vkj−V∗kj∥}=O(r2mk+j√ϵ),

for some that depends on .

Note that we obtain the same order for the error bound for the inexact updating bounds as in Wang et al. (2016a), who studied the case with .

Several techniques can be used to speed the convergence of inexact updating of ALS. First, we can use warm-start to initialize the least squares solvers by setting initial points as . Due to the local convergence property, after several iterations, the subproblem (14) of TCCA varies slightly with and , i.e. for large enough

 ∥Ukj−Uk−1,j∥≈0≈∥Vkj−Vk−1,j∥

Therefore we may use as an initialization point when minimizing (14). Second, we can regularize the problem by adding the ridge penalty, or equivalently by setting in Algorithm 1. The regularization makes the least squares problem guaranteed to be strongly convex and speeds up the optimization process. This type of regularization is necessary when the size of data is smaller than the dimension of parameters for the condition about smallest eigenvalue in Assumption 1 to be satisfied (Ma et al., 2015; Wang et al., 2018). Finally, the shift-and-invert preconditioning method can be also considered, but we leave this for future work.

### 5.2 (k1,k2,…,km)-Tcca

We develop a general TCCA procedure for extracting more than one canonical component. We can interpret general TCCA as a higher rank approximation of general CCA. That is, we seek to solve the following CCA problem

 minU,V in a low-rank space1nn∑t=1(⟨U,Xt⟩−⟨V,Yt⟩)2 s.t. 1nn∑t=1⟨U,Xt⟩2=1=1nn∑t=1⟨V,Yt⟩2, (15)

where lie in a “low-rank” space. For example, TCCA restricts solutions in the space of rank-one tensors. There are many ways to obtain a higher rank tensor factorization, but here we focus on rank- approximation (De Lathauwer et al., 2000a), which is particularly related to 2DCCA and TCCA. For simplicity, we now restrict our discussion to -2DCCA. To perform the -2DCCA, Lee and Choi (2007) obtain by computing the largest generalized eigenvectors of

 [0SrXYSrYX0][LxLy]=λ[SrXX+RxI00SrYY+RyI][LxLy],

where and , , , and similarly for . We refer to this method as the SVD-based algorithm for -2DCCA. The corresponding extension of HOPM for -TCCA is given in Algorithm 2. Here we replace vectors by matrices. Note that we only need to compute the SVD for small matrices and .