The Partial Least Squares (PLS) is a well-established framework for estimation, regression and classification, whose objective is to predict a set of dependent variables (responses) from a set of independent variables (predictors) through the extraction of a small number of latent variables. One member of the PLS family is Partial Least Squares Regression (PLSR) - a multivariate method which, in contrast to Multiple Linear Regression (MLR) and Principal Component Regression (PCR), is proven to be particularly suited to highly collinear data[1, 2]
. In order to predict response variablesfrom independent variables
, PLS finds a set of latent variables (also called latent vectors, score vectors or components) by projecting bothand onto a new subspace, while at the same time maximizing the pairwise covariance between the latent variables of and . A standard way to optimize the model parameters is the Non-linear Iterative Partial Least Squares (NIPALS); for an overview of PLS and its applications in neuroimaging see [4, 5, 6]. There are many variations of the PLS model including the orthogonal projection on latent structures (O-PLS) , Biorthogonal PLS (BPLS) , recursive partial least squares (RPLS) , nonlinear PLS [10, 11]. The PLS regression is known to exhibit high sensitivity to noise, a problem that can be attributed to redundant latent variables, whose selection still remains an open problem 
. Penalized regression methods are also popular for simultaneous variable selection and coefficient estimation, which impose e.g., L2 or L1 constraints on the regression coefficients. Algorithms of this kind are Ridge regression and Lasso. The recent progress in sensor technology, biomedicine, and biochemistry has highlighted the necessity to consider multiple data streams as multi-way data structures , for which the corresponding analysis methods are very naturally based on tensor decompositions [16, 17, 18]. Although matricization of a tensor is an alternative way to express such data, this would result in the “Large Small ”problem and also make it difficult to interpret the results, as the physical meaning and multi-way data structures would be lost due to the unfolding operation.
The -way PLS (N-PLS) decomposes the independent and dependent data into rank-one tensors, subject to maximum pairwise covariance of the latent vectors. This promises enhanced stability, resilience to noise, and intuitive interpretation of the results [19, 20]. Owing to these desirable properties N-PLS has found applications in areas ranging from chemometrics[21, 22, 23] to neuroscience [24, 25]. A modification of the N-PLS and the multi-way covariates regression were studied in [26, 27, 28], where the weight vectors yielding the latent variables are optimized by the same strategy as in N-PLS, resulting in better fitting to independent data while maintaining no difference in predictive performance. The tensor decomposition used within N-PLS is Canonical Decomposition /Parallel Factor Analysis (CANDECOMP/PARAFAC or CP) , which makes N-PLS inherit both the advantages and limitations of CP . These limitations are related to poor fitness ability, computational complexity and slow convergence when handling multivariate dependent data and higher order () independent data, causing N-PLS not to be guaranteed to outperform standard PLS [23, 31].
In this paper, we propose a new generalized mutilinear regression model, called Higer-Order Partial Least Squares (HOPLS), which makes it possible to predict an th-order tensor () (or a particular case of two-way matrix ) from an th-order tensor by projecting tensor onto a low-dimensional common latent subspace. The latent subspaces are optimized sequentially through simultaneous rank- approximation of and rank- approximation of (or rank-one approximation in particular case of two-way matrix ). Owing to the better fitness ability of the orthogonal Tucker model as compared to CP  and the flexibility of the block Tucker model , the analysis and simulations show that HOPLS proves to be a promising multilinear subspace regression framework that provides not only an optimal tradeoff between fitness and model complexity but also enhanced predictive ability in general. In addition, we develop a new strategy to find a closed-form solution by employing higher-order singular value decomposition (HOSVD) , which makes the computation more efficient than the currently used iterative way.
The article is structured as follows. In Section 2, an overview of two-way PLS is presented, and the notation and notions related to multi-way data analysis are introduced. In Section 3, the new multilinear regression model is proposed, together with the corresponding solutions and algorithms. Extensive simulations on synthetic data and a real world case study on the fusion of behavioral and neural data are presented in Section 4, followed by conclusions in Section 5.
2 Background and Notation
2.1 Notation and definitions
th-order tensors (multi-way arrays) are denoted by underlined boldface capital letters, matrices (two-way arrays) by boldface capital letters, and vectors by boldface lower-case letters. The th entry of a vector is denoted by , element of a matrix is denoted by , and element of an th-order tensor by or . Indices typically range from to their capital version, e.g., . The mode- matricization of a tensor is denoted by . The th factor matrix in a sequence is denoted by .
The -mode product of a tensor and matrix is denoted by and is defined as:
The rank- Tucker model  is a tensor decomposition defined and denoted as follows:
where is the core tensor and are the factor matrices. The last term is the simplified notation, introduced in , for the Tucker operator. When the factor matrices are orthonormal and the core tensor is all-orthogonal this model is called HOSVD [33, 35].
where the symbol ‘’ denotes the outer product of vectors, is the column- vector of matrix , and are scalars. The CP model can also be represented by (2), under the condition that the core tensor is super-diagonal, i.e., and if for all .
The -mode product between and is of size , and is defined as
The inner product of two tensors is defined by , and the squared Frobenius norm by .
The -mode cross-covariance between an th-order tensor and an th-order tensor with the same size on the th-mode, denoted by , is defined as
where the symbol represents an -mode multiplication between two tensors, and is defined as
As a special case, for a matrix , the -mode cross-covariance between and simplifies as
under the assumption that -mode column vectors of and columns of are mean-centered.
2.2 Standard PLS (two-way PLS)
in order to deal with collinear predictor variables. The usefulness of PLS in chemical applications was illuminated by the group of S. Wold[40, 41], after some initial work by Kowalski et al. . Currently, the PLS regression is being widely applied in chemometrics, sensory evaluation, industrial process control, and more recently, in the analysis of functional brain imaging data[43, 44, 45, 46, 47].
The principle behind PLS is to search for a set of latent vectors by performing a simultaneous decomposition of and with the constraint that these components explain as much as possible of the covariance between and . This can be formulated as
where consists of extracted orthonormal latent variables from , i.e. , and are latent variables from having maximum covariance with column-wise. The matrices and represent loadings and are respectively the residuals for and . In order to find the first set of components, we need to optimize the two sets of weights so as to satisfy
The latent variable then is estimated as . Based on the assumption of a linear relation , is predicted by
where is a diagonal matrix with , implying that the problem boils down to finding common latent variables
that explain the variance of bothand , as illustrated in Fig. 1.
3 Higher-order PLS (HOPLS)
For a two-way matrix, the low-rank approximation is equivalent to subspace approximation, however, for a higher-order tensor, these two criteria lead to completely different models (i.e., CP and Tucker model). The -way PLS (N-PLS), developed by Bro , is a straightforward multi-way extension of standard PLS based on the CP model. Although CP model is the best low-rank approximation, Tucker model is the best subspace approximation, retaining the maximum amount of variation . It thus provides better fitness than the CP model except in a special case when perfect CP exists, since CP is a restricted version of the Tucker model when the core tensor is super-diagonal.
There are two different approaches for extracting the latent components: sequential and simultaneous methods. A sequential method extracts one latent component at a time, deflates the proper tensors and calculates the next component from the residuals. In a simultaneous method, all components are calculated simultaneously by minimizing a certain criterion. In the following, we employ a sequential method since it provides better performance.
Consider an th-order independent tensor and an th-order dependent tensor , having the same size on the first mode, i.e., . Our objective is to find the optimal subspace approximation of and , in which the latent vectors from and have maximum pairwise covariance. Considering a linear relation between the latent vectors, the problem boils down to finding the common latent subspace which can approximate both and simultaneously. We firstly address the general case of a tensor and a tensor . A particular case with a tensor and a matrix is presented separately in Sec. 3.3, using a slightly different approach.
3.1 Proposed model
Applying Tucker decomposition within a PLS framework is not straightforward, and to that end we propose a novel block-wise orthogonal Tucker approach to model the data. More specifically, we assume is decomposed as a sum of rank-() Tucker blocks, while is decomposed as a sum of rank-() Tucker blocks (see Fig. 2), which can be expressed as
where is the number of latent vectors, is the -th latent vector, and are loading matrices on mode- and mode- respectively, and and are core tensors.
However the Tucker decompositions in (12) are not unique  due to the permutation, rotation, and scaling issues. To alleviate this problem, additional constraints should be imposed such that the core tensors and are all-orthogonal, a sequence of loading matrices are column-wise orthonormal, i.e., and , the latent vector is of length one, i.e. . Thus, each term in (12) is represented as an orthogonal Tucker model, implying essentially uniqueness as it is subject only to trivial indeterminacies .
By defining a latent matrix , mode- loading matrix , mode- loading matrix and core tensor , , the HOPLS model in (12) can be rewritten as
where and are residuals after extracting components. The core tensors and have a special block-diagonal structure (see Fig. 2) and their elements indicate the level of local interactions between the corresponding latent vectors and loading matrices. Note that the tensor decomposition in (13) is similar to the block term decomposition discussed in , which aims to the decomposition of only one tensor. However, HOPLS attempts to find the block Tucker decompositions of two tensors with block-wise orthogonal constraints, which at the same time satisfies a certain criteria related to having common latent components on a specific mode.
Benefiting from the advantages of Tucker decomposition over the CP model 
, HOPLS promises to approximate data better than N-PLS. Specifically, HOPLS differs substantially from the N-PLS model in the sense that extraction of latent components in HOPLS is based on subspace approximation rather than on low-rank approximation and the size of loading matrices is controlled by a hyperparameter, providing a tradeoff between fitness and model complexity. Note that HOPLS simplifies into N-PLS if we defineand .
3.2 Optimization criteria and algorithm
The tensor decompositions in (12) consists of two simultaneous optimization problems: (i) approximating and by orthogonal Tucker model, (ii) having at the same time a common latent component on a specific mode. If we apply HOSVD individually on and , the best rank-() approximation for and the best rank-() approximation for can be obtained while the common latent vector cannot be ensured. Another way is to find the best approximation of by HOSVD first, subsequently, can be approximated by a fixed . However, this procedure, which resembles multi-way principal component regression , has the drawback that the common latent components are not necessarily predictive for .
The optimization of subspace transformation according to (12) will be formulated as a problem of determining a set of orthogonormal loadings and latent vectors that satisfies a certain criterion. Since each term can be optimized sequentially with the same criteria based on deflation, in the following, we shall simplify the problem to that of finding the first latent vector and two sequences of loading matrices and .
In order to develop a strategy for the simultaneous minimization of the Frobenius norm of residuals and , while keeping a common latent vector , we first need to introduce the following basic results:
Given a tensor and column orthonormal matrices , with , the least-squares (LS) solution to is given by .
where tensor is the residual and the symbol ‘’ denotes the Kronecker product. Since and is column orthonormal, the LS solution of with fixed matrices and is given by ; writing it in a tensor form we obtain the desired result. ∎
Given a fixed tensor , the following two constrained optimization problems are equivalent:
1) , s. t. matrices are column orthonormal and .
2) , s. t. matrices are column orthonormal and .
The proof is available in  (see pp. 477-478).
According to Proposition 3.2, minimization of and under the orthonormality constraint is equivalent to maximization of and .
However, taking into account the common latent vector between and , there is no straightforward way to maximize and simultaneously. To this end, we propose to maximize a product of norms of two core tensors, i.e., . Since the latent vector is determined by , the first step is to optimize the orthonormal loadings, then the common latent vectors can be computed by the fixed loadings.
Let and , then .
where is the vectorization of the tensor . ∎
Note that this form is quite similar to the optimization problem for two-way PLS in (10), where the cross-covariance matrix is replaced by . In addition, the optimization item becomes the norm of a small tensor in contrast to a scalar in (10). Thus, if we define as a mode- cross-covariance tensor , the optimization problem can be finally formulated as
where and are the parameters to optimize.
which can be obtained by rank-() HOSVD on tensor . Based on Proposition 3.1, the optimization term in (3.2) is equivalent to the norm of core tensor . To achieve this goal, the higher-order orthogonal iteration (HOOI) algorithm [37, 16], which is known to converge fast, is employed to find the parameters and by orthogonal Tucker decomposition of .
Subsequently, based on the estimate of the loadings and , we can now compute the common latent vector . Note that taking into account the asymmetry property of the HOPLS framework, we need to estimate from predictors and to estimate regression coefficient for prediction of responses . For a given set of loading matrices , the latent vector should explain variance of as much as possible, that is
The above procedure should be carried out repeatedly using the deflation operation, until an appropriate number of components (i.e., ) are obtained, or the norms of residuals are smaller than a certain threshold. The deflation111Note that the latent vectors are not orthogonal in HOPLS algorithm, which is related to deflation. The theoretical explanation and proof are given in the supplement material. is performed by subtracting from and the information explained by a rank-() tensor and a rank-() tensor , respectively. The HOPLS algorithm is outlined in Algorithm 1.
3.3 The case of the tensor and matrix
Suppose that we have an th-order independent tensor () and a two-way dependent data , with the same sample size . Since for two-way matrix, subspace approximation is equivalent to low-rank approximation. HOPLS operates by modeling independent data as a sum of rank-() tensors while dependent data is modeled with a sum of rank-one matrices as
where and is a scalar.
Let and is of length one, then solves the problem . In other words, a linear combination of the columns of by using a weighting vector of length one has least squares properties in terms of approximating .
Since is given and
, it is obvious that the ordinary least squares solution to solve the problem is, hence, . If a with length one is found according to some criterion, then automatically with gives the best fit of for that . ∎
As discussed in the previous section, the problem of minimizing with respect to matrices and vector is equivalent to maximizing the norm of core tensor with an orthonormality constraint. Meanwhile, we attempt to find an optimal with unity length which ensures that is linearly correlated with the latent vector , i.e., , then according to Proposition 3.4, gives the best fit of . Therefore, replacing by in the expression for the core tensor in (15), we can optimize the parameters of X-loading matrices and Y-loading vector by maximizing the norm of , which gives the best approximation of both tensor and matrix . Finally, the optimization problem of our interest can be formulated as:
where the loadings and are parameters to optimize. This form is similar to (3.2), but has a different cross-covariance tensor defined between a tensor and a matrix, implying that the problem can be solved by performing a rank-() HOSVD on . Subsequently, the core tensor corresponding to can also be computed.
Next, the latent vector should be estimated so as to best approximate with given loading matrices . According to the model for , if we take its mode-1 matricizacion, we can write
where is still unknown. However, the core tensor (i.e., ) and the core tensor (i.e., ) has a linear connection that . Therefore, the latent vector can be estimated in another way that is different with the previous approach in Section 3.2. For fixed matrices , , the least square solution for the normalized , which minimizes the squared norm of the residual , can be obtained from
where we used the fact that are columnwise orthonormal and the symbol denotes Moore-Penrose pseudoinverse. With the estimated latent vector , and loadings , the regression coefficient used to predict is computed as
The procedure for a two-way response matrix is summarized in Algorithm 2. In this case, HOPLS model is also shown to unify both standard PLS and N-PLS within the same framework, when the appropriate parameters are selected222Explanation and proof are given in the supplement material..