Regression analysis is commonly used for modeling the relationship between a predictor vector and a scalar response . Typically a good regression model can achieve two goals: accurate prediction on future response and parsimonious interpretation of the dependence structure between and (Hastie et al., 2009). As a general setup, it fits training samples via minimizing a regularized loss, i.e., a loss plus a regularization term , as follows
where is the regression coefficient vector, is the standard Euclidean inner product, and is the regularization parameter. For example, the sum of squared loss with -norm regularization leads to the celebrated LASSO approach (Tibshirani, 1996), which performs sparse estimation of
and thus has implicit feature selection embedded therein.
In many modern real-world applications, the predictors/features are represented more naturally as higher-order tensors, such as videos and Magnetic Resonance Imaging (MRI) scans. In this case, if we want to predict a response variable for each tensor, a naive approach is to perform linear regression on the vectorized data (e.g., by stretching the tensor element by element). However, it completely ignores the multidimensional structure of the tensor data, such as the spatial coherence of the voxels. This motivates the tensor regression framework(Yu and Liu, 2016; Zhou et al., 2013), which treats each observation as a tensor and learns a tensor coefficient via regularized model fitting:
When -norm regularization is used, this formulation is essentially equivalent to (1) via vectorization. To effectively exploit the structural information of , we can impose a low-rank constraint on for Problem (2). Some authors achieved this by fixing the CANDECOMP/PARAFAC (CP) rank of as a priori. For example, (Su et al., 2012) assumed to be rank-1. Since rank-1 constraint is too restrictive, (Guo et al., 2012) and (Zhou et al., 2013) imposed a rank- constraint in a tensor decomposition model, but none of the above methods considered adding sparsity constraint as well to enhance the model interpretability. (Wang et al., 2014) imposed a restrictive rank-1 constraint on and also applied an elastic net regularization (Zou and Hastie, 2005). (Tan et al., 2012) imposed a rank- constraint and applied -norm regularization to factor matrices to also promote sparsity in . (Signoretto et al., 2014) applied the trace norm (nuclear norm) for low-rank estimation of , and (Song and Lu, 2017) imposed a combination of trace norm and -norm on . (Bengua et al., 2017) showed that the trace norm may not be appropriate for capturing the global correlation of a tensor as it provides only the mean of the correlation between a single mode (rather than a few modes) and the rest of the tensor. In all the above sparse and low-rank tensor models, the sparsity is imposed on itself, which, does not necessarily lead to the sparsity on the decomposed matrices.
In this paper, we propose a sparse and low-rank tensor regression model in which the unit-rank tensors from the CP decomposition of the coefficient tensor are assumed to be sparse. This structure is both parsimonious and highly interpretable, as it implies that the outcome is related to the features through a few distinct pathways, each of which may only involve subsets of feature dimensions. We take a divide-and-conquer strategy to simplify the task into a set of sparse unit-rank tensor factorization/regression problems (SURF) in the form of
To make the solution process efficient for the SURF problem, we propose a boosted/stagewise estimation procedure to efficiently trace out its entire solution path. We show that as the step size goes to zero, the stagewise solution paths converge exactly to those of the corresponding regularized regression. The effectiveness and efficiency of our proposed approach is demonstrated on various real-world datasets as well as under various simulation setups.
2 Preliminaries on Tensors
We start with a brief review of some necessary preliminaries on tensors, and more details can be found in (Kolda and Bader, 2009). We denote scalars by lowercase letters, e.g., ; vectors by boldfaced lowercase letters, e.g., ; matrices by boldface uppercase letters, e.g., ; and tensors by calligraphic letters, e.g., . We denote their entries by , , , etc., depending on the number of dimensions. Indices are denoted by lowercase letters spanning the range from 1 to the uppercase letter of the index, e.g., . Each entry of an th-order tensor is indexed by indices , and each indexes the -mode of . Specifically, denotes every mode except .
Definition 1 (Inner Product).
The inner product of two tensors is the sum of the products of their entries, defined as .
It follows immediately that the Frobenius norm of is defined as . The norm of a tensor is defined as .
Definition 2 (Tensor Product).
Let be a length- vector for each . The tensor product of , denoted by , is an -tensor of which the entries are given by . We call a rank-1 tensor or a unit-rank tensor.
Definition 3 (CP Decomposition).
Every tensor can be decomposed as a weighted sum of rank-1 tensors with a suitably large as
where , and .
Definition 4 (Tensor Rank).
The tensor rank of , denoted by , is the smallest number such that the equality (3) holds.
Definition 5 (-mode Product).
The -mode product of a tensor by a vector , denoted by , is an -tensor of which the entries are given by
3 Sparse and Low-Rank Tensor Regression
3.1 Model Formulation
For an th-order predictor tensor and a scalar response , , we consider the regression model of the form
where is an th-order coefficient tensor, and is a random error term of zero mean. Without loss of generality, the intercept is set to zero by centering the response and standardizing the predictors as ; and for . Our goal is to estimate with i.i.d. observations .
To reduce the complexity of the model and leverage the structural information in , we assume that the coefficient tensor to be both low-rank and sparse. Specifically, we assume can be decomposed via CP decomposition as , where each rank-1 tensor is possibly sparse, or equivalently, the vectors in its representation , , are possibly sparse, for all .
Here we impose sparsity on the rank-1 components from CP decomposition – rather than on itself (Chen et al., 2012; Tan et al., 2012). This adaption can be more beneficial in multiple ways: 1) It integrates a finer sparsity structure into the CP decomposition, which enables a direct control of component-wise sparsity; 2) It leads to an appealing model interpretation and feature grouping: the outcome is related to the features through a few distinct pathways, each of which may only involve subsets of feature dimensions; 3) It leads to a more flexible and parsimonious model as it requires less number of parameters to recover the within-decomposition sparsity of a tensor than existing methods which impose sparsity on the tensor itself, thus makes the model generalizability better.
A straightforward way of conducting model estimation is to solve the following optimization problem:
where , , are regularization parameters. This problem is very difficult to solve because: 1) The CP rank needs to be pre-specified; 2) As the CP decomposition may not be unique, the pursue of its within-decomposition sparsity is highly non-convex and the problem may suffer from parameter identifiability issues (Mishra et al., 2017); 3) The estimation may involve many regularization parameters, for which the tuning becomes very costly.
3.2 Divide-and-Conquer: Sequential Pursue for Sparse Tensor Decomposition
We propose a divide-and-conquer strategy to recover the sparse CP decomposition. Our approach is based on the sequential extraction method (a.k.a. deflation) (Phan et al., 2015; Mishra et al., 2017), which seeks a unit-rank tensor at a time and then deflates to find further unit-rank tensors from the residuals. This has been proved to be a rapid and effective method of partitioning and concentrating tensor decomposition. Specifically, we sequentially solve the following sparse unit-rank problem:
where is the sequential number of the unit-rank terms and is the current residue of response with
where is the estimated unit-rank tensor in the -th step, with tuning done by, e.g., cross validation. The final estimator is then obtained as . Here for improving the convexity of the problem and its numerical stability, we have used the elastic net (Zou and Hastie, 2005) penalty form instead of LASSO, which is critical to ensure the convergence of the optimization solution. The accuracy of the solution can be controlled simply by adjusting the values of and . Since we mainly focus on sparse estimation, we fix as a small constant in numerical studies.
As each is of unit rank, it can be decomposed as with and . It is then clear that . That is, the sparsity of a unit-rank tensor directly leads to the sparsity of its components. This allows us to kill multiple birds with one stone: by simply pursuing the element-wise sparsity of the unit-rank coefficient tensor with only one tuning parameter , solving (6) can produce a set of sparse factor coefficients for simultaneously.
With this sequential pursue strategy, the general problem boils down to a set of sparse unit-rank estimation problems, for which we develop a novel stagewise/boosting algorithm.
4 Fast Stagewise Unit-Rank Tensor Factorization (SURF)
For simplicity, we drop the index and write the generic form of the problem in (6) as
Let , where , , and the factors are identifiable up to sign flipping. Let , and . Then (7) can be reformulated as
Before diving into the stagewise/boosting algorithm, we first consider an alternating convex search (ACS) approach (Chen et al., 2012; Minasian et al., 2014) which appears to be natural for solving (7) with any fixed tuning parameter. Specifically, we alternately optimize with respect to a block of variables with others fixed. For each block , the relevant constraints are and , but the objective function in (8) is a function of only through their product . So both constraints are avoided when optimizing with respect to . Let and , the subproblem boils down to
where . Once we obtain the solution , we can set and to satisfy the constraints whenever . If , is no longer identifiable, we then set and terminate the algorithm. Note that when , each subproblem is strongly convex and the generated solution sequence is uniformly bounded, we can show that ACS can converge to a coordinate-wise minimum point of (8) with properly chosen initial value (Mishra et al., 2017). The optimization procedure then needs to be repeated to a grid of tuning parameter values for obtaining the solution paths of parameters and locating the optimal sparse solution along the paths.
Inspired by the biconvex structure of (7) and the stagewise algorithm for LASSO (Zhao and Yu, 2007; Vaughan et al., 2017), we develop a fast stagewise unit-rank tensor factorization (SURF) algorithm to trace out the entire solution paths of (7) in a single run. The main idea of a stagewise procedure is to build a model from scratch, gradually increasing the model complexity in a sequence of simple learning steps. For instance, in stagewise estimation for linear regression, a forward step searches for the best predictor to explain the current residual and updates its coefficient by a small increment, and a backward step, on the contrary, may decrease the coefficient of a selected predictor to correct for greedy selection whenever necessary. Due to the biconvex or multi-convex structure of our objective function, it turns out that efficient stagewise estimation remains possible: the only catch is that, when we determine which coefficient to update at each iteration, we always get competing proposals from different tensor modes, rather than just one proposal in case of LASSO.
To simplify the notations, the objective function (9) is re-arranged into a standard LASSO as
using the augmented data and , where
is the identity matrix of size. We write the objective function (10) by
We use to denote all the variables if necessary.
The structure of our stagewise algorithm is presented in Algorithm 1. It can be viewed as a boosting procedure that builds up the solution gradually in terms of forward step (line 4) and backward step (line 3)222Boosting amounts to combine a set of “weak learners” to build one strong learner, and is connected to stagewise and regularized estimation methods. SURF is a boosting method since each backward/forward step is in essence finding a weaker learner to incrementally improve the current learner (thus generate a path of solutions).. The initialization step is solved explicitly (see Lemma 1 below). At each subsequent iteration, the parameter update takes the form in either forward or backward direction, where is a length- vector with all zeros except for a in the -th coordinate, , and is the pre-specified step size controlling the fineness of the grid. The algorithm also keeps track of the tuning parameter . Intuitively, the selection of the index and the increment
is guided by minimizing the penalized loss function with the currentsubject to a constraint on the step size. Comparing to the standard stagewise LASSO, the main difference here is that we need to select the “best” triple of over all the dimensions across all tensor modes.
where is a constant at each iteration. Then the solution at each forward step is
and at each backward step is
where denotes the vector formed by the diagonal elements of a square matrix, and is the -mode active index set at current iteration.
Computational Analysis. In Algorithm 1, the most time-consuming part are the calculations of and , involved in both forward and backward steps. We further write as , , where is a constant during each iteration but varies from one iteration to the next, is a length- vector with all ones. At each iteration, the computational cost is dominated by the update of (), which can be obtained by
where denotes every mode except and , which greatly reduces the computation. Therefore, when , the updates of , , and requires , , , operations, respectively, when , we only need to additionally update , which requires operations. Overall the computational complexity of our approach is per iteration. In contrast, the ACS algorithm has to be run for each fixed , and within each of such problems it requires per iteration (da Silva et al., 2015).
5 Convergence Analysis
We provide convergence analysis for our stagewise algorithm in this section. All detailed proofs are given in the Appendix A. Specifically, Lemma 1 and 2 below justify the validity of the initialization.
Let be the -mode matricization of . Denote where each is a column of , then
Moreover, letting and represents its corresponding indices in tensor space, then the initial non-zero solution of (11), denoted as , is given by
where is a vector with all 0’s except for a in the -th coordinate.
If there exists and with such that , it must be true that .
Lemma 3 shows that the backward step always performs coordinate descent update of fixed size , each time along the steepest coordinate direction within the current active set, until the descent becomes impossible subject to a tolerance level . Also, the forward step performs coordinate descent when . Lemma 4 shows that when gets changed, the penalized loss for the previous can no longer be improved subject to a tolerance level . Thus controls the granularity of the paths, and is a convergence threshold in optimizing the penalized loss with any fixed tuning parameter. They enable convenient trade-off between computational efficiency and estimation accuracy.
For any with , we have .
For any with , we have
For any such that , we have as , where denotes a coordinate-wise minimum point of Problem (7).
We evaluate the effectiveness and efficiency of our method SURF through numerical experiments on both synthetic and real data, and compare with various state-of-the-art regression methods, including LASSO, Elastic Net (ENet), Regularized multilinear regression and selection (Remurs) (Song and Lu, 2017)
, optimal CP-rank Tensor Ridge Regression (orTRR) (Guo et al., 2012), Generalized Linear Tensor Regression Model (GLTRM) (Zhou et al., 2013), and a variant of our method with Alternating Convex Search (ACS) estimation. Table 1 summarizes the properties of all methods. All methods are implemented in MATLAB and executed on a machine with 3.50GHz CPU and 256GB RAM. For LASSO and ENet we use the MATLAB package glmnet from (Friedman et al., 2010). For GLTRM, we solve the regularized CP tensor regression simultaneously for all factors based on TensorReg toolbox (Zhou et al., 2013). We follow (Kampa et al., 2014)
to arrange the test and training sets in the ratio of 1:5. The hyperparameters of all methods are optimized using 5-fold cross validation on the training set, with range, , and . Specifically, for GLTRM, ACS, and SURF, we simply set . For LASSO, ENet and ACS, we generate a sequence of 100 values for to cover the whole path. For fairness, the number of iterations for all compared methods are fixed to 100. All cases are run 50 times and the average results on the test set are reported. Our code is available at https://github.com/LifangHe/SURF.
Synthetic Data. We first use the synthetic data to examine the performance of our method in different scenarios, with varying step sizes, sparsity level, number of features as well as sample size. We generate the data as follows: , where is a random noise generated from , and . We generate samples from , where is a covariance matrix, the correlation coefficient between features and is defined as . We generate the true support as , where each , , is a column vector with i.i.d. entries and normalized with norm, the scalars are defined by . To impose sparsity on , we set of its entries (chosen uniformly at random) to zero. When studying one of factors, other factors are fixed to , , , .
In a first step we analyze the critical parameter for our method. This parameter controls how close SURF approximates the ACS paths. Figure 1 shows the solution path plot of our method versus ACS method under both big and small step sizes. As shown by the plots, a smaller step size leads to a closer approximation to the solutions of ACS. In Figure 2
, we also provide a plot of averaged prediction error with standard deviation bars (left side of y-axis) and CPU execution time (right side of y-axis in mins) over different values of step size. From the figure, we can see that the choice of the step size affects both computational speed and the root mean-squared prediction error (RMSE). The smaller the value of step size, the more accurate the regression result but the longer it will take to run. In both figures, the moderate step sizeseems to offer a better trade-off between performance and ease of implementation. In the following we fix .
Next, we examine the performance of our method with varying sparsity level of . For this purpose, we compare the prediction error (RMSE) and running time (log min) of all methods on the synthetic data. Figure 3(a)-(b) shows the results for the case of on synthetic 2D data, where indicates the sparsity level of true . As can be seen from the plots, SURF generates slightly better predictions than other existing methods when the true is sparser. Moreover, as shown in Figure 3(c), it is also interesting to note that larger step sizes give much more sparsity of coefficients for SURF, this explains why there is no value for some curves as the increase of in Figure 1(c).
Furthermore, we compare the prediction error and running time (log min) of all methods with increasing number of features. Figure 4(a)-(b) shows the results for the case of on synthetic 2D data. Overall, SURF gives better predictions at a lower computational cost. Particularly, SURF and ACS have very similar prediction qualities, this matches with our theoretical result on the solutions of SURF versus ACS. SURF achieves better predictions than other tensor methods, indicating the effectiveness of structured sparsity in unit-rank tensor decomposition itself. In terms of running time, it is clear that as the number of features is increased, SURF is significantly faster than other methods.
Finally, Figure 5 shows the results with increasing number of samples. Overall, SURF gives better predictions at a lower computational cost. Particularly, Remurs and orTRR do not change too much as increasing number of samples, this may due to early stop in the iteration process when searching for optimized solution.
Real Data. We also examine the performance of our method on a real medical image dataset, including both DTI and MRI images, obtained from the Parkinson’s progression markers initiative (PPMI) database333http://www.ppmi-info.org/data with human subjects. We parcel the brain into 84 regions and extract four types connectivity matrices. Our goal is to predict the Montreal Cognitive Assessment (MoCA) scores for each subject. Details of data processing are presented in Appendix B. We use three metrics to evaluate the performance: root mean squared prediction error (RMSE), which describes the deviation between the ground truth of the response and the predicted values in out-of-sample testing; sparsity of coefficients (SC), which is the same as the defined in the synthetic data analysis (i.e., the percentage of zero entries in the corresponding coefficient); and CPU execution time. Table 2
shows the results of all methods on both individual and combined datasets. Again, we can observe that SURF gives better predictions at a lower computational cost, as well as good sparsity. In particular, the paired t-tests showed that for all five real datasets, the RMSE and SC of our approach are significantly lower and higher than those of Remurs and GLTRM methods, respectively. This indicates that the performance gain of our approach over the other low-rank + sparse methods is indeed significant. Figure4(c) provides the running time (log min) of all methods on the PPMI MRI images of voxels each, which clearly demonstrates the efficiency of our approach.
This work is supported by NSF No. IIS-1716432 (Wang), IIS-1750326 (Wang), IIS-1718798 (Chen), DMS-1613295 (Chen), IIS-1749940 (Zhou), IIS-1615597 (Zhou), ONR N00014-18-1-2585 (Wang), and N00014-17-1-2265 (Zhou), and Michael J. Fox Foundation grant number 14858 (Wang). Data used in the preparation of this article were obtained from the Parkinson’s Progression Markers Initiative (PPMI) database (http://www.ppmi-info.org/data). For up-to-date information on the study, visit http://www.ppmi-info.org. PPMI – a public-private partnership – is funded by the Michael J. Fox Foundation for Parkinson’s Research and funding partners, including Abbvie, Avid, Biogen, Bristol-Mayers Squibb, Covance, GE, Genentech, GlaxoSmithKline, Lilly, Lundbeck, Merk, Meso Scale Discovery, Pfizer, Piramal, Roche, Sanofi, Servier, TEVA, UCB and Golub Capital.
- Hastie et al.  Trevor Hastie, Robert Tibshirani, and Jerome Friedman. The elements of statistical learning, 2009.
- Tibshirani  Robert Tibshirani. Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society: Series B, pages 267–288, 1996.
Yu and Liu 
Rose Yu and Yan Liu.
Learning from multiway data: Simple and efficient tensor regression.
International Conference on Machine Learning, pages 373–381, 2016.
- Zhou et al.  Hua Zhou, Lexin Li, and Hongtu Zhu. Tensor regression with applications in neuroimaging data analysis. Journal of the American Statistical Association, 108(502):540–552, 2013.
- Su et al.  Ya Su, Xinbo Gao, Xuelong Li, and Dacheng Tao. Multivariate multilinear regression. IEEE Transactions on Systems, Man, and Cybernetics, Part B (Cybernetics), 42(6):1560–1573, 2012.
- Guo et al.  Weiwei Guo, Irene Kotsia, and Ioannis Patras. Tensor learning for regression. IEEE Transactions on Image Processing, 21(2):816–827, 2012.
Wang et al. 
Fei Wang, Ping Zhang, Buyue Qian, Xiang Wang, and Ian Davidson.
Clinical risk prediction with multilinear sparse logistic regression.In Proceedings of the 20th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pages 145–154. ACM, 2014.
- Zou and Hastie  Hui Zou and Trevor Hastie. Regularization and variable selection via the elastic net. Journal of the Royal Statistical Society: Series B, 67(2):301–320, 2005.
- Tan et al.  Xu Tan, Yin Zhang, Siliang Tang, Jian Shao, Fei Wu, and Yueting Zhuang. Logistic tensor regression for classification. In International Conference on Intelligent Science and Intelligent Data Engineering, pages 573–581. Springer, 2012.
- Signoretto et al.  Marco Signoretto, Quoc Tran Dinh, Lieven De Lathauwer, and Johan AK Suykens. Learning with tensors: a framework based on convex optimization and spectral regularization. Machine Learning, 94(3):303–351, 2014.
- Song and Lu  Xiaonan Song and Haiping Lu. Multilinear regression for embedded feature selection with application to fmri analysis. In AAAI, pages 2562–2568, 2017.
- Bengua et al.  Johann A Bengua, Ho N Phien, Hoang Duong Tuan, and Minh N Do. Efficient tensor completion for color image and video recovery: Low-rank tensor train. IEEE Transactions on Image Processing, 26(5):2466–2479, 2017.
- Kolda and Bader  Tamara G Kolda and Brett W Bader. Tensor decompositions and applications. SIAM Review, 51(3):455–500, 2009.
Chen et al. 
Kun Chen, Kung-Sik Chan, and Nils Chr Stenseth.
Reduced rank stochastic regression with a sparse singular value decomposition.Journal of the Royal Statistical Society: Series B, 74(2):203–221, 2012.
- Mishra et al.  Aditya Mishra, Dipak K. Dey, and Kun Chen. Sequential co-sparse factor regression. Journal of Computational and Graphical Statistics, pages 814–825, 2017.
- Phan et al.  Anh-Huy Phan, Petr Tichavskỳ, and Andrzej Cichocki. Tensor deflation for candecomp/parafac—part i: Alternating subspace update algorithm. IEEE Transactions on Signal Processing, 63(22):5924–5938, 2015.
- Minasian et al.  Arin Minasian, Shahram ShahbazPanahi, and Raviraj S Adve. Energy harvesting cooperative communication systems. IEEE Transactions on Wireless Communications, 13(11):6118–6131, 2014.
- Zhao and Yu  Peng Zhao and Bin Yu. Stagewise lasso. Journal of Machine Learning Research, 8(Dec):2701–2726, 2007.
- Vaughan et al.  Gregory Vaughan, Robert Aseltine, Kun Chen, and Jun Yan. Stagewise generalized estimation equations with grouped variables. Biometrics, 73:1332–1342, 2017.