1 Introduction
Regression analysis is commonly used for modeling the relationship between a predictor vector and a scalar response . Typically a good regression model can achieve two goals: accurate prediction on future response and parsimonious interpretation of the dependence structure between and (Hastie et al., 2009). As a general setup, it fits training samples via minimizing a regularized loss, i.e., a loss plus a regularization term , as follows
(1) 
where is the regression coefficient vector, is the standard Euclidean inner product, and is the regularization parameter. For example, the sum of squared loss with norm regularization leads to the celebrated LASSO approach (Tibshirani, 1996), which performs sparse estimation of
and thus has implicit feature selection embedded therein.
In many modern realworld applications, the predictors/features are represented more naturally as higherorder tensors, such as videos and Magnetic Resonance Imaging (MRI) scans. In this case, if we want to predict a response variable for each tensor, a naive approach is to perform linear regression on the vectorized data (e.g., by stretching the tensor element by element). However, it completely ignores the multidimensional structure of the tensor data, such as the spatial coherence of the voxels. This motivates the tensor regression framework
(Yu and Liu, 2016; Zhou et al., 2013), which treats each observation as a tensor and learns a tensor coefficient via regularized model fitting:(2) 
When norm regularization is used, this formulation is essentially equivalent to (1) via vectorization. To effectively exploit the structural information of , we can impose a lowrank constraint on for Problem (2). Some authors achieved this by fixing the CANDECOMP/PARAFAC (CP) rank of as a priori. For example, (Su et al., 2012) assumed to be rank1. Since rank1 constraint is too restrictive, (Guo et al., 2012) and (Zhou et al., 2013) imposed a rank constraint in a tensor decomposition model, but none of the above methods considered adding sparsity constraint as well to enhance the model interpretability. (Wang et al., 2014) imposed a restrictive rank1 constraint on and also applied an elastic net regularization (Zou and Hastie, 2005). (Tan et al., 2012) imposed a rank constraint and applied norm regularization to factor matrices to also promote sparsity in . (Signoretto et al., 2014) applied the trace norm (nuclear norm) for lowrank estimation of , and (Song and Lu, 2017) imposed a combination of trace norm and norm on . (Bengua et al., 2017) showed that the trace norm may not be appropriate for capturing the global correlation of a tensor as it provides only the mean of the correlation between a single mode (rather than a few modes) and the rest of the tensor. In all the above sparse and lowrank tensor models, the sparsity is imposed on itself, which, does not necessarily lead to the sparsity on the decomposed matrices.
In this paper, we propose a sparse and lowrank tensor regression model in which the unitrank tensors from the CP decomposition of the coefficient tensor are assumed to be sparse. This structure is both parsimonious and highly interpretable, as it implies that the outcome is related to the features through a few distinct pathways, each of which may only involve subsets of feature dimensions. We take a divideandconquer strategy to simplify the task into a set of sparse unitrank tensor factorization/regression problems (SURF) in the form of
To make the solution process efficient for the SURF problem, we propose a boosted/stagewise estimation procedure to efficiently trace out its entire solution path. We show that as the step size goes to zero, the stagewise solution paths converge exactly to those of the corresponding regularized regression. The effectiveness and efficiency of our proposed approach is demonstrated on various realworld datasets as well as under various simulation setups.
2 Preliminaries on Tensors
We start with a brief review of some necessary preliminaries on tensors, and more details can be found in (Kolda and Bader, 2009). We denote scalars by lowercase letters, e.g., ; vectors by boldfaced lowercase letters, e.g., ; matrices by boldface uppercase letters, e.g., ; and tensors by calligraphic letters, e.g., . We denote their entries by , , , etc., depending on the number of dimensions. Indices are denoted by lowercase letters spanning the range from 1 to the uppercase letter of the index, e.g., . Each entry of an thorder tensor is indexed by indices , and each indexes the mode of . Specifically, denotes every mode except .
Definition 1 (Inner Product).
The inner product of two tensors is the sum of the products of their entries, defined as .
It follows immediately that the Frobenius norm of is defined as . The norm of a tensor is defined as .
Definition 2 (Tensor Product).
Let be a length vector for each . The tensor product of , denoted by , is an tensor of which the entries are given by . We call a rank1 tensor or a unitrank tensor.
Definition 3 (CP Decomposition).
Every tensor can be decomposed as a weighted sum of rank1 tensors with a suitably large as
(3) 
where , and .
Definition 4 (Tensor Rank).
The tensor rank of , denoted by , is the smallest number such that the equality (3) holds.
Definition 5 (mode Product).
The mode product of a tensor by a vector , denoted by , is an tensor of which the entries are given by
3 Sparse and LowRank Tensor Regression
3.1 Model Formulation
For an thorder predictor tensor and a scalar response , , we consider the regression model of the form
(4) 
where is an thorder coefficient tensor, and is a random error term of zero mean. Without loss of generality, the intercept is set to zero by centering the response and standardizing the predictors as ; and for . Our goal is to estimate with i.i.d. observations .
To reduce the complexity of the model and leverage the structural information in , we assume that the coefficient tensor to be both lowrank and sparse. Specifically, we assume can be decomposed via CP decomposition as , where each rank1 tensor is possibly sparse, or equivalently, the vectors in its representation , , are possibly sparse, for all .
Here we impose sparsity on the rank1 components from CP decomposition – rather than on itself (Chen et al., 2012; Tan et al., 2012). This adaption can be more beneficial in multiple ways: 1) It integrates a finer sparsity structure into the CP decomposition, which enables a direct control of componentwise sparsity; 2) It leads to an appealing model interpretation and feature grouping: the outcome is related to the features through a few distinct pathways, each of which may only involve subsets of feature dimensions; 3) It leads to a more flexible and parsimonious model as it requires less number of parameters to recover the withindecomposition sparsity of a tensor than existing methods which impose sparsity on the tensor itself, thus makes the model generalizability better.
A straightforward way of conducting model estimation is to solve the following optimization problem:
(5) 
where , , are regularization parameters. This problem is very difficult to solve because: 1) The CP rank needs to be prespecified; 2) As the CP decomposition may not be unique, the pursue of its withindecomposition sparsity is highly nonconvex and the problem may suffer from parameter identifiability issues (Mishra et al., 2017); 3) The estimation may involve many regularization parameters, for which the tuning becomes very costly.
3.2 DivideandConquer: Sequential Pursue for Sparse Tensor Decomposition
We propose a divideandconquer strategy to recover the sparse CP decomposition. Our approach is based on the sequential extraction method (a.k.a. deflation) (Phan et al., 2015; Mishra et al., 2017), which seeks a unitrank tensor at a time and then deflates to find further unitrank tensors from the residuals. This has been proved to be a rapid and effective method of partitioning and concentrating tensor decomposition. Specifically, we sequentially solve the following sparse unitrank problem:
(6) 
where is the sequential number of the unitrank terms and is the current residue of response with
where is the estimated unitrank tensor in the th step, with tuning done by, e.g., cross validation. The final estimator is then obtained as . Here for improving the convexity of the problem and its numerical stability, we have used the elastic net (Zou and Hastie, 2005) penalty form instead of LASSO, which is critical to ensure the convergence of the optimization solution. The accuracy of the solution can be controlled simply by adjusting the values of and . Since we mainly focus on sparse estimation, we fix as a small constant in numerical studies.
As each is of unit rank, it can be decomposed as with and . It is then clear that . That is, the sparsity of a unitrank tensor directly leads to the sparsity of its components. This allows us to kill multiple birds with one stone: by simply pursuing the elementwise sparsity of the unitrank coefficient tensor with only one tuning parameter , solving (6) can produce a set of sparse factor coefficients for simultaneously.
With this sequential pursue strategy, the general problem boils down to a set of sparse unitrank estimation problems, for which we develop a novel stagewise/boosting algorithm.
4 Fast Stagewise UnitRank Tensor Factorization (SURF)
For simplicity, we drop the index and write the generic form of the problem in (6) as
(7) 
Let , where , , and the factors are identifiable up to sign flipping. Let , and . Then (7) can be reformulated as
(8) 
Before diving into the stagewise/boosting algorithm, we first consider an alternating convex search (ACS) approach (Chen et al., 2012; Minasian et al., 2014) which appears to be natural for solving (7) with any fixed tuning parameter. Specifically, we alternately optimize with respect to a block of variables with others fixed. For each block , the relevant constraints are and , but the objective function in (8) is a function of only through their product . So both constraints are avoided when optimizing with respect to . Let and , the subproblem boils down to
(9) 
where . Once we obtain the solution , we can set and to satisfy the constraints whenever . If , is no longer identifiable, we then set and terminate the algorithm. Note that when , each subproblem is strongly convex and the generated solution sequence is uniformly bounded, we can show that ACS can converge to a coordinatewise minimum point of (8) with properly chosen initial value (Mishra et al., 2017). The optimization procedure then needs to be repeated to a grid of tuning parameter values for obtaining the solution paths of parameters and locating the optimal sparse solution along the paths.
Inspired by the biconvex structure of (7) and the stagewise algorithm for LASSO (Zhao and Yu, 2007; Vaughan et al., 2017), we develop a fast stagewise unitrank tensor factorization (SURF) algorithm to trace out the entire solution paths of (7) in a single run. The main idea of a stagewise procedure is to build a model from scratch, gradually increasing the model complexity in a sequence of simple learning steps. For instance, in stagewise estimation for linear regression, a forward step searches for the best predictor to explain the current residual and updates its coefficient by a small increment, and a backward step, on the contrary, may decrease the coefficient of a selected predictor to correct for greedy selection whenever necessary. Due to the biconvex or multiconvex structure of our objective function, it turns out that efficient stagewise estimation remains possible: the only catch is that, when we determine which coefficient to update at each iteration, we always get competing proposals from different tensor modes, rather than just one proposal in case of LASSO.
To simplify the notations, the objective function (9) is rearranged into a standard LASSO as
(10) 
using the augmented data and , where
is the identity matrix of size
. We write the objective function (10) byWe use to denote all the variables if necessary.
(11) 
(12) 
(13)  
The structure of our stagewise algorithm is presented in Algorithm 1. It can be viewed as a boosting procedure that builds up the solution gradually in terms of forward step (line 4) and backward step (line 3)^{2}^{2}2Boosting amounts to combine a set of “weak learners” to build one strong learner, and is connected to stagewise and regularized estimation methods. SURF is a boosting method since each backward/forward step is in essence finding a weaker learner to incrementally improve the current learner (thus generate a path of solutions).. The initialization step is solved explicitly (see Lemma 1 below). At each subsequent iteration, the parameter update takes the form in either forward or backward direction, where is a length vector with all zeros except for a in the th coordinate, , and is the prespecified step size controlling the fineness of the grid. The algorithm also keeps track of the tuning parameter . Intuitively, the selection of the index and the increment
is guided by minimizing the penalized loss function with the current
subject to a constraint on the step size. Comparing to the standard stagewise LASSO, the main difference here is that we need to select the “best” triple of over all the dimensions across all tensor modes.Problem (12) and (13) can be solved efficiently. By expansion of , we have
where is a constant at each iteration. Then the solution at each forward step is
and at each backward step is
where denotes the vector formed by the diagonal elements of a square matrix, and is the mode active index set at current iteration.
Computational Analysis. In Algorithm 1, the most timeconsuming part are the calculations of and , involved in both forward and backward steps. We further write as , , where is a constant during each iteration but varies from one iteration to the next, is a length vector with all ones. At each iteration, the computational cost is dominated by the update of (), which can be obtained by
(14) 
where denotes every mode except and , which greatly reduces the computation. Therefore, when , the updates of , , and requires , , , operations, respectively, when , we only need to additionally update , which requires operations. Overall the computational complexity of our approach is per iteration. In contrast, the ACS algorithm has to be run for each fixed , and within each of such problems it requires per iteration (da Silva et al., 2015).
5 Convergence Analysis
We provide convergence analysis for our stagewise algorithm in this section. All detailed proofs are given in the Appendix A. Specifically, Lemma 1 and 2 below justify the validity of the initialization.
Lemma 1.
Let be the mode matricization of . Denote where each is a column of , then
Moreover, letting and represents its corresponding indices in tensor space, then the initial nonzero solution of (11), denoted as , is given by
where is a vector with all 0’s except for a in the th coordinate.
Lemma 2.
If there exists and with such that , it must be true that .
Lemma 3 shows that the backward step always performs coordinate descent update of fixed size , each time along the steepest coordinate direction within the current active set, until the descent becomes impossible subject to a tolerance level . Also, the forward step performs coordinate descent when . Lemma 4 shows that when gets changed, the penalized loss for the previous can no longer be improved subject to a tolerance level . Thus controls the granularity of the paths, and is a convergence threshold in optimizing the penalized loss with any fixed tuning parameter. They enable convenient tradeoff between computational efficiency and estimation accuracy.
Lemma 3.
For any with , we have .
Lemma 4.
For any with , we have
Theorem 1.
For any such that , we have as , where denotes a coordinatewise minimum point of Problem (7).
6 Experiments
We evaluate the effectiveness and efficiency of our method SURF through numerical experiments on both synthetic and real data, and compare with various stateoftheart regression methods, including LASSO, Elastic Net (ENet), Regularized multilinear regression and selection (Remurs) (Song and Lu, 2017)
, optimal CPrank Tensor Ridge Regression (
orTRR) (Guo et al., 2012), Generalized Linear Tensor Regression Model (GLTRM) (Zhou et al., 2013), and a variant of our method with Alternating Convex Search (ACS) estimation. Table 1 summarizes the properties of all methods. All methods are implemented in MATLAB and executed on a machine with 3.50GHz CPU and 256GB RAM. For LASSO and ENet we use the MATLAB package glmnet from (Friedman et al., 2010). For GLTRM, we solve the regularized CP tensor regression simultaneously for all factors based on TensorReg toolbox (Zhou et al., 2013). We follow (Kampa et al., 2014)to arrange the test and training sets in the ratio of 1:5. The hyperparameters of all methods are optimized using 5fold cross validation on the training set, with range
, , and . Specifically, for GLTRM, ACS, and SURF, we simply set . For LASSO, ENet and ACS, we generate a sequence of 100 values for to cover the whole path. For fairness, the number of iterations for all compared methods are fixed to 100. All cases are run 50 times and the average results on the test set are reported. Our code is available at https://github.com/LifangHe/SURF.Synthetic Data. We first use the synthetic data to examine the performance of our method in different scenarios, with varying step sizes, sparsity level, number of features as well as sample size. We generate the data as follows: , where is a random noise generated from , and . We generate samples from , where is a covariance matrix, the correlation coefficient between features and is defined as . We generate the true support as , where each , , is a column vector with i.i.d. entries and normalized with norm, the scalars are defined by . To impose sparsity on , we set of its entries (chosen uniformly at random) to zero. When studying one of factors, other factors are fixed to , , , .
In a first step we analyze the critical parameter for our method. This parameter controls how close SURF approximates the ACS paths. Figure 1 shows the solution path plot of our method versus ACS method under both big and small step sizes. As shown by the plots, a smaller step size leads to a closer approximation to the solutions of ACS. In Figure 2
, we also provide a plot of averaged prediction error with standard deviation bars (left side of yaxis) and CPU execution time (right side of yaxis in mins) over different values of step size. From the figure, we can see that the choice of the step size affects both computational speed and the root meansquared prediction error (RMSE). The smaller the value of step size, the more accurate the regression result but the longer it will take to run. In both figures, the moderate step size
seems to offer a better tradeoff between performance and ease of implementation. In the following we fix .Next, we examine the performance of our method with varying sparsity level of . For this purpose, we compare the prediction error (RMSE) and running time (log min) of all methods on the synthetic data. Figure 3(a)(b) shows the results for the case of on synthetic 2D data, where indicates the sparsity level of true . As can be seen from the plots, SURF generates slightly better predictions than other existing methods when the true is sparser. Moreover, as shown in Figure 3(c), it is also interesting to note that larger step sizes give much more sparsity of coefficients for SURF, this explains why there is no value for some curves as the increase of in Figure 1(c).
Furthermore, we compare the prediction error and running time (log min) of all methods with increasing number of features. Figure 4(a)(b) shows the results for the case of on synthetic 2D data. Overall, SURF gives better predictions at a lower computational cost. Particularly, SURF and ACS have very similar prediction qualities, this matches with our theoretical result on the solutions of SURF versus ACS. SURF achieves better predictions than other tensor methods, indicating the effectiveness of structured sparsity in unitrank tensor decomposition itself. In terms of running time, it is clear that as the number of features is increased, SURF is significantly faster than other methods.
Finally, Figure 5 shows the results with increasing number of samples. Overall, SURF gives better predictions at a lower computational cost. Particularly, Remurs and orTRR do not change too much as increasing number of samples, this may due to early stop in the iteration process when searching for optimized solution.
Datasets  Metrics  Comparative Methods  
LASSO  ENet  Remurs  orTRR  GLTRM  ACS  SURF  
DTI  RMSE  2.940.34  2.920.32  2.910.32  3.480.21  3.090.35  2.810.24  2.810.23 
Sparsity  0.990.01  0.970.01  0.660.13  0.000.00  0.900.10  0.920.02  0.950.01  
Time  6.40.3  46.64.6  161.39.3  27.95.6  874.829.6  60.824.4  1.70.2  
DTI  RMSE  3.180.36  3.160.42  2.970.30  3.760.44  3.260.46  2.900.31  2.910.32 
Sparsity  0.990.01  0.950.03  0.370.09  0.000.00  0.910.06  0.930.02  0.940.01  
Time  5.70.3  42.42.9  155.010.7  10.20.1  857.422.5  63.021.6  5.20.8  
DTI  RMSE  3.060.34  2.990.34  2.930.27  3.560.41  3.140.39  2.890.38  2.870.35 
Sparsity  0.980.01  0.950.01  0.430.17  0.000.00  0.870.03  0.900.03  0.930.02  
Time  5.80.3  45.01.0  163.69.0  7.50.9  815.46.5  66.344.9  1.50.1  
DTI  RMSE  3.200.40  3.210.59  2.840.35  3.660.35  3.120.32  2.820.33  2.830.32 
Sparsity  0.990.01  0.960.03  0.440.13  0.000.00  0.860.03  0.900.02  0.910.02  
Time  5.50.2  42.31.4  159.67.6  26.63.1  835.89.9  96.743.2  3.80.5  
Combined  RMSE  3.020.37  2.890.41  2.810.31  3.330.27  3.260.45  2.790.31  2.780.29 
Sparsity  0.990.00  0.970.01  0.340.22  0.000.00  0.910.19  0.970.01  0.990.00  
Time  8.80.6  71.92.3  443.5235.7  48.48.0  1093.649.7  463.2268.1  6.20.5 
Real Data. We also examine the performance of our method on a real medical image dataset, including both DTI and MRI images, obtained from the Parkinson’s progression markers initiative (PPMI) database^{3}^{3}3http://www.ppmiinfo.org/data with human subjects. We parcel the brain into 84 regions and extract four types connectivity matrices. Our goal is to predict the Montreal Cognitive Assessment (MoCA) scores for each subject. Details of data processing are presented in Appendix B. We use three metrics to evaluate the performance: root mean squared prediction error (RMSE), which describes the deviation between the ground truth of the response and the predicted values in outofsample testing; sparsity of coefficients (SC), which is the same as the defined in the synthetic data analysis (i.e., the percentage of zero entries in the corresponding coefficient); and CPU execution time. Table 2
shows the results of all methods on both individual and combined datasets. Again, we can observe that SURF gives better predictions at a lower computational cost, as well as good sparsity. In particular, the paired ttests showed that for all five real datasets, the RMSE and SC of our approach are significantly lower and higher than those of Remurs and GLTRM methods, respectively. This indicates that the performance gain of our approach over the other lowrank + sparse methods is indeed significant. Figure
4(c) provides the running time (log min) of all methods on the PPMI MRI images of voxels each, which clearly demonstrates the efficiency of our approach.Acknowledgements
This work is supported by NSF No. IIS1716432 (Wang), IIS1750326 (Wang), IIS1718798 (Chen), DMS1613295 (Chen), IIS1749940 (Zhou), IIS1615597 (Zhou), ONR N000141812585 (Wang), and N000141712265 (Zhou), and Michael J. Fox Foundation grant number 14858 (Wang). Data used in the preparation of this article were obtained from the Parkinson’s Progression Markers Initiative (PPMI) database (http://www.ppmiinfo.org/data). For uptodate information on the study, visit http://www.ppmiinfo.org. PPMI – a publicprivate partnership – is funded by the Michael J. Fox Foundation for Parkinson’s Research and funding partners, including Abbvie, Avid, Biogen, BristolMayers Squibb, Covance, GE, Genentech, GlaxoSmithKline, Lilly, Lundbeck, Merk, Meso Scale Discovery, Pfizer, Piramal, Roche, Sanofi, Servier, TEVA, UCB and Golub Capital.
References
 Hastie et al. [2009] Trevor Hastie, Robert Tibshirani, and Jerome Friedman. The elements of statistical learning, 2009.
 Tibshirani [1996] Robert Tibshirani. Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society: Series B, pages 267–288, 1996.

Yu and Liu [2016]
Rose Yu and Yan Liu.
Learning from multiway data: Simple and efficient tensor regression.
In
International Conference on Machine Learning
, pages 373–381, 2016.  Zhou et al. [2013] Hua Zhou, Lexin Li, and Hongtu Zhu. Tensor regression with applications in neuroimaging data analysis. Journal of the American Statistical Association, 108(502):540–552, 2013.
 Su et al. [2012] Ya Su, Xinbo Gao, Xuelong Li, and Dacheng Tao. Multivariate multilinear regression. IEEE Transactions on Systems, Man, and Cybernetics, Part B (Cybernetics), 42(6):1560–1573, 2012.
 Guo et al. [2012] Weiwei Guo, Irene Kotsia, and Ioannis Patras. Tensor learning for regression. IEEE Transactions on Image Processing, 21(2):816–827, 2012.

Wang et al. [2014]
Fei Wang, Ping Zhang, Buyue Qian, Xiang Wang, and Ian Davidson.
Clinical risk prediction with multilinear sparse logistic regression.
In Proceedings of the 20th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pages 145–154. ACM, 2014.  Zou and Hastie [2005] Hui Zou and Trevor Hastie. Regularization and variable selection via the elastic net. Journal of the Royal Statistical Society: Series B, 67(2):301–320, 2005.
 Tan et al. [2012] Xu Tan, Yin Zhang, Siliang Tang, Jian Shao, Fei Wu, and Yueting Zhuang. Logistic tensor regression for classification. In International Conference on Intelligent Science and Intelligent Data Engineering, pages 573–581. Springer, 2012.
 Signoretto et al. [2014] Marco Signoretto, Quoc Tran Dinh, Lieven De Lathauwer, and Johan AK Suykens. Learning with tensors: a framework based on convex optimization and spectral regularization. Machine Learning, 94(3):303–351, 2014.
 Song and Lu [2017] Xiaonan Song and Haiping Lu. Multilinear regression for embedded feature selection with application to fmri analysis. In AAAI, pages 2562–2568, 2017.
 Bengua et al. [2017] Johann A Bengua, Ho N Phien, Hoang Duong Tuan, and Minh N Do. Efficient tensor completion for color image and video recovery: Lowrank tensor train. IEEE Transactions on Image Processing, 26(5):2466–2479, 2017.
 Kolda and Bader [2009] Tamara G Kolda and Brett W Bader. Tensor decompositions and applications. SIAM Review, 51(3):455–500, 2009.

Chen et al. [2012]
Kun Chen, KungSik Chan, and Nils Chr Stenseth.
Reduced rank stochastic regression with a sparse singular value decomposition.
Journal of the Royal Statistical Society: Series B, 74(2):203–221, 2012.  Mishra et al. [2017] Aditya Mishra, Dipak K. Dey, and Kun Chen. Sequential cosparse factor regression. Journal of Computational and Graphical Statistics, pages 814–825, 2017.
 Phan et al. [2015] AnhHuy Phan, Petr Tichavskỳ, and Andrzej Cichocki. Tensor deflation for candecomp/parafac—part i: Alternating subspace update algorithm. IEEE Transactions on Signal Processing, 63(22):5924–5938, 2015.
 Minasian et al. [2014] Arin Minasian, Shahram ShahbazPanahi, and Raviraj S Adve. Energy harvesting cooperative communication systems. IEEE Transactions on Wireless Communications, 13(11):6118–6131, 2014.
 Zhao and Yu [2007] Peng Zhao and Bin Yu. Stagewise lasso. Journal of Machine Learning Research, 8(Dec):2701–2726, 2007.
 Vaughan et al. [2017] Gregory Vaughan, Robert Aseltine, Kun Chen, and Jun Yan. Stagewise generalized estimation equations with grouped variables. Biometrics, 73:1332–1342, 2017.
 da Silva et al. [2015] Alex Pereira da Silva, Pierre Comon, and Andre Lima Ferrer de Almeida. Rank1 tensor approximation methods and application to deflation. arXiv preprint arXiv:1508.05273, 2015.
 Friedman et al. [2010] Jerome Friedman, Trevor Hastie, and Rob Tibshirani. Regularization paths for generalized linear models via coordinate descent. Journal of Statistical Software, 33(1):1, 2010.
 Kampa et al. [2014] Kittipat Kampa, S Mehta, ChunAn Chou, Wanpracha Art Chaovalitwongse, and Thomas J Grabowski. Sparse optimization in feature selection: application in neuroimaging. Journal of Global Optimization, 59(23):439–457, 2014.