1 Introduction
Now, higherorder data, which is also called tensor, frequently occur in various scientific and realworld applications. Specifically, in neuroscience, functional magnetic resonance imaging (fMRI) is an example of such tensor data consisting of a series of brain scanning images. Therefore, such data can be characterized as a 3D data (or 3mode data) with the shape of time neuron neuron. In many fields, we can encounter the problem that analyzing the relationship between the tensor variable and the scalar response for every sample . Specifically, we assume
(1) 
in which is the inner product operator, is the noise, and
is the coefficient needs to be estimated through regression. Notice that in the real world, these tensor data generally have two properties which makes the coefficient difficult to be inferred perfectly: (1)
Ultrahighdimensional setting, where the number of samples is much less than the number of variables. For example, each sample of the CMU2008 dataset [17] is a 3D tensor with shape of , which is 71553 voxels in total. However, only 360 trials are recorded. The highdimensional setting will make the estimated solution breaks down because we are trying to infer a large model with a limited number of observations. (2) Higherorder structure of data. The higherorder structure of data exists in many fields, such as fMRI and videos, with the shape of time pixelpixel. Traditional machine learning methods are proposed for processing vectors or matrices, hence, dealing with highorder data might be a difficulty. In past years, many methods are introduced to address these two problems.
To reserve the spatial structure, several methods are introduced based on the CANDECOMP/PARAFAC decomposition, which approximates an Morder tensor through
(2) 
Here, is defined as the CPrank of the tensor . For instance, [31] propose GLTRM which first decomposes the variable tensor and then applies the generalized linear model to estimate each component vector. In addition, [7] propose SURF using the divideandconquer strategy for each component vector. Almost all the CANDECOMP/PARAFACdecompositionbased methods, including GLTRM, suffer a problem that the CPrank should be prespecified. However, we always have no prior knowledge about the value of . Even if we can use techniques, such as crossvalidation, to estimate from the data, the solving procedure becomes trivial and computationally expensive for largescale data. A method called orTRR is previously proposed in [5] automatically obtaining a lowrank coefficient without prespecifying . But orTRR uses norm rather than norm for recovering the sparsity of data, which makes it performance poorer than others on variable selection.
To address such problem, some methods are proposed directly making constraints on the tensor. For example, [20] propose Remurs exploiting commonly used norm for enforcing sparsity on the estimated coefficient tensor. In addition, a nuclear norm is attached to it to make the solution lowrank. Remurs is solved through ADMM framework [2]. Although able to obtain an acceptable solution, Remurs has expensive time cost for largescale data.
In this paper, we derive ideas of a scalable estimator, Elementary Estimator [24], and propose Sparse and LowRank HigherOrder Tensor Regression (SLTR), which directly enforcing sparsity and lowrankness on the tensor with norm and nuclear norm respectively. We also propose a fast algorithm making use of parallel proximal method. Notice that because the solution can be obtained in a twolayer parallel manner, the algorithm can be implemented parallelly through multithreading computation or GPUs, thus, the solution of SLTR is able to be obtained with small time complexity. We empirically show that the parallel version of SLTR is faster and obtain no worse solution than previous methods. See details in Section 6 and Section 7. To summarize, this paper makes the following contributions:

A sparse and lowrank tensor regression model: Deriving ideas of Elementary Estimator, in Section 4, we propose a sparse and lowrank tensor regression model through norm and nuclear norm respectively.

A fast and scalable algorithm for the model: We also provide a fast solution to our model. The solution can be obtained through twolayer parallelizatio. In Section 6 we prove that SLTR has lower time complexity than counterparts.

Theoretically prove the convergence rate of sparse and lowrank tensor regression: In Section 5, we theoretically prove the sharp error bound of our model. Specifically, we provide a more exact error bound for 3D data. To our best knowledge, we are the first who prove the error bound of sparse and lowrank tensor regression with norm and nuclear norm directly applied on the tensor.

Experiments on realworld fMRI dataset: We experiment SLTR and four baseline methods on several simulated datasets and one fMRI dataset with nine projects. Results show that SLTR is more fast and stable than previous ones. See detailed explanation in Section 7.
2 Notations
denotes the elementwise norm, denotes the nuclear norm, denotes the norm, and denotes the spectral norm. Unfolding tensor along its th mode is denoted as , where the columns of are the th mode vectors of . Its opposite operator is denoted as . The operation converts a vector into a tensor.
3 Background
3.1 Elementary estimator for linear regression models
In the literature of linear regression (LR), some methods are proposed to address the difficulties of estimating in the highdimensional setting, in which the number of samples less than the number of variables. For instance, Lasso attaches an
norm to the ordinary least square (OLS) problem in order to reduce the number of variables. Lasso can be iteratively solved via proximal gradient method, which takes
time. On the other hand, [3]further propose Dantzig selector for estimations highdimensional settings. Dantzig selector is solved through linear programming (LP), which has
time complexity if using interior point method. These previous methods has prohibitively large time complexity when there are large number of variables. Therefore, to address this problem, [24] propose a fast estimator for highdimensional linear regression named Elementary Estimator (EE). Given vector samples and scalar responses , EE aims to solve(3)  
s.t. 
Here, is an arbitrary regularization and is its dual norm. EE shares similarities with Dantzig selector, however, it has been shown in [24] that EE is more practical scalable in highdimensional settings.
Furthermore, [25] propose the superpositionstructured elementary estimator by combining several estimators, which has the formulation
(4)  
s.t.  
This class of superpositionstructured estimators is able to be solved through independently and parallelly solving every single estimator.
3.2 Regularized matrix regression
Although plenty of methods are proposed for LR, these can not be directly used in tasks with matrix data,such as digital images. One intuitive way is to first vectorize these twodimensional data and then use extant LR methods. However, this will hurt the higherorder structure of data, such as spatial coherence. Therefore, [30] extend the ordinary liner regression to twodimensional data and propose the regularized matrix regression. Because, for 2Ddata, the true coefficient can often be well approximated by a lowrank structure, the regularized matrix regression exploits nuclear norm. Given matrix covariates and scalar responses , the regularized matrix regression solves
(5) 
Here, is nuclear norm and controls the rankness of coefficient . Although effective for secondorder data, this matrix regression model is infeasible for higherorder tensor data.
4 Method
Because Eq. (5) is introduced only for matrix data, we need a new model for solving problems with higherorder tensor data. Therefore, deriving ideas of Eq. (5) [30] propose a new tensor regression method named Remurs, which combines the lowrank constraint (nuclear norm) with the sparsity constraint ( norm). Specifically, given order tensor samples and corresponding responses , Remurs solves
(6) 
Notice that the nuclear norm regularization here can not be directly utilized for higherorder tensors. Moreover, as [8] state, the computation of tensor rank is NPhard, therefore, we use a convex tensor nuclear norm regularization [15]
(7) 
as the convex relaxation of nuclear norm. Remurs exploits ADMM method for estimation, which is computationally expensive for a large problem. To address this problem, deriving ideas of EE (Eq. (3)), we propose our Sparse and LowRank HigherOrder Tensor Regression (SLTR) model. SLTR aims to solve
(8)  
s.t.  
Here, are three tuning parameters, nuclear norm is in the form of Eq. (7), and denotes its dual norm. Moreover, , , and reshapes a matrix into a tensor. Notice that due to the definition of elementwise norm, we have . Integrating this with the definition of tensor nuclear norm, SLTR can be decomposed into
(9)  
After estimating , the final estimation is obtained by averaging them through
(10) 
Remark 4.1
The tensor nuclear norm Eq. (7) we use here is based on the unfolding tensors. Unfolding a tensor loses certain information and desires large number of computations. Directly applying low rank constraints onto the tensor rather its unfolded version could be explored in future works.
4.1 Twolayer parallel solution
In this section, we propose an algorithm to solve SLTR fast and accurately. In Eq. (9), although SLTR is decomposed into subproblems, these subproblems are not independent, thus, can not be solved separately, because each shares same elements. To overcome this drawback, we introduce several auxiliary variables into Eq. (9). Then, for , each subproblem can separately solve
(11)  
Therefore, SLTR can be solved in a parallel manner.
Moreover, Eq. (11) can be further optimized parallelly through the parallel proximal algorithm [4]. Specifically, to simplify the interpretation, we abbreviate as and convert Eq. (11) of each to
(12)  
s.t. 
Here, , , , and . denotes an indicator function of set as if , otherwise . To solve Eq. (12), we propose a parallel proximal method based algorithm, as summarized in Algorithm 1. Due to space limitations, the four proximal operators of functions are postponed in the appendix.
5 Theorem
Proposition 5.1
The tensor nuclear norm is a norm function and it is decomposable [18]. Specifically, given a pair of subspaces , we have for all and
(C1:sparse) The true coefficient is exactly sparse with nonzero elements.
(C2:lowrank) The true coefficient is a rank tensor where and denotes the orthogonal rank of tensor .
Suppose that the true coefficient tensor satisfies conditions (C1:sparse) and (C2:lowrank). Furthermore, suppose we solve Eq. (8) with controlling parameters and satisfying the constraints. Then, the optimal solution satisfies the following error bound:
(13) 
Corollary .1
If the coefficient is a threemode tensor such that and conditions (C1:sparse) and (C2:lowrank) are held for true coefficient , the optimal solution of Eq. (8) satisfies the following error bound:
(14) 
where denotes the rank of the unfolding tensor and
.
6 Discussion
6.1 Complexity analysis
We begin with the time complexity analysis. First of all, each needs to be vectorized once, which costs time. Then the initial approximation is computed with time complexity . Next, each subproblem of each mode is solved simultaneously using parallel proximal method. Obviously, the time complexity of parallely solving one subproblem is dominated by the largest time cost of calculating a certain proximal operator, which is the SVD procedure with time complexity. In a conclusion, by the virtue of our twolayer parallel solution, the total complexity of solving SLTR is only .
As for the memory complexity, this is the bottleneck of our approach. Because of the parallel proximal method, many auxiliary variables are needed, such as . In a total, there are memory spaces used for each in our approach, which is quite expensive for large data. We leave improvements on the memory complexity in our future works.
6.2 Relevance to previous works
Many methods have been proposed in the literature of regression tasks on higherorder tensor data. In this paper, we focus on the setting that the variables are represented by a tensor while the responses are denoted by a vector . Several models were already recently proposed to estimate the coefficient tensor for this specific, what we call, higherorder tensor regression problem.
One group of these methods is the direct extension of regularized linear regression. Naively, one way to solve this regression problem is vectorization. All the elements in the tensor are first stacked into a vector and then existing linear regression models can be applied to it. One obvious shortcoming of vectorization is that it will cause a loss of latent structural information of the data. To reserve certain potential information, [20] is proposed to estimate a sparse and lowrank coefficient tensor, by integrating the tensor nuclear norm and norm into the optimization problem. In Remurs, the tensor nuclear norm is approximated by the summation of ranks of several unfolded matrices. In addition, [13] improve Remurs by substituting the nuclear norm into Tubal nuclear norm [29, 28]
, which can be efficiently solved through discrete Fourier transform. However, these methods are computationally expensive because the nondifferential regularizer,
norm or nuclear norm exists in their objective function. Therefore, currently, this group of methods is not a good choice for higherorder tensor regression.To reserve the latent structure when dealing with tensors, another prevailing group of methods [7, 31, 5, 23] are proposed based on CANDECOMP/PARAFRAC decomposition. Generally, instead of directly estimate the coefficient tensor , we aim at inferring every component vector in each subtask. For example, [31] propose GLTRM using generalized linear model (GLM) to solve each subtask. Moreover, orTRR is proposed in [5] enforcing sparsity and lowrankness on the estimated coefficient tensor. Instead of norm, orTRR utilize norm to obtain the sparsity. In addition, recently, [7] propose SURF exploiting divideandconquer strategy where the subtask has a similar formulation of Elastic Net [33]. In the paper of SURF, authors empirically show that their method can converge, but a statistical convergence rate is not proved. On the contrary, in this paper, we theoretically prove the error bound of our method. Noticeably, the main limitation of CANDECOMP/PARAFACdecompositionbased method is that the decomposition rank should be prespecified, however, we generally have no prior knowledge about the tensor rank in realworld applications. Although orTRR is able to automatically obtain a lowrank result, the estimated result is suboptimal due to the fact that norm is inferior to norm in the sparse setting. Hence, these methods are not suitable for realworld applications.
Some other models were introduced previously for other problem settings. Recently, [6, 9, 26] propose models for nonparametric estimation by assuming that the response , making use of either additive model or Gaussian process. Apart from the abovementioned ones, many models [21, 19, 27, 32, 14] were put forward to estimate the relationship between the variable tensor and a response tensor . Another line of statistical models involving tensor data is tensor decomposition [11, 22, 1, 12, 16]. Tensor decomposition can be considered as an unsupervised problem which aims at approximating the tensor with lowerorder data. On the contrary, our SLTR is a supervised method estimating the latent relationship between variables and responses. Because these methods have different objectives from our method, we pay little attention to them and exclude them from experiments. In section 7, we compare SLTR with several introduced higherorder tensor regression methods, including Lasso, Elastic Net, Remurs, GLTRM, and SURF.
7 Experiment
7.1 Experiment Setups
Baselines and metrics:
To compare with our method, we use four previous methods as baselines. 1) Linear regression, specifically, Lasso and Elastic Net (with tradeoff ratio between and norm being 0.5), 2) Remurs, 3) SURF, and 4) GLTRM. Furthermore, we use three metrics to evaluate performances of our method and baselines as: 1) time cost, 2) coefficient error, which is defined as , and 3) mean squared error (MSE).
Datasets:
We compare SLTR with baselines on simulated datasets and a realworld fMRI dataset. Specifically, our simulated datasets are generated through several steps. First, generate and
with each element drawn from the normal distribution
. Then, randomly set elements of to be . Compute the response with respect to . Here, controls the ratio of noises and is also generated from normal distribution. In addition to simulated datasets, we further experiment SLTR on CMU2008 fMRI datasets [17].Other setups:
In the experiments, parameters of all the methods are selected through 5fold crossvalidation procedures which take the average MSE on validation datasets as the selecting criteria. Detailed descriptions about ranges of tuning parameters are shown in the appendix. As for the experiment environment, all the experiments are implemented on a Linux server with 2 Intel Xeon Silver 4216 CPUs and 256 GB RAMs. Moreover, our SLTR is implemented in MATLAB. Out of fairness, we set the maximal number of iterations to be for all the methods and let them terminate when . We run every single experiment 20 times and report the averaged value of metrics over these 20 trials.
7.2 Experiments on simulated data
First, we experiment SLTR on both 2D and 3D simulated datasets. The shape of data varies from , , , , for 2D data and , , , , , , and for 3D data. We fix the sparsity level and noise coefficient . The number of samples are determined through and for 2D and 3D data respectively. The time costs of all the methods are shown in Figure (1). Apparently, we can see that the not only the sequential version of SLTR has lower time cost, but also the parallel implementation of SLTR achieves more obvious improvements on the time cost, compared with baselines. Notice that the code of SURF is inapplicable for 2D data and the GLTRM is infeasible for 3D data, we omit these two methods in subfigures correspondingly. Ideally, if a better environment is provided, the speedup of SLTR can be larger. Then, we report the MSE of all the methods on these simulated datasets, in Table (1). The bold number denotes the best MSE value and the underlined value represents the secondbest result. The result indicates that in most cases SLTR obtains the best solution, while in other cases an estimation is computed by SLTR with only a little difference with the best one. Because of the infeasibility of SURF and GLTRM, we also omit these two methods in the table under the corresponding conditions. The experiment results in Figure (1) and Table (1) indicate that SLTR is able to spend less time on obtaining a solution no worse than solutions of other methods.


size  SLTR  Remurs  SURF  GLTRM  Lasso  Elastic Net 
17[0.8pt/2pt] 2D Data – 50% samples  
15 15  0.1419  0.1949  Infeasible  1.162  0.1893  0.2015 
20 20  0.1105  0.1567  1.3967  0.1152  0.1157  
25 25  0.1364  0.2992  1.7319  0.1464  0.1460  
30 30  0.1094  0.1323  2.2655  0.1414  0.1418  
17[0.8pt/2pt] 3D Data – 8% samples  
10 10 5  0.8532  0.8538  0.8472  Infeasible  1.7993  1.799 
15 15 5  0.9892  0.9867  0.9994  2.3151  2.3132  
20 20 5  0.9383  0.9378  1.0049  2.5469  2.5193  
25 25 5  0.9275  0.9398  0.9391  2.0149  2.0149  
30 30 5  0.9186  0.919  0.9289  1.9381  1.9377  
35 35 5  0.9336  0.9370  0.9527  2.0147  2.0147  
40 40 5  0.9073  0.9072  1.0006  2.1059  2.1065  

In addition, we apply SLTR in highdimensional settings. We let the shape of data to be and varies from 50 to 400, with 50 increments. The sparsity level is and the noise factor is set to . The MSE values shown in Table (1) indicate that SLTR obtains the best solution under almost all the conditions. Note that for , even the MSE of SLTR is , which is a little worse than the MSE of Remurs (0.9190), this value is much better than others.


N  SLTR  Remurs  SURF  GLTRM  Lasso  Elastic Net 
50  1.6123  1.6198  1.6439  Infeasible  4.5083  4.5759 
100  1.0798  1.0946  1.7101  1.6433  1.6433  
150  0.9295  0.9190  0.9953  1.6777  1.6502  
200  0.8351  0.8469  0.8376  1.9072  0.8376  
250  0.7130  0.7267  0.7199  1.3757  1.3708  
300  0.7282  0.7325  0.7524  1.7316  1.6938  
350  0.6275  0.6275  0.6379  1.3207  1.2804  
400  0.5954  0.5969  0.5975  1.1487  1.1450  

Finally, to test the stability of SLTR and its sensitiveness to the sparsity level, we generate simulated datasets varying the sparsity level from . The shape of each sample is
and a totally of 160 samples are generated. We generate different datasets on a different trail. The averaged time cost and the variance of 20 trials are reported in Figure (
2). Apparently, under every setting, SLTR outperforms baselines. Moreover, the time cost variance of SLTR is at most , which is much less than it of SURF and Remurs (at least ). Although LR methods also have small time variance, they might obtain a worse solution than SLTR (see Table (1)). Therefore, SLTR is faster than other methods on datasets with different sparsity levels and it is more stable.7.3 Experiments on realworld data
In this section, we perform fMRI classification tasks on CMU2008 datasets [17] with 9 projects in total. Each sample of this dataset is a 3mode tensor of size (71553 voxels). This classification task aims to predict human activities associated with recognizing the meanings of nouns. Following [10, 20], we focus on classifications of binary classes: “tools” and “animals”. Here, the class “tool” combines observations from “tool” and “furniture”, class “animal” combines observations from “animal” and “insect” in the CMU2008 dataset. Like simulated experiments, values of tuning parameters of each method are selected through 5fold crossvalidation. For each subject, we split the entire dataset into the training dataset and testing dataset with the proportion of and respectively and use AUC to evaluate classification results. Results are shown in Table 3, which indicates that SLTR obtains the best time cost among all the cases. Notice that although SURF has the lowest time cost, the AUC of its solution is always around 0.5 and drops below 0.5 sometimes. Hence, we think the classification result of SURF is unacceptable. One interesting result occurs on project #1, where linear regression methods obtain a much better solution. We think the reason might be that in this subject, the voxels are independent, hence, the data has no latent structure. Anyway, on a realworld fMRI dataset, SLTR can obtain an acceptable solution with the least time cost.


Project  SLTR  Remurs  Lasso  Elastic Net  SURF  
seq. time  par. time  AUC  time  AUC  time  AUC  time  AUC  time  AUC  
#1  0.9785  0.2135  0.75  23.9151  0.75  2.2105  0.9143  2.2334  0.9277  0.166  0.5042 
#2  0.9405  0.2103  0.5785  28.229  0.6  2.1347  0.7164  2.1105  0.6974  0.0967  0.4571 
#3  1.1246  0.2466  0.7929  21.716  0.7571  3.0322  0.7177  3.020  0.728  0.0809  0.5286 
#4  0.9276  0.2089  0.8286  33.0095  0.75  2.0809  0.7237  2.0809  0.7376  0.0278  0.3429 
#5  0.9385  0.2104  0.6527  26.8979  0.6389  2.1995  0.7411  2.2036  0.7589  0.1826  0.7125 
#6  0.9007  0.2076  0.7704  10.0312  0.5972  2.3102  0.6554  2.365  0.6689  0.1707  0.5486 
#7  0.9405  0.2096  0.6250  19.3765  0.5037  2.6606  0.5851  2.604  0.5929  0.1364  0.4444 
#8  0.8923  0.2086  0.6944  16.7164  0.5486  2.2362  0.6741  2.2854  0.7165  0.0849  0.4028 
#9  0.9036  0.2157  0.6142  24.1116  0.5781  2.1138  0.5522  2.1014  0.5489  0.1742  0.5857 

References

[1]
Genevera Allen, ‘Sparse higherorder principal components analysis’, in
Artificial Intelligence and Statistics, pp. 27–36, (2012).  [2] Stephen Boyd, Neal Parikh, Eric Chu, Borja Peleato, Jonathan Eckstein, et al., ‘Distributed optimization and statistical learning via the alternating direction method of multipliers’, Foundations and Trends® in Machine learning, 3(1), 1–122, (2011).
 [3] Emmanuel Candes, Terence Tao, et al., ‘The dantzig selector: Statistical estimation when p is much larger than n’, The annals of Statistics, 35(6), 2313–2351, (2007).
 [4] Patrick L Combettes and JeanChristophe Pesquet, ‘Proximal splitting methods in signal processing’, in Fixedpoint algorithms for inverse problems in science and engineering, 185–212, Springer, (2011).
 [5] Weiwei Guo, Irene Kotsia, and Ioannis Patras, ‘Tensor learning for regression’, IEEE Transactions on Image Processing, 21(2), 816–827, (2011).
 [6] Botao Hao, Boxiang Wang, Pengyuan Wang, Jingfei Zhang, Jian Yang, and Will Wei Sun, ‘Sparse tensor additive regression’, arXiv preprint arXiv:1904.00479, (2019).
 [7] Lifang He, Kun Chen, Wanwan Xu, Jiayu Zhou, and Fei Wang, ‘Boosted sparse and lowrank tensor regression’, in Advances in Neural Information Processing Systems, pp. 1009–1018, (2018).
 [8] Christopher J Hillar and LekHeng Lim, ‘Most tensor problems are nphard’, Journal of the ACM (JACM), 60(6), 45, (2013).
 [9] Masaaki Imaizumi and Kohei Hayashi, ‘Doubly decomposing nonparametric tensor regression’, in International Conference on Machine Learning, pp. 727–736, (2016).

[10]
Kittipat Kampa, S Mehta, ChunAn Chou, Wanpracha Art Chaovalitwongse, and Thomas J Grabowski, ‘Sparse optimization in feature selection: application in neuroimaging’,
Journal of Global Optimization, 59(23), 439–457, (2014).  [11] Tamara G Kolda and Brett W Bader, ‘Tensor decompositions and applications’, SIAM review, 51(3), 455–500, (2009).
 [12] Jiajia Li, Jee Choi, Ioakeim Perros, Jimeng Sun, and Richard Vuduc, ‘Modeldriven sparse cp decomposition for higherorder tensors’, in 2017 IEEE International Parallel and Distributed Processing Symposium (IPDPS), pp. 1048–1057. IEEE, (2017).
 [13] Wenwen Li, Jian Lou, Shuo Zhou, and Haiping Lu, ‘Sturm: Sparse tubalregularized multilinear regression for fmri’, in International Workshop on Machine Learning in Medical Imaging, pp. 256–264. Springer, (2019).
 [14] Yimei Li, Hongtu Zhu, Dinggang Shen, Weili Lin, John H Gilmore, and Joseph G Ibrahim, ‘Multiscale adaptive regression models for neuroimaging data’, Journal of the Royal Statistical Society: Series B (Statistical Methodology), 73(4), 559–578, (2011).
 [15] Ji Liu, Przemyslaw Musialski, Peter Wonka, and Jieping Ye, ‘Tensor completion for estimating missing values in visual data’, IEEE transactions on pattern analysis and machine intelligence, 35(1), 208–220, (2012).
 [16] Oscar Hernan MadridPadilla and James Scott, ‘Tensor decomposition with generalized lasso penalties’, Journal of Computational and Graphical Statistics, 26(3), 537–546, (2017).
 [17] Tom M Mitchell, Svetlana V Shinkareva, Andrew Carlson, KaiMin Chang, Vicente L Malave, Robert A Mason, and Marcel Adam Just, ‘Predicting human brain activity associated with the meanings of nouns’, science, 320(5880), 1191–1195, (2008).
 [18] Sahand N Negahban, Pradeep Ravikumar, Martin J Wainwright, Bin Yu, et al., ‘A unified framework for highdimensional analysis of estimators with decomposable regularizers’, Statistical Science, 27(4), 538–557, (2012).
 [19] Garvesh Raskutti, Ming Yuan, Han Chen, et al., ‘Convex regularization for highdimensional multiresponse tensor regression’, The Annals of Statistics, 47(3), 1554–1584, (2019).
 [20] Xiaonan Song and Haiping Lu, ‘Multilinear regression for embedded feature selection with application to fmri analysis’, in ThirtyFirst AAAI Conference on Artificial Intelligence, (2017).
 [21] Will Wei Sun and Lexin Li, ‘Store: sparse tensor response regression and neuroimaging analysis’, The Journal of Machine Learning Research, 18(1), 4908–4944, (2017).
 [22] Will Wei Sun, Junwei Lu, Han Liu, and Guang Cheng, ‘Provable sparse tensor decomposition’, Journal of the Royal Statistical Society: Series B (Statistical Methodology), 79(3), 899–916, (2017).
 [23] Xu Tan, Yin Zhang, Siliang Tang, Jian Shao, Fei Wu, and Yueting Zhuang, ‘Logistic tensor regression for classification’, in International Conference on Intelligent Science and Intelligent Data Engineering, pp. 573–581. Springer, (2012).
 [24] Eunho Yang, Aurelie Lozano, and Pradeep Ravikumar, ‘Elementary estimators for highdimensional linear regression’, in International Conference on Machine Learning, pp. 388–396, (2014).

[25]
Eunho Yang, Aurelie Lozano, and Pradeep Ravikumar, ‘Elementary estimators for sparse covariance matrices and other structured moments’, in
International conference on machine learning, pp. 397–405, (2014).  [26] Rose Yu, Guangyu Li, and Yan Liu, ‘Tensor regression meets gaussian processes’, arXiv preprint arXiv:1710.11345, (2017).
 [27] Rose Yu and Yan Liu, ‘Learning from multiway data: Simple and efficient tensor regression’, in International Conference on Machine Learning, pp. 373–381, (2016).
 [28] Zemin Zhang and Shuchin Aeron, ‘Exact tensor completion using tsvd’, IEEE Transactions on Signal Processing, 65(6), 1511–1526, (2016).

[29]
Zemin Zhang, Gregory Ely, Shuchin Aeron, Ning Hao, and Misha Kilmer, ‘Novel
methods for multilinear data completion and denoising based on tensorsvd’,
in
Proceedings of the IEEE conference on computer vision and pattern recognition
, pp. 3842–3849, (2014).  [30] Hua Zhou and Lexin Li, ‘Regularized matrix regression’, Journal of the Royal Statistical Society: Series B (Statistical Methodology), 76(2), 463–483, (2014).
 [31] Hua Zhou, Lexin Li, and Hongtu Zhu, ‘Tensor regression with applications in neuroimaging data analysis’, Journal of the American Statistical Association, 108(502), 540–552, (2013).
 [32] Hongtu Zhu, Yasheng Chen, Joseph G Ibrahim, Yimei Li, Colin Hall, and Weili Lin, ‘Intrinsic regression models for positivedefinite matrices with applications to diffusion tensor imaging’, Journal of the American Statistical Association, 104(487), 1203–1212, (2009).
 [33] Hui Zou and Trevor Hastie, ‘Regularization and variable selection via the elastic net’, Journal of the royal statistical society: series B (statistical methodology), 67(2), 301–320, (2005).
Comments
There are no comments yet.