1 Introduction
Now, higherorder data, which is also called tensor, frequently occur in various scientific and realworld applications. Specifically, in neuroscience, functional magnetic resonance imaging (fMRI) is an example of such tensor data consisting of a series of brain scanning images. Therefore, such data can be characterized as a 3D data (or 3mode data) with the shape of time neuron neuron. In many fields, we can encounter the problem that analyzing the relationship between the tensor variable and the scalar response for every sample . Specifically, we assume
(1) 
in which is the inner product operator, is the noise, and is the coefficient needs to be estimated through regression. Notice that in the real world, these tensor data generally have two properties which makes the coefficient difficult to be inferred perfectly: (1) Ultrahighdimensional setting, where the number of samples is much less than the number of variables. For example, each sample of the CMU2008 dataset [16] is a 3D tensor with shape of , which is 71553 voxels in total. However, only 360 trials are recorded. The highdimensional setting will make the estimated solution breaks down because we are trying to infer a large model with a limited number of observations. (2) Higherorder structure of data. The higherorder structure of data exists in many fields, such as fMRI and videos, with the shape of time pixel
pixel. Traditional machine learning methods are proposed for processing vectors or matrices, hence, dealing with highorder data might be a difficulty. In past years, many methods are introduced to address these two problems.
Under highdimensional settings, several wellknown models were already proposed making use of variable selection, such as Lasso [23] and Dantzig selector [2]
. Because they are unable to deal with data other than vectors, one naive way to use them on tensor data is vectorization. All the elements in the tensor are stacked into a vector, thence, the existing linear regression can work. However, intuitively, the latent structural information will be lost in such a manner. Therefore, some methods aim at directly handling the tensor. For example,
[18] propose Remurs exploiting commonly usednorm for enforcing sparsity on the estimated coefficient tensor. In addition, a nuclear norm is attached to it to make the solution lowrank. The main shortcoming of Remurs is that the tensor nuclear norm is approximated by the nuclear norm of its unfolding matrices. Because the tensor should be unfolded into matrices, its structure is still destructed. Therefore, though this kind of method is able to obtain an acceptable solution in highdimensional settings, the higherorder structure is lost.
To reserve the spatial structure, several methods are introduced based on the CANDECOMP/PARAFAC decomposition, which approximates an Morder tensor through
(2) 
Here, is defined as the CPrank of the tensor . For instance, [29] propose GLTRM which first decomposes the variable tensor and then applies the generalized linear model to estimate each component vector. In addition, [7] propose SURF using the divideandconquer strategy for each component vector. Almost all the CANDECOMP/PARAFACdecompositionbased methods, including GLTRM, suffer a problem that the CPrank should be prespecified. However, we always have no prior knowledge about the value of . Even if we can use techniques, such as crossvalidation, to estimate from the data, the solving procedure becomes trivial and computationally expensive for largescale data. A method called orTRR is previously proposed in [4] automatically obtaining a lowrank coefficient without prespecifying . But orTRR uses norm rather than norm for recovering the sparsity of data, which makes it performance poorer than others on variable selection. To our best knowledge, there is no scalable estimator proposed before, for enforcing both sparsity and lowrankness on the solution in highdimensional settings.
In this paper, we derive ideas of a scalable estimator, Elementary Estimator [24], and propose Fast Sparse HigherOrder Tensor Regression (FasTR), which estimates a unitrank coefficient tensor. First, the problem is decomposed into several subproblems. Then, for each subproblem, i.e., each component vector, a closedform solution can be obtained efficiently. Notice that because the computation of closedform solution can be speeded up through multithreading computation or GPUs, thus, the solution of FasTR is able to be obtained with small time complexity. See details in Section 6. To summarize, this paper has the following novelties:

A sparse tensor regression model and its fast and scalable solution: In Section 4, we propose a regression model for tensor data, using norm for obtaining the sparsity. Moreover, we provide a scalable solution for the model, which is obtained iteratively while at each iteration the temporary estimation has closed form.

Stateoftheart error bound for tensor regression model: We theoretically prove that our sparse estimator has a stateoftheart error bound , where denotes the nonzero elements of the th decomposed component of variable tensor and is a constant characterized by the data. Details are shown in Section 5.

Experiments on realworld fMRI dataset: In Section 7, we make comparison between our FasTR and four baselines on several simulated datasets and one fMRI dataset with nine projects. Experiment results empirically show that FasTR can obtain better estimations with less time cost.
2 Notations
denotes the elementwise norm, denotes the nuclear norm, denotes the norm, and denotes the spectral norm. Throughout this paper, the higherorder tensor is denoted by the calligraphic letter and the vector is denoted by a lowercase letter . Scalars are also denoted by lowercase letters but stated clearly within the context to avoid confusion.
3 Background
3.1 Elmentary estimator for linear regression models (EERidge)
For vector (firstorder) data, [24] propose a stateoftheart method called EERidge to solve highdimensional linear regression problems. Given the sample matrix and the response vector , EERidge has the following formulation:
(3)  
s.t. 
Here, is an arbitrary norm function and . The two hyperparameters, and , handles with the noninvertibility of covariance matrix and controls the level of sparsity respectively. Although EERidge shares certain similarities with the Dantzig selector [2], EERidge has outstanding performance on computational complexity. For instance, when selecting the norm function , a closedform solution to Eq. (3) can be obtained through , in which denotes the softthresholding operator. Noticeably, the calculation of this solution is dominated by computing the matrix inversion, which generally acquires time. ^{1}^{1}1Some other methods can compute matrix inversion with lower time complexity. For instance, the Strassen algorithm proposed in [19] has the time complexity of on matrix inversion computation. This is a significant improvement in previous variable selectors, such as Lasso and Dantzig selector with the time cost of and respectively. Furthermore, the computation of solution to EERidge can be easily speeded up by the virtue of multiple threads or GPUs.
3.2 Higherorder tensor regression model
Given order predictors and scalar responses , , higherorder tensor regression models consider that the responses are generated from a linear formulation . Here is an order coefficient tensor and is an error term. Deriving ideas of LR, the coefficient tensor is estimated through
(4) 
where is a norm function enforcing certain properties on the coefficient tensor. Eq. (4) is akin to the formulation of LR, however, existing LR methods can not be directly applied to it. A naive adaption is the vectorization. By stacking the elements of a tensor into a vector first, LR can be utilized. However, vectorization will hurt the structural information of data, which makes it inapplicable in realworld applications.
3.3 UnitRank Tensor CANDECOMP/PARAFAC decomposition
To reserve latent structural information and decrease the ultrahigh dimensionality when dealing with the higherorder tensor , CANDECOMP/PARAFAC decomposition is proposed in [8, 3, 6] to decompose the tensor into the outer products of several vectors. Specifically, given a tensor , it can be decomposed into the outerproducts of component vectors
(5) 
in which each . In this way, the number of variables largely reduced from to . Intuitively, CANDECOMP/PARAFAC decomposition reserves more latent information than simply vectorizing.
4 Method
Substituting the Morder coefficient tensor with its CP decomposition , the coefficient tensor can be obtained through estimating all the component vectors . Likewise the linear regression, to infer a certain , each sample needs to be “projected” onto the th space . Then, intuitively, we aim to let each fit the projected samples corresponding to its space. In addition, instead of coefficient tensor , we impose sparsity constraints on each component . This leads to a more flexible and efficient model because fewer variables need to be dealt with in the highdimensional settings. Therefore, letting denote the matrix with the th row bpreing the projected on the th space, our objective is solving
(6)  
Here, is a tuning parameter controlling the degree of sparsity and
is an identity matrix. Parameter
aims to make matrix invertible, which handling the crucial problem of highdimensional learning. Then, fortunately, based on EERidge, Eq. (6) has a closedform solution(7) 
With this, as long as is easy to be computed, we can solve Eq. (6) directly.
4.1 Proposed: Fast Sparse HigherOrder Tensor Regression (FasTR)
In our method, we use an simple and intuitive formulation of the projection function as
(8)  
Then, substituting with Eq. (8) in Eq. (6), our FasTR aims to solve
(9)  
for . Notice that the computing of is dominated by a large number of multiplications, obtaining the solution Eq. (7) can be easily accelerated by multiple CPUs or GPUs.
Moreover, we propose a fast algorithm to solve Eq. (9) in a componentwise manner. When estimating of a certain , we fix other component vectors as constants. At each iteration, we first compute and then get the estimation through Eq. (7). Specifically, let denotes the estimation of the th mode component vector at the th iteration,
The algorithm is summarized in Algorithm 1.
5 Theorem
We now provide a statistical analysis of the component estimator (Eq. (9)). We follow the idea of [24] and make following assumptions:
(CSparse) The coefficient component is exactly sparse with nonzero elements.
(CRidge) Let be the singular vectors of
corresponding to the singular values
. Here, is the rank of . Let . Then, with some equence .Consider assumptions (CSparse) and (CRidge) are satisfied, there exist positive constants (), such that the estimated solution of Eq. (9) satisfies
(10)  
with probability at least
. Here, we suppose that is selected through .6 Discussion
6.1 Complexity analysis
When estimating for each mode, the time complexity is dominated by computing , which costs . Once the projection is obtained, is calculated through Eq. (7) with time complexity. As the subtasks that estimating for each are independent to each other, these subtasks can be optimized parallelly. Furthermore, notice that the time cost of computing through Eq. (7) can be easily reduced making use of multiple threads or GPUs, this part of time becomes negligible. Therefore, integrating all the ingredients, the time complexity of our method is .
Apart from the computationally efficiency of FasTR, our method also acquires small number of memory space. Our method requires two parts of memory space, one of which is used for storing components and another is for dealing computations on . Because, among all the computations about the projection, needs the most memory space, the second part requires space. Therefore, totally, the memory complexity of FasTR is .
6.2 Relevance to previous works
Many methods have been proposed in the literature of regression tasks on higherorder tensor data. In this paper, we focus on the setting that the variables are represented by a tensor while the responses are denoted by a vector . Several models were already recently proposed to estimate the coefficient tensor for this specific, what we call, higherorder tensor regression problem.
One group of these methods is the direct extension of regularized linear regression. Naively, one way to solve this regression problem is vectorization. All the elements in the tensor are first stacked into a vector and then existing linear regression models can be applied to it. One obvious shortcoming of vectorization is that it will cause a loss of latent structural information of the data. To reserve certain potential information, [18] is proposed to estimate a sparse and lowrank coefficient tensor, by integrating the tensor nuclear norm and norm into the optimization problem. Notice that in Remurs, the tensor nuclear norm is approximated by the summation of ranks of several unfolded matrices. Remurs still discards some structural information when unfolding the tensor into matrices, although it outperforms than vectorization generally. In addition, [13] improve Remurs by substituting the nuclear norm into Tubal nuclear norm [28, 27]
, which can be efficiently solved through discrete Fourier transform. However, the tensor unfolding is still required. Furthermore, these methods are also computationally expensive because the nondifferential regularizer,
norm or nuclear norm exists in their objective function. Therefore, currently, this group of methods is not a good choice for higherorder tensor regression.To reserve the latent structure when dealing with tensors, another prevailing group of methods [7, 29, 4, 22] are proposed based on CANDECOMP/PARAFRAC decomposition. Generally, instead of directly estimate the coefficient tensor , we aim at inferring every component vector in each subtask. For example, [29] propose GLTRM using generalized linear model (GLM) to solve each subtask. Moreover, orTRR is proposed in [4] enforcing sparsity and lowrankness on the estimated coefficient tensor. Instead of norm, orTRR utilize norm to obtain the sparsity. In addition, recently, [7] propose SURF exploiting divideandconquer strategy where the subtask has a similar formulation of Elastic Net [31]. In the paper of SURF, authors empirically show that their method can converge, but a statistical convergence rate is not proved. On the contrary, in this paper, we theoretically prove the error bound of our method. Noticeably, the main limitation of CANDECOMP/PARAFACdecompositionbased method is that the decomposition rank should be prespecified, however, we generally have no prior knowledge about the tensor rank in realworld applications. Although orTRR is able to automatically obtain a lowrank result, the estimated result is suboptimal due to the fact that norm is inferior to norm in the sparse setting. Hence, these methods are not suitable for realworld applications.
Some other models were introduced previously for other problem settings. Recently, [5, 9, 25] propose models for nonparametric estimation by assuming that the response , making use of either additive model or Gaussian process. Apart from the abovementioned ones, many models [20, 17, 26, 30, 14] were put forward to estimate the relationship between the variable tensor and a response tensor . Another line of statistical models involving tensor data is tensor decomposition [11, 21, 1, 12, 15]. Tensor decomposition can be considered as an unsupervised problem which aims at approximating the tensor with lowerorder data. On the contrary, our FasTR is a supervised method estimating the latent relationship between variables and responses. Because these methods have different objectives from our method, we pay little attention to them and exclude them from experiments. In section 7, we compare FasTR with several introduced higherorder tensor regression methods, including Lasso, Elastic Net, Remurs, GLTRM, and SURF.
7 Experiment
7.1 Experiment Setups
We experiment on several simulated datasets and a realworld datasets with nine projects to compare the performance of our method with previous methods. We use four previous methods as baselines, which are 1) Linear Regression (Lasso and Elastic Net), 2) Remurs, 3) SURF, and 4) GLTRM. Specifically,
We use three metrics to evaluate the performance of our method and baselines, including 1) time cost, 2) mean squared error (MSE), and 3) coefficient error (CE). Here, and .
Furthermore, simulated data are generated through following three steps:

Step 1: For to , generate
with each element derived from the Gaussian distribution
; given the sparsity degree , randomly set elements of to be ; . 
Step 2: Generate while each element of is derived from the distribution .

Step 3: Generate the response with respect to . Here, controls the degree of noise and each element of is generated from .
In addtion to simulated datasets, we also use CMU2008 fMRI dataset to show the superiority of our method.
7.2 Experiments on simulated data
When generating simulated datasets, we let the sparsity degree and noise degree . Out of fairness, we set the maximal number of iterations to be for all the methods and let the method terminate when . Moreover, all the tuning parameters of each method are selected through 5fold crossvalidation. The detailed interval of tuning parameters is shown in the appendix. For every single experiment, we run each method 20 times and average the metrics’ value over these 20 trials.
To evaluate the superiority of our method, we generate both 2D and 3D datasets varying the data size and the number of samples. For linear regression (LR), we use Lasso and Elastic Net and we report the value of metrics for the method which obtains the better MSE. In Figure 1, we show the time cost of each method. Notice that because SURF is infeasible for 2D data and GLTRM is infeasible for 3D data, we discard these two methods in the subfigure correspondingly. Moreover, for 3D data, SURF obtains MSE values much worse that other methods (see Table 1), thus, we discard SURF in Figure 1 due to its terrible performance. We can see that our FasTR outperforms other baselines and as the dimensionality of data increases, the speeds up of FasTR become more and more obvious. Specifically, the two linear regression methods are not able to obtain a solution in less than 90 seconds for data with a shape of , while other methods cost no more than 4 seconds. Furthermore, in Table 1 reports the MSE and CE values of every method on each dataset. Noticeably, FasTR has better MSE and CE under every setting, compared with baselines, for both 2D and 3D data. Specifically, LR can not compute an estimation because the code throws segmentation fault when the size of data is and , which makes it infeasible for largescale data. Notice that the performance of linear regression methods are worse than tensor regression methods, which coincide with our statement that the vectorization can do harm to the structural information of data. To summarize, for largescale data, FasTR is able to obtain a better solution with a much less time cost.


size (N)  FasTR  Remurs  LR  SURF  GLTRM  
MSE  CE  MSE  CE  MSE  CE  MSE  CE  MSE  CE  
2D Data  
0.0427  0.7717  0.1157  0.8398  0.7095  0.9989  Inapplicable  1.8169  3.1381  
0.0305  0.5284  0.2393  0.8509  0.4235  0.9937  4.5746  5.715  
0.0342  0.3483  1.3414  0.8449  0.4980  0.9927  7.9937  9.2854  
0.0628  0.3436  0.1804  0.8030  0.5149  0.9998  22.4912  29.7461  
0.056  0.2544  0.4596  0.7905  0.5434  0.9994  52.3116  62.9425  
0.0518  0.2349  10.978  0.8404  0.5140  0.9998  72.3897  87.5139  
115[0.8pt/2pt] 3D Data  
0.0047  0.1205  0.0610  0.8524  0.0324  0.4507  0.7933  0.8417  Inapplicable  
0.0127  0.1863  0.0925  0.9281  0.0586  0.2126  2.8068  0.9997  
0.1219  0.6772  0.3757  0.8707  0.2544  0.6461  12.6857  0.9989  
0.2158  0.8323  0.5301  0.8562  1.3930  1.0711  5.5318  0.9619  
0.1385  0.8820  0.2472  0.8929  0.5221  1.1443  9.5182  0.8948  
0.2514  0.8776  0.4386  0.8990  1.2765  1.0495  21.0606  0.9898  
0.3257  0.7334  0.5597  0.8877  1.3863  1.0010  175.2118  0.8436  
0.1810  0.6901  0.4867  0.9116  Inapplicable  335.1932  0.9988  
0.2105  0.5467  0.4036  0.8843  107.6186  0.8733  

In highdimensional settings, we generate a 3D dataset with a shape of and vary the number of samples from to . The sparsity level is set to and the noise coefficient is fixed to . Apparently, Table 2 indicates that for every , FasTR has much lower MSE value, which indicates that FasTR outperform baselines on largescale and highdimensional datasets. The MSE value does not reduce along with the increment of the number of samples, which might be a general thought. Because for every , the simulated dataset is generated separately, meaning that these datasets with different have no relevance. Therefore, in this experiment, the MSE does not have to be improved when more samples are provided. However, in every condition, we show that FasTR has better performance.


N  MSE Value  
FasTR  Remurs  LR  SURF  GLTRM  
300  0.2529  0.4795  1.0069  22.9196  Inapplicable 
350  0.2082  0.4451  1.0019  23.6516  
400  0.2252  0.3692  0.8662  10.6082  
450  0.3081  0.4902  1.2470  94.4651  
500  0.1887  0.4896  0.8028  60.3166  
550  0.2614  0.5480  1.0503  92.1064  
600  0.1914  0.3370  0.7686  17.6773  

At last, to test the sensitivity of FasTR for different sparsity levels, we vary the sparsity level when generating simulated datasets with the shape of and the . The noise coefficient is set to 0.2. Varying the sparsity level from , the time cost and MSE of every method, except GLTRM, are reported in Figure 2. For different sparsity level, FasTR retains the superiority among all the methods. Although under too sparse conditions where the sparsity level is larger than , FasTRsignificantly has lower MSE than others. In a word, FasTR obtains better performance on datasets with different level of sparsity, compared with previous methods.
7.3 Experiments on realworld data


Project  FasTR  Remurs  Lasso  Elastic Net  SURF  
time  AUC  time  AUC  time  AUC  time  AUC  time  AUC  
#1  0.3062  0.6786  23.9151  0.5929  2.2287  0.9388  2.6994  0.9566  0.1743  0.5042 
#2  0.3409  0.7692  19.2467  0.5572  2.6699  0.7552  2.6772  0.7575  0.1367  0.4785 
#3  0.3594  0.5314  21.7162  0.8642  2.5251  0.8109  2.5365  0.7786  0.1539  0.5286 
#4  0.3052  0.7778  33.0095  0.7071  2.4533  0.7237  2.4499  0.7376  0.1392  0.5857 
#5  0.3073  0.7681  26.8979  0.7272  2.6739  0.5602  2.7003  0.5413  0.1285  0.4755 
#6  0.3302  0.7222  10.0312  0.5972  2.3102  0.6554  2.365  0.6689  0.1707  0.5486 
#7  0.3453  0.6796  19.3765  0.5714  2.7856  0.6531  2.9838  0.679  0.1364  0.4444 
#8  0.2941  0.8  16.7164  0.5486  2.2362  0.6741  2.2854  0.7165  0.1811  0.5384 
#9  0.2992  0.6953  24.1116  0.6428  2.9229  0.5531  2.912  0.5509  0.196  0.5857 

In this section, we perform fMRI classification tasks on CMU2008 datasets [16] with 9 projects in total. Each sample of this dataset is a 3mode tensor of size (71553 voxels). This classification task aims to predict human activities associated with recognizing the meanings of nouns. Following [10, 18], we focus on classifications of binary classes: “tools” and “animals”. Here, the class “tool” combines observations from “tool” and “furniture”, class “animal” combines observations from “animal” and “insect” in the CMU2008 dataset. Like simulated experiments, values of tuning parameters of each method are selected through 5fold crossvalidation. For each subject, we split the entire dataset into the training dataset and testing dataset with the proportion of and respectively and use AUC to evaluate classification results. Results are shown in Table 3, which indicates that FasTR obtains the best AUC value for most cases. Notice that although SURF has the lowest time cost, the AUC of its solution is around 0.5 and drops below 0.5 sometimes. Hence, we think the classification result of SURF is unacceptable. One interesting result occurs on project #1, where linear regression methods obtain a much better solution. We think the reason might be that in this subject, the voxels are independent, hence, the data has no latent structure. Anyway, FasTR has a significant performance on a realworld fMRI dataset.
References

[1]
Genevera Allen, ‘Sparse higherorder principal components analysis’, in
Artificial Intelligence and Statistics, pp. 27–36, (2012).  [2] Emmanuel Candes, Terence Tao, et al., ‘The dantzig selector: Statistical estimation when p is much larger than n’, The annals of Statistics, 35(6), 2313–2351, (2007).
 [3] J Douglas Carroll and JihJie Chang, ‘Analysis of individual differences in multidimensional scaling via an nway generalization of “eckartyoung” decomposition’, Psychometrika, 35(3), 283–319, (1970).
 [4] Weiwei Guo, Irene Kotsia, and Ioannis Patras, ‘Tensor learning for regression’, IEEE Transactions on Image Processing, 21(2), 816–827, (2011).
 [5] Botao Hao, Boxiang Wang, Pengyuan Wang, Jingfei Zhang, Jian Yang, and Will Wei Sun, ‘Sparse tensor additive regression’, arXiv preprint arXiv:1904.00479, (2019).
 [6] Richard A Harshman et al., ‘Foundations of the parafac procedure: Models and conditions for an" explanatory" multimodal factor analysis’, (1970).
 [7] Lifang He, Kun Chen, Wanwan Xu, Jiayu Zhou, and Fei Wang, ‘Boosted sparse and lowrank tensor regression’, in Advances in Neural Information Processing Systems, pp. 1009–1018, (2018).
 [8] Frank L Hitchcock, ‘The expression of a tensor or a polyadic as a sum of products’, Journal of Mathematics and Physics, 6(14), 164–189, (1927).
 [9] Masaaki Imaizumi and Kohei Hayashi, ‘Doubly decomposing nonparametric tensor regression’, in International Conference on Machine Learning, pp. 727–736, (2016).

[10]
Kittipat Kampa, S Mehta, ChunAn Chou, Wanpracha Art Chaovalitwongse, and Thomas J Grabowski, ‘Sparse optimization in feature selection: application in neuroimaging’,
Journal of Global Optimization, 59(23), 439–457, (2014).  [11] Tamara G Kolda and Brett W Bader, ‘Tensor decompositions and applications’, SIAM review, 51(3), 455–500, (2009).
 [12] Jiajia Li, Jee Choi, Ioakeim Perros, Jimeng Sun, and Richard Vuduc, ‘Modeldriven sparse cp decomposition for higherorder tensors’, in 2017 IEEE International Parallel and Distributed Processing Symposium (IPDPS), pp. 1048–1057. IEEE, (2017).
 [13] Wenwen Li, Jian Lou, Shuo Zhou, and Haiping Lu, ‘Sturm: Sparse tubalregularized multilinear regression for fmri’, in International Workshop on Machine Learning in Medical Imaging, pp. 256–264. Springer, (2019).
 [14] Yimei Li, Hongtu Zhu, Dinggang Shen, Weili Lin, John H Gilmore, and Joseph G Ibrahim, ‘Multiscale adaptive regression models for neuroimaging data’, Journal of the Royal Statistical Society: Series B (Statistical Methodology), 73(4), 559–578, (2011).
 [15] Oscar Hernan MadridPadilla and James Scott, ‘Tensor decomposition with generalized lasso penalties’, Journal of Computational and Graphical Statistics, 26(3), 537–546, (2017).
 [16] Tom M Mitchell, Svetlana V Shinkareva, Andrew Carlson, KaiMin Chang, Vicente L Malave, Robert A Mason, and Marcel Adam Just, ‘Predicting human brain activity associated with the meanings of nouns’, science, 320(5880), 1191–1195, (2008).
 [17] Garvesh Raskutti, Ming Yuan, Han Chen, et al., ‘Convex regularization for highdimensional multiresponse tensor regression’, The Annals of Statistics, 47(3), 1554–1584, (2019).
 [18] Xiaonan Song and Haiping Lu, ‘Multilinear regression for embedded feature selection with application to fmri analysis’, in ThirtyFirst AAAI Conference on Artificial Intelligence, (2017).
 [19] Volker Strassen, ‘Gaussian elimination is not optimal’, Numerische mathematik, 13(4), 354–356, (1969).
 [20] Will Wei Sun and Lexin Li, ‘Store: sparse tensor response regression and neuroimaging analysis’, The Journal of Machine Learning Research, 18(1), 4908–4944, (2017).
 [21] Will Wei Sun, Junwei Lu, Han Liu, and Guang Cheng, ‘Provable sparse tensor decomposition’, Journal of the Royal Statistical Society: Series B (Statistical Methodology), 79(3), 899–916, (2017).
 [22] Xu Tan, Yin Zhang, Siliang Tang, Jian Shao, Fei Wu, and Yueting Zhuang, ‘Logistic tensor regression for classification’, in International Conference on Intelligent Science and Intelligent Data Engineering, pp. 573–581. Springer, (2012).
 [23] Robert Tibshirani, ‘Regression shrinkage and selection via the lasso’, Journal of the Royal Statistical Society: Series B (Methodological), 58(1), 267–288, (1996).
 [24] Eunho Yang, Aurelie Lozano, and Pradeep Ravikumar, ‘Elementary estimators for highdimensional linear regression’, in International Conference on Machine Learning, pp. 388–396, (2014).
 [25] Rose Yu, Guangyu Li, and Yan Liu, ‘Tensor regression meets gaussian processes’, arXiv preprint arXiv:1710.11345, (2017).
 [26] Rose Yu and Yan Liu, ‘Learning from multiway data: Simple and efficient tensor regression’, in International Conference on Machine Learning, pp. 373–381, (2016).
 [27] Zemin Zhang and Shuchin Aeron, ‘Exact tensor completion using tsvd’, IEEE Transactions on Signal Processing, 65(6), 1511–1526, (2016).

[28]
Zemin Zhang, Gregory Ely, Shuchin Aeron, Ning Hao, and Misha Kilmer, ‘Novel
methods for multilinear data completion and denoising based on tensorsvd’,
in
Proceedings of the IEEE conference on computer vision and pattern recognition
, pp. 3842–3849, (2014).  [29] Hua Zhou, Lexin Li, and Hongtu Zhu, ‘Tensor regression with applications in neuroimaging data analysis’, Journal of the American Statistical Association, 108(502), 540–552, (2013).
 [30] Hongtu Zhu, Yasheng Chen, Joseph G Ibrahim, Yimei Li, Colin Hall, and Weili Lin, ‘Intrinsic regression models for positivedefinite matrices with applications to diffusion tensor imaging’, Journal of the American Statistical Association, 104(487), 1203–1212, (2009).
 [31] Hui Zou and Trevor Hastie, ‘Regularization and variable selection via the elastic net’, Journal of the royal statistical society: series B (statistical methodology), 67(2), 301–320, (2005).
Comments
There are no comments yet.