Fast and Scalable Estimator for Sparse and Unit-Rank Higher-Order Regression Models

by   Jiaqi Zhang, et al.

Because tensor data appear more and more frequently in various scientific researches and real-world applications, analyzing the relationship between tensor features and the univariate outcome becomes an elementary task in many fields. To solve this task, we propose Fast Sparse Tensor Regression model (FasTR) based on so-called unit-rank CANDECOMP/PARAFAC decomposition. FasTR first decomposes the tensor coefficient into component vectors and then estimates each vector with ℓ_1 regularized regression. Because of the independence of component vectors, FasTR is able to solve in a parallel way and the time complexity is proved to be superior to previous models. We evaluate the performance of FasTR on several simulated datasets and a real-world fMRI dataset. Experiment results show that, compared with four baseline models, in every case, FasTR can compute a better solution within less time.



There are no comments yet.


page 1

page 2

page 3

page 4


Sparse and Low-Rank Tensor Regression via Parallel Proximal Method

Motivated by applications in various scientific fields having demand of ...

Boosted Sparse and Low-Rank Tensor Regression

We propose a sparse and low-rank tensor regression model to relate a uni...

On Tensor Train Rank Minimization: Statistical Efficiency and Scalable Algorithm

Tensor train (TT) decomposition provides a space-efficient representatio...

Generalized Co-sparse Factor Regression

Multivariate regression techniques are commonly applied to explore the a...

Sturm: Sparse Tubal-Regularized Multilinear Regression for fMRI

While functional magnetic resonance imaging (fMRI) is important for heal...

Higher-Order Partial Least Squares (HOPLS): A Generalized Multi-Linear Regression Method

A new generalized multilinear regression model, termed the Higher-Order ...

SPARTan: Scalable PARAFAC2 for Large & Sparse Data

In exploratory tensor mining, a common problem is how to analyze a set o...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Now, higher-order data, which is also called tensor, frequently occur in various scientific and real-world applications. Specifically, in neuroscience, functional magnetic resonance imaging (fMRI) is an example of such tensor data consisting of a series of brain scanning images. Therefore, such data can be characterized as a 3D data (or 3-mode data) with the shape of time neuron neuron. In many fields, we can encounter the problem that analyzing the relationship between the tensor variable and the scalar response for every sample . Specifically, we assume


in which is the inner product operator, is the noise, and is the coefficient needs to be estimated through regression. Notice that in the real world, these tensor data generally have two properties which makes the coefficient difficult to be inferred perfectly: (1) Ultra-high-dimensional setting, where the number of samples is much less than the number of variables. For example, each sample of the CMU2008 dataset [16] is a 3D tensor with shape of , which is 71553 voxels in total. However, only 360 trials are recorded. The high-dimensional setting will make the estimated solution breaks down because we are trying to infer a large model with a limited number of observations. (2) Higher-order structure of data. The higher-order structure of data exists in many fields, such as fMRI and videos, with the shape of time pixel

pixel. Traditional machine learning methods are proposed for processing vectors or matrices, hence, dealing with high-order data might be a difficulty. In past years, many methods are introduced to address these two problems.

Under high-dimensional settings, several well-known models were already proposed making use of variable selection, such as Lasso [23] and Dantzig selector [2]

. Because they are unable to deal with data other than vectors, one naive way to use them on tensor data is vectorization. All the elements in the tensor are stacked into a vector, thence, the existing linear regression can work. However, intuitively, the latent structural information will be lost in such a manner. Therefore, some methods aim at directly handling the tensor. For example,

[18] propose Remurs exploiting commonly used

norm for enforcing sparsity on the estimated coefficient tensor. In addition, a nuclear norm is attached to it to make the solution low-rank. The main shortcoming of Remurs is that the tensor nuclear norm is approximated by the nuclear norm of its unfolding matrices. Because the tensor should be unfolded into matrices, its structure is still destructed. Therefore, though this kind of method is able to obtain an acceptable solution in high-dimensional settings, the higher-order structure is lost.

To reserve the spatial structure, several methods are introduced based on the CANDECOMP/PARAFAC decomposition, which approximates an M-order tensor through


Here, is defined as the CP-rank of the tensor . For instance, [29] propose GLTRM which first decomposes the variable tensor and then applies the generalized linear model to estimate each component vector. In addition, [7] propose SURF using the divide-and-conquer strategy for each component vector. Almost all the CANDECOMP/PARAFAC-decomposition-based methods, including GLTRM, suffer a problem that the CP-rank should be pre-specified. However, we always have no prior knowledge about the value of . Even if we can use techniques, such as cross-validation, to estimate from the data, the solving procedure becomes trivial and computationally expensive for large-scale data. A method called orTRR is previously proposed in [4] automatically obtaining a low-rank coefficient without pre-specifying . But orTRR uses norm rather than norm for recovering the sparsity of data, which makes it performance poorer than others on variable selection. To our best knowledge, there is no scalable estimator proposed before, for enforcing both sparsity and low-rankness on the solution in high-dimensional settings.

In this paper, we derive ideas of a scalable estimator, Elementary Estimator [24], and propose Fast Sparse Higher-Order Tensor Regression (FasTR), which estimates a unit-rank coefficient tensor. First, the problem is decomposed into several sub-problems. Then, for each sub-problem, i.e., each component vector, a closed-form solution can be obtained efficiently. Notice that because the computation of closed-form solution can be speeded up through multi-threading computation or GPUs, thus, the solution of FasTR is able to be obtained with small time complexity. See details in Section 6. To summarize, this paper has the following novelties:

  • A sparse tensor regression model and its fast and scalable solution: In Section 4, we propose a regression model for tensor data, using norm for obtaining the sparsity. Moreover, we provide a scalable solution for the model, which is obtained iteratively while at each iteration the temporary estimation has closed form.

  • State-of-the-art error bound for tensor regression model: We theoretically prove that our sparse estimator has a state-of-the-art error bound , where denotes the non-zero elements of the th decomposed component of variable tensor and is a constant characterized by the data. Details are shown in Section 5.

  • Experiments on real-world fMRI dataset: In Section 7, we make comparison between our FasTR and four baselines on several simulated datasets and one fMRI dataset with nine projects. Experiment results empirically show that FasTR can obtain better estimations with less time cost.

2 Notations

denotes the element-wise norm, denotes the nuclear norm, denotes the norm, and denotes the spectral norm. Throughout this paper, the higher-order tensor is denoted by the calligraphic letter and the vector is denoted by a lower-case letter . Scalars are also denoted by lower-case letters but stated clearly within the context to avoid confusion.

3 Background

3.1 Elmentary estimator for linear regression models (EE-Ridge)

For vector (first-order) data, [24] propose a state-of-the-art method called EE-Ridge to solve high-dimensional linear regression problems. Given the sample matrix and the response vector , EE-Ridge has the following formulation:


Here, is an arbitrary norm function and . The two hyper-parameters, and , handles with the non-invertibility of covariance matrix and controls the level of sparsity respectively. Although EE-Ridge shares certain similarities with the Dantzig selector [2], EE-Ridge has outstanding performance on computational complexity. For instance, when selecting the norm function , a closed-form solution to Eq. (3) can be obtained through , in which denotes the soft-thresholding operator. Noticeably, the calculation of this solution is dominated by computing the matrix inversion, which generally acquires time. 111Some other methods can compute matrix inversion with lower time complexity. For instance, the Strassen algorithm proposed in [19] has the time complexity of on matrix inversion computation. This is a significant improvement in previous variable selectors, such as Lasso and Dantzig selector with the time cost of and respectively. Furthermore, the computation of solution to EE-Ridge can be easily speeded up by the virtue of multiple threads or GPUs.

3.2 Higher-order tensor regression model

Given -order predictors and scalar responses , , higher-order tensor regression models consider that the responses are generated from a linear formulation . Here is an -order coefficient tensor and is an error term. Deriving ideas of LR, the coefficient tensor is estimated through


where is a norm function enforcing certain properties on the coefficient tensor. Eq. (4) is akin to the formulation of LR, however, existing LR methods can not be directly applied to it. A naive adaption is the vectorization. By stacking the elements of a tensor into a vector first, LR can be utilized. However, vectorization will hurt the structural information of data, which makes it inapplicable in real-world applications.

3.3 Unit-Rank Tensor CANDECOMP/PARAFAC decomposition

To reserve latent structural information and decrease the ultra-high dimensionality when dealing with the higher-order tensor , CANDECOMP/PARAFAC decomposition is proposed in [8, 3, 6] to decompose the tensor into the outer products of several vectors. Specifically, given a tensor , it can be decomposed into the outer-products of component vectors


in which each . In this way, the number of variables largely reduced from to . Intuitively, CANDECOMP/PARAFAC decomposition reserves more latent information than simply vectorizing.

4 Method

Substituting the M-order coefficient tensor with its CP decomposition , the coefficient tensor can be obtained through estimating all the component vectors . Likewise the linear regression, to infer a certain , each sample needs to be “projected” onto the -th space . Then, intuitively, we aim to let each fit the projected samples corresponding to its space. In addition, instead of coefficient tensor , we impose sparsity constraints on each component . This leads to a more flexible and efficient model because fewer variables need to be dealt with in the high-dimensional settings. Therefore, letting denote the matrix with the -th row bpreing the projected on the -th space, our objective is solving


Here, is a tuning parameter controlling the degree of sparsity and

is an identity matrix. Parameter

aims to make matrix invertible, which handling the crucial problem of high-dimensional learning. Then, fortunately, based on EE-Ridge, Eq. (6) has a closed-form solution


With this, as long as is easy to be computed, we can solve Eq. (6) directly.

4.1 Proposed: Fast Sparse Higher-Order Tensor Regression (FasTR)

In our method, we use an simple and intuitive formulation of the projection function as


Then, substituting with Eq. (8) in Eq. (6), our FasTR aims to solve


for . Notice that the computing of is dominated by a large number of multiplications, obtaining the solution Eq. (7) can be easily accelerated by multiple CPUs or GPUs.

Moreover, we propose a fast algorithm to solve Eq. (9) in a component-wise manner. When estimating of a certain , we fix other component vectors as constants. At each iteration, we first compute and then get the estimation through Eq. (7). Specifically, let denotes the estimation of the -th mode component vector at the -th iteration,

The algorithm is summarized in Algorithm 1.

1:  Input: Samples , corresponding observed responses , , the maximum number of iterations , and tuning parameters and .
2:  Randomly initialize
3:  for  to iter do
4:     for  to  do
5:        Compute through Eq. (8);
6:        Compute through Eq. (7);
7:     end for
8:  end for
10:  Output: .
Algorithm 1   FasTR

5 Theorem

We now provide a statistical analysis of the component estimator (Eq. (9)). We follow the idea of [24] and make following assumptions:

(C-Sparse) The coefficient component is exactly sparse with non-zero elements.

(C-Ridge) Let be the singular vectors of

corresponding to the singular values

. Here, is the rank of . Let . Then, with some equence .

Consider assumptions (C-Sparse) and (C-Ridge) are satisfied, there exist positive constants (), such that the estimated solution of Eq. (9) satisfies


with probability at least

. Here, we suppose that is selected through .

6 Discussion

6.1 Complexity analysis

When estimating for each mode, the time complexity is dominated by computing , which costs . Once the projection is obtained, is calculated through Eq. (7) with time complexity. As the sub-tasks that estimating for each are independent to each other, these sub-tasks can be optimized parallelly. Furthermore, notice that the time cost of computing through Eq. (7) can be easily reduced making use of multiple threads or GPUs, this part of time becomes negligible. Therefore, integrating all the ingredients, the time complexity of our method is .

Apart from the computationally efficiency of FasTR, our method also acquires small number of memory space. Our method requires two parts of memory space, one of which is used for storing components and another is for dealing computations on . Because, among all the computations about the projection, needs the most memory space, the second part requires space. Therefore, totally, the memory complexity of FasTR is .

6.2 Relevance to previous works

Many methods have been proposed in the literature of regression tasks on higher-order tensor data. In this paper, we focus on the setting that the variables are represented by a tensor while the responses are denoted by a vector . Several models were already recently proposed to estimate the coefficient tensor for this specific, what we call, higher-order tensor regression problem.

One group of these methods is the direct extension of regularized linear regression. Naively, one way to solve this regression problem is vectorization. All the elements in the tensor are first stacked into a vector and then existing linear regression models can be applied to it. One obvious shortcoming of vectorization is that it will cause a loss of latent structural information of the data. To reserve certain potential information, [18] is proposed to estimate a sparse and low-rank coefficient tensor, by integrating the tensor nuclear norm and norm into the optimization problem. Notice that in Remurs, the tensor nuclear norm is approximated by the summation of ranks of several unfolded matrices. Remurs still discards some structural information when unfolding the tensor into matrices, although it outperforms than vectorization generally. In addition, [13] improve Remurs by substituting the nuclear norm into Tubal nuclear norm [28, 27]

, which can be efficiently solved through discrete Fourier transform. However, the tensor unfolding is still required. Furthermore, these methods are also computationally expensive because the non-differential regularizer,

norm or nuclear norm exists in their objective function. Therefore, currently, this group of methods is not a good choice for higher-order tensor regression.

To reserve the latent structure when dealing with tensors, another prevailing group of methods [7, 29, 4, 22] are proposed based on CANDECOMP/PARAFRAC decomposition. Generally, instead of directly estimate the coefficient tensor , we aim at inferring every component vector in each sub-task. For example, [29] propose GLTRM using generalized linear model (GLM) to solve each sub-task. Moreover, orTRR is proposed in [4] enforcing sparsity and low-rankness on the estimated coefficient tensor. Instead of norm, orTRR utilize norm to obtain the sparsity. In addition, recently, [7] propose SURF exploiting divide-and-conquer strategy where the sub-task has a similar formulation of Elastic Net [31]. In the paper of SURF, authors empirically show that their method can converge, but a statistical convergence rate is not proved. On the contrary, in this paper, we theoretically prove the error bound of our method. Noticeably, the main limitation of CANDECOMP/PARAFAC-decomposition-based method is that the decomposition rank should be pre-specified, however, we generally have no prior knowledge about the tensor rank in real-world applications. Although orTRR is able to automatically obtain a low-rank result, the estimated result is sub-optimal due to the fact that norm is inferior to norm in the sparse setting. Hence, these methods are not suitable for real-world applications.

Some other models were introduced previously for other problem settings. Recently, [5, 9, 25] propose models for non-parametric estimation by assuming that the response , making use of either additive model or Gaussian process. Apart from the above-mentioned ones, many models [20, 17, 26, 30, 14] were put forward to estimate the relationship between the variable tensor and a response tensor . Another line of statistical models involving tensor data is tensor decomposition [11, 21, 1, 12, 15]. Tensor decomposition can be considered as an unsupervised problem which aims at approximating the tensor with lower-order data. On the contrary, our FasTR is a supervised method estimating the latent relationship between variables and responses. Because these methods have different objectives from our method, we pay little attention to them and exclude them from experiments. In section 7, we compare FasTR with several introduced higher-order tensor regression methods, including Lasso, Elastic Net, Remurs, GLTRM, and SURF.

7 Experiment

7.1 Experiment Setups

We experiment on several simulated datasets and a real-world datasets with nine projects to compare the performance of our method with previous methods. We use four previous methods as baselines, which are 1) Linear Regression (Lasso and Elastic Net), 2) Remurs, 3) SURF, and 4) GLTRM. Specifically,

We use three metrics to evaluate the performance of our method and baselines, including 1) time cost, 2) mean squared error (MSE), and 3) coefficient error (CE). Here, and .

Furthermore, simulated data are generated through following three steps:

  • Step 1: For to , generate

    with each element derived from the Gaussian distribution

    ; given the sparsity degree , randomly set elements of to be ; .

  • Step 2: Generate while each element of is derived from the distribution .

  • Step 3: Generate the response with respect to . Here, controls the degree of noise and each element of is generated from .

In addtion to simulated datasets, we also use CMU2008 fMRI dataset to show the superiority of our method.

7.2 Experiments on simulated data

Figure 1: (a) Time cost of FasTR and baseline methods on simulated 2D data with shape of , , , , and . The dashed line indicates that Lasso and Elastic Net are not able to obtain an estimation in 90 seconds. The code of SURF is infeasible for 2D data, hence, its experimental result is omitted. (b) Time cost of FasTR and baseline methods on simulated 3D data. with shape of , , , , , and . GLTRM is only feasible for 2D data, thus, we omit it in the sub-figure. Moreover, the estimation of SURF is terrible, so we also discard it when comparing the time cost.

When generating simulated datasets, we let the sparsity degree and noise degree . Out of fairness, we set the maximal number of iterations to be for all the methods and let the method terminate when . Moreover, all the tuning parameters of each method are selected through 5-fold cross-validation. The detailed interval of tuning parameters is shown in the appendix. For every single experiment, we run each method 20 times and average the metrics’ value over these 20 trials.

To evaluate the superiority of our method, we generate both 2D and 3D datasets varying the data size and the number of samples. For linear regression (LR), we use Lasso and Elastic Net and we report the value of metrics for the method which obtains the better MSE. In Figure 1, we show the time cost of each method. Notice that because SURF is infeasible for 2D data and GLTRM is infeasible for 3D data, we discard these two methods in the sub-figure correspondingly. Moreover, for 3D data, SURF obtains MSE values much worse that other methods (see Table 1), thus, we discard SURF in Figure 1 due to its terrible performance. We can see that our FasTR outperforms other baselines and as the dimensionality of data increases, the speeds up of FasTR become more and more obvious. Specifically, the two linear regression methods are not able to obtain a solution in less than 90 seconds for data with a shape of , while other methods cost no more than 4 seconds. Furthermore, in Table 1 reports the MSE and CE values of every method on each dataset. Noticeably, FasTR has better MSE and CE under every setting, compared with baselines, for both 2D and 3D data. Specifically, LR can not compute an estimation because the code throws segmentation fault when the size of data is and , which makes it infeasible for large-scale data. Notice that the performance of linear regression methods are worse than tensor regression methods, which coincide with our statement that the vectorization can do harm to the structural information of data. To summarize, for large-scale data, FasTR is able to obtain a better solution with a much less time cost.


size (N) FasTR Remurs LR SURF GLTRM
2D Data
0.0427 0.7717 0.1157 0.8398 0.7095 0.9989 Inapplicable 1.8169 3.1381
0.0305 0.5284 0.2393 0.8509 0.4235 0.9937 4.5746 5.715
0.0342 0.3483 1.3414 0.8449 0.4980 0.9927 7.9937 9.2854
0.0628 0.3436 0.1804 0.8030 0.5149 0.9998 22.4912 29.7461
0.056 0.2544 0.4596 0.7905 0.5434 0.9994 52.3116 62.9425
0.0518 0.2349 10.978 0.8404 0.5140 0.9998 72.3897 87.5139
1-15[0.8pt/2pt] 3D Data
0.0047 0.1205 0.0610 0.8524 0.0324 0.4507 0.7933 0.8417 Inapplicable
0.0127 0.1863 0.0925 0.9281 0.0586 0.2126 2.8068 0.9997
0.1219 0.6772 0.3757 0.8707 0.2544 0.6461 12.6857 0.9989
0.2158 0.8323 0.5301 0.8562 1.3930 1.0711 5.5318 0.9619
0.1385 0.8820 0.2472 0.8929 0.5221 1.1443 9.5182 0.8948
0.2514 0.8776 0.4386 0.8990 1.2765 1.0495 21.0606 0.9898
0.3257 0.7334 0.5597 0.8877 1.3863 1.0010 175.2118 0.8436
0.1810 0.6901 0.4867 0.9116 Inapplicable 335.1932 0.9988
0.2105 0.5467 0.4036 0.8843 107.6186 0.8733


Table 1: The MSE and CE of FasTR and baselines on several 2D and 3D simulated datasets. SURF is inapplicable for 2D data and GLTRM is inapplicable for 3D data. The LR methods can not obtain the estimation for simulated data with too large size ( and ) because the MATLAB package “GLMNET” throws segmentation fault for too large tensor data. “LR” stands for Lasso and Elastic Net and we choose one of them with better MSE to represent linear regression.

In high-dimensional settings, we generate a 3D dataset with a shape of and vary the number of samples from to . The sparsity level is set to and the noise coefficient is fixed to . Apparently, Table 2 indicates that for every , FasTR has much lower MSE value, which indicates that FasTR outperform baselines on large-scale and high-dimensional datasets. The MSE value does not reduce along with the increment of the number of samples, which might be a general thought. Because for every , the simulated dataset is generated separately, meaning that these datasets with different have no relevance. Therefore, in this experiment, the MSE does not have to be improved when more samples are provided. However, in every condition, we show that FasTR has better performance.


N MSE Value
300 0.2529 0.4795 1.0069 22.9196 Inapplicable
350 0.2082 0.4451 1.0019 23.6516
400 0.2252 0.3692 0.8662 10.6082
450 0.3081 0.4902 1.2470 94.4651
500 0.1887 0.4896 0.8028 60.3166
550 0.2614 0.5480 1.0503 92.1064
600 0.1914 0.3370 0.7686 17.6773


Table 2: The MSE value of each method, varying the number of samples. The data size is with sparsity level noise coefficient is 0.1 and . GLTRM is inapplicable for 3D data, hence, its MSE value is omitted. “LR” stands for Lasso and Elastic Net and we choose one of them with better MSE to represent linear regression.

At last, to test the sensitivity of FasTR for different sparsity levels, we vary the sparsity level when generating simulated datasets with the shape of and the . The noise coefficient is set to 0.2. Varying the sparsity level from , the time cost and MSE of every method, except GLTRM, are reported in Figure 2. For different sparsity level, FasTR retains the superiority among all the methods. Although under too sparse conditions where the sparsity level is larger than , FasTRsignificantly has lower MSE than others. In a word, FasTR obtains better performance on datasets with different level of sparsity, compared with previous methods.

Figure 2: Varying the sparsity level of simulated data, we compare the performance of FasTR and baseline methods on 3D data with shape of and noise factor of . (a) The time cost of methods. We discard GLTRM in this experiment because of its infeasibility for 3D data. (b) The MSE of methods.

7.3 Experiments on real-world data


Project FasTR Remurs Lasso Elastic Net SURF
time AUC time AUC time AUC time AUC time AUC
#1 0.3062 0.6786 23.9151 0.5929 2.2287 0.9388 2.6994 0.9566 0.1743 0.5042
#2 0.3409 0.7692 19.2467 0.5572 2.6699 0.7552 2.6772 0.7575 0.1367 0.4785
#3 0.3594 0.5314 21.7162 0.8642 2.5251 0.8109 2.5365 0.7786 0.1539 0.5286
#4 0.3052 0.7778 33.0095 0.7071 2.4533 0.7237 2.4499 0.7376 0.1392 0.5857
#5 0.3073 0.7681 26.8979 0.7272 2.6739 0.5602 2.7003 0.5413 0.1285 0.4755
#6 0.3302 0.7222 10.0312 0.5972 2.3102 0.6554 2.365 0.6689 0.1707 0.5486
#7 0.3453 0.6796 19.3765 0.5714 2.7856 0.6531 2.9838 0.679 0.1364 0.4444
#8 0.2941 0.8 16.7164 0.5486 2.2362 0.6741 2.2854 0.7165 0.1811 0.5384
#9 0.2992 0.6953 24.1116 0.6428 2.9229 0.5531 2.912 0.5509 0.196 0.5857


Table 3: Time cost and AUC value of FasTR and baselines on nine classification projects of CMU2008 datasets.

In this section, we perform fMRI classification tasks on CMU2008 datasets [16] with 9 projects in total. Each sample of this dataset is a 3-mode tensor of size (71553 voxels). This classification task aims to predict human activities associated with recognizing the meanings of nouns. Following [10, 18], we focus on classifications of binary classes: “tools” and “animals”. Here, the class “tool” combines observations from “tool” and “furniture”, class “animal” combines observations from “animal” and “insect” in the CMU2008 dataset. Like simulated experiments, values of tuning parameters of each method are selected through 5-fold cross-validation. For each subject, we split the entire dataset into the training dataset and testing dataset with the proportion of and respectively and use AUC to evaluate classification results. Results are shown in Table 3, which indicates that FasTR obtains the best AUC value for most cases. Notice that although SURF has the lowest time cost, the AUC of its solution is around 0.5 and drops below 0.5 sometimes. Hence, we think the classification result of SURF is unacceptable. One interesting result occurs on project #1, where linear regression methods obtain a much better solution. We think the reason might be that in this subject, the voxels are independent, hence, the data has no latent structure. Anyway, FasTR has a significant performance on a real-world fMRI dataset.


  • [1]

    Genevera Allen, ‘Sparse higher-order principal components analysis’, in

    Artificial Intelligence and Statistics, pp. 27–36, (2012).
  • [2] Emmanuel Candes, Terence Tao, et al., ‘The dantzig selector: Statistical estimation when p is much larger than n’, The annals of Statistics, 35(6), 2313–2351, (2007).
  • [3] J Douglas Carroll and Jih-Jie Chang, ‘Analysis of individual differences in multidimensional scaling via an n-way generalization of “eckart-young” decomposition’, Psychometrika, 35(3), 283–319, (1970).
  • [4] Weiwei Guo, Irene Kotsia, and Ioannis Patras, ‘Tensor learning for regression’, IEEE Transactions on Image Processing, 21(2), 816–827, (2011).
  • [5] Botao Hao, Boxiang Wang, Pengyuan Wang, Jingfei Zhang, Jian Yang, and Will Wei Sun, ‘Sparse tensor additive regression’, arXiv preprint arXiv:1904.00479, (2019).
  • [6] Richard A Harshman et al., ‘Foundations of the parafac procedure: Models and conditions for an" explanatory" multimodal factor analysis’, (1970).
  • [7] Lifang He, Kun Chen, Wanwan Xu, Jiayu Zhou, and Fei Wang, ‘Boosted sparse and low-rank tensor regression’, in Advances in Neural Information Processing Systems, pp. 1009–1018, (2018).
  • [8] Frank L Hitchcock, ‘The expression of a tensor or a polyadic as a sum of products’, Journal of Mathematics and Physics, 6(1-4), 164–189, (1927).
  • [9] Masaaki Imaizumi and Kohei Hayashi, ‘Doubly decomposing nonparametric tensor regression’, in International Conference on Machine Learning, pp. 727–736, (2016).
  • [10]

    Kittipat Kampa, S Mehta, Chun-An Chou, Wanpracha Art Chaovalitwongse, and Thomas J Grabowski, ‘Sparse optimization in feature selection: application in neuroimaging’,

    Journal of Global Optimization, 59(2-3), 439–457, (2014).
  • [11] Tamara G Kolda and Brett W Bader, ‘Tensor decompositions and applications’, SIAM review, 51(3), 455–500, (2009).
  • [12] Jiajia Li, Jee Choi, Ioakeim Perros, Jimeng Sun, and Richard Vuduc, ‘Model-driven sparse cp decomposition for higher-order tensors’, in 2017 IEEE International Parallel and Distributed Processing Symposium (IPDPS), pp. 1048–1057. IEEE, (2017).
  • [13] Wenwen Li, Jian Lou, Shuo Zhou, and Haiping Lu, ‘Sturm: Sparse tubal-regularized multilinear regression for fmri’, in International Workshop on Machine Learning in Medical Imaging, pp. 256–264. Springer, (2019).
  • [14] Yimei Li, Hongtu Zhu, Dinggang Shen, Weili Lin, John H Gilmore, and Joseph G Ibrahim, ‘Multiscale adaptive regression models for neuroimaging data’, Journal of the Royal Statistical Society: Series B (Statistical Methodology), 73(4), 559–578, (2011).
  • [15] Oscar Hernan Madrid-Padilla and James Scott, ‘Tensor decomposition with generalized lasso penalties’, Journal of Computational and Graphical Statistics, 26(3), 537–546, (2017).
  • [16] Tom M Mitchell, Svetlana V Shinkareva, Andrew Carlson, Kai-Min Chang, Vicente L Malave, Robert A Mason, and Marcel Adam Just, ‘Predicting human brain activity associated with the meanings of nouns’, science, 320(5880), 1191–1195, (2008).
  • [17] Garvesh Raskutti, Ming Yuan, Han Chen, et al., ‘Convex regularization for high-dimensional multiresponse tensor regression’, The Annals of Statistics, 47(3), 1554–1584, (2019).
  • [18] Xiaonan Song and Haiping Lu, ‘Multilinear regression for embedded feature selection with application to fmri analysis’, in Thirty-First AAAI Conference on Artificial Intelligence, (2017).
  • [19] Volker Strassen, ‘Gaussian elimination is not optimal’, Numerische mathematik, 13(4), 354–356, (1969).
  • [20] Will Wei Sun and Lexin Li, ‘Store: sparse tensor response regression and neuroimaging analysis’, The Journal of Machine Learning Research, 18(1), 4908–4944, (2017).
  • [21] Will Wei Sun, Junwei Lu, Han Liu, and Guang Cheng, ‘Provable sparse tensor decomposition’, Journal of the Royal Statistical Society: Series B (Statistical Methodology), 79(3), 899–916, (2017).
  • [22] Xu Tan, Yin Zhang, Siliang Tang, Jian Shao, Fei Wu, and Yueting Zhuang, ‘Logistic tensor regression for classification’, in International Conference on Intelligent Science and Intelligent Data Engineering, pp. 573–581. Springer, (2012).
  • [23] Robert Tibshirani, ‘Regression shrinkage and selection via the lasso’, Journal of the Royal Statistical Society: Series B (Methodological), 58(1), 267–288, (1996).
  • [24] Eunho Yang, Aurelie Lozano, and Pradeep Ravikumar, ‘Elementary estimators for high-dimensional linear regression’, in International Conference on Machine Learning, pp. 388–396, (2014).
  • [25] Rose Yu, Guangyu Li, and Yan Liu, ‘Tensor regression meets gaussian processes’, arXiv preprint arXiv:1710.11345, (2017).
  • [26] Rose Yu and Yan Liu, ‘Learning from multiway data: Simple and efficient tensor regression’, in International Conference on Machine Learning, pp. 373–381, (2016).
  • [27] Zemin Zhang and Shuchin Aeron, ‘Exact tensor completion using t-svd’, IEEE Transactions on Signal Processing, 65(6), 1511–1526, (2016).
  • [28] Zemin Zhang, Gregory Ely, Shuchin Aeron, Ning Hao, and Misha Kilmer, ‘Novel methods for multilinear data completion and de-noising based on tensor-svd’, in

    Proceedings of the IEEE conference on computer vision and pattern recognition

    , pp. 3842–3849, (2014).
  • [29] Hua Zhou, Lexin Li, and Hongtu Zhu, ‘Tensor regression with applications in neuroimaging data analysis’, Journal of the American Statistical Association, 108(502), 540–552, (2013).
  • [30] Hongtu Zhu, Yasheng Chen, Joseph G Ibrahim, Yimei Li, Colin Hall, and Weili Lin, ‘Intrinsic regression models for positive-definite matrices with applications to diffusion tensor imaging’, Journal of the American Statistical Association, 104(487), 1203–1212, (2009).
  • [31] Hui Zou and Trevor Hastie, ‘Regularization and variable selection via the elastic net’, Journal of the royal statistical society: series B (statistical methodology), 67(2), 301–320, (2005).