Now, higher-order data, which is also called tensor, frequently occur in various scientific and real-world applications. Specifically, in neuroscience, functional magnetic resonance imaging (fMRI) is an example of such tensor data consisting of a series of brain scanning images. Therefore, such data can be characterized as a 3D data (or 3-mode data) with the shape of time neuron neuron. In many fields, we can encounter the problem that analyzing the relationship between the tensor variable and the scalar response for every sample . Specifically, we assume
in which is the inner product operator, is the noise, and is the coefficient needs to be estimated through regression. Notice that in the real world, these tensor data generally have two properties which makes the coefficient difficult to be inferred perfectly: (1) Ultra-high-dimensional setting, where the number of samples is much less than the number of variables. For example, each sample of the CMU2008 dataset  is a 3D tensor with shape of , which is 71553 voxels in total. However, only 360 trials are recorded. The high-dimensional setting will make the estimated solution breaks down because we are trying to infer a large model with a limited number of observations. (2) Higher-order structure of data. The higher-order structure of data exists in many fields, such as fMRI and videos, with the shape of time pixel
pixel. Traditional machine learning methods are proposed for processing vectors or matrices, hence, dealing with high-order data might be a difficulty. In past years, many methods are introduced to address these two problems.
. Because they are unable to deal with data other than vectors, one naive way to use them on tensor data is vectorization. All the elements in the tensor are stacked into a vector, thence, the existing linear regression can work. However, intuitively, the latent structural information will be lost in such a manner. Therefore, some methods aim at directly handling the tensor. For example, propose Remurs exploiting commonly used
norm for enforcing sparsity on the estimated coefficient tensor. In addition, a nuclear norm is attached to it to make the solution low-rank. The main shortcoming of Remurs is that the tensor nuclear norm is approximated by the nuclear norm of its unfolding matrices. Because the tensor should be unfolded into matrices, its structure is still destructed. Therefore, though this kind of method is able to obtain an acceptable solution in high-dimensional settings, the higher-order structure is lost.
To reserve the spatial structure, several methods are introduced based on the CANDECOMP/PARAFAC decomposition, which approximates an M-order tensor through
Here, is defined as the CP-rank of the tensor . For instance,  propose GLTRM which first decomposes the variable tensor and then applies the generalized linear model to estimate each component vector. In addition,  propose SURF using the divide-and-conquer strategy for each component vector. Almost all the CANDECOMP/PARAFAC-decomposition-based methods, including GLTRM, suffer a problem that the CP-rank should be pre-specified. However, we always have no prior knowledge about the value of . Even if we can use techniques, such as cross-validation, to estimate from the data, the solving procedure becomes trivial and computationally expensive for large-scale data. A method called orTRR is previously proposed in  automatically obtaining a low-rank coefficient without pre-specifying . But orTRR uses norm rather than norm for recovering the sparsity of data, which makes it performance poorer than others on variable selection. To our best knowledge, there is no scalable estimator proposed before, for enforcing both sparsity and low-rankness on the solution in high-dimensional settings.
In this paper, we derive ideas of a scalable estimator, Elementary Estimator , and propose Fast Sparse Higher-Order Tensor Regression (FasTR), which estimates a unit-rank coefficient tensor. First, the problem is decomposed into several sub-problems. Then, for each sub-problem, i.e., each component vector, a closed-form solution can be obtained efficiently. Notice that because the computation of closed-form solution can be speeded up through multi-threading computation or GPUs, thus, the solution of FasTR is able to be obtained with small time complexity. See details in Section 6. To summarize, this paper has the following novelties:
A sparse tensor regression model and its fast and scalable solution: In Section 4, we propose a regression model for tensor data, using norm for obtaining the sparsity. Moreover, we provide a scalable solution for the model, which is obtained iteratively while at each iteration the temporary estimation has closed form.
State-of-the-art error bound for tensor regression model: We theoretically prove that our sparse estimator has a state-of-the-art error bound , where denotes the non-zero elements of the th decomposed component of variable tensor and is a constant characterized by the data. Details are shown in Section 5.
Experiments on real-world fMRI dataset: In Section 7, we make comparison between our FasTR and four baselines on several simulated datasets and one fMRI dataset with nine projects. Experiment results empirically show that FasTR can obtain better estimations with less time cost.
denotes the element-wise norm, denotes the nuclear norm, denotes the norm, and denotes the spectral norm. Throughout this paper, the higher-order tensor is denoted by the calligraphic letter and the vector is denoted by a lower-case letter . Scalars are also denoted by lower-case letters but stated clearly within the context to avoid confusion.
3.1 Elmentary estimator for linear regression models (EE-Ridge)
For vector (first-order) data,  propose a state-of-the-art method called EE-Ridge to solve high-dimensional linear regression problems. Given the sample matrix and the response vector , EE-Ridge has the following formulation:
Here, is an arbitrary norm function and . The two hyper-parameters, and , handles with the non-invertibility of covariance matrix and controls the level of sparsity respectively. Although EE-Ridge shares certain similarities with the Dantzig selector , EE-Ridge has outstanding performance on computational complexity. For instance, when selecting the norm function , a closed-form solution to Eq. (3) can be obtained through , in which denotes the soft-thresholding operator. Noticeably, the calculation of this solution is dominated by computing the matrix inversion, which generally acquires time. 111Some other methods can compute matrix inversion with lower time complexity. For instance, the Strassen algorithm proposed in  has the time complexity of on matrix inversion computation. This is a significant improvement in previous variable selectors, such as Lasso and Dantzig selector with the time cost of and respectively. Furthermore, the computation of solution to EE-Ridge can be easily speeded up by the virtue of multiple threads or GPUs.
3.2 Higher-order tensor regression model
Given -order predictors and scalar responses , , higher-order tensor regression models consider that the responses are generated from a linear formulation . Here is an -order coefficient tensor and is an error term. Deriving ideas of LR, the coefficient tensor is estimated through
where is a norm function enforcing certain properties on the coefficient tensor. Eq. (4) is akin to the formulation of LR, however, existing LR methods can not be directly applied to it. A naive adaption is the vectorization. By stacking the elements of a tensor into a vector first, LR can be utilized. However, vectorization will hurt the structural information of data, which makes it inapplicable in real-world applications.
3.3 Unit-Rank Tensor CANDECOMP/PARAFAC decomposition
To reserve latent structural information and decrease the ultra-high dimensionality when dealing with the higher-order tensor , CANDECOMP/PARAFAC decomposition is proposed in [8, 3, 6] to decompose the tensor into the outer products of several vectors. Specifically, given a tensor , it can be decomposed into the outer-products of component vectors
in which each . In this way, the number of variables largely reduced from to . Intuitively, CANDECOMP/PARAFAC decomposition reserves more latent information than simply vectorizing.
Substituting the M-order coefficient tensor with its CP decomposition , the coefficient tensor can be obtained through estimating all the component vectors . Likewise the linear regression, to infer a certain , each sample needs to be “projected” onto the -th space . Then, intuitively, we aim to let each fit the projected samples corresponding to its space. In addition, instead of coefficient tensor , we impose sparsity constraints on each component . This leads to a more flexible and efficient model because fewer variables need to be dealt with in the high-dimensional settings. Therefore, letting denote the matrix with the -th row bpreing the projected on the -th space, our objective is solving
Here, is a tuning parameter controlling the degree of sparsity and
is an identity matrix. Parameteraims to make matrix invertible, which handling the crucial problem of high-dimensional learning. Then, fortunately, based on EE-Ridge, Eq. (6) has a closed-form solution
With this, as long as is easy to be computed, we can solve Eq. (6) directly.
4.1 Proposed: Fast Sparse Higher-Order Tensor Regression (FasTR)
In our method, we use an simple and intuitive formulation of the projection function as
for . Notice that the computing of is dominated by a large number of multiplications, obtaining the solution Eq. (7) can be easily accelerated by multiple CPUs or GPUs.
Moreover, we propose a fast algorithm to solve Eq. (9) in a component-wise manner. When estimating of a certain , we fix other component vectors as constants. At each iteration, we first compute and then get the estimation through Eq. (7). Specifically, let denotes the estimation of the -th mode component vector at the -th iteration,
The algorithm is summarized in Algorithm 1.
(C-Sparse) The coefficient component is exactly sparse with non-zero elements.
(C-Ridge) Let be the singular vectors of
corresponding to the singular values. Here, is the rank of . Let . Then, with some equence .
6.1 Complexity analysis
When estimating for each mode, the time complexity is dominated by computing , which costs . Once the projection is obtained, is calculated through Eq. (7) with time complexity. As the sub-tasks that estimating for each are independent to each other, these sub-tasks can be optimized parallelly. Furthermore, notice that the time cost of computing through Eq. (7) can be easily reduced making use of multiple threads or GPUs, this part of time becomes negligible. Therefore, integrating all the ingredients, the time complexity of our method is .
Apart from the computationally efficiency of FasTR, our method also acquires small number of memory space. Our method requires two parts of memory space, one of which is used for storing components and another is for dealing computations on . Because, among all the computations about the projection, needs the most memory space, the second part requires space. Therefore, totally, the memory complexity of FasTR is .
6.2 Relevance to previous works
Many methods have been proposed in the literature of regression tasks on higher-order tensor data. In this paper, we focus on the setting that the variables are represented by a tensor while the responses are denoted by a vector . Several models were already recently proposed to estimate the coefficient tensor for this specific, what we call, higher-order tensor regression problem.
One group of these methods is the direct extension of regularized linear regression. Naively, one way to solve this regression problem is vectorization. All the elements in the tensor are first stacked into a vector and then existing linear regression models can be applied to it. One obvious shortcoming of vectorization is that it will cause a loss of latent structural information of the data. To reserve certain potential information,  is proposed to estimate a sparse and low-rank coefficient tensor, by integrating the tensor nuclear norm and norm into the optimization problem. Notice that in Remurs, the tensor nuclear norm is approximated by the summation of ranks of several unfolded matrices. Remurs still discards some structural information when unfolding the tensor into matrices, although it outperforms than vectorization generally. In addition,  improve Remurs by substituting the nuclear norm into Tubal nuclear norm [28, 27]
, which can be efficiently solved through discrete Fourier transform. However, the tensor unfolding is still required. Furthermore, these methods are also computationally expensive because the non-differential regularizer,norm or nuclear norm exists in their objective function. Therefore, currently, this group of methods is not a good choice for higher-order tensor regression.
To reserve the latent structure when dealing with tensors, another prevailing group of methods [7, 29, 4, 22] are proposed based on CANDECOMP/PARAFRAC decomposition. Generally, instead of directly estimate the coefficient tensor , we aim at inferring every component vector in each sub-task. For example,  propose GLTRM using generalized linear model (GLM) to solve each sub-task. Moreover, orTRR is proposed in  enforcing sparsity and low-rankness on the estimated coefficient tensor. Instead of norm, orTRR utilize norm to obtain the sparsity. In addition, recently,  propose SURF exploiting divide-and-conquer strategy where the sub-task has a similar formulation of Elastic Net . In the paper of SURF, authors empirically show that their method can converge, but a statistical convergence rate is not proved. On the contrary, in this paper, we theoretically prove the error bound of our method. Noticeably, the main limitation of CANDECOMP/PARAFAC-decomposition-based method is that the decomposition rank should be pre-specified, however, we generally have no prior knowledge about the tensor rank in real-world applications. Although orTRR is able to automatically obtain a low-rank result, the estimated result is sub-optimal due to the fact that norm is inferior to norm in the sparse setting. Hence, these methods are not suitable for real-world applications.
Some other models were introduced previously for other problem settings. Recently, [5, 9, 25] propose models for non-parametric estimation by assuming that the response , making use of either additive model or Gaussian process. Apart from the above-mentioned ones, many models [20, 17, 26, 30, 14] were put forward to estimate the relationship between the variable tensor and a response tensor . Another line of statistical models involving tensor data is tensor decomposition [11, 21, 1, 12, 15]. Tensor decomposition can be considered as an unsupervised problem which aims at approximating the tensor with lower-order data. On the contrary, our FasTR is a supervised method estimating the latent relationship between variables and responses. Because these methods have different objectives from our method, we pay little attention to them and exclude them from experiments. In section 7, we compare FasTR with several introduced higher-order tensor regression methods, including Lasso, Elastic Net, Remurs, GLTRM, and SURF.
7.1 Experiment Setups
We experiment on several simulated datasets and a real-world datasets with nine projects to compare the performance of our method with previous methods. We use four previous methods as baselines, which are 1) Linear Regression (Lasso and Elastic Net), 2) Remurs, 3) SURF, and 4) GLTRM. Specifically,
We use three metrics to evaluate the performance of our method and baselines, including 1) time cost, 2) mean squared error (MSE), and 3) coefficient error (CE). Here, and .
Furthermore, simulated data are generated through following three steps:
Step 1: For to , generate
with each element derived from the Gaussian distribution; given the sparsity degree , randomly set elements of to be ; .
Step 2: Generate while each element of is derived from the distribution .
Step 3: Generate the response with respect to . Here, controls the degree of noise and each element of is generated from .
In addtion to simulated datasets, we also use CMU2008 fMRI dataset to show the superiority of our method.
7.2 Experiments on simulated data
When generating simulated datasets, we let the sparsity degree and noise degree . Out of fairness, we set the maximal number of iterations to be for all the methods and let the method terminate when . Moreover, all the tuning parameters of each method are selected through 5-fold cross-validation. The detailed interval of tuning parameters is shown in the appendix. For every single experiment, we run each method 20 times and average the metrics’ value over these 20 trials.
To evaluate the superiority of our method, we generate both 2D and 3D datasets varying the data size and the number of samples. For linear regression (LR), we use Lasso and Elastic Net and we report the value of metrics for the method which obtains the better MSE. In Figure 1, we show the time cost of each method. Notice that because SURF is infeasible for 2D data and GLTRM is infeasible for 3D data, we discard these two methods in the sub-figure correspondingly. Moreover, for 3D data, SURF obtains MSE values much worse that other methods (see Table 1), thus, we discard SURF in Figure 1 due to its terrible performance. We can see that our FasTR outperforms other baselines and as the dimensionality of data increases, the speeds up of FasTR become more and more obvious. Specifically, the two linear regression methods are not able to obtain a solution in less than 90 seconds for data with a shape of , while other methods cost no more than 4 seconds. Furthermore, in Table 1 reports the MSE and CE values of every method on each dataset. Noticeably, FasTR has better MSE and CE under every setting, compared with baselines, for both 2D and 3D data. Specifically, LR can not compute an estimation because the code throws segmentation fault when the size of data is and , which makes it infeasible for large-scale data. Notice that the performance of linear regression methods are worse than tensor regression methods, which coincide with our statement that the vectorization can do harm to the structural information of data. To summarize, for large-scale data, FasTR is able to obtain a better solution with a much less time cost.
|1-15[0.8pt/2pt] 3D Data|
In high-dimensional settings, we generate a 3D dataset with a shape of and vary the number of samples from to . The sparsity level is set to and the noise coefficient is fixed to . Apparently, Table 2 indicates that for every , FasTR has much lower MSE value, which indicates that FasTR outperform baselines on large-scale and high-dimensional datasets. The MSE value does not reduce along with the increment of the number of samples, which might be a general thought. Because for every , the simulated dataset is generated separately, meaning that these datasets with different have no relevance. Therefore, in this experiment, the MSE does not have to be improved when more samples are provided. However, in every condition, we show that FasTR has better performance.
At last, to test the sensitivity of FasTR for different sparsity levels, we vary the sparsity level when generating simulated datasets with the shape of and the . The noise coefficient is set to 0.2. Varying the sparsity level from , the time cost and MSE of every method, except GLTRM, are reported in Figure 2. For different sparsity level, FasTR retains the superiority among all the methods. Although under too sparse conditions where the sparsity level is larger than , FasTRsignificantly has lower MSE than others. In a word, FasTR obtains better performance on datasets with different level of sparsity, compared with previous methods.
7.3 Experiments on real-world data
In this section, we perform fMRI classification tasks on CMU2008 datasets  with 9 projects in total. Each sample of this dataset is a 3-mode tensor of size (71553 voxels). This classification task aims to predict human activities associated with recognizing the meanings of nouns. Following [10, 18], we focus on classifications of binary classes: “tools” and “animals”. Here, the class “tool” combines observations from “tool” and “furniture”, class “animal” combines observations from “animal” and “insect” in the CMU2008 dataset. Like simulated experiments, values of tuning parameters of each method are selected through 5-fold cross-validation. For each subject, we split the entire dataset into the training dataset and testing dataset with the proportion of and respectively and use AUC to evaluate classification results. Results are shown in Table 3, which indicates that FasTR obtains the best AUC value for most cases. Notice that although SURF has the lowest time cost, the AUC of its solution is around 0.5 and drops below 0.5 sometimes. Hence, we think the classification result of SURF is unacceptable. One interesting result occurs on project #1, where linear regression methods obtain a much better solution. We think the reason might be that in this subject, the voxels are independent, hence, the data has no latent structure. Anyway, FasTR has a significant performance on a real-world fMRI dataset.
Genevera Allen, ‘Sparse higher-order principal components analysis’, inArtificial Intelligence and Statistics, pp. 27–36, (2012).
-  Emmanuel Candes, Terence Tao, et al., ‘The dantzig selector: Statistical estimation when p is much larger than n’, The annals of Statistics, 35(6), 2313–2351, (2007).
-  J Douglas Carroll and Jih-Jie Chang, ‘Analysis of individual differences in multidimensional scaling via an n-way generalization of “eckart-young” decomposition’, Psychometrika, 35(3), 283–319, (1970).
-  Weiwei Guo, Irene Kotsia, and Ioannis Patras, ‘Tensor learning for regression’, IEEE Transactions on Image Processing, 21(2), 816–827, (2011).
-  Botao Hao, Boxiang Wang, Pengyuan Wang, Jingfei Zhang, Jian Yang, and Will Wei Sun, ‘Sparse tensor additive regression’, arXiv preprint arXiv:1904.00479, (2019).
-  Richard A Harshman et al., ‘Foundations of the parafac procedure: Models and conditions for an" explanatory" multimodal factor analysis’, (1970).
-  Lifang He, Kun Chen, Wanwan Xu, Jiayu Zhou, and Fei Wang, ‘Boosted sparse and low-rank tensor regression’, in Advances in Neural Information Processing Systems, pp. 1009–1018, (2018).
-  Frank L Hitchcock, ‘The expression of a tensor or a polyadic as a sum of products’, Journal of Mathematics and Physics, 6(1-4), 164–189, (1927).
-  Masaaki Imaizumi and Kohei Hayashi, ‘Doubly decomposing nonparametric tensor regression’, in International Conference on Machine Learning, pp. 727–736, (2016).
Kittipat Kampa, S Mehta, Chun-An Chou, Wanpracha Art Chaovalitwongse, and Thomas J Grabowski, ‘Sparse optimization in feature selection: application in neuroimaging’,Journal of Global Optimization, 59(2-3), 439–457, (2014).
-  Tamara G Kolda and Brett W Bader, ‘Tensor decompositions and applications’, SIAM review, 51(3), 455–500, (2009).
-  Jiajia Li, Jee Choi, Ioakeim Perros, Jimeng Sun, and Richard Vuduc, ‘Model-driven sparse cp decomposition for higher-order tensors’, in 2017 IEEE International Parallel and Distributed Processing Symposium (IPDPS), pp. 1048–1057. IEEE, (2017).
-  Wenwen Li, Jian Lou, Shuo Zhou, and Haiping Lu, ‘Sturm: Sparse tubal-regularized multilinear regression for fmri’, in International Workshop on Machine Learning in Medical Imaging, pp. 256–264. Springer, (2019).
-  Yimei Li, Hongtu Zhu, Dinggang Shen, Weili Lin, John H Gilmore, and Joseph G Ibrahim, ‘Multiscale adaptive regression models for neuroimaging data’, Journal of the Royal Statistical Society: Series B (Statistical Methodology), 73(4), 559–578, (2011).
-  Oscar Hernan Madrid-Padilla and James Scott, ‘Tensor decomposition with generalized lasso penalties’, Journal of Computational and Graphical Statistics, 26(3), 537–546, (2017).
-  Tom M Mitchell, Svetlana V Shinkareva, Andrew Carlson, Kai-Min Chang, Vicente L Malave, Robert A Mason, and Marcel Adam Just, ‘Predicting human brain activity associated with the meanings of nouns’, science, 320(5880), 1191–1195, (2008).
-  Garvesh Raskutti, Ming Yuan, Han Chen, et al., ‘Convex regularization for high-dimensional multiresponse tensor regression’, The Annals of Statistics, 47(3), 1554–1584, (2019).
-  Xiaonan Song and Haiping Lu, ‘Multilinear regression for embedded feature selection with application to fmri analysis’, in Thirty-First AAAI Conference on Artificial Intelligence, (2017).
-  Volker Strassen, ‘Gaussian elimination is not optimal’, Numerische mathematik, 13(4), 354–356, (1969).
-  Will Wei Sun and Lexin Li, ‘Store: sparse tensor response regression and neuroimaging analysis’, The Journal of Machine Learning Research, 18(1), 4908–4944, (2017).
-  Will Wei Sun, Junwei Lu, Han Liu, and Guang Cheng, ‘Provable sparse tensor decomposition’, Journal of the Royal Statistical Society: Series B (Statistical Methodology), 79(3), 899–916, (2017).
-  Xu Tan, Yin Zhang, Siliang Tang, Jian Shao, Fei Wu, and Yueting Zhuang, ‘Logistic tensor regression for classification’, in International Conference on Intelligent Science and Intelligent Data Engineering, pp. 573–581. Springer, (2012).
-  Robert Tibshirani, ‘Regression shrinkage and selection via the lasso’, Journal of the Royal Statistical Society: Series B (Methodological), 58(1), 267–288, (1996).
-  Eunho Yang, Aurelie Lozano, and Pradeep Ravikumar, ‘Elementary estimators for high-dimensional linear regression’, in International Conference on Machine Learning, pp. 388–396, (2014).
-  Rose Yu, Guangyu Li, and Yan Liu, ‘Tensor regression meets gaussian processes’, arXiv preprint arXiv:1710.11345, (2017).
-  Rose Yu and Yan Liu, ‘Learning from multiway data: Simple and efficient tensor regression’, in International Conference on Machine Learning, pp. 373–381, (2016).
-  Zemin Zhang and Shuchin Aeron, ‘Exact tensor completion using t-svd’, IEEE Transactions on Signal Processing, 65(6), 1511–1526, (2016).
-  Zemin Zhang, Gregory Ely, Shuchin Aeron, Ning Hao, and Misha Kilmer, ‘Novel methods for multilinear data completion and de-noising based on tensor-svd’, in , pp. 3842–3849, (2014).
-  Hua Zhou, Lexin Li, and Hongtu Zhu, ‘Tensor regression with applications in neuroimaging data analysis’, Journal of the American Statistical Association, 108(502), 540–552, (2013).
-  Hongtu Zhu, Yasheng Chen, Joseph G Ibrahim, Yimei Li, Colin Hall, and Weili Lin, ‘Intrinsic regression models for positive-definite matrices with applications to diffusion tensor imaging’, Journal of the American Statistical Association, 104(487), 1203–1212, (2009).
-  Hui Zou and Trevor Hastie, ‘Regularization and variable selection via the elastic net’, Journal of the royal statistical society: series B (statistical methodology), 67(2), 301–320, (2005).