Sparse and Low-Rank Tensor Regression via Parallel Proximal Method

11/29/2019
by   Jiaqi Zhang, et al.
0

Motivated by applications in various scientific fields having demand of predicting relationship between higher-order (tensor) feature and univariate response, we propose a Sparse and Low-rank Tensor Regression model (SLTR). This model enforces sparsity and low-rankness of the tensor coefficient by directly applying ℓ_1 norm and tensor nuclear norm on it respectively, such that (1) the structural information of tensor is preserved and (2) the data interpretation is convenient. To make the solving procedure scalable and efficient, SLTR makes use of the proximal gradient method to optimize two norm regularizers, which can be easily implemented parallelly. Additionally, a tighter convergence rate is proved over three-order tensor data. We evaluate SLTR on several simulated datasets and one fMRI dataset. Experiment results show that, compared with previous models, SLTR is able to obtain a solution no worse than others with much less time cost.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

11/29/2019

Fast and Scalable Estimator for Sparse and Unit-Rank Higher-Order Regression Models

Because tensor data appear more and more frequently in various scientifi...
11/03/2018

Boosted Sparse and Low-Rank Tensor Regression

We propose a sparse and low-rank tensor regression model to relate a uni...
09/07/2015

A New Low-Rank Tensor Model for Video Completion

In this paper, we propose a new low-rank tensor model based on the circu...
07/09/2019

Tensor p-shrinkage nuclear norm for low-rank tensor completion

In this paper, a new definition of tensor p-shrinkage nuclear norm (p-TN...
03/07/2015

Higher order Matching Pursuit for Low Rank Tensor Learning

Low rank tensor learning, such as tensor completion and multilinear mult...
03/16/2017

Accelerated and Inexact Soft-Impute for Large-Scale Matrix and Tensor Completion

Matrix and tensor completion aim to recover a low-rank matrix / tensor f...
12/04/2018

Sturm: Sparse Tubal-Regularized Multilinear Regression for fMRI

While functional magnetic resonance imaging (fMRI) is important for heal...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Now, higher-order data, which is also called tensor, frequently occur in various scientific and real-world applications. Specifically, in neuroscience, functional magnetic resonance imaging (fMRI) is an example of such tensor data consisting of a series of brain scanning images. Therefore, such data can be characterized as a 3D data (or 3-mode data) with the shape of time neuron neuron. In many fields, we can encounter the problem that analyzing the relationship between the tensor variable and the scalar response for every sample . Specifically, we assume

(1)

in which is the inner product operator, is the noise, and

is the coefficient needs to be estimated through regression. Notice that in the real world, these tensor data generally have two properties which makes the coefficient difficult to be inferred perfectly: (1)

Ultra-high-dimensional setting, where the number of samples is much less than the number of variables. For example, each sample of the CMU2008 dataset [17] is a 3D tensor with shape of , which is 71553 voxels in total. However, only 360 trials are recorded. The high-dimensional setting will make the estimated solution breaks down because we are trying to infer a large model with a limited number of observations. (2) Higher-order structure of data. The higher-order structure of data exists in many fields, such as fMRI and videos, with the shape of time pixel

pixel. Traditional machine learning methods are proposed for processing vectors or matrices, hence, dealing with high-order data might be a difficulty. In past years, many methods are introduced to address these two problems.

To reserve the spatial structure, several methods are introduced based on the CANDECOMP/PARAFAC decomposition, which approximates an M-order tensor through

(2)

Here, is defined as the CP-rank of the tensor . For instance, [31] propose GLTRM which first decomposes the variable tensor and then applies the generalized linear model to estimate each component vector. In addition, [7] propose SURF using the divide-and-conquer strategy for each component vector. Almost all the CANDECOMP/PARAFAC-decomposition-based methods, including GLTRM, suffer a problem that the CP-rank should be pre-specified. However, we always have no prior knowledge about the value of . Even if we can use techniques, such as cross-validation, to estimate from the data, the solving procedure becomes trivial and computationally expensive for large-scale data. A method called orTRR is previously proposed in [5] automatically obtaining a low-rank coefficient without pre-specifying . But orTRR uses norm rather than norm for recovering the sparsity of data, which makes it performance poorer than others on variable selection.

To address such problem, some methods are proposed directly making constraints on the tensor. For example, [20] propose Remurs exploiting commonly used norm for enforcing sparsity on the estimated coefficient tensor. In addition, a nuclear norm is attached to it to make the solution low-rank. Remurs is solved through ADMM framework [2]. Although able to obtain an acceptable solution, Remurs has expensive time cost for large-scale data.

In this paper, we derive ideas of a scalable estimator, Elementary Estimator [24], and propose Sparse and Low-Rank Higher-Order Tensor Regression (SLTR), which directly enforcing sparsity and low-rankness on the tensor with norm and nuclear norm respectively. We also propose a fast algorithm making use of parallel proximal method. Notice that because the solution can be obtained in a two-layer parallel manner, the algorithm can be implemented parallelly through multi-threading computation or GPUs, thus, the solution of SLTR is able to be obtained with small time complexity. We empirically show that the parallel version of SLTR is faster and obtain no worse solution than previous methods. See details in Section 6 and Section 7. To summarize, this paper makes the following contributions:

  • A sparse and low-rank tensor regression model: Deriving ideas of Elementary Estimator, in Section 4, we propose a sparse and low-rank tensor regression model through norm and nuclear norm respectively.

  • A fast and scalable algorithm for the model: We also provide a fast solution to our model. The solution can be obtained through two-layer parallelizatio. In Section 6 we prove that SLTR has lower time complexity than counterparts.

  • Theoretically prove the convergence rate of sparse and low-rank tensor regression: In Section 5, we theoretically prove the sharp error bound of our model. Specifically, we provide a more exact error bound for 3D data. To our best knowledge, we are the first who prove the error bound of sparse and low-rank tensor regression with norm and nuclear norm directly applied on the tensor.

  • Experiments on real-world fMRI dataset: We experiment SLTR and four baseline methods on several simulated datasets and one fMRI dataset with nine projects. Results show that SLTR is more fast and stable than previous ones. See detailed explanation in Section 7.

2 Notations

denotes the element-wise norm, denotes the nuclear norm, denotes the norm, and denotes the spectral norm. Unfolding tensor along its -th mode is denoted as , where the columns of are the th mode vectors of . Its opposite operator is denoted as . The operation converts a vector into a tensor.

3 Background

3.1 Elementary estimator for linear regression models

In the literature of linear regression (LR), some methods are proposed to address the difficulties of estimating in the high-dimensional setting, in which the number of samples less than the number of variables. For instance, Lasso attaches an

norm to the ordinary least square (OLS) problem in order to reduce the number of variables. Lasso can be iteratively solved via proximal gradient method, which takes

time. On the other hand, [3]

further propose Dantzig selector for estimations high-dimensional settings. Dantzig selector is solved through linear programming (LP), which has

time complexity if using interior point method. These previous methods has prohibitively large time complexity when there are large number of variables. Therefore, to address this problem, [24] propose a fast estimator for high-dimensional linear regression named Elementary Estimator (EE). Given vector samples and scalar responses , EE aims to solve

(3)
s.t.

Here, is an arbitrary regularization and is its dual norm. EE shares similarities with Dantzig selector, however, it has been shown in [24] that EE is more practical scalable in high-dimensional settings.

Furthermore, [25] propose the superposition-structured elementary estimator by combining several estimators, which has the formulation

(4)
s.t.

This class of superposition-structured estimators is able to be solved through independently and parallelly solving every single estimator.

3.2 Regularized matrix regression

Although plenty of methods are proposed for LR, these can not be directly used in tasks with matrix data,such as digital images. One intuitive way is to first vectorize these two-dimensional data and then use extant LR methods. However, this will hurt the higher-order structure of data, such as spatial coherence. Therefore, [30] extend the ordinary liner regression to two-dimensional data and propose the regularized matrix regression. Because, for 2D-data, the true coefficient can often be well approximated by a low-rank structure, the regularized matrix regression exploits nuclear norm. Given matrix covariates and scalar responses , the regularized matrix regression solves

(5)

Here, is nuclear norm and controls the rankness of coefficient . Although effective for second-order data, this matrix regression model is infeasible for higher-order tensor data.

4 Method

Because Eq. (5) is introduced only for matrix data, we need a new model for solving problems with higher-order tensor data. Therefore, deriving ideas of Eq. (5) [30] propose a new tensor regression method named Remurs, which combines the low-rank constraint (nuclear norm) with the sparsity constraint ( norm). Specifically, given -order tensor samples and corresponding responses , Remurs solves

(6)

Notice that the nuclear norm regularization here can not be directly utilized for higher-order tensors. Moreover, as [8] state, the computation of tensor rank is NP-hard, therefore, we use a convex tensor nuclear norm regularization [15]

(7)

as the convex relaxation of nuclear norm. Remurs exploits ADMM method for estimation, which is computationally expensive for a large problem. To address this problem, deriving ideas of EE (Eq. (3)), we propose our Sparse and Low-Rank Higher-Order Tensor Regression (SLTR) model. SLTR aims to solve

(8)
s.t.

Here, are three tuning parameters, nuclear norm is in the form of Eq. (7), and denotes its dual norm. Moreover, , , and reshapes a matrix into a tensor. Notice that due to the definition of elementwise norm, we have . Integrating this with the definition of tensor nuclear norm, SLTR can be decomposed into

(9)

After estimating , the final estimation is obtained by averaging them through

(10)
Remark 4.1

The tensor nuclear norm Eq. (7) we use here is based on the unfolding tensors. Unfolding a tensor loses certain information and desires large number of computations. Directly applying low rank constraints onto the tensor rather its unfolded version could be explored in future works.

4.1 Two-layer parallel solution

In this section, we propose an algorithm to solve SLTR fast and accurately. In Eq. (9), although SLTR is decomposed into subproblems, these subproblems are not independent, thus, can not be solved separately, because each shares same elements. To overcome this drawback, we introduce several auxiliary variables into Eq. (9). Then, for , each subproblem can separately solve

(11)

Therefore, SLTR can be solved in a parallel manner.

Moreover, Eq. (11) can be further optimized parallelly through the parallel proximal algorithm [4]. Specifically, to simplify the interpretation, we abbreviate as and convert Eq. (11) of each to

(12)
s.t.

Here, , , , and . denotes an indicator function of set as if , otherwise . To solve Eq. (12), we propose a parallel proximal method based algorithm, as summarized in Algorithm 1. Due to space limitations, the four proximal operators of functions are postponed in the appendix.

  Input: Tensor observation , corresponding observed responses , , the maximum number of iterations , and tuning parameters , , , and .
2:  Initialize
  for  to  do
4:     for  to iter do
        
6:        
        
8:        
        
10:        for  do
           
12:        end for
        
14:     end for
     
16:  end for
  Output: .
Algorithm 1   Parallel Proximal Based Algorithm for SLTR

5 Theorem

Proposition 5.1

The tensor nuclear norm is a norm function and it is decomposable [18]. Specifically, given a pair of subspaces , we have for all and

(C1:sparse) The true coefficient is exactly sparse with non-zero elements.

(C2:low-rank) The true coefficient is a -rank tensor where and denotes the orthogonal rank of tensor .

Suppose that the true coefficient tensor satisfies conditions (C1:sparse) and (C2:low-rank). Furthermore, suppose we solve Eq. (8) with controlling parameters and satisfying the constraints. Then, the optimal solution satisfies the following error bound:

(13)
Corollary .1

If the coefficient is a three-mode tensor such that and conditions (C1:sparse) and (C2:low-rank) are held for true coefficient , the optimal solution of Eq. (8) satisfies the following error bound:

(14)

where denotes the rank of the unfolding tensor and
.

For each subproblem Eq. (11), every sequence generated by Algorithm 1 converges to a solution to Eq. (11).

6 Discussion

6.1 Complexity analysis

We begin with the time complexity analysis. First of all, each needs to be vectorized once, which costs time. Then the initial approximation is computed with time complexity . Next, each subproblem of each mode is solved simultaneously using parallel proximal method. Obviously, the time complexity of parallely solving one subproblem is dominated by the largest time cost of calculating a certain proximal operator, which is the SVD procedure with time complexity. In a conclusion, by the virtue of our two-layer parallel solution, the total complexity of solving SLTR is only .

As for the memory complexity, this is the bottleneck of our approach. Because of the parallel proximal method, many auxiliary variables are needed, such as . In a total, there are memory spaces used for each in our approach, which is quite expensive for large data. We leave improvements on the memory complexity in our future works.

6.2 Relevance to previous works

Many methods have been proposed in the literature of regression tasks on higher-order tensor data. In this paper, we focus on the setting that the variables are represented by a tensor while the responses are denoted by a vector . Several models were already recently proposed to estimate the coefficient tensor for this specific, what we call, higher-order tensor regression problem.

One group of these methods is the direct extension of regularized linear regression. Naively, one way to solve this regression problem is vectorization. All the elements in the tensor are first stacked into a vector and then existing linear regression models can be applied to it. One obvious shortcoming of vectorization is that it will cause a loss of latent structural information of the data. To reserve certain potential information, [20] is proposed to estimate a sparse and low-rank coefficient tensor, by integrating the tensor nuclear norm and norm into the optimization problem. In Remurs, the tensor nuclear norm is approximated by the summation of ranks of several unfolded matrices. In addition, [13] improve Remurs by substituting the nuclear norm into Tubal nuclear norm [29, 28]

, which can be efficiently solved through discrete Fourier transform. However, these methods are computationally expensive because the non-differential regularizer,

norm or nuclear norm exists in their objective function. Therefore, currently, this group of methods is not a good choice for higher-order tensor regression.

To reserve the latent structure when dealing with tensors, another prevailing group of methods [7, 31, 5, 23] are proposed based on CANDECOMP/PARAFRAC decomposition. Generally, instead of directly estimate the coefficient tensor , we aim at inferring every component vector in each sub-task. For example, [31] propose GLTRM using generalized linear model (GLM) to solve each sub-task. Moreover, orTRR is proposed in [5] enforcing sparsity and low-rankness on the estimated coefficient tensor. Instead of norm, orTRR utilize norm to obtain the sparsity. In addition, recently, [7] propose SURF exploiting divide-and-conquer strategy where the sub-task has a similar formulation of Elastic Net [33]. In the paper of SURF, authors empirically show that their method can converge, but a statistical convergence rate is not proved. On the contrary, in this paper, we theoretically prove the error bound of our method. Noticeably, the main limitation of CANDECOMP/PARAFAC-decomposition-based method is that the decomposition rank should be pre-specified, however, we generally have no prior knowledge about the tensor rank in real-world applications. Although orTRR is able to automatically obtain a low-rank result, the estimated result is sub-optimal due to the fact that norm is inferior to norm in the sparse setting. Hence, these methods are not suitable for real-world applications.

Some other models were introduced previously for other problem settings. Recently, [6, 9, 26] propose models for non-parametric estimation by assuming that the response , making use of either additive model or Gaussian process. Apart from the above-mentioned ones, many models [21, 19, 27, 32, 14] were put forward to estimate the relationship between the variable tensor and a response tensor . Another line of statistical models involving tensor data is tensor decomposition [11, 22, 1, 12, 16]. Tensor decomposition can be considered as an unsupervised problem which aims at approximating the tensor with lower-order data. On the contrary, our SLTR is a supervised method estimating the latent relationship between variables and responses. Because these methods have different objectives from our method, we pay little attention to them and exclude them from experiments. In section 7, we compare SLTR with several introduced higher-order tensor regression methods, including Lasso, Elastic Net, Remurs, GLTRM, and SURF.

7 Experiment

7.1 Experiment Setups

Baselines and metrics:

To compare with our method, we use four previous methods as baselines. 1) Linear regression, specifically, Lasso and Elastic Net (with trade-off ratio between and norm being 0.5), 2) Remurs, 3) SURF, and 4) GLTRM. Furthermore, we use three metrics to evaluate performances of our method and baselines as: 1) time cost, 2) coefficient error, which is defined as , and 3) mean squared error (MSE).

Datasets:

We compare SLTR with baselines on simulated datasets and a real-world fMRI dataset. Specifically, our simulated datasets are generated through several steps. First, generate and

with each element drawn from the normal distribution

. Then, randomly set elements of to be . Compute the response with respect to . Here, controls the ratio of noises and is also generated from normal distribution. In addition to simulated datasets, we further experiment SLTR on CMU2008 fMRI datasets [17].

Other setups:

In the experiments, parameters of all the methods are selected through 5-fold cross-validation procedures which take the average MSE on validation datasets as the selecting criteria. Detailed descriptions about ranges of tuning parameters are shown in the appendix. As for the experiment environment, all the experiments are implemented on a Linux server with 2 Intel Xeon Silver 4216 CPUs and 256 GB RAMs. Moreover, our SLTR is implemented in MATLAB. Out of fairness, we set the maximal number of iterations to be for all the methods and let them terminate when . We run every single experiment 20 times and report the averaged value of metrics over these 20 trials.

7.2 Experiments on simulated data

First, we experiment SLTR on both 2D and 3D simulated datasets. The shape of data varies from , , , , for 2D data and , , , , , , and for 3D data. We fix the sparsity level and noise coefficient . The number of samples are determined through and for 2D and 3D data respectively. The time costs of all the methods are shown in Figure (1). Apparently, we can see that the not only the sequential version of SLTR has lower time cost, but also the parallel implementation of SLTR achieves more obvious improvements on the time cost, compared with baselines. Notice that the code of SURF is inapplicable for 2D data and the GLTRM is infeasible for 3D data, we omit these two methods in subfigures correspondingly. Ideally, if a better environment is provided, the speedup of SLTR can be larger. Then, we report the MSE of all the methods on these simulated datasets, in Table (1). The bold number denotes the best MSE value and the underlined value represents the second-best result. The result indicates that in most cases SLTR obtains the best solution, while in other cases an estimation is computed by SLTR with only a little difference with the best one. Because of the infeasibility of SURF and GLTRM, we also omit these two methods in the table under the corresponding conditions. The experiment results in Figure (1) and Table (1) indicate that SLTR is able to spend less time on obtaining a solution no worse than solutions of other methods.

Figure 1: Logarithmic time cost of SLTR and baselines on simulated datasets with the sparsity level . Note that SURF is infeasible for 2D data and GLTRM is infeasible for 3D data. “SLTR(sequential)” and “SLTR(parallel)” denotes the sequential and parallel version of our SLTR respectively. (a) Time cost on 2D data varying the shape of data. The number of samples are set to . (b) Time cost on 2D data varying the shape of data. The number of samples are set to .

 

size SLTR Remurs SURF GLTRM Lasso Elastic Net
1-7[0.8pt/2pt] 2D Data – 50% samples
15 15 0.1419 0.1949 Infeasible 1.162 0.1893 0.2015
20 20 0.1105 0.1567 1.3967 0.1152 0.1157
25 25 0.1364 0.2992 1.7319 0.1464 0.1460
30 30 0.1094 0.1323 2.2655 0.1414 0.1418
1-7[0.8pt/2pt] 3D Data – 8% samples
10 10 5 0.8532 0.8538 0.8472 Infeasible 1.7993 1.799
15 15 5 0.9892 0.9867 0.9994 2.3151 2.3132
20 20 5 0.9383 0.9378 1.0049 2.5469 2.5193
25 25 5 0.9275 0.9398 0.9391 2.0149 2.0149
30 30 5 0.9186 0.919 0.9289 1.9381 1.9377
35 35 5 0.9336 0.9370 0.9527 2.0147 2.0147
40 40 5 0.9073 0.9072 1.0006 2.1059 2.1065

 

Table 1: MSE of every method on simulated dataset with different size of data. We set for evewry datasset. The bold number denotes the best MSE value and the underlined value represents the second best result. Note that SURF is infeasible for 2D data and GLTRM is infeasible for 3D data.

In addition, we apply SLTR in high-dimensional settings. We let the shape of data to be and varies from 50 to 400, with 50 increments. The sparsity level is and the noise factor is set to . The MSE values shown in Table (1) indicate that SLTR obtains the best solution under almost all the conditions. Note that for , even the MSE of SLTR is , which is a little worse than the MSE of Remurs (0.9190), this value is much better than others.

 

N SLTR Remurs SURF GLTRM Lasso Elastic Net
50 1.6123 1.6198 1.6439 Infeasible 4.5083 4.5759
100 1.0798 1.0946 1.7101 1.6433 1.6433
150 0.9295 0.9190 0.9953 1.6777 1.6502
200 0.8351 0.8469 0.8376 1.9072 0.8376
250 0.7130 0.7267 0.7199 1.3757 1.3708
300 0.7282 0.7325 0.7524 1.7316 1.6938
350 0.6275 0.6275 0.6379 1.3207 1.2804
400 0.5954 0.5969 0.5975 1.1487 1.1450

 

Table 2: MSE of every methods in high-dimensional settings. Simulated datasets are generated with the shape of data is , sparsity level and noise coefficient . The number of samples varies from 50 to 400. Note that GLTRM is inapplicable for 3D data. The bold number denotes the best MSE value and the underlined value represents the second best result.

Finally, to test the stability of SLTR and its sensitiveness to the sparsity level, we generate simulated datasets varying the sparsity level from . The shape of each sample is

and a totally of 160 samples are generated. We generate different datasets on a different trail. The averaged time cost and the variance of 20 trials are reported in Figure (

2). Apparently, under every setting, SLTR outperforms baselines. Moreover, the time cost variance of SLTR is at most , which is much less than it of SURF and Remurs (at least ). Although LR methods also have small time variance, they might obtain a worse solution than SLTR (see Table (1)). Therefore, SLTR is faster than other methods on datasets with different sparsity levels and it is more stable.

Figure 2: Time cost of SLTR and baselines on simulated datasets of different sparsity level. The of variance of time cost is also reported for every experiment. The size of coefficient tensor is set to be and . We let noise ratio .

7.3 Experiments on real-world data

In this section, we perform fMRI classification tasks on CMU2008 datasets [17] with 9 projects in total. Each sample of this dataset is a 3-mode tensor of size (71553 voxels). This classification task aims to predict human activities associated with recognizing the meanings of nouns. Following [10, 20], we focus on classifications of binary classes: “tools” and “animals”. Here, the class “tool” combines observations from “tool” and “furniture”, class “animal” combines observations from “animal” and “insect” in the CMU2008 dataset. Like simulated experiments, values of tuning parameters of each method are selected through 5-fold cross-validation. For each subject, we split the entire dataset into the training dataset and testing dataset with the proportion of and respectively and use AUC to evaluate classification results. Results are shown in Table 3, which indicates that SLTR obtains the best time cost among all the cases. Notice that although SURF has the lowest time cost, the AUC of its solution is always around 0.5 and drops below 0.5 sometimes. Hence, we think the classification result of SURF is unacceptable. One interesting result occurs on project #1, where linear regression methods obtain a much better solution. We think the reason might be that in this subject, the voxels are independent, hence, the data has no latent structure. Anyway, on a real-world fMRI dataset, SLTR can obtain an acceptable solution with the least time cost.

 

Project SLTR Remurs Lasso Elastic Net SURF
seq. time par. time AUC time AUC time AUC time AUC time AUC
#1 0.9785 0.2135 0.75 23.9151 0.75 2.2105 0.9143 2.2334 0.9277 0.166 0.5042
#2 0.9405 0.2103 0.5785 28.229 0.6 2.1347 0.7164 2.1105 0.6974 0.0967 0.4571
#3 1.1246 0.2466 0.7929 21.716 0.7571 3.0322 0.7177 3.020 0.728 0.0809 0.5286
#4 0.9276 0.2089 0.8286 33.0095 0.75 2.0809 0.7237 2.0809 0.7376 0.0278 0.3429
#5 0.9385 0.2104 0.6527 26.8979 0.6389 2.1995 0.7411 2.2036 0.7589 0.1826 0.7125
#6 0.9007 0.2076 0.7704 10.0312 0.5972 2.3102 0.6554 2.365 0.6689 0.1707 0.5486
#7 0.9405 0.2096 0.6250 19.3765 0.5037 2.6606 0.5851 2.604 0.5929 0.1364 0.4444
#8 0.8923 0.2086 0.6944 16.7164 0.5486 2.2362 0.6741 2.2854 0.7165 0.0849 0.4028
#9 0.9036 0.2157 0.6142 24.1116 0.5781 2.1138 0.5522 2.1014 0.5489 0.1742 0.5857

 

Table 3: Time cost and AUC value of SLTR and baselines on nine classification projects of CMU2008 datasets. “seq.time” and “par.time” denote the time cost of sequential SLTR and parallel SLTR respectively. The bold number denotes the smallest time cost and underlined number highlights the best AUC value.

References

  • [1]

    Genevera Allen, ‘Sparse higher-order principal components analysis’, in

    Artificial Intelligence and Statistics, pp. 27–36, (2012).
  • [2] Stephen Boyd, Neal Parikh, Eric Chu, Borja Peleato, Jonathan Eckstein, et al., ‘Distributed optimization and statistical learning via the alternating direction method of multipliers’, Foundations and Trends® in Machine learning, 3(1), 1–122, (2011).
  • [3] Emmanuel Candes, Terence Tao, et al., ‘The dantzig selector: Statistical estimation when p is much larger than n’, The annals of Statistics, 35(6), 2313–2351, (2007).
  • [4] Patrick L Combettes and Jean-Christophe Pesquet, ‘Proximal splitting methods in signal processing’, in Fixed-point algorithms for inverse problems in science and engineering, 185–212, Springer, (2011).
  • [5] Weiwei Guo, Irene Kotsia, and Ioannis Patras, ‘Tensor learning for regression’, IEEE Transactions on Image Processing, 21(2), 816–827, (2011).
  • [6] Botao Hao, Boxiang Wang, Pengyuan Wang, Jingfei Zhang, Jian Yang, and Will Wei Sun, ‘Sparse tensor additive regression’, arXiv preprint arXiv:1904.00479, (2019).
  • [7] Lifang He, Kun Chen, Wanwan Xu, Jiayu Zhou, and Fei Wang, ‘Boosted sparse and low-rank tensor regression’, in Advances in Neural Information Processing Systems, pp. 1009–1018, (2018).
  • [8] Christopher J Hillar and Lek-Heng Lim, ‘Most tensor problems are np-hard’, Journal of the ACM (JACM), 60(6),  45, (2013).
  • [9] Masaaki Imaizumi and Kohei Hayashi, ‘Doubly decomposing nonparametric tensor regression’, in International Conference on Machine Learning, pp. 727–736, (2016).
  • [10]

    Kittipat Kampa, S Mehta, Chun-An Chou, Wanpracha Art Chaovalitwongse, and Thomas J Grabowski, ‘Sparse optimization in feature selection: application in neuroimaging’,

    Journal of Global Optimization, 59(2-3), 439–457, (2014).
  • [11] Tamara G Kolda and Brett W Bader, ‘Tensor decompositions and applications’, SIAM review, 51(3), 455–500, (2009).
  • [12] Jiajia Li, Jee Choi, Ioakeim Perros, Jimeng Sun, and Richard Vuduc, ‘Model-driven sparse cp decomposition for higher-order tensors’, in 2017 IEEE International Parallel and Distributed Processing Symposium (IPDPS), pp. 1048–1057. IEEE, (2017).
  • [13] Wenwen Li, Jian Lou, Shuo Zhou, and Haiping Lu, ‘Sturm: Sparse tubal-regularized multilinear regression for fmri’, in International Workshop on Machine Learning in Medical Imaging, pp. 256–264. Springer, (2019).
  • [14] Yimei Li, Hongtu Zhu, Dinggang Shen, Weili Lin, John H Gilmore, and Joseph G Ibrahim, ‘Multiscale adaptive regression models for neuroimaging data’, Journal of the Royal Statistical Society: Series B (Statistical Methodology), 73(4), 559–578, (2011).
  • [15] Ji Liu, Przemyslaw Musialski, Peter Wonka, and Jieping Ye, ‘Tensor completion for estimating missing values in visual data’, IEEE transactions on pattern analysis and machine intelligence, 35(1), 208–220, (2012).
  • [16] Oscar Hernan Madrid-Padilla and James Scott, ‘Tensor decomposition with generalized lasso penalties’, Journal of Computational and Graphical Statistics, 26(3), 537–546, (2017).
  • [17] Tom M Mitchell, Svetlana V Shinkareva, Andrew Carlson, Kai-Min Chang, Vicente L Malave, Robert A Mason, and Marcel Adam Just, ‘Predicting human brain activity associated with the meanings of nouns’, science, 320(5880), 1191–1195, (2008).
  • [18] Sahand N Negahban, Pradeep Ravikumar, Martin J Wainwright, Bin Yu, et al., ‘A unified framework for high-dimensional analysis of -estimators with decomposable regularizers’, Statistical Science, 27(4), 538–557, (2012).
  • [19] Garvesh Raskutti, Ming Yuan, Han Chen, et al., ‘Convex regularization for high-dimensional multiresponse tensor regression’, The Annals of Statistics, 47(3), 1554–1584, (2019).
  • [20] Xiaonan Song and Haiping Lu, ‘Multilinear regression for embedded feature selection with application to fmri analysis’, in Thirty-First AAAI Conference on Artificial Intelligence, (2017).
  • [21] Will Wei Sun and Lexin Li, ‘Store: sparse tensor response regression and neuroimaging analysis’, The Journal of Machine Learning Research, 18(1), 4908–4944, (2017).
  • [22] Will Wei Sun, Junwei Lu, Han Liu, and Guang Cheng, ‘Provable sparse tensor decomposition’, Journal of the Royal Statistical Society: Series B (Statistical Methodology), 79(3), 899–916, (2017).
  • [23] Xu Tan, Yin Zhang, Siliang Tang, Jian Shao, Fei Wu, and Yueting Zhuang, ‘Logistic tensor regression for classification’, in International Conference on Intelligent Science and Intelligent Data Engineering, pp. 573–581. Springer, (2012).
  • [24] Eunho Yang, Aurelie Lozano, and Pradeep Ravikumar, ‘Elementary estimators for high-dimensional linear regression’, in International Conference on Machine Learning, pp. 388–396, (2014).
  • [25]

    Eunho Yang, Aurelie Lozano, and Pradeep Ravikumar, ‘Elementary estimators for sparse covariance matrices and other structured moments’, in

    International conference on machine learning, pp. 397–405, (2014).
  • [26] Rose Yu, Guangyu Li, and Yan Liu, ‘Tensor regression meets gaussian processes’, arXiv preprint arXiv:1710.11345, (2017).
  • [27] Rose Yu and Yan Liu, ‘Learning from multiway data: Simple and efficient tensor regression’, in International Conference on Machine Learning, pp. 373–381, (2016).
  • [28] Zemin Zhang and Shuchin Aeron, ‘Exact tensor completion using t-svd’, IEEE Transactions on Signal Processing, 65(6), 1511–1526, (2016).
  • [29] Zemin Zhang, Gregory Ely, Shuchin Aeron, Ning Hao, and Misha Kilmer, ‘Novel methods for multilinear data completion and de-noising based on tensor-svd’, in

    Proceedings of the IEEE conference on computer vision and pattern recognition

    , pp. 3842–3849, (2014).
  • [30] Hua Zhou and Lexin Li, ‘Regularized matrix regression’, Journal of the Royal Statistical Society: Series B (Statistical Methodology), 76(2), 463–483, (2014).
  • [31] Hua Zhou, Lexin Li, and Hongtu Zhu, ‘Tensor regression with applications in neuroimaging data analysis’, Journal of the American Statistical Association, 108(502), 540–552, (2013).
  • [32] Hongtu Zhu, Yasheng Chen, Joseph G Ibrahim, Yimei Li, Colin Hall, and Weili Lin, ‘Intrinsic regression models for positive-definite matrices with applications to diffusion tensor imaging’, Journal of the American Statistical Association, 104(487), 1203–1212, (2009).
  • [33] Hui Zou and Trevor Hastie, ‘Regularization and variable selection via the elastic net’, Journal of the royal statistical society: series B (statistical methodology), 67(2), 301–320, (2005).