# Stochastic Iterative Hard Thresholding for Low-Tucker-Rank Tensor Recovery

Low-rank tensor recovery problems have been widely studied in many applications of signal processing and machine learning. Tucker decomposition is known as one of the most popular decompositions in the tensor framework. In recent years, researchers have developed many state-of-the-art algorithms to address the problem of low-Tucker-rank tensor recovery. Motivated by the favorable properties of the stochastic algorithms, such as stochastic gradient descent and stochastic iterative hard thresholding, we aim to extend the well-known stochastic iterative hard thresholding algorithm to the tensor framework in order to address the problem of recovering a low-Tucker-rank tensor from its linear measurements. We have also developed linear convergence analysis for the proposed method and conducted a series of experiments with both synthetic and real data to illustrate the performance of the proposed method.

## Authors

• 5 publications
• 28 publications
• 10 publications
• 24 publications
• 29 publications
• ### Iterative Hard Thresholding for Low CP-rank Tensor Models

Recovery of low-rank matrices from a small number of linear measurements...

08/22/2019 ∙ by Rachel Grotheer, et al. ∙ 6

• ### Minimum n-Rank Approximation via Iterative Hard Thresholding

The problem of recovering a low n-rank tensor is an extension of sparse ...

11/18/2013 ∙ by Min Zhang, et al. ∙ 0

• ### On Iterative Hard Thresholding Methods for High-dimensional M-Estimation

The use of M-estimators in generalized linear regression models in high ...

10/20/2014 ∙ by Prateek Jain, et al. ∙ 0

• ### Tensor vs Matrix Methods: Robust Tensor Decomposition under Block Sparse Perturbations

Robust tensor CP decomposition involves decomposing a tensor into low ra...

10/15/2015 ∙ by Prateek Jain, et al. ∙ 0

• ### STARK: Structured Dictionary Learning Through Rank-one Tensor Recovery

In recent years, a class of dictionaries have been proposed for multidim...

11/13/2017 ∙ by Mohsen Ghassemi, et al. ∙ 0

• ### Tensor-Free Proximal Methods for Lifted Bilinear/Quadratic Inverse Problems with Applications to Phase Retrieval

We propose and study a class of novel algorithms that aim at solving bil...

07/10/2019 ∙ by Robert Beinert, et al. ∙ 0

• ### Knowledge-Aided Normalized Iterative Hard Thresholding Algorithms and Applications to Sparse Reconstruction

This paper deals with the problem of sparse recovery often found in comp...

09/25/2018 ∙ by R. C. de Lamare, et al. ∙ 0

##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1 Introduction

Tensors are high-dimensional extensions of vectors and matrices. There are many different kinds of tensor decompositions, among which, the Canonical Polyadic (CP) decomposition and Tucker decomposition are the most popular

Sokolnikoff (1951); Zhang (2017). In recent years, low-rank tensor recovery problems have gained a great amount of attention in various applications including hyperspectral image restoration Fan et al. (2017), video processing Bengua et al. (2017), signal processing Li et al. (2015, 2017), and simultaneous blind deconvolution and phase retrieval Li et al. (2019)

. Unlike low-rank matrix recovery problems, which often use nuclear norm minimization as a popular heuristic for rank minimization, the computation of the nuclear norm for high order tensors is NP-hard

Hillar and Lim (2013); Friedland and Lim (2018).

Over the decades, the iterative hard thresholding (IHT) algorithm has been widely used in compressive sensing Blumensath and Davies (2009); Blanchard et al. (2015); Carrillo and Barner (2013) and low-rank matrix recovery Tanner and Wei (2013); Chunikhina et al. (2014); Geng et al. (2015). It has many extensions, such as the stochastic variant proposed in Nguyen et al. (2017), which was further extended to the multiple measurement vector framework in Qin et al. (2017). Inspired by the idea of using the IHT algorithm in low-rank matrix recovery problems, the authors in Rauhut et al. (2017)

extended the IHT algorithm to the tensor framework and proposed the Tensor IHT (TIHT) algorithm as an alternative to the tensor nuclear norm minimization. The authors of

Li et al. (2019)

then combined this TIHT algorithm with higher-order singular value decomposition (HOSVD), a type of Tucker decomposition, to solve a low-rank tensor recovery problem formulated from a simultaneous blind deconvolution and phase retrieval problem. Another recent work

de Morais Goulart and Favier (2017) also extends the IHT algorithm to the problem of low-rank tensor recovery based on a low-Tucker-rank approximation technique named sequentially optimal modal projections.

The stochastic versions of gradient descent algorithms and IHT algorithms usually have many favorable properties. For example, these algorithms do not need to compute the full gradient, which makes it possible for them to be utilized in large scale problems where computing the full gradient is very expensive. These properties inspired us to extend the stochastic IHT algorithm to the tensor framework and introduce the Stochastic Tensor IHT (StoTIHT) algorithm to recover a low-Tucker-rank tensor from its linear measurements. In this work, we provide convergence analysis for the proposed StoTIHT algorithm, based on a Tucker decomposition of the tensor, under the assumption that the linear operator used to obtain the measurements satisfies a tensor restricted isometry property (TRIP). Our simulations also indicate that the proposed StoTIHT algorithm converges much faster than the original TIHT algorithm in a large scale setting.

The remainder of this work is organized as follows. In Section 2, we briefly review some fundamental concepts and definitions used in the tensor framework. We formulate our low-rank tensor recovery problem in Section 3 and present the proposed StoTIHT algorithm in Section 4. We then introduce the linear convergence analysis for our proposed StoTIHT algorithm in Section 5 and illustrate its performance with both synthetic and real data in Section 6. Finally, we conclude our work in Section 7.

## 2 Preliminaries

In this section, we briefly review some fundamental concepts and definitions used in the tensor framework Sokolnikoff (1951); Zhang (2017); Kolda and Bader (2009). We denote a -th order tensor as . Vectors and matrices can be viewed as low-dimensional tensors with and 2, respectively. Denote as the mode- matricization or the -th unfolding of a tensor 111One can refer to Kolda and Bader (2009) for a more detailed definition and some easily understandable examples.. Similar to the matrix case, it is possible to vectorize a tensor, resulting in a column vector containing all of the elements of the tensor. Note that the ordering of the elements is not uniquely defined. In this work, the ordering is consistent. In particular, denote as an operator used to vectorize a matrix or a tensor. For a tensor , we choose the vectorization as . That is, we chose to vectorize the mode-1 matricization of the tensor given by . The inner product of two tensors is then defined as

 ⟨X1,X2⟩≜vec(X2)⊤vec(X1).

The induced Frobenius norm is then defined as

 ∥X∥F≜√⟨X,X⟩.

The Tucker rank of tensor is then defined as a tuple with . The Tucker decomposition is one of the most popular tensor decompositions and one can find more details in Tucker (1963, 1964). The family of Tucker decompositions include the HOSVD of a tensor , which is given as

 X=S×1U(1)⋯×dU(d). (2.1)

Here, and denote the core tensor and the basis, respectively. One can refer to Sokolnikoff (1951); Zhang (2017) for more details about the properties of the core tensor and basis. The product is the mode- (matrix) product of the tensor, that is, the product of a tensor and a matrix along the -th mode of the tensor.

## 3 Problem Formulation

In this work, we consider the recovery of a rank- tensor from its linear measurements , where is a linear operator used to obtain the measurements. In particular, the -th element of is given as

 y(i)=Ai(X⋆)=⟨Ai,X⋆⟩,i=1,…,m, (3.1)

where is a sensing tensor. We observe that the cost function can be rewritten as

 F(X) ≜12m∥y−A(X)∥22=12mm∑i=1(y(i)−⟨Ai,X⟩)2 (3.2) =1MM∑i=1⎛⎝12bib∑j=(i−1)b+1(y(j)−⟨Aj,X⟩)2⎞⎠ =1MM∑i=112b∥ybi−Abi(X)∥22≜1MM∑i=1fi(X),

where we decompose the measurement vector into non-overlapping vectors , . Note that we can choose to be an integer and let . We denote as a linear operator with the -th entry of being , . It can be seen that each function is associated with a collection of measurements .

Due to the low-rankness of , a standard approach to recover is to solve the following minimization

 minimizeX∈Rn1×n2×…×nd F(X)subject to  rank(X)≤r, (3.3)

where the cost function is defined in (3.2). The above optimization has been heavily studied in existing literature. The authors in Gandy et al. (2011); Mu et al. (2014) relax the above optimization by minimizing the sum of the nuclear norm of the tensor matricizations. However, this kind of relaxation is actually non-optimal Recht et al. (2010). Inspired by the idea of using the IHT algorithm for compressive sensing and low-rank matrix recovery problems, the authors in Rauhut et al. (2017) extend the IHT algorithm Blumensath and Davies (2009) to the tensor framework and propose the TIHT algorithm as an alternative to the tensor nuclear norm minimization. In particular, to recover , the TIHT algorithm consists of the following two steps at iteration :

 ˜Xt =Xt+μA∗(y−A(Xt)), (3.4) Xt+1 =Hr(˜Xt). (3.5)

Here, is the stepsize and is the adjoint operator of . That is for tensor and vector , . computes a rank- approximation of a tensor with HOSVD. Note that the second step (3.5) is not straightforward, and the authors in Rauhut et al. (2017) required that the assumption

 ∥Hr(˜Xt)−˜Xt∥F≤η∥˜Xtbest−˜Xt∥F (3.6)

held for all with some . is the best rank- approximation of with respect to the Tucker decomposition (given by the HOSVD), namely, . We will also assume such an approximation exists in our theorem.

## 4 The Proposed StoTIHT algorithm

We observe that the linear measurements can be rewritten as

 y=Ax⋆,

where is a matrix with the -th row being the vectorized version of , and is the vectorized version of . Then, we can update in (3.4) with

 ˜xt =xt+μA⊤(y−Axt), (4.1)

where is the vectorized version of .

As previously stated, the stochastic variant of algorithms does not require computation of the full gradient and, thus can be much more efficient in large scale settings especially when the computation and/or storage of the full gradient is very expensive. Thus, we propose a stochastic variant of the TIHT algorithm (StoTIHT) by replacing (4.1) with

 ˜xt =xt+μMp(it)A(it,:)⊤(ybit−A(it,:)xt), (4.2)

where is an index randomly selected from

with probability

, and denotes the -th block of . This updating step is equivalent to

 ˜Xt =Xt+μMp(it)1bitb∑j=(it−1)b+1Aj(y(j)−⟨Aj,Xt⟩) (4.3) =Xt−μMp(it)∇fit(Xt)

with as in (3.2). Based on the above analysis, we summarize the algorithm below.

## 5 Linear Convergence for StoTIHT

We present a linear convergence analysis for the proposed StoTIHT algorithm in this section. We first introduce the tensor restricted isometry property (TRIP) in the following definition.

###### Definition 5.1.

(TRIP) Rauhut et al. (2017) Let and be the two linear operators defined in Section 3. For a fixed tensor Tucker decomposition and a corresponding Tucker rank , we say and satisfy the TRIP if there exists a tensor restricted isometry constant such that

 1m∥A(X)∥22 ≥(1−δr)∥X∥2F (5.1) 1b∥Abi(X)∥22 ≤(1+δr)∥X∥2F (5.2)

hold for all tensors of Tucker-rank at most .

Note that (5.2) is stronger than , but we will need (5.2) in the proof of the main theorem. We call the linear operator defined in Section 3 a random Gaussian linear operator if the entries of all of the sensing tensors are random Gaussian variables. It is shown in Rauhut et al. (2017) that the random Gaussian linear operators and satisfy the TRIP with high probability as long as and 222These ’s are not necessarily the same constant, according to Rauhut et al. (2017). with and . Here, is corresponding partition tree of the vertices . See Rauhut et al. (2017) for a more detailed description of the construction of .

Now, we are in the position to state our main results in the following theorem.

###### Theorem 5.1.

Assume that the operators and used to obtain the linear measurements satisfy the TRIP defined in Definition 5.1. Let be a feasible solution of the optimization problem (3.3). Denote as the initial tensor. We also assume that the Tucker-rank- approximation operator satisfies (3.6) for all with some . Using Algorithm 1, one can guarantee that the expectation of the recovery error is bounded by

 EIt∥Xt+1−X⋆∥F≤κt+1∥X0−X⋆∥F+σX⋆

at the -th iteration. denotes the set containing all indices randomly selected at and before iteration . Here, and are the contraction coefficient and tolerance parameter, which are defined as

 κ ≜2√1−(2−μα3r)μρ−3r+√η2−1√1+μ2α3rρ+3r−2μρ−3r σX⋆ ≜μMmini∈[M]p(i)(2Eit∥PUt(∇fit(X⋆))∥F+√η2−1Eit∥∇fit(X⋆)∥F)

with and being an index selected from with probability . is defined as a subspace of spanned by , and . is then defined as the orthogonal projection onto .

The above theorem indicates that the proposed StoTIHT algorithm is guaranteed to linearly converge to the true solution even when the full gradient computation is not available. Therefore, the proposed StoTIHT algorithm has a significant computational advantage in large-scale settings where the number of measurements is very large and the computation of the full gradient can be very expensive, since it only needs to compute a partial gradient.

We adapt the proofs of StoIHT in Nguyen et al. (2017) and TIHT in Rauhut et al. (2017) to the show the linear convergence of our proposed StoTIHT algorithm. Note that in what follows “rank” denotes Tucker rank. We first present a key lemma that will be used in the proof of Theorem 5.1.

###### Lemma 5.1.

Denote as the index randomly selected from with probability . For any fixed low-rank tensors and , let be a space that contains the space spanned by and . Assume that the rank of any tensor in is at most . Then, we have

 Ei∥∥∥X′−X−μMp(i)PU(∇fi(X′)−∇fi(X))∥∥∥F≤√1−(2−μαr)μρ−r∥X′−X∥F, (5.3) Ei∥∥∥X′−X−μMp(i)(∇fi(X′)−∇fi(X))∥∥∥F≤√1+μ2αrρ+r−2μρ−r∥X′−X∥F. (5.4)

To prove Lemma 5.1, we need the following two lemmas.

###### Lemma 5.2.

For any two low-rank tensors and , let be a space that contains the space spanned by and . Assume that the rank of any tensor in is at most . Suppose that the two linear operators and satisfy the TRIP defined in Definition 5.1. Then, the functions and defined in (3.2) satisfy

 ρ−r∥X′−X∥2F≤⟨X′−X,∇F(X′)−∇F(X)⟩, (5.5) ∥∇fi(X′)−∇fi(X)∥F≤ρ+r∥X′−X∥F (5.6)

for all low-rank tensors and .

###### Proof.

Recall that

 F(X) =12mm∑i=1(y(i)−⟨Ai,X⟩)2, fi(X) =12bib∑j=(i−1)b+1(y(j)−⟨Aj,X⟩)2.

With some fundamental calculations, we obtain their gradients

 ∇F(X) =1mm∑i=1Ai(⟨Ai,X⟩−y(i)), ∇fi(X) =1bib∑j=(i−1)b+1Aj(⟨Aj,X⟩−y(j)).

It follows from the TRIP (5.1) that

 ⟨X′−X,∇F(X′)−∇F(X)⟩ =1m⟨X′−X,m∑i=1Ai⟨Ai,X′−X⟩⟩ =1mm∑i=1⟨Ai,X′−X⟩2 =1m∥A(X′−X)∥22 ≥(1−δr)∥X′−X∥2F.

Thus, we finish the proof of (5.5) by setting .

Note that

 1b∥Abi(X′−X)∥22 =1bib∑j=(i−1)b+1⟨Aj,X′−X⟩2=1b⟨ib∑j=(i−1)b+1Aj⟨Aj,X′−X⟩,X′−X⟩ =⟨∇fi(X′)−∇fi(X),X′−X⟩ =(fi(X′)−fi(X)−⟨∇fi(X),X′−X⟩)+(fi(X)−fi(X′)−⟨∇fi(X′),X−X′⟩).

Without loss of generality, we assume that

 fi(X′)−fi(X)−⟨∇fi(X),X′−X⟩≤fi(X)−fi(X′)−⟨∇fi(X′),X−X′⟩.

Then, it follows from the TRIP (5.2) that

 fi(X′)−fi(X)−⟨∇fi(X),X′−X⟩≤(1+δr)∥X′−X∥2F. (5.7)

Note that is a convex function with respect to , together with Lemma 4 in Zhou (2018), we then get (5.6) by setting .

###### Lemma 5.3.

(Co-coercivity) For any two low-rank tensors and , let be a space that contains the space spanned by and . Assume that the rank of any tensor in is at most . Then, we have

 ∥PU(∇fi(X′)−∇fi(X))∥2F≤ρ+r⟨X′−X,∇fi(X′)−∇fi(X)⟩ (5.8)
###### Proof.

Exchanging the role of and in (5.7), we get

 fi(X)−fi(X′)−⟨∇fi(X′),X−X′⟩≤12ρ+r∥X′−X∥2F.

Adding the above inequality and (5.7) together, we obtain

 ⟨X′−X,∇fi(X′)−∇fi(X)⟩≤ρ+r∥X′−X∥2F.

Define a function

 Gi(W)≜fi(W)−⟨∇fi(X),W⟩.

Then, we have

 ∥∇Gi(W1)−∇Gi(W2)∥F=∥∇fi(W1)−∇fi(W2)∥F≤ρ+r∥W1−W2∥F

holds for any and with their span belonging to . This implies that we can get a similar inequality as in (5.7), namely

 Gi(W1)−Gi(W2)−⟨∇Gi(W2),W1−W2⟩≤12ρ+r∥W1−W2∥2F. (5.9)

Note that

 Gi(W)−Gi(X) =fi(W)−fi(X)−⟨∇fi(X),W−X⟩ (5.10) =12bib∑j=(i−1)b+1⟨Aj,W−X⟩2≥0

holds for all . Here, the second equality follows by plugging the expression of , and . Define , it can be seen that since both and belong to . Then, applying (5.9) and (5.10), we have

 Gi(X) ≤Gi(W)=Gi(X′−1ρ+rPU∇Gi(X′)) ≤Gi(X′)+⟨∇Gi(X′),−1ρ+rPU∇Gi(X′)⟩+12ρ+r∥PU∇Gi(X′)∥2F =Gi(X′)−12ρ+r∥PU∇Gi(X′)∥2F.

Plugging the definition of and into the above inequality gives

 =12ρ+r∥PU(∇fi(X′)−∇fi(X))∥2F ≤Gi(X′)−Gi(X)=fi(X′)−fi(X)−⟨∇fi(X),X′−X⟩.

Finally, we can obtain (5.8) by summing the two inequalities with and exchanged.

Proof of Lemma 5.1. With the above two lemmas, we obtain

 Ei∥∥∥X′−X−μMp(i)PU(∇fi(X′)−∇fi(X))∥∥∥2F = ∥X′−X∥2F+Eiμ2(Mp(i))2∥PU(∇fi(X′)−∇fi(X))∥2F−2μEi⟨X′−X,1Mp(i)PU(∇fi(X′)−∇fi(X))⟩ ≤ ∥X′−X∥2F+μ2Eiρ+r(Mp(i))2⟨X′−X,∇fi(X′)−∇fi(X)⟩−2μEi⟨X′−X,1Mp(i)(∇fi(X′)−∇fi(X))⟩ ≤ ∥X′−X∥2F+(μ2maxiρ+rMp(i)−2μ)Ei⟨X′−X,1Mp(i)(∇fi(X′)−∇fi(X))⟩ = ∥X′−X∥2F−(2μ−μ2αr)⟨X′−X,∇F(X′)−∇F(X)⟩ ≤ ∥X′−X∥2F−(2μ−μ2αr)ρ−r∥X′−X∥2F.

Here, the first inequality follows from (5.8) and the definition of . The second equality follows from and . The last inequality follows from (5.5). Then, we complete the proof of (5.3) by applying the well known inequality .

Similar to the proof of (5.3), we also have

 Ei∥∥∥X′−X−μMp(i)(∇fi(X′)−∇fi(X))∥∥∥2F = ∥X′−X∥2F+Eiμ2(Mp(i))2∥∇fi(X′)−∇fi(X)∥2F−2μEi⟨X′−X,1Mp(i)(∇fi(X′)−∇fi(X))⟩ = ∥X′−X∥2F+Eiμ2(Mp(i))2∥∇fi(X′)−∇fi(X)∥2F−2μ⟨X′−X,∇F(X′)−∇F(X)⟩ ≤ ∥X′−X∥2F+Eiμ2(Mp(i))2(ρ+r)2∥X′−X∥2F−2μρ−r∥X′−X∥2F ≤ (1+μ2αrρ+r−2μρ−r)∥X′−X∥2F,

where the first inequality follows from (5.5) and (5.6), and the last inequality follows from

 Eiμ2(Mp(i))2(ρ+r)2≤μ2maxiρ+rMp(i)Eiρ+rMp(i)=μ2αrM∑i=1ρ+rMp(i)p(i)=μ2αrρ+r.

Therefore, by applying , we finish the proof of (5.4).

We are now prepared to prove Theorem 5.1.

###### Proof.

Denote as the best rank- approximation of . Assume that the rank- approximation operator satisfies

 ∥Hr(˜Xt)−˜Xt∥F≤η∥˜Xtbest−˜Xt∥F

for all with some . Note that for the HOSVD, Grasedyck (2010) demonstrated a method to compute the HOSVD which obtains , as well as methods that result in and . We refer the reader to Grasedyck (2010) for a description of these methods. Then, we have

 ∥Xt+1−˜Xt∥F=∥Hr(˜Xt)−˜Xt∥F≤η∥˜Xtbest−˜Xt∥F≤η∥˜Xt−X⋆∥F.

It follows that

 η2∥˜Xt−X⋆∥2F ≥∥Xt+1−˜Xt∥2F=∥Xt+1−X⋆+X⋆−˜Xt∥2F

Simplifying, we get

 ∥Xt+1−X⋆∥2F ≤2⟨Xt+1−X⋆,˜Xt−X⋆⟩+(η2−1)∥˜Xt−X⋆∥2F.

Plugging in the updating step (4.3) gives

 ∥Xt+1−X∥2F ≤2⟨Xt+1−X⋆,Xt−X⋆−μMp(it)∇fit(Xt)⟩+(η2−1)∥∥∥