1 Introduction
In this paper, we develop a robust tensor decomposition method, which recovers a low rank tensor subject to gross corruptions. Given an input tensor , we aim to recover both and , where is a low rank tensor and is a sparse tensor
(1) 
and . The above form of is known as the Candecomp/Parafac or the CPform. We assume that is a rank orthogonal tensor, i.e., if and otherwise. The above problem arises in numerous applications such as image and video denoising [17]
, multitask learning, and robust learning of latent variable models (LVMs) with grosslycorrupted moments, for details see Section
1.2.The matrix version of (1
), viz., decomposing a matrix into sparse and low rank matrices, is known as robust principal component analysis (PCA). It has been studied extensively
[8, 6, 15, 22]. Both convex [8, 6] as well as nonconvex [22] methods have been proposed with provable recovery.One can attempt to solve the robust tensor problem in (1) using matrix methods. In other words, robust matrix PCA can be applied either to each matrix slice of the tensor, or to the matrix obtained by flattening the tensor. However, such matrix methods ignore the tensor algebraic constraints or the CP rank constraints, which differ from the matrix rank constraints. There are however a number of challenges to incorporating the tensor CP rank constraints. Enforcing a given tensor rank is NPhard [14], unlike the matrix case, where low rank projections can be computed efficiently. Moreover, finding the best convex relaxation of the tensor CP rank is also NPhard [14], unlike the matrix case, where the convex relaxation of the rank, viz., the nuclear norm, can be computed efficiently.
1.1 Summary of Results
Proposed method:
We propose a nonconvex iterative method, termed
, that maintains low rank and sparse estimates
, , which are alternately updated. The low rank estimateis updated through the eigenvector computation of
, and the sparse estimate is updated through (hard) thresholding of the residual .Tensor Eigenvector Computation:
Computing eigenvectors of is challenging as the tensor can have arbitrary “noise” added to an orthogonal tensor and hence the techniques of [2] do not apply as they only guarantee an approximation to the eigenvectors up to the “noise” level. Results similar to the shifted power method of [19]
should apply for our problem, but their results hold in an arbitrarily small centered at the true eigenvectors, and the size of the ball is typically not welldefined. In this work, we provide a simple variant of the tensor power method based on gradient ascent of a regularized variational form of the eigenvalue problem of a tensor. We show that our method converges to the true eigenvectors at a linear rate when initialized within a reasonably small ball around eigenvectors. See Theorem
2 for details.Guaranteed recovery:
As a main result, we prove convergence to the global optimum for under an incoherence assumption on , and a bounded sparsity level for . These conditions are similar to the conditions required for the success of matrix robust PCA. We also prove fast linear convergence rate for , i.e. we obtain an additive approximation in iterations.
Superiority over matrix robust PCA:
We compare our method with matrix robust PCA, carried out either on matrix slices of the tensor, or on the flattened tensor. We prove our method is superior and can handle higher sparsity levels in the noise tensor , when it is block structured. Intuitively, each block of noise represents correlated noise which persists for a subset of slices in the tensor. For example, in a video if there is an occlusion then the occlusion remains fixed in a small number of frames. In the scenario of momentbased estimation, represents gross corruptions of the moments of some multivariate distribution, and we can assume that it occurs over a small subset of variables.
We prove that our tensor methods can handle a much higher level of block sparse perturbations, when the overlap between the blocks is controlled (e.g. random block sparsity). For example, for a rank tensor, our method can handle corrupted entries per fiber of the tensor (i.e. row/column of a slice of the tensor). In contrast, matrix robust PCA methods only allows for corrupted entries, and this bound is tight [22]. We prove that even better gains are obtained for when the rank of increases, and we provide precise results in this paper. Thus, our achieves best of both the worlds: better accuracy and faster running times.
We conduct extensive simulations to empirically validate the performance of our method and compare it to various matrix robust PCA methods. Our synthetic experiments show that our tensor method is  times more accurate, and about  times faster, compared to matrix decomposition methods. On the realworld Curtain dataset, for the activity detection, our tensor method obtains better recovery with a % speedup.
Overview of techniques:
At a high level, the proposed method is a tensor analogue of the nonconvex matrix robust PCA method in [22]. However, both the algorithm () and the analysis of is significantly challenging due to two key reasons: a) there can be significantly more structure in the tensor problem that needs to be exploited carefully using structure in the noise,b) unlike matrices, tensors can have an exponential number of eigenvectors [7].
We would like to stress that we need to establish convergence to the globally optimal solution , and not just to a local optimum, despite the nonconvexity of the decomposition problem. Intuitively, if we are in the basin of attraction of the global optimum, it is natural to expect that the estimates under are progressively refined, and get closer to the true solution . However, characterizing this basin, and the conditions needed to ensure we “land” in this basin is nontrivial and novel.
As mentioned above, our method alternates between finding low rank estimate on the residual and viceversa. The main steps in our proof are as follows: (i) For updating the low rank estimate, we propose a modified tensor power method, and prove that it converges to one of the eigenvectors of . In addition, the recovered eigenvectors are “close” to the components of . (ii) When the sparse estimate is updated through hard thresholding, we prove that the support of is contained within that of . (iii)
We make strict progress in each epoch, where
and are alternately updated.In order to prove the first part, we establish that the proposed method performs gradient ascent on a regularized variational form of the eigenvector problem. We then establish that the regularized objective satisfies local strong convexity and smoothness. We also establish that by having a polynomial number of initializations, we can recover vectors that are “reasonably” close to eigenvectors of
. Using the above two facts, we establish a linear convergence to the true eigenvectors of , which are close to the components of .For step (ii) and (iii), we show that using an intuitive block structure in the noise, we can bound the affect of noise on the eigenvectors of the true lowrank tensor and show that the proposed iterative scheme refines the estimates of and converge to at a linear rate.
1.2 Applications
In addition to the standard applications of robust tensor decomposition to video denoising [17], we propose two applications in probabilistic learning.
Learning latent variable models using grossly corrupted moments:
Using tensor decomposition for learning LVMs has been intensely studied in the last few years, e.g. [2, 1, 18]. The idea is to learn the models through CP decomposition of higher order moment tensors.While the above works assume access to empirical moments, we can extend the framework to that of robust estimation, where the moments are subject to gross corruptions. In this case, gross corruptions on the moments can occur either due to adversarial manipulations or systematic bias in estimating moments of some subset of variables.
Multitask learning of linear Bayesian networks:
Let . The samples for the Bayesian network are generated as and where is the hidden variable for the network. If the Bayesian networks are related, we can share parameters among them. In the above framework, we share parameters , which map the hidden variables to observed ones. Assuming that all the covariances are sparse, when they are stacked together, they form a sparse tensor. Similarly stacked constitutes a low rank tensor. Thus, we can consider the samples jointly, and learn the parameters by performing robust tensor decomposition.
1.3 Related Work
Robust matrix decomposition:
In the matrix setting, the above problem of decomposition into sparse and low rank parts is popularly known as robust PCA, and has been studied in a number of works ([8],[6]). The popular method is based on convex relaxation, where the low rank penalty is replaced by nuclear norm and the sparsity is replaced by the norm. However, this technique is not applicable in the tensor setting, when we consider the CP rank. There is no convex surrogate available for the CP rank.
Recently a non convex method for robust PCA is proposed in [22]. It involves alternating steps of PCA and thresholding of the residual. Our proposed tensor method can be seen as a tensor analogue of the method in [22]. However, the analysis is very different, since the optimization landscape for tensor decomposition differs significantly from that of the matrix.
Convex Robust Tucker decomposition:
Previous works which employ convex surrogates for tensor problems employ a different notion of rank, known as the Tucker rank or the multirank, e.g. [25, 10, 16, 20, 12]. However, the notion of a multirank is based on ranks of the matricization or flattening of the tensor, and thus, this method does not exploit the tensor algebraic constraints. The problem of robust tensor PCA is specifically tackled in [13, 11]. In [11], convex and nonconvex methods are proposed based on Tucker rank, but there are no guarantees on if it yields the original tensors and in (1). In [13], they prove success under restricted eigenvalue conditions. However, these conditions are opaque and it is not clear regarding the level of sparsity that can be handled.
Sum of squares:
Barak et al [5] recently consider CPtensor completion using algorithms based on the sum of squares hierarchy. However, these algorithms are expensive. In contrast, in this paper, we consider simple iterative methods based on the power method that are efficient and scalable for large datasets. It is however unclear if the sum of squares algorithm improves the result for the block sparse model considered here.
Robust tensor decomposition:
Shah et al [23] consider robust tensor decomposition method using a randomized convex relaxation formulation. Under their random sparsity model, their algorithm provides guaranteed recovery as long as the number of nonzeros per fibre is . This is in contrast to our method which potentially tolerates upto nonzero sparse corruptions per fibre.
2 Proposed Algorithm
Notations:
Let , and denote the norm of vector . For a matrix or a tensor , refers to spectral norm and refers to maximum absolute entry.
Tensor preliminaries:
A real third order tensor is a threeway array. The different dimensions of the tensor are referred to as modes. In addition, fibers are higher order analogues of matrix rows and columns. A fiber is obtained by fixing all but one of the indices of the tensor. For example, for a third order tensor , the mode fiber is given by . Similarly, slices are obtained by fixing all but two of the indices of the tensor. For example, for the third order tensor , the slices along third mode are given by . A flattening of tensor along mode is a matrix whose columns correspond to mode fibres of .
We view a tensor as a multilinear form. In particular, for vectors , let
(2) 
which is a multilinear combination of the tensor mode fibers. Similarly is a multilinear combination of the tensor entries.
A tensor has a CP rank at most if it can be written as the sum of rank tensors as
(3) 
where notation represents the outer product. We sometimes abbreviate as . Without loss of generality, , since .
method:
We propose nonconvex algorithm for robust tensor decomposition, described in Algorithm 1. The algorithm proceeds in stages, , where is the target rank of the low rank estimate. In stage, we consider alternating steps of low rank projection and hard thresholding of the residual, . For computing , that denotes the leading eigenpairs of , we execute gradient ascent on a function with multiple restarts and deflation (see Procedure 1).
Computational complexity:
In , at the stage, the eigenpairs are computed using Procedure 1, whose complexity is . The hard thresholding operation requires time. We have iterations for each stage of the algorithm and there are stages. By Theorem 2, it suffices to have and , and where represents up to polylogarithmic factors and is a small constant. Hence, the overall computational complexity of is .
3 Theoretical Guarantees
In this section, we provide guarantees for the proposed in the previous section for RTD in (1). Even though we consider a symmetric and in (1), we can extend the results to asymmetric tensors, by embedding them into symmetric tensors, on lines of [24].
In general, it is impossible to have a unique recovery of lowrank and sparse components. Instead, we assume the following conditions to guarantee uniqueness:
(L) is a rank orthogonal tensor in (1), i.e., , where iff and o.w. is incoherent, i.e., and , .
The above conditions of having an incoherent low rank tensor are similar to the conditions for robust matrix PCA. For tensors, the assumption of an orthogonal decomposition is limiting, since there exist tensors whose CP decomposition is nonorthogonal [2]. We later discuss how our analysis can be extended to nonorthogonal tensors. We now list the conditions for sparse tensor .
The tensor is block sparse, where each block has at most nonzero entries along any fibre and the number of blocks is . Let be the support tensor that has the same sparsity pattern as , but with unit entries, i.e. iff. for all . We now make assumptions on sparsity pattern of . Let be the maximum fraction of overlap between any two block fibres and . In other words, . (S) Let be the sparsity level along any fibre of a block and let be the number of blocks. We require
(4)  
(5) 
We assume a block sparsity model for above. Under this model, the support tensor which encodes sparsity pattern, has rank , but not the sparse tensor since the entries are allowed to be arbitrary. We also note that we set to be for ease of exposition and show one concrete example where our method significantly outperforms matrix robust PCA methods.
As discussed in the introduction, it may be advantageous to consider tensor methods for robust decomposition only when sparsity across the different matrix slices are aligned/structured in some manner, and a block sparse model is a natural structure to consider. We later demonstrate the precise nature of superiority of tensor methods under block sparse perturbations.
For the above mentioned sparsity structure, we set in our algorithm. Under the above conditions, our proposed algorithm establishes convergence to the globally optimal solution.
Theorem 1 (Convergence to global optimum for ).
Let satisfy and , and . The outputs (and its parameters and ) and of Algorithm 1 satisfy w.h.p.:
Comparison with matrix methods:
We now compare with the matrix methods for recovering the sparse and low rank tensor components in (1). Robust matrix PCA can be performed either on all the matrix slices of the input tensor , for , or on the flattened tensor (see the definition in Section 2). We can either use convex relaxation methods [8, 6, 15] or nonconvex methods [22] for robust matrix PCA.
Recall that measures the fraction of overlapping entries between any two different block fibres, i.e. , where are the fibres of the block components of tensor in (3) which encodes the sparsity pattern of with  entries. A short proof is given in Appendix B.1.
Corollary 1 (Superiority of tensor methods).
Simplifications under random block sparsity:
We now obtain transparent results for a random block sparsity model, where the components in (3) for the support tensor are drawn uniformly among all sparse vectors. In this case, the incoherence simplifies as when and o.w. Thus, the condition on in (5) simplifies as when and o.w. Recall that as before, we require sparsity level of a fibre in any block as . For simplicity, we assume that for the remaining section.
We now explicitly compute the sparsity level of allowed by our method and compare it to the level allowed by matrix based robust PCA.
Corollary 2 (Superiority of tensor methods under random sparsity).
Let be the number of nonzeros in () allowed by under the analysis of Theorem 1 and let be the allowed using the standard matrix robust PCA analysis. Also, let be sampled from the random block sparse model. Then, the following holds:
for ,  (6)  
o.w.  (7) 
Unstructured sparse perturbations :
If we do not assume block sparsity in (S), but instead assume unstructured sparsity at level , i.e. the number of non zeros along any fibre of is at most , then the matrix methods are superior. In this case, for success of , we require that which is worse than the requirement of matrix methods . Our analysis suggests that if there is no structure in sparse patterns, then matrix methods are superior. This is possibly due to the fact that finding a low rank tensor decomposition requires more stringent conditions on the noise level. Meanwhile, when there is no block structure, the tensor algebraic constraints do not add significantly new information. However, in the experiments, we find that our tensor method is superior to matrix methods even in this case, in terms of both accuracy and running times.
3.1 Analysis of Procedure 1
Figure[c] 10[c]10 0.60.6 0.50.5 0.40.4 Nonwhiten 40[l] 40 30[l] 30 20[l] 20 Matrix(slice)  Figure[c] 10[c]10 0.80.8 0.70.7 0.60.6 0.50.5 0.40.4 0.30.3 Nonwhiten 40[l] 40 30[l] 30 20[l] 20  Figure[c] 10[c]10 0[c]0 40[l] 40 30[l] 30 20[l] 20 NonwhitenTRPCA Whiten(random)TRPCAW(slice) Whiten(true)TRPCAW(true) Matrix(slice)MRPCA(slice) Matrix(flat)MRPCA(flat)  Figure[c] 10[c]10 0[c]0 40[l] 40 30[l] 30 20[l] 20 
(a)  (b)  (c)  (d) 
Figure[c] 0[l] 0 10[l] 10 40[l] 40 30[l] 30 20[l] 20  0[l] 0 10[l] 10 40[l] 40 30[l] 30 20[l] 20 Figure[c]  Figure[c] 0[l] 0 10[l] 10 40[l] 40 30[l] 30 20[l] 20 NonwhitenTRPCA Whiten(random)TRPCAW(slice) Whiten(true)TRPCAW(true) Matrix(slice)MRPCA(slice) Matrix(flat)MRPCA(flat)  0[l] 0 10[l] 10 40[l] 40 30[l] 30 20[l] 20 Figure[c] 
(a)  (b)  (c)  (d) 
Our proof of Theorem 1 depends critically on an assumption that Procedure 1 indeed obtains the top eigenpairs. We now concretely prove this claim. Let be a symmetric tensor which is a perturbed version of an orthogonal tensor , where and form an orthonormal basis.
Recall that is the number of initializations to seed the power method, is the number of power iterations, and is the convergence criterion. For any , and , assume the following
where is a small constant. We now state the main result for recovery of components of when Procedure 1 is applied to .
Theorem 2 (Gradient Ascent method).
Let be as defined above with . Then, applying Procedure 1 with deflations on with target rank , yields eigenpairs of , given by up to arbitrary small error
and with probability at least
. Moreover, there exists a permutation on such that: ,While [2, Thm. 5.1] considers power method, here we consider the power method followed by a gradient ascent procedure. With both methods, we obtain outputs which are “close” to the original eigenpairs of of . However, the crucial difference is that Procedure 1 outputs correspond to specific eigenpairs of input tensor , while the outputs of the usual power method have no such property and only guarantees accuracy upto error. We critically require the eigen property of the outputs in order to guarantee contraction of error in between alternating steps of low rank decomposition and thresholding.
The analysis of Procedure 1 has two phases. In the first phase, we prove that with initializations and power iterations, we get close to true eigenpairs of , i.e. for . After this, in the second phase, we prove convergence to eigenpairs of .
The proof for the first phase is on lines of proof in [2], but with improved requirement on error tensor in (3.1). This is due to the use of SVD initializations rather than random initializations to seed the power method, and its analysis is given in [3].
Proof of the second phase follows using two observations: a) Procedure 1 is just a simple gradient ascent of the following program: , b) within a small distance to the eigenvectors of , is strongly concave and as well as strongly smooth with appropriate parameters. See below lemma for a detailed proof of the above claim. Hence, using our initialization guarantee from the phaseone, Procedure 1 converges to a approximation to eigenpair of in time and hence, Theorem 2 holds.
Lemma 3.
Let . Then, is strongly concave and strongly smooth at all points s.t. and , for some . Procedure 1 converges to an eigenvector of at a linear rate.
Proof.
Consider the gradient and Hessian of w.r.t. :
(8)  
(9) 
We first note that the stationary points of indeed correspond to eigenvectors of with eigenvalues . Moreover, when and , we have:
Recall that , where is an orthogonal tensor and . Hence, there exists one eigenvector in each of the above mentioned set, i.e., set of s.t. and . Hence, the standard gradient ascent procedure on would lead to convergence to an eigenvector of .
Extending to nonorthogonal low rank tensors:
In (L), we assume that the low rank tensor in (1) is orthogonal. We can also extend to nonorthogonal tensors , whose components are linearly independent. Here, there exists an invertible transformation known as whitening that orthogonalizes the tensors [2]. We can incorporate whitening in Procedure 1 to find low rank tensor decomposition, within the iterations of .
4 Experiments
(a)  (b)  (c) 
We now present an empirical study of our method. The goal of this study is threefold: a) establish that our method indeed recovers the lowrank and sparse parts correctly, without significant parameter tuning, b) demonstrate that whitening during low rank decomposition gives computational advantages, c) show that our tensor methods are superior to matrix based RPCA methods in practice.
Our pseudocode (Algorithm 1) prescribes the threshold
in Step 5, which depends on the knowledge of the singular values of the low rank component
. Instead, in the experiments, we set the threshold at the step of stage as . We found that the above thresholding, in the tensor case as well, provides exact recovery while speeding up the computation significantly.Synthetic datasets:
The lowrank part is generated from a factor matrix whose entries are i.i.d. samples from . For deterministic sparsity setting, is generated by setting each entry of to be nonzeros with probability and each nonzero value
is drawn i.i.d. from the uniform distribution over
. For block sparsity setting, we randomly select numbers of vectors in which each entry is chosen to be nonzero with probability . The value of nonzero entry is assigned similar to the one in deterministic sparsity case. Each of this vector will form a subtensor() and those subtensors form the whole . For increasing incoherence of , we randomly zeroout rows of and then renormalize them. For the CPdecomposition, we use the alternating least squares (ALS) method available in the tensor toolbox [4]. Note that we use the ALS procedure in practice since we found that empirically, ALS performs quite well and is convenient to use. For whitening, we use two different whitening matrices: a) the true second order moment from the true lowrank part, b) the recovered low rank part from a random slice of the tensor by using matrix nonconvex RPCA method. We compare our with matrix nonconvex RPCA in two ways: a) treat each slice of the input tensor as a matrix RPCA problem, b) reshape the tensor along one mode and treat the resultant as a matrix RPCA problem.We vary sparsity of and rank of for with a fixed tensor size. We investigate performance of our method, both with and without whitening, and compare with the stateoftheart matrix nonconvex RPCA algorithm. Our results for synthetic datasets is averaged over runs. In Figure 1, we report relative error () by each of the methods allowing maximum number of iterations up to . Comparing (a) and (b) in Figure 1, we can see that with block sparsity, is better than matrix based nonconvex RPCA method when is less than . If we use whitening, the advantage of is more significant. But when rank becomes higher, the whitening method is not helpful. In Figure 2, we illustrate the computational time of each methods. We can see that whitening simplifies the problem and give us computational advantage. Besides, the running time for the one using tensor method is similar to the one using matrix method when we reshape the tensor as one matrix. Doing matrix slices increases the running time.
Realworld dataset:
To demonstrate the advantage of our method, we apply our method to activity detection or foreground filtering [21]. The goal of this task is to detect activities from a video coverage, which is a set of image frames that form a tensor. In our robust decomposition framework, the moving objects correspond sparse (foreground) perturbations which we wish to filter out. The static ambient background is of lesser interest since nothing changes.
We selected one of datasets, namely the Curtain video dataset wherein a person walks in and out of the room between the frame numbers and . We compare our tensor method with the stateoftheart matrix method in [22] on the set of frames where the activity happens. In our tensor method, we preserve the multimodal nature of videos and consider the set of image frames without vectorizing them. For the matrix method, we follow the setup of [22] by reshaping each image frame into a vector and stacking them together. We set the convergence criterion to and run both the methods. Our tensor method yields a % speedup and obtains a better visual recovery for the same convergence accuracy as shown in Figure 3.
5 Conclusion
We proposed a nonconvex alternating method for decomposing a tensor into low rank and sparse parts. We established convergence to the globally optimal solution under natural conditions such as incoherence of the low rank part and bounded sparsity levels for the sparse part. We prove that our proposed tensor method can handle perturbations at a much higher sparsity level compared to robust matrix methods. Our simulations show superior performance of our tensor method, both in terms of accuracy and computational time. Some future directions are analyzing: (1) our method with whitening (2) the setting where grossly corrupted samples arrive in streaming manner.
Acknowledgements
Animashree Anandkumar is supported in part by Microsoft Faculty Fellowship, NSF Career Award CCF1254106, ONR Award N000141410665, ARO YIP Award W911NF1310084, and AFOSR YIP FA95501510221. Yang Shi is supported by NSF Career Award CCF1254106 and ONR Award N000141512737, Niranjan is supported by NSF BigData Award IIS1251267 and ONR Award N000141512737.
References
 [1] A. Anandkumar, R. Ge, D. Hsu, and S. M. Kakade. A Tensor Spectral Approach to Learning Mixed Membership Community Models. In Conference on Learning Theory (COLT), June 2013.

[2]
A. Anandkumar, R. Ge, D. Hsu, S. M. Kakade, and M. Telgarsky.
Tensor decompositions for learning latent variable models.
J. of Machine Learning Research
, 15:2773–2832, 2014.  [3] Animashree Anandkumar, Rong Ge, and Majid Janzamin. Guaranteed nonorthogonal tensor decomposition via alternating rank1 updates. ArXiv 1402.5180, 2014.
 [4] Brett W. Bader, Tamara G. Kolda, et al. Matlab tensor toolbox version 2.6. Available online, February 2015.
 [5] B. Barak and A. Moitra. Tensor Prediction, Rademacher Complexity and Random 3XOR. ArXiv eprints 1501.06521, January 2015.
 [6] Emmanuel J Candès, Xiaodong Li, Yi Ma, and John Wright. Robust principal component analysis? Journal of the ACM (JACM), 58(3):11, 2011.
 [7] Dustin Cartwright and Bernd Sturmfels. The number of eigenvalues of a tensor. Linear algebra and its applications, 438(2):942–952, 2013.
 [8] Venkat Chandrasekaran, Sujay Sanghavi, Pablo A Parrilo, and Alan S Willsky. Ranksparsity incoherence for matrix decomposition. SIAM Journal on Optimization, 21(2):572–596, 2011.
 [9] KungChing Chang, Kelly Pearson, Tan Zhang, et al. Perronfrobenius theorem for nonnegative tensors. Commun. Math. Sci, 6(2):507–520, 2008.
 [10] Silvia Gandy, Benjamin Recht, and Isao Yamada. Tensor completion and lownrank tensor recovery via convex optimization. Inverse Problems, 27(2):025010, 2011.
 [11] Donald Goldfarb and Zhiwei Qin. Robust lowrank tensor recovery: Models and algorithms. SIAM Journal on Matrix Analysis and Applications, 35(1):225–253, 2014.
 [12] Quanquan Gu, Huan Gui, and Jiawei Han. Robust tensor decomposition with gross corruption. In Z. Ghahramani, M. Welling, C. Cortes, N.D. Lawrence, and K.Q. Weinberger, editors, Advances in Neural Information Processing Systems 27, pages 1422–1430. Curran Associates, Inc., 2014.
 [13] Quanquan Gu, Huan Gui, and Jiawei Han. Robust tensor decomposition with gross corruption. In Advances in Neural Information Processing Systems, pages 1422–1430, 2014.
 [14] Christopher J Hillar and LekHeng Lim. Most tensor problems are nphard. Journal of the ACM (JACM), 60(6):45, 2013.
 [15] Daniel Hsu, Sham M Kakade, and Tong Zhang. Robust matrix decomposition with sparse corruptions. Information Theory, IEEE Transactions on, 57(11):7221–7234, 2011.
 [16] Bo Huang, Cun Mu, Donald Goldfarb, and John Wright. Provable lowrank tensor recovery. Preprint, 2014.
 [17] Heng Huang and Chris Ding. Robust tensor factorization using norm. In Computer Vision and Pattern Recognition, 2008. CVPR 2008. IEEE Conference on, pages 1–8. IEEE, 2008.
 [18] Prateek Jain and Sewoong Oh. Learning mixtures of discrete product distributions using spectral decompositions. 2014.
 [19] Tamara G. Kolda and Jackson R. Mayo. Shifted power method for computing tensor eigenpairs. SIAM J. Matrix Analysis Applications, 32(4):1095–1124, 2011.
 [20] Nadia Kreimer, Aaron Stanton, and Mauricio D Sacchi. Tensor completion based on nuclear norm minimization for 5d seismic data reconstruction. Geophysics, 78(6):V273–V284, 2013.
 [21] Liyuan Li, Weimin Huang, Irene YuHua Gu, and Qi Tian. Statistical modeling of complex backgrounds for foreground object detection. Image Processing, IEEE Transactions on, 13(11):1459–1472, 2004.
 [22] Praneeth Netrapalli, UN Niranjan, Sujay Sanghavi, Animashree Anandkumar, and Prateek Jain. Nonconvex robust pca. In Advances in Neural Information Processing Systems, pages 1107–1115, 2014.
 [23] Parikshit Shah, Nikhil Rao, and Gongguo Tang. Sparse and lowrank tensor decomposition. In Advances in Neural Information Processing Systems, 2015.
 [24] Stefan Ragnarsson and Charles F Van Loan. Block tensors and symmetric embeddings. Linear Algebra and Its Applications, 438(2):853–874, 2013.
 [25] Ryota Tomioka, Kohei Hayashi, and Hisashi Kashima. Estimation of lowrank tensors via convex optimization. arXiv preprint arXiv:1010.0789, 2010.
Appendix A Bounds for block sparse tensors
One of the main bounds to control is the spectral norm of the sparse perturbation tensor . The success of the power iterations and the improvement in accuracy of recovery over iterative steps of requires this bound.
Lemma 4 (Spectral norm bounds for block sparse tensors).
Let satisfy the block sparsity assumption (S). Then
(10) 
Proof.
Let be a tensor that encodes the sparsity of i.e. iff for all . We have that
where the last inequality is from Perron Frobenius theorem for nonnegative tensors [9]. Note that is nonnegative by definition. Now we bound on lines of [3, Lemma 4]. Recall that ,
By definition . Define normalized vectors . We have
Define matrix . Note that is a matrix with unit diagonal entries and absolute values of offdiagonal entries bounded by , by assumption. From Gershgorin Disk Theorem, every subset of columns in has singular values within , where . Moreover, from Gershgorin Disk Theorem, .
For any unit vector , let be the set of indices that are largest in . By the argument above we know . In particular, the smallest entry in is at most . By construction of this implies for all not in , is at most . Now we can write the norm of as
Here the first inequality uses that every entry outside is small, and last inequality uses the bound argued on , the spectral norm bound is assumed on . Since , we have the result.
Another important bound required is norm of certain contractions of the (normalized) sparse tensor and its powers, which we denote by below. We use a loose bound based on spectral norm and we require . However, this constraint will also be needed for the power iterations to succeed and is not an additional requirement. Thus, the loose bound below will suffice for our results to hold.
Lemma 5 (Infinity norm bounds).
Let satisfy the block sparsity assumption (S). Let satisfy the assumption . Then, we have

, where

for .

when .
Proof.
We have from norm conversion
(11)  
(12) 
where norm (i.e. sum of absolute values of entries) of a slice is , since the number of nonzero entries in one block in a slice is .
Let . Now, where . Now,
Hence, .