1 Introduction
The past decades have seen a large body of work on tensors or multiway arrays in applied mathematics, signal processing, machine learning, statistics, among many other fields. Tensors arise in numerous applications involving multiway data, such as brain imaging (zhou2013tensor; zhang2019tensor), electron microscopy imaging (han2020optimal; zhang2020denoising), recommender system design (bi2018multilayer). In addition, tensor methods have been applied to many problems in statistics and machine learning where the observations are not necessarily tensors, such as topic and latent variable models (anandkumar2014tensor), additive index models (balasubramanian2018tensor), highorder interaction pursuit (hao2018sparse). In this paper, we focus on a prototypical model for tensor estimation:
(1) 
Here, are the observations and unknown noise and is an order tensor parameter of interest. is a known linear map, which can be explicitly expressed as
(2) 
with the given measurement tensors . Our goal is to estimate based on . When , (1) becomes the lowrank tensor recovery problem (rauhut2017low) where the aim is to recover exactly.
In many applications, , i.e., the number of parameters in , is much greater than the sample size , so some structural conditions are often assumed to ensure the problem is wellposed. In the literature, the lowrankness assumption was widely considered (kolda2009tensor; zhou2013tensor; anandkumar2014guaranteed; richard2014statistical; montanari2018spectral). In this work, we focus on the setting that the target parameter is low Tucker rank and admits the following Tucker (or multilinear) decomposition with Tucker rank :
(3) 
Here, is the order core tensor; is a by matrix with orthonormal columns, which represents the mode
singular vectors of
; “” is the tensormatrix product along mode . The formal definitions of Tucker decomposition and tensormatrix product are given in Section 1.4.With different designs of , the general model (1) covers many specific settings arising from applications, such as recommender system (bi2018multilayer), neuroimaging (guhaniyogi2015bayesian; li2017parsimonious), longitudinal relational data analysis (hoff2015multilinear), imaging processing (guo2012tensor). The specific settings of model (1) include:

[leftmargin=*]

Tensor regression with general random or deterministic design (zhou2013tensor; raskutti2015convex), where are general tensors. Specifically, the Gaussian ensemble design ( has i.i.d. Gaussian/subGaussian entries) is widely studied in the literature.

Tensor completion (gandy2011tensor; liu2013tensor; yuan2014tensor): , is the th canonical vector and are randomly selected integers from , “” represents the outer product and ;

Tensor estimation via rank1 projections (hao2018sparse): , where are random vectors;

Tensor PCA/SVD (richard2014statistical; hopkins2015tensor; zhang2018tensor; perry2020statistical) is a special case of tensor completion where all entries are observable. In this particular setting, we can tensorize and rewrite the model (1) equivalently to . Here is the low Tucker rank signal tensor and is the noise.
In view of model (1) and assumption (3), a natural estimator of is
(4) 
Here is the Tucker rank of (see definition in Section 1.4). However, the optimization problem in (4) is nonconvex and NPhard in general. To tame the nonconvexity, a common scheme is the convex relaxation (mu2014square; raskutti2015convex; tomioka2011statistical)
. However, this scheme may either obtain suboptimal statistical guarantees or require evaluating the tensor nuclear norm, which is NPhard to compute in general
(hillar2013most). Alternatively, a large body of literature turn to the nonconvex formulation and focus on developing computational efficient twostage procedures on estimating : first, one obtains a warm initialization of and then runs local algorithms to refine the estimate. Provable guarantees on estimation or recovery of for such a twostage paradigm have been developed in different scenarios (rauhut2017low; chen2016non; ahmed2020tensor; han2020optimal; cai2019nonconvex; hao2018sparse; cai2020provable; xia2017statistically). In this work, we focus on the nonconvex formulation and aim to develop a provable computationally efficient estimator for . Departing from the existing literature that focus on the firstorder local methods, we propose a new Riemannian GaussNewton (RGN) algorithm for iterative refinement and establish the first quadratic convergence guarantee on the estimation of .1.1 Our Contributions
In this paper, we develop a new Riemannian GaussNewton (RGN) algorithm for lowrank tensor estimation. The proposed algorithm is tuning free and generally has the same periteration computational complexity as the alternating minimization (zhou2013tensor; li2013tucker) and comparable complexity to the other firstorder methods including projected gradient descent (chen2016non) and gradient descent (han2020optimal).
Moreover, assuming satisfies the tensor restricted isometry property (TRIP) (see Definition 2), we prove that with some proper initialization, the iterates generated by RGN converge quadratically to up to some statistical error. Especially in the noiseless setting, i.e., , RGN converges quadratically to the exact parameter . Figure 1 shows the numerical performance of RGN in tensor regression (left panel) and tensor completion (right panel): in the noiseless case, RGN converges quadratically to ; in the noisy case, RGN converges quadratically to a neighborhood of up to some statistical error. More simulation results on tensor estimation via rank1 projections and tensor SVD can be found in Section 6. Since RGN generally converges to a point with nonzero function value in the noisy setting, the generic theory on RGN can only guarantee a (super)linear convergence rate to a stationary point (absil2009optimization; breiding2018convergence). Our result complements the classic theory of RGN: we show RGN converges quadratically to a neighborhood of the true parameter of interest, which achieves a statistically optimal estimation error rate. To our best knowledge, such a result is new and our RGN is the first algorithm with a provable guarantee of secondorder convergence for the lowrank tensor estimation.
Furthermore, we provide a deterministic minimax lower bound for the estimation error under model (1). The lower bound matches the estimation error upper bound, which demonstrates the statistical rateoptimality of RGN.
Next, we apply RGN to two problems arising from applications in machine learning and statistics: tensor regression and tensor SVD. In both problems, we prove the iterates of RGN converge quadratically to a neighborhood of that achieves the minimax optimal estimation error. A comparison of RGN and prior algorithms on tensor regression and tensor SVD is given in Table 1. We can see for fixed
, RGN achieves the best estimation error and signaltonoise ratio requirement, i.e., sample complexity in tensor regression and least singular value in tensor SVD, comparing to the state of the art while maintaining a relatively low computational cost. Moreover, RGN is the only algorithm with guaranteed quadratic convergence in both applications. Finally, we conduct numerical studies to support our theoretical findings in Section
6. The simulation studies show RGN offers much faster convergence compared to the existing approaches in the literature.Tensor Regression  
Algorithm  required  convergence  estimation  periteration 
sample size  rate  error  cost  
RGN  quadratic  
(this work)  
GD  linear  
(han2020optimal)  
NoconvexPGD  linear  
(chen2016non)  
Alter Mini  N.A.  linear  N.A.  
(zhou2013tensor)  
IHT^{1}^{1}1Department of Statistics, University of WisconsinMadison. (yluo86@wisc.edu,anruzhang@ stat.wisc.edu)  linear  
(rauhut2017low)  
Tensor SVD  
Algorithm  least  convergence  estimation  periteration 
singular value  rate  error  cost  
RGN  quadratic  
(this work)  
GD  linear  
(han2020optimal)  
Alter Mini  linear  
(zhang2018tensor) 
is the standard deviation of the Gaussian noise in tensor regression and tensor SVD.
relies on the assumption that the projection on the low Tucker rank tensor manifold can be almost exactly achieved by truncated highorder singular value decomposition
(de2000best). However, it is unclear whether this assumption holds in general.1.2 Related Literature
Our work is related to a broad range of literature from a number of communities. Here we make an attempt to discuss existing results without claiming that the survey is exhaustive.
First, the lowrank tensor estimation has attracted much recent attention from machine learning and statistics communities. Various methods were proposed, including the convex relaxation (mu2014square; raskutti2015convex; tomioka2011statistical), projected gradient descent (rauhut2017low; chen2016non; ahmed2020tensor; yu2016learning), gradient descent on the factorized model (han2020optimal; cai2019nonconvex; hao2018sparse), alternating minimization (zhou2013tensor; jain2014provable; liu2020tensor; xia2017statistically), and importance sketching (zhang2020islet). Moreover, when the target tensor has order two, our problem reduces to the widely studied lowrank matrix recovery/estimation (recht2010guaranteed; li2019rapid; ma2019implicit; sun2015guaranteed; tu2016low; wang2017unified; zhao2015nonconvex; zheng2015convergent; charisopoulos2021low; luo2020recursive; bauch2021rank). The readers are referred to a recent survey in chi2019nonconvex.
Second, Riemannian manifold optimization methods have been powerful in solving optimization problems with geometric constraints (absil2009optimization; boumal2020introduction). Many progresses in this topic were made for the lowrank matrix estimation (keshavan2009matrix; boumal2011rtrmc; boumal2015low; wei2016guarantees; meyer2011linear; mishra2014fixed; vandereycken2013low; huang2018blind; cherian2016riemannian; luo2020recursive). See the recent survey on this line of work at cai2018exploiting; uschmajew2020geometric. Moreover, Riemannian manifold optimization method has been applied for various problems on lowrank tensor estimation, such as tensor regression (cai2020provable; kressner2016preconditioned), tensor completion (rauhut2015tensor; kasai2016low; dong2021new; kressner2014low; heidel2018riemannian; xia2017polynomial; steinlechner2016riemannian; da2015optimization), and robust tensor PCA (cai2021generalized). These papers mostly focus on the firstorder Riemannian optimization methods, possibly due to the hardness of deriving the exact expressions of the Riemannian Hessian. A few exceptions also appear: heidel2018riemannian and psenka2020second developed Riemannian trustregion method for tensor completion under Tucker and tensortrain formats, respectively; kressner2016preconditioned proposed approximate Riemannian Newton methods for tensor regression in tensortrain and Tucker formats under the setting that the linear map has additive and Kroneckerproducttype structures. Departing from these results that focus on the geometric objects and numerical implementations, in this paper we not only develop an efficient implementation of RGN under the Tucker format but also prove the quadratic convergence of the iterates and the optimal estimation error rate for the estimation of .
Finally, the lowrank tensor estimation model (1
) is related to various problems in different contexts. In the tensorbased scientific computing community, largescale linear systems where the solution admits a lowrank tensor structure commonly arise after discretizing highdimensional partial differential equations (PDEs)
(lynch1964tensor; hofreither2018black), which exactly become the central problem (1) in this paper. In the literature, various methods have been proposed there to solve (1). For example, bousse2018linear developed the algebraic method and GaussNewton method to solve the linear system with a CP lowrank tensor solution. georgieva2019greedy and kressner2016preconditioned respectively introduced a greedy approach and an approximate Riemannian Newton method to approximate the linear system by a low Tucker rank tensor. The readers are also referred to grasedyck2013literature for a recent survey. There are some key differences from this line of work to ours: first, their goal is often to find a lowrank tensor that approximately solves a linear system with small approximation error, while we aim to develop an estimator with small estimation error; second, the design matrix in linear systems from discretized PDEs often has Kroneckerproducttype structure, we do not assume such structure in this paper. On the other hand, the structures of the design assumed here are application dependent, e.g., subGaussian design in tensor regression and “onehot” design in tensor completion as we mentioned in the introduction; finally, their work mainly focuses on computational aspects of the proposed methods (grasedyck2013literature), while this paper develops Riemannian GaussNewton for solving the lowrank tensor estimation problem and gives theoretical guarantees for the quadratic convergence of the algorithm and for the optimal estimation error bound of the final estimator.1.3 Organization of the Paper
After a brief introduction of notation and preliminaries in Section 1.4, we introduce our main algorithm RGN and its geometric ingredients in Section 2. The theoretical results of RGN and its applications in tensor regression and tensor SVD are discussed in Sections 3 and 4, respectively. Computational complexity of RGN and numerical studies are presented in Sections 5 and 6, respectively. Conclusion and future work are given in Section 7. Additional algorithms and all technical proofs are presented in Appendices AC.
1.4 Notation and Preliminaries
The following notation will be used throughout this article. Lowercase letters (e.g., ), lowercase boldface letters (e.g., ), uppercase boldface letters (e.g., ), and boldface calligraphic letters (e.g., ) are used to denote scalars, vectors, matrices, and order3orhigher tensors, respectively. For simplicity, we denote as the tensor indexed by in a sequence of tensors . We use bracket subscripts to denote subvectors, submatrices, and subtensors. For any vector , define its norm as . For any matrix , let be the th largest singular value of . We also denote and QR() as the subspace composed of the leading left singular vectors and the Q part of the QR orthogonalization of , respectively. represents the byidentity matrix. Let be the set of all by matrices with orthonormal columns. For any , represents the projection matrix onto the column space of ; we use to represent the orthonormal complement of .
The matricization is the operation that unfolds the order tensor along mode into the matrix where . Specifically, the mode matricization of is formally defined as
(5) 
for any . We also use notation to denote the mode tensorization or reverse operator of . Throughout the paper, , as a reversed operation of , maps a matrix back to a tensor. The HilbertSchmidt norm of is defined as The Tucker rank of a tensor is denoted by and defined as a tuple , where . For any Tucker rank tensor , it has Tucker decomposition (tucker1966some):
(6) 
where is the core tensor and is the mode singular vectors. Here, the mode product of with a matrix is denoted by and is of size , and its formal definition is given below
(7) 
It is convenient to introduce the following abbreviations to denote the tensormatrix product along multiple modes: ; . The following property about tensor matricization will be used (kolda2001orthogonal, Section 4):
(8) 
where “” is the matrix Kronecker product. For any tensor , we define as the best Tucker rank approximation of in terms of HilbertSchmidt norm, where Finally, for any linear operator , we denote as its adjoint operator.
2 Algorithm
We introduce the geometry of low Tucker rank tensor Riemannian manifolds in Section 2.1 and present the procedure of RGN in Section 2.2.
2.1 The Geometry for Low Tucker Rank Tensor Manifolds
Denote the collection of dimensional tensors of Tucker rank by . Then forms a smooth submanifold embedded in with dimension (uschmajew2013geometry; kressner2014low). Throughout the paper, we use the natural Euclidean inner product as the Riemannian metric. Suppose has Tucker decomposition ; koch2010dynamical showed that the tangent space of at , , can be represented as
(9) 
In the representation above, s are not free parameters due to the constraints , . In the following Lemma 1, we introduce another representation of
with a minimal parameterization, which matches the degree of freedom of the tangent space (
). For , we let , which corresponds to the row space of , and define(10) 
where . By (8), correspond to the subspaces of the column and row spans of , respectively.
Lemma 1.
We can also show that any tensor in is at most Tucker rank . This fact will facilitate the efficient computation of RGN to be discussed in Section 5.
Lemma 2.
Any tensor is at most Tucker rank .
Lemma 3.1 of koch2010dynamical and the tangent space representation in Lemma 1 yield the following projection operator that projects any tensor onto the tangent space of at :
(11) 
where and are respectively the contraction map and extension map defined as follows:
(12) 
In particular, is the adjoint operator of . We will see in Section 2.2 that the representation in (11) helps the efficient implementation of RGN.
2.2 Riemannian Optimization and Riemannian GaussNewton
In this subsection, we first give a preliminary for Riemannian optimization and then introduce the procedure of RGN for lowrank tensor estimation.
Overall threestep procedure of Riemannian optimization. Riemannian optimization concerns optimizing a realvalued function defined on a Riemannian manifold , for which the readers are referred to absil2009optimization and boumal2020introduction for an introduction. Due to the common nonlinearity, the continuous optimization on the Riemannian manifold often requires calculations on the tangent space. A typical procedure of a Riemannian optimization method contains three steps per iteration: Step 1. find the tangent space; Step 2. update the point on the tangent space; Step 3. map the point from the tangent space back to the manifold.
Lowrank tensor Riemannian manifold (Step 1). We have already discussed the tangent space of low Tucker rank tensor manifolds in Section 2.1, i.e., Step 1 above.
Update on tangent space (Step 2). Next, we describe the procedure of RGN in the tangent space. We begin by introducing a few more preliminaries for Riemannian manifold optimization. The Riemannian gradient of a smooth function at is defined as the unique tangent vector such that where denotes the directional derivative of at point along direction . Specifically for the embedded submanifold , we have:
A common way to derive RGN update in the literature is to first write down the Riemannian Newton equation, then replace the Riemannian Hessian by its GaussNewton approximation (absil2009optimization, Chapter 8.4.1), and finally solve the modified Riemannian Newton equation, i.e., the Riemannian GaussNewton equation. In our lowrank tensor estimation problem with the objective function (4), suppose the current iterate is , the RGN update should solve the following RGN equation (absil2009optimization, Chapter 8.4),
(13) 
However, it is not clear how to solve this equation directly in practice.
Inspired by the classical GaussNewton (GN) algorithm, we instead introduce another scheme to derive RGN. Recall in solving the nonlinear least squares problem in the Euclidean space , the classic GaussNewton can be viewed as a modified Newton method, and can also be derived by replacing the nonlinear function by its local linear approximation at the current iterate (nocedal2006numerical, Chapter 10.3). These two ways of interpretations are equivalent. Similar local linearization idea can be extended to the manifold setting except that the linearization needs to be taken in the tangent space in each iterate. Specifically, consider the objective function in (4), the linearization of at in is , which can be simplified to . So the update can be calculated by
(14) 
By mapping back to the manifold, we get the new iterate .
Proposition 1.
Proposition 1 shows that (14) yields the RGN update, which directly provides a simple implementation of RGN. To see this, recall , where and are defined in the similar way as in (12) except evaluated on ; then the objective function in (14) can be rewritten as follows,
(15) 
where . Based on the calculation in (15), we define the following covariates maps , where for ,
Here, satisfies and similarly for . Then, by (15) and the fact that , (14) can be equivalently solved by
where
(16) 
Note that (16) is an unconstrained least squares with the number of parameters equal to .
Retraction (Step 3). Finally, we discuss how to map the point from the tangent space back to the manifold, i.e., Step 3 above. An ideal method is via the exponential map, which moves a point on the manifold along the geodesic. However, computing the exponential map is prohibitively expensive in most situations, and a more practical choice is the socalled retraction. Retraction is in general a firstorder approximation of the exponential map. In tensor manifold , the retraction map, denoted by , should be a smooth map from to that satisfies i) and ii) for all and (absil2009optimization, Chapter 4). Here, is the tangent bundle of .
In the low Tucker rank tensor manifolds, Proposition 2.3 of kressner2014low showed that the truncated highorder singular value decomposition (THOSVD) (de2000multilinear) is a retraction. We further show in the following Lemma 4 that the sequentially truncated HOSVD (STHOSVD) (vannieuwenhoven2012new), a computationally more efficient procedure than THOSVD, also satisfies the retraction properties. The detailed procedures of THOSVD and STHOSVD are given in Appendix A.
Lemma 4 (Retraction of Sequentially Truncated HOSVD).
Summary of RGN. We give the complete RGN algorithm for lowrank tensor estimation in Algorithm 1.
Input: , , Tucker rank , initialization with Tucker decomposition , and defined as (10).
(17) 
(18) 
(19) 
Output: .
Remark 1 (Operator ).
In (19), plays the role as retraction that maps the iterate from the tangent space of at back onto the manifold. Since directly operates on the updated tensor, to distinguish with the canonical notation for retraction, we use a simplified notation to represent this map here. As we mentioned before, THOSVD (de2000multilinear) and STHOSVD (vannieuwenhoven2012new) are two choices of retractions.
3 Theoretical Analysis
We analyze the convergence rate of RGN in this section. We begin by introducing the quasiprojection property of THOSVD and STHOSVD and the assumption on the linear map . Different from the lowrank matrix projection, which can be efficiently and exactly computed via truncated SVD, performing lowrank tensor projection exactly, even for , can be NP hard in general (hillar2013most). We thus introduce the following quasiprojection property and the approximation constant .
Definition 1 (Quasiprojection of and Approximation Constant ).
Let be the projection map from to the tensor space of Tucker rank at most , i.e., for any and of Tucker rank at most , one always has .
We say satisfies the quasiprojection property with approximation constant if for any .
It is known that THOSVD and STHOSVD satisfy the quasiprojection property (see Chapter 10 in (hackbusch2012tensor)).
Proposition 2 (Quasiprojection property of THOSVD and STHOSVD).
THOSVD and STHOSVD described in Appendix A satisfy the quasiprojection property with approximation constant .
For technical convenience, we also assume satisfies the following Tensor Restricted Isometry Property (TRIP) (rauhut2017low). TRIP condition can be seen as a tensor generalization of the restricted isometry property (RIP). In the compressedsensing and lowrank matrix recovery literature, the RIP condition has been widely used as one standard assumption (candes2011tight; cai2013sharp). rauhut2017low showed that TRIP condition holds if is randomly designed with a sufficient sample size .
Definition 2 (Tensor Restricted Isometry Property (TRIP)).
Let be a linear map. For a fixed tuple with for , define the tensor restricted isometry constant to be the smallest number such that holds for all of Tucker rank at most . If , we say satisfies tensor restricted isometry property (TRIP).
Next, we establish the deterministic convergence theory for RGN.
Theorem 1 (Convergence of RGN).
Suppose is either THOSVD or STHOSVD, satisfies the TRIP, and the initialization satisfies , where is the minimum of least singular values at each matricization of . Then for all ,
Comments
There are no comments yet.