Low-rank Tensor Estimation via Riemannian Gauss-Newton: Statistical Optimality and Second-Order Convergence

by   Yuetian Luo, et al.
University of Wisconsin-Madison

In this paper, we consider the estimation of a low Tucker rank tensor from a number of noisy linear measurements. The general problem covers many specific examples arising from applications, including tensor regression, tensor completion, and tensor PCA/SVD. We propose a Riemannian Gauss-Newton (RGN) method with fast implementations for low Tucker rank tensor estimation. Different from the generic (super)linear convergence guarantee of RGN in the literature, we prove the first quadratic convergence guarantee of RGN for low-rank tensor estimation under some mild conditions. A deterministic estimation error lower bound, which matches the upper bound, is provided that demonstrates the statistical optimality of RGN. The merit of RGN is illustrated through two machine learning applications: tensor regression and tensor SVD. Finally, we provide the simulation results to corroborate our theoretical findings.



There are no comments yet.


page 1

page 2

page 3

page 4


An Optimal Statistical and Computational Framework for Generalized Tensor Estimation

This paper describes a flexible framework for generalized low-rank tenso...

Cross: Efficient Low-rank Tensor Completion

The completion of tensors, or high-order arrays, attracts significant at...

Optimal High-order Tensor SVD via Tensor-Train Orthogonal Iteration

This paper studies a general framework for high-order tensor SVD. We pro...

Generalized Low-rank plus Sparse Tensor Estimation by Fast Riemannian Optimization

We investigate a generalized framework to estimate a latent low-rank plu...

A Sharp Blockwise Tensor Perturbation Bound for Orthogonal Iteration

In this paper, we develop novel perturbation bounds for the high-order o...

Rank Determination for Low-Rank Data Completion

Recently, fundamental conditions on the sampling patterns have been obta...

A Low Rank Tensor Representation of Linear Transport and Nonlinear Vlasov Solutions and Their Associated Flow Maps

We propose a low-rank tensor approach to approximate linear transport an...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

The past decades have seen a large body of work on tensors or multiway arrays in applied mathematics, signal processing, machine learning, statistics, among many other fields. Tensors arise in numerous applications involving multiway data, such as brain imaging (zhou2013tensor; zhang2019tensor), electron microscopy imaging (han2020optimal; zhang2020denoising), recommender system design (bi2018multilayer). In addition, tensor methods have been applied to many problems in statistics and machine learning where the observations are not necessarily tensors, such as topic and latent variable models (anandkumar2014tensor), additive index models (balasubramanian2018tensor), high-order interaction pursuit (hao2018sparse). In this paper, we focus on a prototypical model for tensor estimation:


Here, are the observations and unknown noise and is an order- tensor parameter of interest. is a known linear map, which can be explicitly expressed as


with the given measurement tensors . Our goal is to estimate based on . When , (1) becomes the low-rank tensor recovery problem (rauhut2017low) where the aim is to recover exactly.

In many applications, , i.e., the number of parameters in , is much greater than the sample size , so some structural conditions are often assumed to ensure the problem is well-posed. In the literature, the low-rankness assumption was widely considered (kolda2009tensor; zhou2013tensor; anandkumar2014guaranteed; richard2014statistical; montanari2018spectral). In this work, we focus on the setting that the target parameter is low Tucker rank and admits the following Tucker (or multilinear) decomposition with Tucker rank :


Here, is the order- core tensor; is a -by- matrix with orthonormal columns, which represents the mode-

singular vectors of

; “” is the tensor-matrix product along mode . The formal definitions of Tucker decomposition and tensor-matrix product are given in Section 1.4.

With different designs of , the general model (1) covers many specific settings arising from applications, such as recommender system (bi2018multilayer), neuroimaging (guhaniyogi2015bayesian; li2017parsimonious), longitudinal relational data analysis (hoff2015multilinear), imaging processing (guo2012tensor). The specific settings of model (1) include:

  • [leftmargin=*]

  • Tensor regression with general random or deterministic design (zhou2013tensor; raskutti2015convex), where are general tensors. Specifically, the Gaussian ensemble design ( has i.i.d. Gaussian/sub-Gaussian entries) is widely studied in the literature.

  • Tensor completion (gandy2011tensor; liu2013tensor; yuan2014tensor): , is the th canonical vector and are randomly selected integers from , “” represents the outer product and ;

  • Tensor estimation via rank-1 projections (hao2018sparse): , where are random vectors;

  • Tensor PCA/SVD (richard2014statistical; hopkins2015tensor; zhang2018tensor; perry2020statistical) is a special case of tensor completion where all entries are observable. In this particular setting, we can tensorize and rewrite the model (1) equivalently to . Here is the low Tucker rank signal tensor and is the noise.

In view of model (1) and assumption (3), a natural estimator of is


Here is the Tucker rank of (see definition in Section 1.4). However, the optimization problem in (4) is non-convex and NP-hard in general. To tame the non-convexity, a common scheme is the convex relaxation (mu2014square; raskutti2015convex; tomioka2011statistical)

. However, this scheme may either obtain suboptimal statistical guarantees or require evaluating the tensor nuclear norm, which is NP-hard to compute in general

(hillar2013most). Alternatively, a large body of literature turn to the non-convex formulation and focus on developing computational efficient two-stage procedures on estimating : first, one obtains a warm initialization of and then runs local algorithms to refine the estimate. Provable guarantees on estimation or recovery of for such a two-stage paradigm have been developed in different scenarios (rauhut2017low; chen2016non; ahmed2020tensor; han2020optimal; cai2019nonconvex; hao2018sparse; cai2020provable; xia2017statistically). In this work, we focus on the non-convex formulation and aim to develop a provable computationally efficient estimator for . Departing from the existing literature that focus on the first-order local methods, we propose a new Riemannian Gauss-Newton (RGN) algorithm for iterative refinement and establish the first quadratic convergence guarantee on the estimation of .

1.1 Our Contributions

In this paper, we develop a new Riemannian Gauss-Newton (RGN) algorithm for low-rank tensor estimation. The proposed algorithm is tuning free and generally has the same per-iteration computational complexity as the alternating minimization (zhou2013tensor; li2013tucker) and comparable complexity to the other first-order methods including projected gradient descent (chen2016non) and gradient descent (han2020optimal).

Moreover, assuming satisfies the tensor restricted isometry property (TRIP) (see Definition 2), we prove that with some proper initialization, the iterates generated by RGN converge quadratically to up to some statistical error. Especially in the noiseless setting, i.e., , RGN converges quadratically to the exact parameter . Figure 1 shows the numerical performance of RGN in tensor regression (left panel) and tensor completion (right panel): in the noiseless case, RGN converges quadratically to ; in the noisy case, RGN converges quadratically to a neighborhood of up to some statistical error. More simulation results on tensor estimation via rank-1 projections and tensor SVD can be found in Section 6. Since RGN generally converges to a point with nonzero function value in the noisy setting, the generic theory on RGN can only guarantee a (super)linear convergence rate to a stationary point (absil2009optimization; breiding2018convergence). Our result complements the classic theory of RGN: we show RGN converges quadratically to a neighborhood of the true parameter of interest, which achieves a statistically optimal estimation error rate. To our best knowledge, such a result is new and our RGN is the first algorithm with a provable guarantee of second-order convergence for the low-rank tensor estimation.

(a) Tensor regression. Here, is the sample size, with , Tucker rank , with and has i.i.d. standard Gaussian entries.
(b) Tensor completion. Here, we observe partial uniformly-at-random sampled entries from the noisy tensor index by , where with . Tucker rank of is , and has i.i.d. entries with .
Figure 1: RGN achieves a quadratic rate of convergence in low-rank tensor estimation. More details of the simulation setting are given in Section 6.

Furthermore, we provide a deterministic minimax lower bound for the estimation error under model (1). The lower bound matches the estimation error upper bound, which demonstrates the statistical rate-optimality of RGN.

Next, we apply RGN to two problems arising from applications in machine learning and statistics: tensor regression and tensor SVD. In both problems, we prove the iterates of RGN converge quadratically to a neighborhood of that achieves the minimax optimal estimation error. A comparison of RGN and prior algorithms on tensor regression and tensor SVD is given in Table 1. We can see for fixed

, RGN achieves the best estimation error and signal-to-noise ratio requirement, i.e., sample complexity in tensor regression and least singular value in tensor SVD, comparing to the state of the art while maintaining a relatively low computational cost. Moreover, RGN is the only algorithm with guaranteed quadratic convergence in both applications. Finally, we conduct numerical studies to support our theoretical findings in Section

6. The simulation studies show RGN offers much faster convergence compared to the existing approaches in the literature.

Tensor Regression
Algorithm required convergence estimation per-iteration
sample size rate error cost
RGN quadratic
(this work)
GD linear
Noconvex-PGD linear
Alter Mini N.A. linear N.A.
IHT111Department of Statistics, University of Wisconsin-Madison. (yluo86@wisc.edu,anruzhang@ stat.wisc.edu) linear
Tensor SVD
Algorithm least convergence estimation per-iteration
singular value rate error cost
RGN quadratic
(this work)
GD linear
Alter Mini linear
Table 1: Comparison of RGN with gradient descent (GD), nonconvex projected gradient descent (Nonconvex-PGD), alternating minimization (Alter Mini) and iterative hard thresholding (IHT) in the literature in aspects of signal-to-noise ratio requirement (i.e., sample complexity in tensor regression and least singular value in tensor SVD), convergence rate, estimation error, and per-iteration computational cost. Here we assume and

is the standard deviation of the Gaussian noise in tensor regression and tensor SVD.

11footnotetext: The analysis of IHT in rauhut2017low

relies on the assumption that the projection on the low Tucker rank tensor manifold can be almost exactly achieved by truncated high-order singular value decomposition

(de2000best). However, it is unclear whether this assumption holds in general.

1.2 Related Literature

Our work is related to a broad range of literature from a number of communities. Here we make an attempt to discuss existing results without claiming that the survey is exhaustive.

First, the low-rank tensor estimation has attracted much recent attention from machine learning and statistics communities. Various methods were proposed, including the convex relaxation (mu2014square; raskutti2015convex; tomioka2011statistical), projected gradient descent (rauhut2017low; chen2016non; ahmed2020tensor; yu2016learning), gradient descent on the factorized model (han2020optimal; cai2019nonconvex; hao2018sparse), alternating minimization (zhou2013tensor; jain2014provable; liu2020tensor; xia2017statistically), and importance sketching (zhang2020islet). Moreover, when the target tensor has order two, our problem reduces to the widely studied low-rank matrix recovery/estimation (recht2010guaranteed; li2019rapid; ma2019implicit; sun2015guaranteed; tu2016low; wang2017unified; zhao2015nonconvex; zheng2015convergent; charisopoulos2021low; luo2020recursive; bauch2021rank). The readers are referred to a recent survey in chi2019nonconvex.

Second, Riemannian manifold optimization methods have been powerful in solving optimization problems with geometric constraints (absil2009optimization; boumal2020introduction). Many progresses in this topic were made for the low-rank matrix estimation (keshavan2009matrix; boumal2011rtrmc; boumal2015low; wei2016guarantees; meyer2011linear; mishra2014fixed; vandereycken2013low; huang2018blind; cherian2016riemannian; luo2020recursive). See the recent survey on this line of work at cai2018exploiting; uschmajew2020geometric. Moreover, Riemannian manifold optimization method has been applied for various problems on low-rank tensor estimation, such as tensor regression (cai2020provable; kressner2016preconditioned), tensor completion (rauhut2015tensor; kasai2016low; dong2021new; kressner2014low; heidel2018riemannian; xia2017polynomial; steinlechner2016riemannian; da2015optimization), and robust tensor PCA (cai2021generalized). These papers mostly focus on the first-order Riemannian optimization methods, possibly due to the hardness of deriving the exact expressions of the Riemannian Hessian. A few exceptions also appear: heidel2018riemannian and psenka2020second developed Riemannian trust-region method for tensor completion under Tucker and tensor-train formats, respectively; kressner2016preconditioned proposed approximate Riemannian Newton methods for tensor regression in tensor-train and Tucker formats under the setting that the linear map has additive and Kronecker-product-type structures. Departing from these results that focus on the geometric objects and numerical implementations, in this paper we not only develop an efficient implementation of RGN under the Tucker format but also prove the quadratic convergence of the iterates and the optimal estimation error rate for the estimation of .

Finally, the low-rank tensor estimation model (1

) is related to various problems in different contexts. In the tensor-based scientific computing community, large-scale linear systems where the solution admits a low-rank tensor structure commonly arise after discretizing high-dimensional partial differential equations (PDEs)

(lynch1964tensor; hofreither2018black), which exactly become the central problem (1) in this paper. In the literature, various methods have been proposed there to solve (1). For example, bousse2018linear developed the algebraic method and Gauss-Newton method to solve the linear system with a CP low-rank tensor solution. georgieva2019greedy and kressner2016preconditioned respectively introduced a greedy approach and an approximate Riemannian Newton method to approximate the linear system by a low Tucker rank tensor. The readers are also referred to grasedyck2013literature for a recent survey. There are some key differences from this line of work to ours: first, their goal is often to find a low-rank tensor that approximately solves a linear system with small approximation error, while we aim to develop an estimator with small estimation error; second, the design matrix in linear systems from discretized PDEs often has Kronecker-product-type structure, we do not assume such structure in this paper. On the other hand, the structures of the design assumed here are application dependent, e.g., sub-Gaussian design in tensor regression and “one-hot” design in tensor completion as we mentioned in the introduction; finally, their work mainly focuses on computational aspects of the proposed methods (grasedyck2013literature), while this paper develops Riemannian Gauss-Newton for solving the low-rank tensor estimation problem and gives theoretical guarantees for the quadratic convergence of the algorithm and for the optimal estimation error bound of the final estimator.

1.3 Organization of the Paper

After a brief introduction of notation and preliminaries in Section 1.4, we introduce our main algorithm RGN and its geometric ingredients in Section 2. The theoretical results of RGN and its applications in tensor regression and tensor SVD are discussed in Sections 3 and 4, respectively. Computational complexity of RGN and numerical studies are presented in Sections 5 and 6, respectively. Conclusion and future work are given in Section 7. Additional algorithms and all technical proofs are presented in Appendices A-C.

1.4 Notation and Preliminaries

The following notation will be used throughout this article. Lowercase letters (e.g., ), lowercase boldface letters (e.g., ), uppercase boldface letters (e.g., ), and boldface calligraphic letters (e.g., ) are used to denote scalars, vectors, matrices, and order-3-or-higher tensors, respectively. For simplicity, we denote as the tensor indexed by in a sequence of tensors . We use bracket subscripts to denote sub-vectors, sub-matrices, and sub-tensors. For any vector , define its norm as . For any matrix , let be the th largest singular value of . We also denote and QR() as the subspace composed of the leading left singular vectors and the Q part of the QR orthogonalization of , respectively. represents the -by-identity matrix. Let be the set of all -by- matrices with orthonormal columns. For any , represents the projection matrix onto the column space of ; we use to represent the orthonormal complement of .

The matricization is the operation that unfolds the order- tensor along mode into the matrix where . Specifically, the mode- matricization of is formally defined as


for any . We also use notation to denote the mode- tensorization or reverse operator of . Throughout the paper, , as a reversed operation of , maps a matrix back to a tensor. The Hilbert-Schmidt norm of is defined as The Tucker rank of a tensor is denoted by and defined as a -tuple , where . For any Tucker rank- tensor , it has Tucker decomposition (tucker1966some):


where is the core tensor and is the mode- singular vectors. Here, the mode- product of with a matrix is denoted by and is of size , and its formal definition is given below


It is convenient to introduce the following abbreviations to denote the tensor-matrix product along multiple modes: ; . The following property about tensor matricization will be used (kolda2001orthogonal, Section 4):


where “” is the matrix Kronecker product. For any tensor , we define as the best Tucker rank approximation of in terms of Hilbert-Schmidt norm, where Finally, for any linear operator , we denote as its adjoint operator.

2 Algorithm

We introduce the geometry of low Tucker rank tensor Riemannian manifolds in Section 2.1 and present the procedure of RGN in Section 2.2.

2.1 The Geometry for Low Tucker Rank Tensor Manifolds

Denote the collection of -dimensional tensors of Tucker rank by . Then forms a smooth submanifold embedded in with dimension (uschmajew2013geometry; kressner2014low). Throughout the paper, we use the natural Euclidean inner product as the Riemannian metric. Suppose has Tucker decomposition ; koch2010dynamical showed that the tangent space of at , , can be represented as


In the representation above, s are not free parameters due to the constraints , . In the following Lemma 1, we introduce another representation of

with a minimal parameterization, which matches the degree of freedom of the tangent space (

). For , we let , which corresponds to the row space of , and define


where . By (8), correspond to the subspaces of the column and row spans of , respectively.

Lemma 1.

The tangent space of at in (9) can be written as

where is the mode- tensorization operator and is given in (10).

We can also show that any tensor in is at most Tucker rank . This fact will facilitate the efficient computation of RGN to be discussed in Section 5.

Lemma 2.

Any tensor is at most Tucker rank .

Lemma 3.1 of koch2010dynamical and the tangent space representation in Lemma 1 yield the following projection operator that projects any tensor onto the tangent space of at :


where and are respectively the contraction map and extension map defined as follows:


In particular, is the adjoint operator of . We will see in Section 2.2 that the representation in (11) helps the efficient implementation of RGN.

2.2 Riemannian Optimization and Riemannian Gauss-Newton

In this subsection, we first give a preliminary for Riemannian optimization and then introduce the procedure of RGN for low-rank tensor estimation.

Overall three-step procedure of Riemannian optimization. Riemannian optimization concerns optimizing a real-valued function defined on a Riemannian manifold , for which the readers are referred to absil2009optimization and boumal2020introduction for an introduction. Due to the common non-linearity, the continuous optimization on the Riemannian manifold often requires calculations on the tangent space. A typical procedure of a Riemannian optimization method contains three steps per iteration: Step 1. find the tangent space; Step 2. update the point on the tangent space; Step 3. map the point from the tangent space back to the manifold.

Low-rank tensor Riemannian manifold (Step 1). We have already discussed the tangent space of low Tucker rank tensor manifolds in Section 2.1, i.e., Step 1 above.

Update on tangent space (Step 2). Next, we describe the procedure of RGN in the tangent space. We begin by introducing a few more preliminaries for Riemannian manifold optimization. The Riemannian gradient of a smooth function at is defined as the unique tangent vector such that where denotes the directional derivative of at point along direction . Specifically for the embedded submanifold , we have:

Lemma 3.

For in (4), where is the projection operator onto the tangent space of at defined in (11).

A common way to derive RGN update in the literature is to first write down the Riemannian Newton equation, then replace the Riemannian Hessian by its Gauss-Newton approximation (absil2009optimization, Chapter 8.4.1), and finally solve the modified Riemannian Newton equation, i.e., the Riemannian Gauss-Newton equation. In our low-rank tensor estimation problem with the objective function (4), suppose the current iterate is , the RGN update should solve the following RGN equation (absil2009optimization, Chapter 8.4),


However, it is not clear how to solve this equation directly in practice.

Inspired by the classical Gauss-Newton (GN) algorithm, we instead introduce another scheme to derive RGN. Recall in solving the nonlinear least squares problem in the Euclidean space , the classic Gauss-Newton can be viewed as a modified Newton method, and can also be derived by replacing the non-linear function by its local linear approximation at the current iterate (nocedal2006numerical, Chapter 10.3). These two ways of interpretations are equivalent. Similar local linearization idea can be extended to the manifold setting except that the linearization needs to be taken in the tangent space in each iterate. Specifically, consider the objective function in (4), the linearization of at in is , which can be simplified to . So the update can be calculated by


By mapping back to the manifold, we get the new iterate .

Next, we show the proposed update derived in (14) actually matches the standard RGN update (13).

Proposition 1.

Let be update computed in (14). Then, is the Riemannian Gauss-Newton update, i.e., it solves the Riemannian Gauss-Newton equation (13).

Proposition 1 shows that (14) yields the RGN update, which directly provides a simple implementation of RGN. To see this, recall , where and are defined in the similar way as in (12) except evaluated on ; then the objective function in (14) can be rewritten as follows,


where . Based on the calculation in (15), we define the following covariates maps , where for ,

Here, satisfies and similarly for . Then, by (15) and the fact that , (14) can be equivalently solved by



Note that (16) is an unconstrained least squares with the number of parameters equal to .

Retraction (Step 3). Finally, we discuss how to map the point from the tangent space back to the manifold, i.e., Step 3 above. An ideal method is via the exponential map, which moves a point on the manifold along the geodesic. However, computing the exponential map is prohibitively expensive in most situations, and a more practical choice is the so-called retraction. Retraction is in general a first-order approximation of the exponential map. In tensor manifold , the retraction map, denoted by , should be a smooth map from to that satisfies i) and ii) for all and (absil2009optimization, Chapter 4). Here, is the tangent bundle of .

In the low Tucker rank tensor manifolds, Proposition 2.3 of kressner2014low showed that the truncated high-order singular value decomposition (T-HOSVD) (de2000multilinear) is a retraction. We further show in the following Lemma 4 that the sequentially truncated HOSVD (ST-HOSVD) (vannieuwenhoven2012new), a computationally more efficient procedure than T-HOSVD, also satisfies the retraction properties. The detailed procedures of T-HOSVD and ST-HOSVD are given in Appendix A.

Lemma 4 (Retraction of Sequentially Truncated HOSVD).

For ST-HOSVD defined in Appendix A, the map

is a retraction on around .

Summary of RGN. We give the complete RGN algorithm for low-rank tensor estimation in Algorithm 1.

Input: , , Tucker rank , initialization with Tucker decomposition , and defined as (10).

1:for  do
2:     Construct the covariates maps , where for
3:     Solve the unconstrained least squares problem
4:     Update
and via (10). Here is the retraction map onto (two choices are ST-HOSVD and T-HOSVD).
5:end for

Output: .

Algorithm 1 Riemannian Gauss-Newton for Low-rank Tensor Estimation
Remark 1 (Operator ).

In (19), plays the role as retraction that maps the iterate from the tangent space of at back onto the manifold. Since directly operates on the updated tensor, to distinguish with the canonical notation for retraction, we use a simplified notation to represent this map here. As we mentioned before, T-HOSVD (de2000multilinear) and ST-HOSVD (vannieuwenhoven2012new) are two choices of retractions.

3 Theoretical Analysis

We analyze the convergence rate of RGN in this section. We begin by introducing the quasi-projection property of T-HOSVD and ST-HOSVD and the assumption on the linear map . Different from the low-rank matrix projection, which can be efficiently and exactly computed via truncated SVD, performing low-rank tensor projection exactly, even for , can be NP hard in general (hillar2013most). We thus introduce the following quasi-projection property and the approximation constant .

Definition 1 (Quasi-projection of and Approximation Constant ).

Let be the projection map from to the tensor space of Tucker rank at most , i.e., for any and of Tucker rank at most , one always has .

We say satisfies the quasi-projection property with approximation constant if for any .

It is known that T-HOSVD and ST-HOSVD satisfy the quasi-projection property (see Chapter 10 in (hackbusch2012tensor)).

Proposition 2 (Quasi-projection property of T-HOSVD and ST-HOSVD).

T-HOSVD and ST-HOSVD described in Appendix A satisfy the quasi-projection property with approximation constant .

For technical convenience, we also assume satisfies the following Tensor Restricted Isometry Property (TRIP) (rauhut2017low). TRIP condition can be seen as a tensor generalization of the restricted isometry property (RIP). In the compressed-sensing and low-rank matrix recovery literature, the RIP condition has been widely used as one standard assumption (candes2011tight; cai2013sharp). rauhut2017low showed that TRIP condition holds if is randomly designed with a sufficient sample size .

Definition 2 (Tensor Restricted Isometry Property (TRIP)).

Let be a linear map. For a fixed -tuple with for , define the -tensor restricted isometry constant to be the smallest number such that holds for all of Tucker rank at most . If , we say satisfies -tensor restricted isometry property (TRIP).

Next, we establish the deterministic convergence theory for RGN.

Theorem 1 (Convergence of RGN).

Suppose is either T-HOSVD or ST-HOSVD, satisfies the -TRIP, and the initialization satisfies , where is the minimum of least singular values at each matricization of . Then for all ,