1 Introduction
Automatic differentiation (AD) is a powerful tool for numerically calculating derivatives of functions specified as computer programs. It significantly simplifies the programming of derivatives of complicated functions without loss of efficiency, providing better stability properties compared with classical numerical differentiation using finite differences. AD is commonly used in applied mathematics, and in particular, it is at the core of deep learning success, allowing researchers to combine ever more complex neural networks from known modules, and train them without worrying about efficient gradient computation.
In this paper, we are concerned with applying AD to the minimization problem
where is a smooth function and is a subset of of fixedrank matrices () or fixedrank tensortrains () [1]. It is known that in both cases, forms a Riemannian manifold. One can, therefore, apply Riemannian optimization algorithms [2]
that are currently actively used for the development of stateoftheart algorithms in numerical mathematics, partial differential equations and machine learning. A realization of such algorithms requires specific knowledge of computational aspects of lowrank objects and is especially complicated for tensor decompositions, where a number of tricks have to be done to reduce rank dependence and ensure the stability of an algorithm. The AD technique proposed in this work allows for a significant simplification of this process.
We are concerned with computing Riemannian gradients and matrixvector products with approximate Riemannian Hessians, which are the building blocks for Riemannian optimization algorithms. Note that in the case of lowrank matrix or tensortrain manifolds, the matrixvector product with the Hessian can be numerically unstable. It happens due to the presence of terms with inverted singular values
[3]. We, therefore, consider multiplication by the approximate Hessian with an omitted curvature term [4, 5, 6] (see details in Sec. 3).In the proposed method, calculating the Riemannian gradient or matrixvector product with the approximate Riemannian Hessian of a function has the same asymptotic complexity as evaluation of the function itself at a single point^{1}^{1}1This holds under the assumption that the function evaluation is at least as expensive as the cost of the orthogonalization operation, which is a necessary step in any Riemannian gradient computation. This assumption holds true for most practical functions (see Propositions 5.2 and 6 for more details).
. Moreover, thanks to the implementation in TensorFlow (a Python library with AD support), the algorithms can be run both on CPUs and GPUs.
We numerically evaluate the performance of the proposed algorithms on several functions arising from solving systems of linear equations, the eigenvalue problem, the tensor completion problem and in the training of a machine learning model.
Our main contributions are:

We develop automatic differentiation algorithms for computing the Riemannian gradient and a matrixvector product with the approximate Riemannian Hessian of a function for lowrank matrices and TTtensors. Under mild assumptions, the asymptotic complexity of the proposed method equals the complexity of evaluating the function at one point.

We implement the proposed algorithms in TensorFlow and make them available in T3F^{2}^{2}2https://github.com/Bihaqo/t3f
– an opensource Python library for working with TT decomposition.
Related work
There is a large body of work on creating libraries for working with tensors and tensor decompositions, which often include automatic differentiation abilities (see, e.g., [7, 8, 9, 10, 11], tntorch^{3}^{3}3https://tntorch.readthedocs.io), but most of these libraries do not target the Riemannian automatic differentiation, which is the focus of this paper. Typically, researchers compute the Riemannian gradients manually, but the Riemannian automatic differentiation libraries [12, 13, 14] are gaining traction, empowering the Riemannian optimization community. However, existing Riemannian AD libraries lack lowrank tensor support. For lowrank matrices, PyManOpt [12] supports Riemannian gradients, but no library supports multiplying the Riemannian Hessian by a given vector, which is required for secondorder methods.
Note that in [12], an algorithm to compute the Riemannian gradient for lowrank matrices has already been proposed and implemented. Nevertheless, in this work, we present an alternative way of doing it avoiding inversions of singular values, which can be close to machine epsilon if the rank is overestimated.
A method for automatic secondorder Riemannian differentiation for the manifold of lowrank tensors was proposed in [15]. The authors focus on the curvature term of the Riemannian Hessian (which we omit as explained in Sec. 3) and assume that the other terms can be computed efficiently by a twostep procedure: first computing the Euclidean gradient or Hessianbyvector product and then projecting it onto the tangent space. This is indeed efficient for some functions, but can be significantly slower than the approach proposed in this paper for some other function. Thus, the two papers complement each other: one can use [15] for computing the curvature term, and the algorithms proposed in this paper for the other terms.
2 Automatic differentiation (AD)
In this section, we give a brief introduction to the automatic differentiation concept. A reader familiar with this topic can skip this section.
AD is a technique for computing the value of the gradient of a smooth function specified by a computer program. In particular, it is assumed that can be represented as a sequence of elementary operations, for example additions, multiplications, trigonometric functions, logarithms, etc. Evaluation of can also involve other operations such as matrix decompositions, for which differentiation formulas are available. Under this assumption, AD allows for computing derivatives with working precision, and with the number of operations, which is only a small constant factor times larger than the number of operations to execute the evaluation of (i.e., with the same asymptotic complexity).
Let us illustrate the AD concept in a simple example. Let :
then it can be written as a sequence of elementary operations and depicted as the following computational graph:
AD uses the chain rule
^{4}^{4}4In this paper we focus on reversemodeautodiff, which is also sometimes called backpropagation. The alternative –
forwardmode autodiff – is typically used for functions where because of the smaller asymptotic complexity in this case. to find both components of in one pass through the computational graph in reverse order. Let for . We have,where and .
Thus, AD allows us to calculate all components of in one pass with complexity, where is the number of FLOP to calculate at a given . In general, the computational graph for computing the gradient of a function has as many nodes as the original graph for evaluating the function value, and each node is, at most, a small constant times more expensive than the corresponding node from the original graph.
Let us compare AD with numerical differentiation using finite differences, where components of a gradient of a function are approximated, e.g., using forward differences
(1) 
where is chosen so that the approximation error is small enough. First, numerical differentiation is computationally more expensive than AD. Indeed, (1) requires function evaluations to approximate and, hence, the complexity is . Moreover, due to the error amplification of derivative approximation, (1) cannot achieve accuracy better than the square root of machine precision [16]. At the same time, AD is more robust and can achieve machine precision accuracy [17].
Another alternative to AD and numerical differentiation is symbolic differentiation. In it, one assembles the final formula for each component of the gradient using a sequence of rules as product rule, chain rule, etc. Since this constraint of expressing the entire result as a single formula does not allow introducing intermediate variables, in the worst case the final formula may contain exponentially many duplicated fragments. By contrast to the symbolic differentiation, in AD one uses intermediate variables to define those duplicated fragments, allowing one to never evaluate any quantity more than once and providing efficiency guarantees.
For a more indepth review of automatic differentiation see e.g. [18].
3 Riemannian optimization
Let us briefly introduce the Riemannian optimization concept. Let be a smooth embedded submanifold. In this paper, we are concerned with the manifold of fixedrank matrices () and the manifold of tensors of fixed tensortrain rank (). The definitions will be given in Section 4.1 and in Section 5.1 respectively. In this section, we only provide an introductory overview without implementation details.
Our goal is to solve a minimization problem with a smooth function :
Assume that the solution to this problem can be approximated by a certain point . Then, we can reformulate the problem as
(2) 
i.e., the search space is restricted to a Riemannian manifold . Riemannian optimization algorithms usually involve computation of Riemannian gradients , which for embedded submanifolds of and functions defined on the ambient space, may be written as a projection of the Euclidean gradient to the tangent plane of at the point :
(3) 
where denotes an operator of orthogonal projection to the tangent plane and depends on nonlinearly. Given the Riemannian gradient notion, we may solve (2) using the Riemannian gradient descent
where returns a tangent vector back to the manifold (see [19] for different retraction operations) and the parameter is chosen to ensure decay of the functional. More advanced optimization algorithms, e.g., a Riemannian version of the conjugate gradient method is also available [2].
One can also utilize secondorder methods, which involve computation of the Riemannian Hessian operator. For the Riemannian Hessian we can use^{5}^{5}5Note that both and operations depend on the particular choice of a manifold. Nevertheless, we do not use the subscript as it will be clear from context and to not overcomplicate the notation. the formula [3, 4]:
(4) 
where is the Euclidean Hessian and denotes the Fréchet derivative of . The second term in (4) arises due to the nonlinearity of the manifold. For the manifold of lowrank matrices, it contains the inverse of a matrix of singular values [3]. If singular values are small, this can lead to numerical instabilities. To avoid this problem, the second term in (4) can be omitted [4, 20]. In this case, the optimization procedure can be interpreted as a constrained GaussNewton method. We, therefore, consider only linearized Hessians and are interested in an efficient matrixvector product by the first term of (4):
(5) 
Note that first computing as in (3) and then applying can be inefficient. Therefore, should be calculated at once. For example, for the manifold of lowrank matrices, the Riemannian gradient can always be represented as a lowrank matrix (see Sec. 4.1 for details), while the Euclidean gradient can have an arbitrary large rank. Thus, using the Euclidean gradient in the intermediate calculations can lead to an inefficient algorithm. Similarly, first computing as in (5) and then applying can be significantly less efficient than calculating at once. The goal of this paper is, thus, to develop an efficient tool to calculate (3) and (5) – the building block operations of Riemannian optimization. The key assumption we make is that we can efficiently evaluate at any point , . Then, under mild conditions (see Propositions 5.2 and 6 and below), the overall complexity of the presented algorithm is only constant times larger than the complexity of the function evaluation.
Let us introduce the scalar product and the associated norm
Using this notation, possible choices of are, for example:

or for given and that arise when solving linear systems;

with possibly nonlinear , which arises when solving (nonlinear) eigenvalue problems;

where denotes projection on the index set such that
This type of problem is referred to as matrix or tensor completion problems.

is a neural network loss function, which arises when using TTdecomposition to parametrize a recurrent neural network and applying Riemannian optimization for training (for more details see Section
7.1).
In Section 5.3, we will also discuss how our approach can be used for operations that are not directly related to a minimization of a function, e.g., how to efficiently compute the preconditioned residual for noncommuting and .
4 Automatic differentiation for the Riemannian gradient: fixedrank matrices
In this section, we propose an approach to automatically compute Riemannian gradients for the manifold of fixedrank matrices.
4.1 The manifold of fixedrank matrices
Let us briefly recall the concepts related to the manifold of fixedrank matrices. The set of matrices of fixed rank : forms a smooth submanifold of [21, Example 8.14]. Using SVD, any point of the manifold can be represented as , where and are matrices with orthonormal columns –– singular vectors (, ) and is the diagonal matrix of singular values. The tangent space of the manifold at a point can be written as
(6) 
where
denotes a zero matrix of size
. In what follows, we refer to the matrices and that define an element of the tangent space as deltamatrices. The orthogonal projection of to the tangent space can, thus, be obtained as follows:(7) 
We refer the reader to, e.g., [22, Sec. 2.1] for a more detailed discussion of the manifold of lowrank matrices, including the derivation of (7).
Finally, to simplify the notation, we denote the projection operator as
We also introduce that maps parametrization matrices to an element of the tangent plane at the point
(8) 
namely,
(9) 
This mapping will be used later in Sec. 4.2 to simplify the presentation of the algorithm.
4.2 Automatic differentiation approach
In this section, we propose an efficient way of computing the Riemannian gradient
The Riemannian gradient is an matrix, but as noted in the previous section it can be defined via the delta matrices and using just parameters ( if gauge condition are taken into account). Thus, if we can avoid using full matrices in intermediate calculations, we can potentially compute the Riemannian gradient with a better asymptotic complexity than .
A naive approach of computing the Riemannian gradient is to first compute with AD and then project the result to the tangent plane by using formula (7):
(10) 
The problem with this approach is that it requires finding the full matrix of the Euclidean gradient of the size , which we want to avoid. Alternatively, we may find the Riemannian gradient without explicitly forming . In particular, we notice that the Riemannian gradient (10) involves computing the following multiplication of matrices
(11) 
We may find these two quantities by using the classical AD as follows:
So, one can use classic AD on the function twice (each time with the complexity equal to evaluating the function at a single point due to AD properties) to compute all the pieces that depend on .
However, in the rest of this section we propose an alternative way of computing quantities (11) by using classic AD a single time on a specially introduced auxiliary function. This alternative approach is introduced because it naturally generalizes into an efficient algorithm for the tensor case (see Sec. 5.2).
Quantities (11) can be computed at once by differentiating (using AD) the following auxiliary function defined using mapping (9)
We have
(12) 
Indeed, can be represented as
and, hence, the partial derivatives of at are
where is the Kronecker delta. Applying the chain rule to (12), we get
Thus, a lowrank representation of the Riemannian gradient can be written as
with
(13)  
The algorithm to compute the Riemannian gradient is summarized in Algorithm 1.
4.3 Complexity of the approach
Let us estimate the complexity of computing
and by the proposed approach, i.e., by defining the auxiliary function and differentiating it with respect to and .Proposition
Let be a smooth function defined by a program , which takes as input SVD decomposition of a matrix and outputs the value in floating point operations (FLOP), which is polynomial with respect to the rank of the matrix X (i.e., the program belongs to the P complexity class). Then, the complexity of using Alg. 1 for computing delta terms and which define the Riemannian gradient is .
As an example, computing
leads to FLOP, since
Proof (Proof of Prop. 4.3)
The auxiliary function can be constructed by feeding to the factors of the matrix
which is represented with the rank — twice larger than the rank of the original matrix. As a result, the asymptotic complexity (as a function of and ) of evaluating the function on such a matrix is still . Thanks to the properties of AD, computing the derivatives of with respect to the factors and has the same complexity . Finally, computing the factor using (13) can be done in , yielding the total complexity .
For most functions used in practice, the asymptotic complexity of executing the function at one point exceeds and the total complexity of the proposed algorithm (as a function of and ) equals to .
4.4 More general view of the proposed algorithm
In this section, we look at the proposed algorithm from a more general perspective, trying to avoid specifics of the fixedrank manifold. The main idea of the proposed algorithm is to introduce the auxiliary function (27) and express the desired Riemannian gradient in terms of its derivatives (12). Note that we could have used an alternative auxiliary function^{6}^{6}6One can use the same derivation as in Sec. 4.2 to prove that differentiating the alternative auxiliary function yields the same results.
If one combines both arguments of the mapping (see (8)) into a single dimensional vector , you can define an equivalent mapping
and an alternative representation of the auxiliary function
(14) 
Thus, the proposed approach is equivalent to defining a mapping from the parametrization of the tangent space onto the tangent space itself, defining an auxiliary function (14), computing its gradient using classical AD, and finally doing certain post processing of this gradient: reshaping, enforcing the gauge conditions as in (13). It is not surprising that we obtain a Riemannian gradient using these formulas, as informally the gradient of the auxiliary function (14) is the fastest ascent direction of at and, hence, of at in the direction of all possible vectors from .
5 Automatic differentiation for the Riemannian gradient: fixedrank tensors
In this section, we extend the results of Sec. 4.2 to fixedrank tensortrain tensors, which is a generalization of fixedrank matrices to multidimensional arrays.
5.1 The manifold of TTtensors of fixed rank
A tensor is said to be represented in the tensortrain format [23] (TTformat) if each of its elements is a product of matrices:
(15) 
where for fixed , is an matrix for any value of . We require such that is row vector and is column vector. The threedimensional arrays of sizes , are called TTcores and the vector
is called the TTrank of . For a more detailed discussion on the properties of the TTformat see [23].
Like in the matrix case (Sec. 4), the set of tensors
forms a smooth manifold. To parametrize its tangent spaces, we need the notion of orthogonalization of the TTcores. A TTrepresentation (15) is called orthogonal, if
(16) 
for , and
(17) 
for . If or , we only require (17) or (16) respectively. The cores satisfying (16) and (17) are called, respectively, left and rightorthogonal cores. TTdecomposition of a tensor is not unique, and for any there exists a orthogonal representation of a given tensor [24, Sec. 4.2.1]. Moreover, for any , the orthogonal and orthogonal decompositions can be constructed to share the leftorthogonal TTcores satisfying (16) and the rightorthogonal TTcores satisfying (17).
For a given tensor X, one can define a set of leftorthogonal TTcores , rightorthogonal TTcores , and unrestricted TTcores such that for any , there exists the following orthogonal decomposition of the tensor
(18) 
Using the leftorthogonal TTcores and the rightorthogonal TTcores of tensor , one may parametrize the tangent space as follows
(19) 
In what follows, we refer to the tensors that define an element of the tangent space as deltaterms.
Additional gauge conditions are usually introduced^{7}^{7}7These gauge conditions generalize the orthogonality constraint in the definition of the matrix tangent space (6) to the tensor case. to uniquely parametrize elements of the tangent space:
(20) 
In what follows, we always assume that the deltas that define an element of the tangent space obey the gauge conditions (20).
Note that in (19), the expression for an element of the tangent space is formally represented as a sum of TTtensors, each of TTrank , and, hence, can be represented as a single TT tensor with the TTrank [23, Sec. 4.1]. Nevertheless, thanks to the common cores, it can be represented with the TTrank equal to . Indeed, by directly multiplying the block matrices, one can verify that
(21) 
For convenience, we also introduce a function that maps the delta terms to an element of the tangent space
namely
(22) 
as is defined in (21). The following proposition gives the explicit representation of a general tensor projected onto the tangent plane of .
Proposition
[24, equation (4.17)] The orthogonal projection of a given tensor onto the tangent space is defined as an element of the tangent space (19) with :
(23)  
and as
For a more detailed discussion of the manifold of fixed tensortrain rank tensors (including derivations of the above equations) see, e.g., Sec. 4.34.4 of [24].