Automatic differentiation for Riemannian optimization on low-rank matrix and tensor-train manifolds

03/27/2021
by   Alexander Novikov, et al.
Skoltech
24

In scientific computing and machine learning applications, matrices and more general multidimensional arrays (tensors) can often be approximated with the help of low-rank decompositions. Since matrices and tensors of fixed rank form smooth Riemannian manifolds, one of the popular tools for finding low-rank approximations is to use Riemannian optimization. Nevertheless, efficient implementation of Riemannian gradients and Hessians, required in Riemannian optimization algorithms, can be a nontrivial task in practice. Moreover, in some cases, analytic formulas are not even available. In this paper, we build upon automatic differentiation and propose a method that, given an implementation of the function to be minimized, efficiently computes Riemannian gradients and matrix-by-vector products between an approximate Riemannian Hessian and a given vector.

READ FULL TEXT VIEW PDF

page 1

page 2

page 3

page 4

05/14/2020

Multilevel Riemannian optimization for low-rank problems

Large-scale optimization problems arising from the discretization of pro...
08/23/2013

Manopt, a Matlab toolbox for optimization on manifolds

Optimization on manifolds is a rapidly developing branch of nonlinear op...
03/03/2020

A Riemannian Newton Optimization Framework for the Symmetric Tensor Rank Approximation Problem

The symmetric tensor rank approximation problem (STA) consists in comput...
08/02/2021

Tensor completion using geodesics on Segre manifolds

We propose a Riemannian conjugate gradient (CG) optimization method for ...
12/07/2015

Fast Optimization Algorithm on Riemannian Manifolds and Its Application in Low-Rank Representation

The paper addresses the problem of optimizing a class of composite funct...
04/26/2017

Riemannian Optimization for Skip-Gram Negative Sampling

Skip-Gram Negative Sampling (SGNS) word embedding model, well known by i...
03/03/2021

Riemannian thresholding methods for row-sparse and low-rank matrix recovery

In this paper, we present modifications of the iterative hard thresholdi...

1 Introduction

Automatic differentiation (AD) is a powerful tool for numerically calculating derivatives of functions specified as computer programs. It significantly simplifies the programming of derivatives of complicated functions without loss of efficiency, providing better stability properties compared with classical numerical differentiation using finite differences. AD is commonly used in applied mathematics, and in particular, it is at the core of deep learning success, allowing researchers to combine ever more complex neural networks from known modules, and train them without worrying about efficient gradient computation.

In this paper, we are concerned with applying AD to the minimization problem

where is a smooth function and is a subset of of fixed-rank matrices () or fixed-rank tensor-trains ([1]. It is known that in both cases, forms a Riemannian manifold. One can, therefore, apply Riemannian optimization algorithms [2]

that are currently actively used for the development of state-of-the-art algorithms in numerical mathematics, partial differential equations and machine learning. A realization of such algorithms requires specific knowledge of computational aspects of low-rank objects and is especially complicated for tensor decompositions, where a number of tricks have to be done to reduce rank dependence and ensure the stability of an algorithm. The AD technique proposed in this work allows for a significant simplification of this process.

We are concerned with computing Riemannian gradients and matrix-vector products with approximate Riemannian Hessians, which are the building blocks for Riemannian optimization algorithms. Note that in the case of low-rank matrix or tensor-train manifolds, the matrix-vector product with the Hessian can be numerically unstable. It happens due to the presence of terms with inverted singular values 

[3]. We, therefore, consider multiplication by the approximate Hessian with an omitted curvature term [4, 5, 6] (see details in Sec. 3).

In the proposed method, calculating the Riemannian gradient or matrix-vector product with the approximate Riemannian Hessian of a function has the same asymptotic complexity as evaluation of the function itself at a single point111This holds under the assumption that the function evaluation is at least as expensive as the cost of the orthogonalization operation, which is a necessary step in any Riemannian gradient computation. This assumption holds true for most practical functions (see Propositions 5.2 and 6 for more details).

. Moreover, thanks to the implementation in TensorFlow (a Python library with AD support), the algorithms can be run both on CPUs and GPUs.

We numerically evaluate the performance of the proposed algorithms on several functions arising from solving systems of linear equations, the eigenvalue problem, the tensor completion problem and in the training of a machine learning model.

Our main contributions are:

  • We develop automatic differentiation algorithms for computing the Riemannian gradient and a matrix-vector product with the approximate Riemannian Hessian of a function for low-rank matrices and TT-tensors. Under mild assumptions, the asymptotic complexity of the proposed method equals the complexity of evaluating the function at one point.

  • We implement the proposed algorithms in TensorFlow and make them available in T3F222https://github.com/Bihaqo/t3f

    – an open-source Python library for working with TT decomposition.

Related work

There is a large body of work on creating libraries for working with tensors and tensor decompositions, which often include automatic differentiation abilities (see, e.g., [7, 8, 9, 10, 11], tntorch333https://tntorch.readthedocs.io), but most of these libraries do not target the Riemannian automatic differentiation, which is the focus of this paper. Typically, researchers compute the Riemannian gradients manually, but the Riemannian automatic differentiation libraries [12, 13, 14] are gaining traction, empowering the Riemannian optimization community. However, existing Riemannian AD libraries lack low-rank tensor support. For low-rank matrices, PyManOpt [12] supports Riemannian gradients, but no library supports multiplying the Riemannian Hessian by a given vector, which is required for second-order methods.

Note that in [12], an algorithm to compute the Riemannian gradient for low-rank matrices has already been proposed and implemented. Nevertheless, in this work, we present an alternative way of doing it avoiding inversions of singular values, which can be close to machine epsilon if the rank is overestimated.

A method for automatic second-order Riemannian differentiation for the manifold of low-rank tensors was proposed in [15]. The authors focus on the curvature term of the Riemannian Hessian (which we omit as explained in Sec. 3) and assume that the other terms can be computed efficiently by a two-step procedure: first computing the Euclidean gradient or Hessian-by-vector product and then projecting it onto the tangent space. This is indeed efficient for some functions, but can be significantly slower than the approach proposed in this paper for some other function. Thus, the two papers complement each other: one can use [15] for computing the curvature term, and the algorithms proposed in this paper for the other terms.

2 Automatic differentiation (AD)

In this section, we give a brief introduction to the automatic differentiation concept. A reader familiar with this topic can skip this section.

AD is a technique for computing the value of the gradient of a smooth function specified by a computer program. In particular, it is assumed that can be represented as a sequence of elementary operations, for example additions, multiplications, trigonometric functions, logarithms, etc. Evaluation of can also involve other operations such as matrix decompositions, for which differentiation formulas are available. Under this assumption, AD allows for computing derivatives with working precision, and with the number of operations, which is only a small constant factor times larger than the number of operations to execute the evaluation of (i.e., with the same asymptotic complexity).

Let us illustrate the AD concept in a simple example. Let :

then it can be written as a sequence of elementary operations and depicted as the following computational graph:

.

AD uses the chain rule

444In this paper we focus on reverse-mode

autodiff, which is also sometimes called backpropagation. The alternative –

forward-mode autodiff – is typically used for functions where because of the smaller asymptotic complexity in this case.
to find both components of in one pass through the computational graph in reverse order. Let for . We have,

where and .

Thus, AD allows us to calculate all components of in one pass with complexity, where is the number of FLOP to calculate at a given . In general, the computational graph for computing the gradient of a function has as many nodes as the original graph for evaluating the function value, and each node is, at most, a small constant times more expensive than the corresponding node from the original graph.

Let us compare AD with numerical differentiation using finite differences, where components of a gradient of a function are approximated, e.g., using forward differences

(1)

where is chosen so that the approximation error is small enough. First, numerical differentiation is computationally more expensive than AD. Indeed, (1) requires function evaluations to approximate and, hence, the complexity is . Moreover, due to the error amplification of derivative approximation, (1) cannot achieve accuracy better than the square root of machine precision [16]. At the same time, AD is more robust and can achieve machine precision accuracy [17].

Another alternative to AD and numerical differentiation is symbolic differentiation. In it, one assembles the final formula for each component of the gradient using a sequence of rules as product rule, chain rule, etc. Since this constraint of expressing the entire result as a single formula does not allow introducing intermediate variables, in the worst case the final formula may contain exponentially many duplicated fragments. By contrast to the symbolic differentiation, in AD one uses intermediate variables to define those duplicated fragments, allowing one to never evaluate any quantity more than once and providing efficiency guarantees.

For a more in-depth review of automatic differentiation see e.g. [18].

3 Riemannian optimization

Let us briefly introduce the Riemannian optimization concept. Let be a smooth embedded submanifold. In this paper, we are concerned with the manifold of fixed-rank matrices () and the manifold of tensors of fixed tensor-train rank (). The definitions will be given in Section 4.1 and in Section 5.1 respectively. In this section, we only provide an introductory overview without implementation details.

Our goal is to solve a minimization problem with a smooth function :

Assume that the solution to this problem can be approximated by a certain point . Then, we can reformulate the problem as

(2)

i.e., the search space is restricted to a Riemannian manifold . Riemannian optimization algorithms usually involve computation of Riemannian gradients , which for embedded submanifolds of and functions defined on the ambient space, may be written as a projection of the Euclidean gradient to the tangent plane  of  at the point :

(3)

where denotes an operator of orthogonal projection to the tangent plane  and depends on non-linearly. Given the Riemannian gradient notion, we may solve (2) using the Riemannian gradient descent

where returns a tangent vector back to the manifold (see [19] for different retraction operations) and the parameter is chosen to ensure decay of the functional. More advanced optimization algorithms, e.g., a Riemannian version of the conjugate gradient method is also available [2].

One can also utilize second-order methods, which involve computation of the Riemannian Hessian operator. For the Riemannian Hessian we can use555Note that both and operations depend on the particular choice of a manifold. Nevertheless, we do not use the subscript as it will be clear from context and to not overcomplicate the notation. the formula [3, 4]:

(4)

where is the Euclidean Hessian and denotes the Fréchet derivative of . The second term in (4) arises due to the nonlinearity of the manifold. For the manifold of low-rank matrices, it contains the inverse of a matrix of singular values [3]. If singular values are small, this can lead to numerical instabilities. To avoid this problem, the second term in (4) can be omitted [4, 20]. In this case, the optimization procedure can be interpreted as a constrained Gauss-Newton method. We, therefore, consider only linearized Hessians and are interested in an efficient matrix-vector product by the first term of (4):

(5)

Note that first computing as in (3) and then applying can be inefficient. Therefore, should be calculated at once. For example, for the manifold of low-rank matrices, the Riemannian gradient can always be represented as a low-rank matrix (see Sec. 4.1 for details), while the Euclidean gradient can have an arbitrary large rank. Thus, using the Euclidean gradient in the intermediate calculations can lead to an inefficient algorithm. Similarly, first computing as in (5) and then applying can be significantly less efficient than calculating at once. The goal of this paper is, thus, to develop an efficient tool to calculate (3) and (5) – the building block operations of Riemannian optimization. The key assumption we make is that we can efficiently evaluate at any point , . Then, under mild conditions (see Propositions 5.2 and 6 and below), the overall complexity of the presented algorithm is only constant times larger than the complexity of the function evaluation.

Let us introduce the scalar product and the associated norm

Using this notation, possible choices of are, for example:

  • or for given and that arise when solving linear systems;

  • with possibly nonlinear , which arises when solving (nonlinear) eigenvalue problems;

  • where denotes projection on the index set such that

    This type of problem is referred to as matrix or tensor completion problems.

  • is a neural network loss function, which arises when using TT-decomposition to parametrize a recurrent neural network and applying Riemannian optimization for training (for more details see Section 

    7.1).

In Section 5.3, we will also discuss how our approach can be used for operations that are not directly related to a minimization of a function, e.g., how to efficiently compute the preconditioned residual for non-commuting and .

4 Automatic differentiation for the Riemannian gradient: fixed-rank matrices

In this section, we propose an approach to automatically compute Riemannian gradients for the manifold of fixed-rank matrices.

4.1 The manifold of fixed-rank matrices

Let us briefly recall the concepts related to the manifold of fixed-rank matrices. The set of matrices of fixed rank : forms a smooth submanifold of  [21, Example 8.14]. Using SVD, any point of the manifold can be represented as , where and are matrices with orthonormal columns –– singular vectors (, ) and is the diagonal matrix of singular values. The tangent space of the manifold at a point can be written as

(6)

where

denotes a zero matrix of size

. In what follows, we refer to the matrices and that define an element of the tangent space as delta-matrices. The orthogonal projection of to the tangent space can, thus, be obtained as follows:

(7)

We refer the reader to, e.g., [22, Sec. 2.1] for a more detailed discussion of the manifold of low-rank matrices, including the derivation of (7).

Finally, to simplify the notation, we denote the projection operator as

We also introduce that maps parametrization matrices to an element of the tangent plane at the point

(8)

namely,

(9)

This mapping will be used later in Sec. 4.2 to simplify the presentation of the algorithm.

4.2 Automatic differentiation approach

In this section, we propose an efficient way of computing the Riemannian gradient

The Riemannian gradient is an matrix, but as noted in the previous section it can be defined via the delta matrices and using just parameters ( if gauge condition are taken into account). Thus, if we can avoid using full matrices in intermediate calculations, we can potentially compute the Riemannian gradient with a better asymptotic complexity than .

A naive approach of computing the Riemannian gradient is to first compute with AD and then project the result to the tangent plane by using formula (7):

(10)

The problem with this approach is that it requires finding the full matrix of the Euclidean gradient of the size , which we want to avoid. Alternatively, we may find the Riemannian gradient without explicitly forming . In particular, we notice that the Riemannian gradient (10) involves computing the following multiplication of matrices

(11)

We may find these two quantities by using the classical AD as follows:

So, one can use classic AD on the function twice (each time with the complexity equal to evaluating the function at a single point due to AD properties) to compute all the pieces that depend on .

However, in the rest of this section we propose an alternative way of computing quantities (11) by using classic AD a single time on a specially introduced auxiliary function. This alternative approach is introduced because it naturally generalizes into an efficient algorithm for the tensor case (see Sec. 5.2).

Quantities (11) can be computed at once by differentiating (using AD) the following auxiliary function defined using mapping (9)

We have

(12)

Indeed, can be represented as

and, hence, the partial derivatives of at are

where is the Kronecker delta. Applying the chain rule to (12), we get

Thus, a low-rank representation of the Riemannian gradient can be written as

with

(13)

The algorithm to compute the Riemannian gradient is summarized in Algorithm 1.

1:, – implementation of evaluating at for any and .
2: such that
3:
4:function g()
5:     return
6:Using AD, compute
7:Using AD, compute
8:
Algorithm 1 Computing the Riemannian gradient for low-rank matrices via AD.

4.3 Complexity of the approach

Let us estimate the complexity of computing

and by the proposed approach, i.e., by defining the auxiliary function and differentiating it with respect to and .

Proposition

Let be a smooth function defined by a program , which takes as input SVD decomposition of a matrix and outputs the value in floating point operations (FLOP), which is polynomial with respect to the rank of the matrix X (i.e., the program belongs to the P complexity class). Then, the complexity of using Alg. 1 for computing delta terms and which define the Riemannian gradient is .

As an example, computing

leads to FLOP, since

Proof (Proof of Prop. 4.3)

The auxiliary function can be constructed by feeding to the factors of the matrix

which is represented with the rank — twice larger than the rank of the original matrix. As a result, the asymptotic complexity (as a function of and ) of evaluating the function on such a matrix is still . Thanks to the properties of AD, computing the derivatives of with respect to the factors  and  has the same complexity . Finally, computing the factor using (13) can be done in , yielding the total complexity .

For most functions used in practice, the asymptotic complexity of executing the function at one point exceeds and the total complexity of the proposed algorithm (as a function of and ) equals to .

4.4 More general view of the proposed algorithm

In this section, we look at the proposed algorithm from a more general perspective, trying to avoid specifics of the fixed-rank manifold. The main idea of the proposed algorithm is to introduce the auxiliary function (27) and express the desired Riemannian gradient in terms of its derivatives (12). Note that we could have used an alternative auxiliary function666One can use the same derivation as in Sec. 4.2 to prove that differentiating the alternative auxiliary function yields the same results.

If one combines both arguments of the mapping (see (8)) into a single dimensional vector , you can define an equivalent mapping

and an alternative representation of the auxiliary function

(14)

Thus, the proposed approach is equivalent to defining a mapping from the parametrization of the tangent space onto the tangent space itself, defining an auxiliary function  (14), computing its gradient using classical AD, and finally doing certain post processing of this gradient: reshaping, enforcing the gauge conditions as in (13). It is not surprising that we obtain a Riemannian gradient using these formulas, as informally the gradient of the auxiliary function (14) is the fastest ascent direction of at and, hence, of at in the direction of all possible vectors from .

5 Automatic differentiation for the Riemannian gradient: fixed-rank tensors

In this section, we extend the results of Sec. 4.2 to fixed-rank tensor-train tensors, which is a generalization of fixed-rank matrices to multidimensional arrays.

5.1 The manifold of TT-tensors of fixed rank

A tensor is said to be represented in the tensor-train format [23] (TT-format) if each of its elements is a product of matrices:

(15)

where for fixed , is an matrix for any value of . We require such that is row vector and is column vector. The three-dimensional arrays of sizes , are called TT-cores and the vector

is called the TT-rank of . For a more detailed discussion on the properties of the TT-format see [23].

Like in the matrix case (Sec. 4), the set of tensors

forms a smooth manifold. To parametrize its tangent spaces, we need the notion of orthogonalization of the TT-cores. A TT-representation (15) is called -orthogonal, if

(16)

for , and

(17)

for . If or , we only require (17) or (16) respectively. The cores satisfying (16) and (17) are called, respectively, left- and right-orthogonal cores. TT-decomposition of a tensor is not unique, and for any there exists a -orthogonal representation of a given tensor [24, Sec. 4.2.1]. Moreover, for any , the -orthogonal and -orthogonal decompositions can be constructed to share the left-orthogonal TT-cores satisfying (16) and the right-orthogonal TT-cores satisfying (17).

For a given tensor X, one can define a set of left-orthogonal TT-cores , right-orthogonal TT-cores , and unrestricted TT-cores such that for any , there exists the following -orthogonal decomposition of the tensor

(18)

Using the left-orthogonal TT-cores and the right-orthogonal TT-cores of tensor , one may parametrize the tangent space as follows

(19)

In what follows, we refer to the tensors that define an element of the tangent space as delta-terms.

Additional gauge conditions are usually introduced777These gauge conditions generalize the orthogonality constraint in the definition of the matrix tangent space (6) to the tensor case. to uniquely parametrize elements of the tangent space:

(20)

In what follows, we always assume that the deltas that define an element of the tangent space obey the gauge conditions (20).

Note that in (19), the expression for an element of the tangent space is formally represented as a sum of TT-tensors, each of TT-rank , and, hence, can be represented as a single TT tensor with the TT-rank  [23, Sec. 4.1]. Nevertheless, thanks to the common cores, it can be represented with the TT-rank equal to . Indeed, by directly multiplying the block matrices, one can verify that

(21)

For convenience, we also introduce a function that maps the delta terms to an element of the tangent space

namely

(22)

as is defined in (21). The following proposition gives the explicit representation of a general tensor projected onto the tangent plane of .

Proposition

[24, equation (4.17)] The orthogonal projection of a given tensor onto the tangent space is defined as an element of the tangent space (19) with :

(23)

and as

For a more detailed discussion of the manifold of fixed tensor-train rank tensors (including derivations of the above equations) see, e.g., Sec. 4.3-4.4 of [24].

5.2 Automatic differentiation

Let us find the Riemannian gradient of a function at a point X. Similarly to the matrix case, we consider an auxiliary function using (22):

Note that the intuitive explanation of the proposed method provided in Sec. 4.4 still applies in this case.

In particular, we have

where , are zero tensors of appropriate sizes and is defined in (18) for . As a result,

Consider the derivative of with respect to at a point :

(24)

By comparing expressions (23) and (24), it is easy to see that the that defines the Riemannian gradient can be computed as

(25)
1:TT-tensor defined by the TT-cores , tensors that define the tangent space element (see (21))
2:, –– TT-cores of
3:Compute resp. left- and right-orthogonal and , and tensors as in (18)
4:for  to  do
5:     
6:for  to  do
7:     for  to  do
8:                
9:for  to  do
10:     
Algorithm 2 Converting delta notation to TT-cores (implementation of (22)).
1: – TT-cores of , – Python implementation of for a point given by TT-cores .
2:The TT-cores of the Riemannian gradient
3:For , compute resp. left- and right-orthogonal , and as in (18).
4:function g()
5:     Run Alg. 2 passing as input , and write the output into 
6:     return
7:Using AD, compute for
8:for  to  do
9:     
10:     
11:      See (25). Parentheses indicate the order of operations
12:     
13:Run Alg. 2 passing as input and