Tensor completion using geodesics on Segre manifolds

08/02/2021
by   Lars Swijsen, et al.
0

We propose a Riemannian conjugate gradient (CG) optimization method for finding low rank approximations of incomplete tensors. Our main contribution consists of an explicit expression of the geodesics on the Segre manifold. These are exploited in our algorithm to perform the retractions. We apply our method to movie rating predictions in a recommender system for the MovieLens dataset, and identification of pure fluorophores via fluorescent spectroscopy with missing data. In this last application, we recover the tensor decomposition from less than 10% of the data.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

06/23/2020

Note: low-rank tensor train completion with side information based on Riemannian optimization

We consider the low-rank tensor train completion problem when additional...
04/05/2018

High-dimension Tensor Completion via Gradient-based Optimization Under Tensor-train Format

In this paper, we propose a novel approach to recover the missing entrie...
03/27/2021

Automatic differentiation for Riemannian optimization on low-rank matrix and tensor-train manifolds

In scientific computing and machine learning applications, matrices and ...
05/26/2016

Low-rank tensor completion: a Riemannian manifold preconditioning approach

We propose a novel Riemannian manifold preconditioning approach for the ...
08/27/2019

Robust Tensor Recovery with Fiber Outliers for Traffic Events

Event detection is gaining increasing attention in smart cities research...
04/29/2021

Scaling and Scalability: Provable Nonconvex Low-Rank Tensor Estimation from Incomplete Measurements

Tensors, which provide a powerful and flexible model for representing mu...
07/06/2020

Tensor Convolutional Sparse Coding with Low-Rank activations, an application to EEG analysis

Recently, there has been growing interest in the analysis of spectrogram...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1. Introduction

Data is often stored in a -dimensional array , called a tensor. Let

be vectors with

. We can turn these vectors into a tensor by applying the outer product and defining

We call this tensor a rank- tensor and denote it by .

Because of the high dimensionality of a tensor, one is often interested in storing the data in a concise way. A classic way to achieve this is using a canonical polyadic decomposition (CPD). Given a tensor it is possible to write it as a finite sum of rank- tensors

with . The minimal number of rank- tensors needed to build the tensor is called the rank of .

The CPD has applications in different fields, such as signal processing, chemometrics and linguistics [kolda_tensor_2009, bro_parafac_1997]. It turns out that because of extra underlying structure in these tensors, the rank- tensors that are obtained contain valuable information. Using the fact that for low ranks the rank- terms are essentially unique [kruskal_three-way_1977, sidiropoulos_uniqueness_2000, chiantini_algorithm_2014], a property that is called r-identifiability, one can recover this information from the CPD.

In this paper we present a Riemannian optimization algorithm that seeks a best rank- approximation of a given tensor . Concretely, we aim to solve the following optimization problem:

where is called the Frobenius norm of 111We assume a solution for this optimization problem exists. However, this is not always the case, as is explained in [de_silva_tensor_2008]. Therefore, we optimize over copies of the space of all rank- tensors. It turns out that the space of all tensors of rank has an interesting structure. It is a space with curvature, and more specifically it is a Riemannian manifold called the Segre manifold, which we denote by . In this paper we compute the geodesics of this Riemannian manifold and obtain the following result:

Theorem 1.1.

The (unit-speed) geodesic of through the point

in the direction

is given by

where , and .

These geodesics, while interesting in their own right, are then used to implement a Riemannian conjugate gradient method that moves over the manifold in a geometrically optimal way. These geodesics are a computationally viable alternative to HOSVD-based retractions that can be found in the literature, see for example [breiding_riemannian_2018, kressner_low-rank_2014].

As an application of these optimal curves, we investigate two applications involving missing data, i.e., incomplete tensors. First, we will look at the 1M MovieLens database. This is a collection of approximately 1 million ratings given by 6000 users on 4000 movies. We aim to use the algorithm to create a prototype recommender system that predicts the match between an arbitrary user-movie pair, as in [frolov_tensor_2017]. Next, we apply the algorithm in the setting of fluorescence spectroscopy. To measure fluorescent compounds in a solution, the latter is excited with light of different wavelengths. This excites the electrons and will result in the emission of light of different wavelengths. The resulting emission patterns can be used to determine the fluorescent compounds in the mixtures. We experimentally study how incomplete the data can become, such that it remains possible to recover the entire dataset. These results might be applied in developing new compressive sensing hardware that reduces the experimental duration.

The paper is structured as follows. In the next section we give an short introduction to Riemannian geometry. In particular, the relevant definitions and notation will be introduced. Section 3 contains the computation of the geodesics of the Segre manifold. The geodesics are used in Section 4 to create a Riemannian conjugate gradient algorithm for tensor decomposition, whose experimental results are discussed in Section 5.

Acknowledgements

We thank Carlos Beltrán from the Universidad de Cantabria for initiating the computation of the geodesics of the Segre manifold while the last author visited him in the first semester of .

2. Introduction to Riemannian Geometry

In this section we will give a brief introduction to some of the elementary notions of Riemannian geometry that we need in order to present the main result. If the reader is interested in a more in-depth introduction to this subject, we refer to [godinho_introduction_2014].

A manifold is a space that locally looks like a Euclidean space, together with some topological conditions. It is a second-countable and Hausdorff space that is locally homeomorphic 222This means can be covered by open sets and continuous maps , called charts, such that for each , is invertible and is continuous. Additionally, if a and overlap, then the map has to be differentiable on the overlap. to an Euclidean space. If the manifold locally resembles , it is said to be of dimension . At each point one can define the tangent space of at , denoted by . This linear space contains all velocity vectors of paths in at . Concretely it is defined by an equivalence relationship on all paths with where if and only if .333Choose a chart such that . Then is defined as the regular derivative . The union of all tangent spaces to is denoted by and is called the tangent bundle. A vector field on the manifold is a smooth collection of tangent vectors at every point of the manifold. It is a smooth map with . Later on, we will need a tool to connect different tangent spaces on in order to transport tangent vectors from one point to another point. This process can be achieved by a connection on the manifold. We again refer to [godinho_introduction_2014] for the technical definition.

A manifold is called a Riemannian manifold when it is endowed with a metric. A Riemannian metric is a smooth positive-definite inner product on each tangent space . Once we have defined a Riemannian metric on , there is a unique connection that interacts with this metric in a nice way. This canonical connection is called the Levi–Civita connection. A Riemannian metric allows us to properly study notions such as distances and angles on a curved manifold. The length of a path is defined as

An important concept for the remaining sections will be the notion of a geodesic. A geodesic is the generalisation of a straight line in Euclidean space to curved manifolds. Locally on the manifold they correspond to distance minimizing curves. In [godinho_introduction_2014] one can find a definition which involves the Levi–Civita connection.

3. Geodesics on the Segre manifold

In this section we find the geodesics of the Segre manifold consisting of all tensors of rank , equipped with the metric induced by the ambient Euclidean metric. Thus, we are looking for the length-minimizing curves in the usual Euclidean distance in the space of all tensors. For simplicity of notation, we restrict ourselves in the introduction of this section to the case of tensors of order with size . The main theorem will be proved for tensors of arbitrary order and size.

Let us define as the unit sphere in , i.e., the manifold of points in with Euclidean distance to the origin. Let denote the manifold of strictly positive real numbers. Note that, using the multilinear properties of the tensor product, an arbitrary rank- tensor of size can be parametrized by the following local diffeomorphism:444A map is a local diffeomorphism if for each there exists an open set around such that is open in and the restriction of is differentiable with a differentiable inverse function.

This allows us to locally identify with . Additionally, deriving gives a description of the tangent space of the Segre manifold. Let , then we can identify the tangent space of with

as follows:

Note that the Segre manifold lives in the ambient Euclidean space of all tensors of size . This allows us to equip the Segre manifold with the induced metric of the ambient Euclidean space. Fix a point and take two tangent vectors and at this point. Using the fact that is the orthogonal complement of the position vector for all , it follows that for the ambient metric:

where is the inner product on the sphere, which corresponds to the usual Euclidean inner product. We used orthogonality in the second equality and the fact that and hence has unit norm in the third equality. We can generalize these computation in the following lemma.

Lemma 3.1.

Let and take the tangent vectors
and . The metric induced by the ambient Euclidean space on is given by

Endowing the Segre manifold with this induced metric turns it into a Riemannian manifold. Note that this metric is closely related to the product metric of the usual metrics on and . If and are two Riemannian manifolds, then one can endow with a warped product as follows: Given two tangent vectors and at a point , one can define the inner product as

where is called the warping function. The manifold endowed with this warped product is denoted by . In the case of being the constant function , this reduces to a regular product manifold. Using this notation one can see that

hence it is a cone over a product of spheres. For more details on warped product we refer to [oneill_semi-riemannian_1983].

The remainder of this section serves as a proof of creftype 1.1.

Proof of creftype 1.1.

Let us look at a path and assume it is the geodesic with given starting conditions and .

Recall from the introduction of this section that the Segre manifold endowed with the induced metric is a warped product of endowed with the standard metric and endowed with the product metric, using the identical warping function. This implies, as in [oneill_semi-riemannian_1983], that the path is a pre-geodesic, i.e., a geodesic that is not necessarily arc-length parametrized, in .

Now let us re-parametrize the path in such a way that the angular path on the spheres has constant velocity . Let us thus apply the change of variable with such that we obtain the path with . It holds for all that

From this we find that

After this re-parametrization, the path is a proper geodesic, since it is an arc-length parametrization of a pre-geodesic. It is well known that this implies that the projection of this geodesic on each of the spheres results in a geodesic. Since geodesics on spheres are well-known, we can conclude that

for all , where .

We now aim to find the path . We will use the fact that geodesics are curves on a manifold that locally minimize distances. Recall that the length of from to is

(1)

This shows that we aim to minimize Eq. 1 for all paths of the form with . Using the induced metric described in creftype 3.1, Eq. 1 can be rewritten as

This minimization problem can be solved via the Euler–Lagrange equations [fox_introduction_1987]. When we define , we need to solve

A straightforward computation gives

which, after cancellations, reduces to

One can check that the solution to this ODE is

where and are constants to be determined by the starting conditions. Note that the starting conditions and were given for the original path and that . Hence we have

We have thus found a pre-geodesic of the form

(2)

All that remains is computing an arc-length parametrization of this path. Denote this parametrization by , with the assumption that , such that the path has length . Using the abbreviation , we can compute

The solution of this ODE is straightforward via separation of variables:

Using that , we can compute to be . We can then rewrite to conclude that

We can apply this change of variable to obtain the arc-length parametrized geodesic, which takes the following form after simplifying and renaming to :

for all , with and . Using the abbreviations as in creftype 1.1, we obtain the result. ∎

4. Riemannian conjugate gradient algorithm for tensor decomposition

In the previous section we had an in-depth look at the geodesics of the Segre manifold. The aim of this section is to bring together all items needed to write a Riemannian conjugate gradient algorithm to find the best rank decomposition of a given tensor.

A conjugate gradient algorithm is an improvement of the gradient descent algorithm, since it remembers the previous search direction. A gradient descent method on an embedded manifold has a clear geometric interpretation. One starts with a gradient flow in the ambient space pointing towards a point. One then projects this flow onto the submanifold and aims to follow this projected flow.

Since we are optimizing over a manifold, we need to compute certain geometric properties of the manifold. We will use the geodesics to move over the manifold in a given direction. Additionally we will use parallel transport to move the previous search direction to the current iteration point. The pseudo-code below is a basic layout of a conjugate gradient algorithm over a manifold; for more details see [absil_optimization_2009].

Data: Manifold , function
Result: Minimum of function
Pick a starting point ;
while Stopping conditions not met do
       Compute Riemannian gradient ;
       Move previous gradient to with parallel transport to obtain ;
       Compute ;
       Use line-search algorithm to find step-size ;
       Use retraction to move from in direction of for time to obtain ;
      
end while
Algorithm 1 Riemannian conjugate gradient algorithm

There are different methods for choosing the parameter in the above algorithm. Our algorithm uses a Hestenes-Stiefel rule. For details we again refer to [absil_optimization_2009].

4.1. Objective function

Recall that the goal is to find the best rank- approximation of a given tensor . We aim to solve the following least squares optimization problem:

As explained before, we optimize over the product manifold of copies of , denoted by , as in [breiding_riemannian_2018]. In the remainder of this section, we often use .

4.2. Riemannian gradient

To compute the search direction from a point , we need to compute the Riemannian gradient of the objective function at . The Riemannian gradient of is defined as the unique vector field such that for all and :

Recall that is an embedded manifold in the ambient Euclidean space . It is shown in [boumal_introduction_2020], that the Riemannian gradient at a point on the Segre is the projection of the gradient of the extended map on the ambient space to . Thus we can conclude that

where is the projection onto . This gradient can be computed efficiently as in [vannieuwenhoven_computing_2015].

4.3. Geodesics

In a Riemannian conjugate gradient algorithm we often need to move from a point in a certain direction . We will use the geometrically optimal curves

where is the geodesic starting at in the direction . This map is known as the exponential map.

4.4. Parallel transport

Additionally, a Riemannian conjugate gradient algorithm requires a method to move tangent vectors from one iteration point to the next. The geometric tool corresponding to this is a connection on the manifold . Note that the Segre manifold is an embedded manifold in the ambient Euclidean space of all tensors of the fixed size. It is described, for example in [boumal_introduction_2020], that the parallel transport is simply the projection of the ambient parallel transport onto the tangent spaces. If we call the ambient connection and take two vector fields on , we find

where is a smooth extension of the vector field to the ambient Euclidean space and is the orthogonal projection onto .

4.5. Line search

Once we have determined the direction in which we move from a certain iteration point, we need a line search algorithm to dictate the step size . A commonly used line search algorithm to determine a good step size is a so-called backtracking algorithm. One typically, as is mentioned in [boumal_introduction_2020], starts with an initial guess for and iteratively decreases it with a factor (often ) until the Armijo–Goldstein condition is satisfied [boumal_introduction_2020]. This condition checks if the decrease of the objective function using a certain step size is at least what we would expect given the step size and the norm of the gradient.

One disadvantage of this method is the unknown amount of iterations and therefore the uncontrolled amount of function evaluations. After investigating the particular problem of tensor approximation, it was noted that the evaluation of the objective function behaved nicely with respect to the step size, as can be seen in Fig. 1.

Figure 1. Error for different step-sizes

Based on these observations we propose a simple formula using quadratic interpolation. One already knows the current function value corresponding to a step size

, call this . Computing the function values corresponding to and , called respectively and one can easily calculate the interpolating parabola as

This parabola obtains its minimum at step size

Since the error at step size has to be calculated to determine the error at the current iteration point, this method requires only additional function evaluations. This clear bound on the number of function evaluations is the main advantage of this method of determining . In the code for this line search algorithm we still check if the that is determined by the quadratic interpolation satisfies the Armijo–Goldstein condition. If not, we return to the default backtracking line search algorithm.

5. Experimental results

In this section we apply the Riemannian conjugate gradient algorithm to two different topics, namely recommender systems for movies and fluorescence spectroscopy.

The Riemannian conjugate gradient algorithm using geodesics on the Segre manifold is implemented in Matlab. The toolbox Manopt [boumal_manopt_2013] was used for the general framework of optimization over manifolds. In particular, we used the conjugategradient method with the default options. Note that instead of parametrizing as , we worked with

where is the oblique manifold consisting of all matrices with unit norm column vectors. Its geometry is equivalent to , but its implementation is more efficient. Additionally, tools from Tensorlab [vervliet_tensorlab_2016] were used for efficient tensor calculations.

The experiments were run on an HP Elitebook. It has an AMD Ryzen PRO GHz processor and GB of RAM.

5.1. Recommender system for MovieLens data

In this well known application we will apply the algorithm to answer the problem of tensor completion. In the field of tensor completion, one aims to complete a given incomplete tensor, assuming an underlying low rank structure [song_tensor_2019]. The dataset that will be used is the 1M MovieLens dataset [harper_movielens_2015]. This database contains ratings (ranging from to stars) from users of movies. In order to increase the density, the data have already been filtered by MovieLens in such a way that every user has rated at least movies. The density of the data, i.e., the percentage of the known elements, is approximately .

The aim of a recommender system is to predict the match between an arbitrary user and movie as accurately as possible. We are thus interested in a recommendation map:

As is done in [frolov_fifty_2016], we interpret the dataset as a tensor . The first two modes of correspond to the users and the movies, while the last mode corresponds to the ratings . So if user gives movie a rating of stars, we will have and if . The aim of this process is instead of generating an expected rating for a given (user,movie)-pair, we can approximate a rating distribution. In other words, we aim to create a recommender map that looks like

For more details and an overview of tensor methods being used in recommender systems, we refer to [frolov_tensor_2017]. As is usual, we assume the data can be well-approximated by a low-rank tensor decomposition. We aim to find the best rank- approximation for the tensor , of course only looking at the known indices. The tensor can then be completed by this model, thusly recommending ratings for unknown user-movie pairs. Before showing the experimental results, we outline some specifics of the used algorithm for this setting.

We start by discussing the objective function. As mentioned above we aim to minimize the distance between a rank tensor and . To compute this distance, we only look at the set of given indices, which we call . Note that intuitively we expect entries of the tensor to be values in the interval . To promote this behaviour, we add a penalty function that punishes each entry of the tensor outside of this interval. Concretely, we define the objective function as

where is the low rank approximation and is the Frobenius norm obtained by only summing over the indices in . The parameter , which is taken equal to in the remainder of this section, is used to give weight to the penalty term. The ninth power in the penalty term is used to push the negative part of the graph of to zero in a smooth way. The gradient of the objective function was adjusted accordingly. The addition of the penalty function removes the nice effect observed in Section 4.5, hence we used the standard backtracking algorithm provided by Manopt. We used the rand fucntion by Manopt to randomly choose a starting point on the product manifold.

Secondly we need a way to recommend a rating for a user-movie pair , using our five predicted values . We considered four possible methods:

  1. Defining as the weighted average, so .

  2. First rescaling the scores to have sum , then proceeding as in the first option.

  3. Setting negative values equal to zero and setting values bigger than one equal to one, then proceeding as in the first option.

  4. Recommending a rating of stars if the highest value is achieved is .

As is customary, the full dataset was randomly split into a training set and a test set. Only the training set was used to train the algorithm. Firstly, one aims to find the optimal for approaching the tensor by a rank tensor. Choosing a small value for risks not being able to properly explain the dataset, while choosing a large value for risks overfitting the data for the training set and obtaining worse recommendations for the test set. For each rank we make a new random split of the training set. We run the algorithm to approximate the data and compare the prediction to the data we withheld. We used the notion of the root mean square error (RMSE) to compare the predictions to the real values. RMSE is a well-established measure of difference between observed values and values predicted by a model and is given by

We stopped the algorithm when the norm of the gradient dropped below . In all of our runs, this corresponds to a relative norm of about . One can find the results in Fig. 2. Note that option is not shown, since it was not at all competitive with RMSE up to larger than the other options.

Figure 2. RMSE for different methods of predicting ratings.

It is clear that option performs best in this setting, with option being the only other competitive option. Already for reasonably low ranks, such as , we achieve values as low as . These values are in the range of several different algorithms; see [papadakis_scor_2017].

Running the algorithm for takes around iterations in minutes. Note however that this is not the relevant time. One is for example interested in quickly giving the top recommendations for a given user. Taking arbitrary users and providing each of them with the films best suited for them took seconds.

5.2. Fluorescence spectroscopy

As a final application we look at a dataset generated by fluorescence spectroscopy. The experiments aim to identify the concentration of several fluorophores in different mixtures. This is done by exciting them with light of different wavelengths and intercepting the light emitted by the mixtures.

Currently, experiments in fluorescence spectroscopy are executed by exciting the mixtures with a range of wavelengths and intercepting a range of wavelengths. It is of interest to determine how much data is really required to identify the fluophores in the different mixtures. Concretely we ask ourselves how incomplete the tensor with data can be in order to still be able to recover the full dataset. This can be useful to develop new hardware for fluorescence spectroscopy. By only exciting the mixtures with light of randomly chosen wavelengths and using a filter to only intercept light at certain random wavelengths, we anticipate that one could drastically reduce the experimental duration. We investigate this on two different datasets: one smaller dataset with a small relative error and one bigger dataset with a larger relative error.

It is known that one can expect a dataset obtained by fluorescence spectroscopy on mixtures of different fluophores to be explained by a model of rank [appellof_three-dimensional_1983]. We will check this experimentally in both datasets. We aim to experimentally determine how incomplete the tensor of data can become such that it can still be recovered. We will remove all but elements of , where is a variable and is the -th secant variety of . Note that for order three tensors of size one has , with equality generally expected to hold for sufficiently small ranks [abo_induction_2009]. Results will be quantified using both the relative approximation error and the core consistency of the approximation, which is a constant derived from a procedure called Corcondia[bro_new_2003]. Corcondia is often used for deriving the amount of latent factors in a model, by computing a percentage that decreases when a model is being overfitted. For details on this method, together with an efficient implementation we refer to [papalexakis_fast_2015].

The dataset [bro_parafac_1997] consists of five mixtures, each containing different concentrations of three amino acid fluorescents: tyrosine, tryptophan and phenylalanine. These mixtures are excited by light of wavelengths between nm and nm in intervals of nm. The emitted light is intercepted at wavelengths between nm and nm in intervals of nm. The data can therefore be collected in a tensor of size . Figure 3 shows the relative errors obtained by approximating with a rank tensor. Note that we stopped the algorithm when the norm of the gradient was less then . In all of our runs, this corresponds to a relative norm of about .

Figure 3. Relative errors per rank

It is clear that, as expected, a model of rank is sufficient to explain the dataset, which corresponds to a relative error of around . Note that in this setting , which is of the size of the entire dataset. Figure 4 shows the relative error obtained by only using randomly chosen elements of in function of . As a further study for values of , Fig. 5 shows the core consistency of the approximation using only elements.

Figure 4. Relative error of the approximation obtained using only elements.
Figure 5. Core consistency of the approximation obtained using only elements.

Figures 5 and 4 suggest that for we already have an accurate approximation of . One can check this by retrieving the emission patterns of the amino acids from the decompositions. Figure 6 shows that the emission patterns are well approximated from the incomplete tensor. This shows that in this setting one only needs around of all the data collected by fluorescence spectroscopy. Because of the relatively small dataset, this corresponds to a decrease in the runtime of the computer algorithm from seconds to seconds. However, since only about of the data has to be collected, there is a possibility to significantly lower the acquisition time of the spectroscopic images by developing new hardware.

(a) Full tensor, relative error of
(b) , relative error of
(c) , relative error of
(d) , relative error of
Figure 6. Emission patterns of the rank 3 decomposition

Conclusions

We proposed a Riemannian conjugate gradient algorithm for tensor rank approximation using the geometry of the manifold over which one optimizes. The Riemannian geometry of the manifold of rank tensors was studied. The main original contribution is the computation of the geodesics of and applying them as retractions in the Riemannian conjugate gradient algorithm. We applied the algorithm in two different settings with incomplete tensors. First, we showed how to use the algorithm to build a recommender system for movie ratings. Second, we determined experimentally that one only needs a fraction of the data in order to recover a common fluorescence spectroscopy dataset up to measurement errors. This observation could be exploited to create new hardware for fluorescence spectroscopy that reduces the experimental duration.

References