Finding entries of maximum absolute value in low-rank tensors

12/04/2019 ∙ by Lars Grasedyck, et al. ∙ RWTH Aachen University 0

We present an iterative method for the search of extreme entries in low-rank tensors which is based on a power iteration combined with a binary search. In this work we use the HT-format for low-rank tensors but other low-rank formats can be used verbatim. We present two different approaches to accelerate the basic power iteration: an orthogonal projection of Rayleigh-Ritz type, as well as an acceleration of the power iteration itself which can be achieved due to the diagonal structure of the underlying eigenvalue problem. Finally the maximizing index is determined by a binary search based on the proposed iterative method for the approximation of the maximum norm. The iterative method for the maximum norm estimation inherits the linear complexity of the involved tensor arithmetic in the HT-format w.r.t. the tensor order, which is also verified by numerical tests.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Scientific computing with functions, vectors and matrices in low rank formats allows us to tackle even high-dimensional and extreme scale problems, provided the underlying mathematical objects allow for a low rank approximation. In this article we focus on the task of postprocessing low rank matrices and tensors. Our guiding problem is to find the entry of maximum absolute value in a low rank object without accessing every entry. Such a task is the discrete analogue to finding global optima and thus of general interest. While this is straight-forward for rank one, it is surprisingly difficult for ranks beyond one. The seminal idea for finding the maximum absolute value efficiently stems from the PhD thesis of Mike Espig

[5]

where the problem is reformulated as a diagonal (extreme scale) eigenvalue problem. The latter can e.g. be solved by a simple power iteration, and in case that there is a unique maximizer, the corresponding eigenvector is rank one. Hence, it makes sense to approximate the eigenvector by low rank (cf.

[6] for a recent approach independently of this article). However, if several entries are of similar size, then the power iteration can be arbitrarily slow, the iterates cannot be approximated by low rank and a truncation to low rank may destroy convergence altogether.

In this article we provide much more advanced approaches for finding the maximum by subspace acceleration and a squaring trick. Without truncation we are able to prove a uniformly good convergence rate of for computing the -norm and based on this we devise a divide and conquer strategy to find the corresponding entry.

The article is structured as follows: In Section 2 we introduce the HT-format followed by the problem for rank one or elementary tensors in Section 3. In Section 4 we summarize the power iteration approach and give examples that underline the involved difficulties. In Section 5 we provide the subspace acceleration and in Section 6 the squaring trick. Finally, in Section 7 we explain how one can determine the entry of maximum absolute value based on the so far introduced iterative methods. We conclude by giving numerical examples in Section 8 that show the benefits and limitations of the algorithms.

2 Low-rank formats for tensors

We denote the entry of an order tensor , by

For larger tensor order

a full (entrywise) representation is not feasible (cf. the curse of dimensionality

[3, 2]) and additional structure is required. In the following we use low rank tensors that generalize the matrix rank. [Elementary tensors] A tensor is called elementary tensor or rank tensor if

(1)

The CP-rank (Canonical Polyadic) [14] of a tensor is defined as the minimum number such that

(2)

For computational purposes the minimum number of terms is not necessary, instead any representation of the form (2) with reasonably small is sufficient. Such a representation is called CP-format with representation rank . The corresponding set is defined by

(3)

These sets are numerically difficult and thus a different type of rank is commonly used which is based on matricizations. [-matricization] Let . We denote

(4)

and we use the short notation

(5)

Then the -matricization of is defined by

i.e. a rearrangement of into a matrix with row index set and column index set . For the CP-format, instead of all tensor entries of some tensor , the matrices from the right-hand side of (2) are stored. Assuming all tensor directions to be of the same size , this sums up to a storage complexity of , which is linear in the tensor order . Despite of this desirable linear complexity in , the CP-format has two drawbacks: on the one hand, the CP-rank is in general NP-hard to determine, cf. [15], on the other hand, the set (3) is not closed for tensors of higher order , cf. [4, 12]. This fact can render the numerical treatment of tensors in CP-format difficult.

For the Tucker format the corresponding set is closed, cf. [12], but the storage complexity w.r.t. the tensor order is exponential.

The above mentioned complexities and shortcomings transfer to many arithmetic operations, such as dot products or Hadamard products, which are involved in the algorithms presented in this article. All of the shortcomings can be resolved by using the Hierarchical Tucker format (HT-format) introduced in the following.

2.1 The Hierarchical Tucker format

An HT-format for a tensor , , is based on a hierarchy of all tensor directions , which can be expressed by a binary tree, referred to as dimension tree, cf. [8, 12]:

[Dimension tree] For a tensor space , , by we denote the set of all tensor directions. A corresponding dimension tree is a tree with the following properties:

  1. all vertices of are non-empty subsets of ,

  2. the root vertex of is the set : ,

  3. all vertices with have two disjoint successors with

A dimension tree for a tensor order , , is thus a binary tree with singletons , , at the leaf vertices. Two examples of dimension trees for tensor order are shown in Figure 1.

[. [. ]. [. ]. ].

[. [. [. ]. ]. ].

Figure 1: Two different dimension trees for tensor order .

The HT-format is again defined by matricizations, cf. Definition 3, where the subsets are the subsets of the underlying dimension tree which determines the structure of the HT-format: [Hierarchical Tucker format (HT-format)] Let , , be a tensor and , , a dimension tree. Furthermore, for all let

(6)

, as in (4), be a decomposition of the -matricization of , where the , , are referred to as frames: the columns of span the range of , especially for we assume (6) to be chosen such that

(7)

holds, i.e. . A Hierarchical Tucker format (HT-format) for is then defined by

  1. the above matrices for all leaf vertices , , of and

  2. transfer tensors at all non-leaf vertices , , with at the root, which fulfill for all

    (8)

    i.e. the transfer tensor holds the coefficients of the linear combination (8) by which the frame can be obtained from the frames and .

The vector containing the numbers of columns of the frames , , is called representation rank of the HT-format. Notice that we always assume at the root, i.e. contains only one column, namely , cf. (7).

[Hierarchical Tucker rank (HT-rank)] The Hierarchical Tucker rank (HT-rank) of a tensor , , w.r.t. some dimension tree , , is defined as the vector containing the smallest possible numbers , , such that an HT-format of with representation rank exists.

Notice that by (6) the components , , of the HT-rank of , , w.r.t to some dimension tree are the ranks of the respective matricizations , denoted as

According to (3) for , , we define the set as

(9)

i.e. the set of all tensors for which an HT-representation with representation rank exists. It can be shown that is a closed subset of , cf. [12].

Evaluation of tensor entries

Let , , be represented in the HT-format w.r.t. some dimension tree , i.e. at the leaf vertices the matrices are stored and at the non-leaf vertices the transfer tensors , , are stored. The tensor entries , , are thus not available explicitly but can be computed recursively: by (7) we have

which can be evaluated by using (8):

(10)

where we use the notation (5) for sub-indices of corresponding to some subset . The columns and are either stored explicitly (if the corresponding vertex or is a leaf of ) or can again be evaluated recursively by using (8). By this means, the tensor entries , , of can be evaluated with a computational work growing like , i.e. linearly with the tensor order , when is an upper bound for the components of the representation rank .

Storage complexity

The storage complexity for a tensor , , in HT-representation w.r.t. some dimension tree also grows linearly with the tensor order (as for the CP-format): assuming all tensor directions to be of the same size and a bound for the components of the representation rank of , real numbers have to be stored at each leaf vertex , , cf. Definition 1 (a), and real numbers have to be stored at each interior vertex (neither leaf nor root) and real numbers at the root. Taking into account that a dimension tree , , consists of leaf vertices and non-leaf vertices, this sums up to a storage complexity of .

Arithmetic operations in the HT-format

Two arithmetic operations for tensors, which are of particular importance for this article, are the dot product and the Hadamard product of two tensors , defined by

These operations can be computed directly in the HT-format, cf. [12, 11]. The corresponding complexities grow again linearly with the tensor order : assuming all tensor directions to be of the same size and an upper bound for the components of the representation ranks of and , dot products can be computed with a complexity bounded by flops whereas flops are needed for the computation of Hadamard products (cf. [11] for a parallel algorithm).

A more detailed description of different low-rank formats for tensors can e.g. be found in [12, 10]. For this article we use the HT-format but other formats can be used as well, e.g. the TT-format [20, 12] or MPS format [21, 22]. Since the computation of the Hadamard product of two low-rank tensors typically yields a result of higher representation rank, one needs a truncation procedure which truncates tensors back to lower ranks. For the HT-format we use the truncation techniques introduced in [8] the complexity of which is comparable to that of the computation of dot products above.

3 Maximum entries of elementary tensors

Let , . [Maximum norm of a tensor] The maximum norm of a tensor is defined by .

Maximum norm of a matrix

Notice that Definition 3 may lead to an ambiguity: for a matrix , is often used to denote the operator norm of as a linear operator which does not match Definition 3:

cf. [7], where denotes the -th row of . In the following is always used according to Definition 3, i.e. .

Maximum norm of an elementary tensor

Let , , , be an elementary tensor. The maximum norm of is then given by

which can be maximized by maximizing each of the factors , , in complexity . Unfortunately, no such algorithm is known for tensors of higher rank (i.e. non-elementary tensors). Even if a good approximation of a matrix by a rank- matrix

exists in the Euclidean sense (e.g. a truncated singular value decomposition), the corresponding maximum norms

and may differ extremely as the following example indicates:

Example

Consider a matrix defined as

(11)

Then has the singular values and is the rank- best approximation of in the Frobenius (and spectral) norm with an approximation error

(12)

in the Frobenius norm. Assuming we get a corresponding error

(13)

for the approximation of the maximum norm, which tends to for , independently of the error (12).

4 Estimating the maximum norm by a power iteration

We use a power iteration to approximate the maximum norm of a tensor , , as introduced in [5]: consider the diagonal matrix

(14)

Then the maximum norm is the maximum absolute value of an eigenvalue of :

(15)

which can be computed by a standard power iteration if there exists only one eigenvalue of with , i.e. there is only one index with . Taking into account that the matrix multiplication , , can be written as the Hadamard product , this results in Algorithm 1, where denotes the Euclidean norm of .

1;
2 for   do
3       ;
       // Rayleigh quotient
4       ;
5      
Algorithm 1 Power iteration: the maximum norm of a tensor , , is approximated by , , if only one index with exists.
Improvement of the Rayleigh quotient

Due to the diagonal structure of the eigenvalue problem in (15), the Rayleigh quotient in line 1 of Algorithm 1 can be improved, which results in a new estimator , , cf. Algorithm 2. [Improved estimator for the power iteration] The new estimator , , of Algorithm 2 is at least as good as the Rayleigh quotient , , from the standard power iteration, cf. Algorithm 1:

(16)

and always converges to from below: , . Using the Cauchy-Schwarz inequality and in line 1 of Algorithm 1 yields

i.e. (16). By induction one easily verifies that , , holds in line 2 of Algorithm 2, where denotes the -fold Hadamard product of . This implies

(17)

in line 2, i.e. , , are lower bounds of . We now use

(18)

and the norm equivalence for -norms on which follows from the Hölder inequality:

(19)

as well as

(20)

in order to get a lower bound for , :

(21)

Combining the bounds (17) and (21) for yields

(22)

i.e. for since , .

1;
2 for   do
3       ;
       // improved estimator
4       ;
5      
Algorithm 2 Power iteration: for , where is bounded from below by the absolute value of the Rayleigh quotient from Algorithm 1, cf. Lemma 4.

The following example shows that the estimator from Algorithm 2 is indeed better than the Rayleigh quotient from the standard power iteration:

Example

For the vector (order- tensor) the standard power iteration from Algorithm 1 does not converge to :

The estimator from Algorithm 2, however, immediately converges to :

Error estimator for Algorithm 2

Since Algorithm 2 is a power iteration for with an improved estimator for the eigenvalue, cf. Lemma 4, any error estimator known for the standard power iteration can also be applied to Algorithm 2. Since in our framework the underlying matrix is diagonal, cf. (14), results for power iteration with symmetric matrices may be applied. E.g. in [18] the error , , is proven to be bounded by

(23)

if the corresponding matrix is symmetric positive definite, which in our framework means that all entries of are positive, cf. (14).

However, for our special case of a power iteration on a diagonal matrix , the bound

(24)

can directly be obtained from (22) in the proof of Lemma 4. Asymptotically the bounds (23) and (24) show the same behavior, where the bound in (24) is always better than the one in (23). Assuming an error which grows like the bound (23) leads to a slow convergence:

(25)

Even though the above error estimates may be pessimistic in some cases, our experiments show that the convergence behavior (25) in fact occurs for many standard examples, cf. Figures 1(a) and 2(a).

Complexity of the power iteration

In the HT-format all tensor operations involved in Algorithm 2 can be carried out directly in the HT-format with a complexity depending linearly on the tensor order and the mode sizes. We can thus expect such a linear complexity for the approximation of by Algorithm 2 with , , in HT-representation, which is verified by the experiments in Section 8.6, cf. Figure 7.

Truncation

Since the computation of Hadamard products increases the representation ranks of the iterates, we include truncations in each step of Algorithm 2 in order to keep the representation ranks of the iterates small enough. We compute the truncations by a hierarchical singular value decomposition (HSVD) in the HT-format [8]. A detailed analysis of truncations in iterative fixed-point like processes is given by [13]: if the truncation errors are small enough and the starting tensor is chosen appropriately, the iterative process still converges to the original fixed point, provided that the latter allows for a respective low-rank representation. In our numerical experiments, cf. Section 8, we truncate w.r.t. fixed prescribed HT-ranks in each step of the algorithms, which already worked well. Another option would be to truncate w.r.t. a given tolerance for the relative truncation error, which, however, may lead to high representation ranks in the beginning of the algorithms, since the underlying error estimations for the HT-format can be rather pessimistic, cf. [8].

Notice that it is possible to construct counterexamples where any truncation leads to the loss of that entry of which corresponds to , i.e. to a wrong approximation of in the end. Such counterexamples can be constructed as in (11): take (order- tensor) as defined in (11) with and , . A rank- best approximation of is given by , cf. (11), i.e. after truncation, instead of , the maximum norm would be computed. This example shows that it may happen that a single entry of larger absolute value (the entry of ) is ”hidden” by many entries of smaller absolute value (the entries of ). The reason for this is that the truncation, cf. [8], is based on the singular value decomposition, which corresponds to a best approximation in the Euclidean sense and not w.r.t. the maximum norm. Nevertheless this does not stand in contradiction to the convergence analysis developed in [13] since the truncation of does not meet the underlying assumptions: the starting tensor of the iterative process is far away from the fixed point (), cf. (11):

5 Estimating the maximum norm by Ritz values

In order to accelerate the convergence of the power iteration, we make use of an orthogonal projection technique which uses not only the iterate of the current step for the estimation of , cf. line 2 in Algorithm 2, but the latest iterates, i.e. , for some , . The resulting method is often referred to as Rayleigh-Ritz method, cf. [1]:

Rayleigh-Ritz method

The idea is to approximate an exact eigenvector of the underlying eigenvalue problem , , cf. (14), (15), by a linear combination of the latest iterates , , which means that an approximation of is searched for in the corresponding subspace

(26)

The approximation and a corresponding approximation of are obtained by imposing a Galerkin condition:

Computing an orthonormal basis of and defining the symmetric matrix as

(27)

yields

i.e. is the eigenvalue of which has maximum absolute value, and . The eigenvalues of are called Ritz values of , i.e. the eigenvalue of fulfilling is approximated by the Ritz value of which maximizes .

5.1 Computation of orthonormal bases in the HT-format

Since we use the HT-format for the representation of low-rank tensors, the iterates spanning the subspace are tensors in the HT-format. We thus want to compute an orthonormal basis of

directly in the HT-format. This can e.g. be achieved by a Gram-Schmidt process which only involves the computation of sums and dot products and can thus be carried out directly in the HT-format.

For our experiments in Section 8 we used a different procedure which computes a set of orthonormal HT-tensors together with a right factor , such that

(28)

We now briefly describe this procedure which we refer to as HT-QR decomposition. In

[12] the computation of joint orthonormal bases (ONB) is described for a set of tensors in HT-representation: at each non-root vertex of the underlying dimension tree , a QR-decomposition of the matrix containing the corresponding frames of the tensors is computed:

(29)

The orthonormal columns of are kept as new (joint, orthonormal) basis at the vertex for all tensors and the right factors are multiplied to the father vertices, where the procedure continues recursively. By this, joint ONB are installed at all non-root vertices , i.e. the new HT-representations of only differ at the root vertex. At the leaf vertices, a QR-decomposition (29) can be computed directly since the frame matrices are stored explicitly. At the interior vertices, (29) can be computed by a respective QR-decomposition of

Notice that the computation of joint ONB for the set only affects their HT-representations while the tensors themselves stay invariant. After the HT-representations of have been transformed into joint ONB, an HT-QR decomposition (28) can easily be obtained by computing a respective QR-decomposition of the root transfer tensors .

The resulting accelerated power iteration is shown in Algorithm 3: we start with Algorithm 2, where we leave out the computation of the estimator in each step, which is now computed in the end as Ritz value of maximum absolute value based on the last iterates.

1;
2 for   do
3       ;
4       ;
5      
;
  // Compute an ONB
6 for   do
7       for   do
             ;
              // Matrix from (27)
8            
9      
10Compute the eigenvalues , , of ;
// Estimator for
Algorithm 3 Acceleration of the power iteration using a Rayleigh-Ritz type orthogonal projection method: the estimator now depends on the latest iterates of the power iteration, .

6 Acceleration by squaring

Even though the Rayleigh-Ritz method (Algorithm 3), cf. Section 5, may lead to a notable acceleration of the power iteration (Algorithm 2), cf. Figure 3(a) and Figure 3, there are also cases where the order of convergence stays invariant, cf. Figure 3(b). However, the power iteration including a Rayleigh-Ritz method can well be used to generate sufficiently good starting tensors for a faster (but less stable) method, which is presented in this section.

Power iteration with squaring

As stated in the proof of Lemma 4, the -th iterate of Algorithm 2 is the corresponding normalized power of the original tensor (if truncations down to smaller HT-representation rank are neglected):

(30)

The power iteration can be sped up by taking the new iterate

instead of line 2 in Algorithm 2, which results in Algorithm 4, hereinafter referred to as power iteration with squaring.

1;
2 for   do
3       ;
4       ;
5      
Algorithm 4 Power iteration with squaring: for .

Without truncations, the iterates of Algorithm 4 are also normalized powers of the original tensor :

(31)

which yields the corresponding new expression for in line 4 of Algorithm 4.

The comparison of (30) and (31) shows the immense acceleration which can be expected by using Algorithm 4 instead of Algorithm 2: the iterate in step of Algorithm 4 corresponds to the iterate in step of Algorithm 2. Assuming the slow convergence (25) for Algorithm 2, cf. Figures 1(a) and 2(a), the corresponding convergence behavior of Algorithm 4 would be

(32)

i.e. linear convergence with a convergence rate of or better, which is verified by our experiments in Section 8.3, cf. Figure 5.

Truncation

As in Algorithm 2, in practice truncations are included into Algorithm 4, since otherwise the complexity of the involved tensor arithmetic quickly becomes too large, cf. Section 2.1. Algorithm 4, however, is less stable than Algorithm 2, which can be explained as follows: consider two tensors , , the Hadamard product of which is to be computed. If only is perturbed by a relative error (due to truncation), the computed Hadamard product will have the same relative error (notice that for any -norm, , if ):

(33)

whereas the relative error doubles if and are both afflicted with the relative error: