## 1 Introduction

Tensors, or multi-way arrays, appear in a wide range of applications such as signal processing; neuroscientific applications such as Electroencephalography; data mining; seismic data processing; machine learning applications such as facial recognition, handwriting digit classification, and latent semantic indexing; imaging; astronomy; and uncertainty quantification. For example, a database of gray scale images constitutes a third order array when each image is stored as a matrix, while a numerical simulation of system of a partial differential equations (PDEs) in three-dimensional space when tracking several parameters over time yields a five-dimensional dataset. Often, these datasets are treated as matrices rather than as tensors, suggesting that additional structure that could be leveraged for gaining insight and lowering computational cost is often underutilized and undiscovered.

A key step in processing and studying these datasets involves a compression step either to find an economical representation in memory, or to find principal directions of variability. While working with tensors there are many possible formats one may consider, and each format is equipped with a different notion of compression and rank. Examples of tensor formats include CANDECOMP/PARAFAC (CP), Tucker, Hierarchical Tucker, and Tensor Train, all of which have their respective benefits (see surveys [28, 20, 10, 11]). The CP format which represents a tensor as a sum of rank

outer products gives a compact and unique (under certain conditions) representation. Tucker generally finds a better fit for data by estimating the subspace of each mode, while Hierarchical-Tucker and Tensor Train are useful for very high-dimensional tensors. In this paper, we focus on the Tucker representation which is known for its favorable compression properties in a modest number of dimensions (3-7 modes). Given a multilinear rank

, the Tucker form finds a rank representation of a tensor as a product of a core tensor and factor matrices typically having orthonormal columns. Popular algorithms for compression in the Tucker format can be found in [12, 41, 13], and a survey of approximation techniques can be found in [20]. Also, depending on how small the target rank for an approximation is compared to the original dimensions, high compression can be achieved. If the data is such that this is not possible, representing it in the Tucker format can still give insight into its principal directions, since Tucker is also a form of higher-dimensional principal component analysis (PCA).

In recent years, randomized matrix algorithms have gained popularity for developing low-rank matrix approximations (see the review [24, 30, 15]

). These algorithms are easy to implement, computationally efficient for a wide range of matrices (e.g. sparse matrices, matrices that can be accessed only via matrix-vector products, and dense matrices that are difficult to load in memory), and have accuracy comparable with non-randomized algorithms. There is also well-developed error analysis applicable to several classes of random matrices for randomized algorithms. Even more recently, randomized algorithms have been developed for tensor decompositions (see below for a detailed review). In this paper, we make several contributions by proposing new algorithms that are accompanied by rigorous error analysis. The results in this paper enable the efficient computation of low-rank Tucker decompositions across a wide range of applications. The randomized algorithms developed here exploit the structure of tensor decompositions, while the analysis provides insight into the choice of parameters that control the tradeoff between accuracy and computational efficiency.

##### Contributions and Contents

In Section 2, we first review the necessary background information on tensors and randomized algorithms. Then, in Section 3, we present analyses of randomized versions of HOSVD and STHOSVD (proposed in [47] and [8] respectively). Our contributions on this front include the probabilistic analysis of the randomized versions of these algorithms in expectation, as well as analysis of the associated computational costs. In Section 4 we present adaptive randomized algorithms to compute low-rank tensor decompositions to be used in applications where the target rank is not known beforehand. In Section 5, we present a new randomized compression algorithm for large tensors, which produces a low-rank decomposition whose core tensor has entries taken directly from the tensor of interest. In this sense, the core tensor preserves the structure (e.g., sparsity, non-negativity) of the original tensor. For sparse tensors, our algorithm has the added benefit that the intermediate and final decompositions can be stored efficiently, thus enabling the computation of low-rank tensor decompositions of large, sparse tensors. To supplement the algorithm, we provide a probabilistic analysis in expectation. Finally, in Section 6, we test the performance of all algorithms on several synthetic tensors and real-world datasets, and discuss the performance of the proposed bounds.

##### Related Work

Several randomized algorithms have been proposed for computing low-rank tensor decompositions, e.g., Tucker format [8, 46, 29, 32, 40, 19, 47], CP format [19, 5, 6, 43], t-product [46], tensor networks [4], and Tensor Train format [8, 26]. Our work is most similar to [8, 19, 47]. The algorithm for randomized HOSVD is presented in [47], and the corresponding analysis is presented in [19] (both unpublished manuscripts). Randomized and adaptive versions of the STHOSVD were proposed and analyzed in [8], but our manuscript uses a different distribution of random matrices (see Section 3 for a justification of our choice), and provides bounds in expectation. To our knowledge, our proposed algorithm for producing structure-preserving tensor decompositions and the corresponding error analysis are novel. Related to this algorithm is the CUR-type decomposition for tensors proposed in [34, 14]. In contrast, our algorithm produces decompositions in which the core tensor (rather than the factor matrices, in the aforementioned references) retains entries from the original tensor.

## 2 Background

In this section, we introduce the necessary background information for working with tensors, and review the standard compression algorithms. We also discuss the optimal approximation of a tensor for comparison purposes. Finally, we review the relevant background for randomized matrix algorithms, specifically the randomized SVD.

### 2.1 Notation and preliminaries

We denote a -mode tensor with entries

A tensor can be “unfolded” into a matrix by reordering the elements, and this process is known as matricization. There are different unfoldings for a -mode tensor. Each mode- unfolding arranges the resulting matrix such that the columns are the mode- fibers of the tensor. The mode- unfolding is denoted as for .

##### Tensor product

The tensor product (or mode product) is a fundamental operation for multiplying a tensor by a matrix. Given a matrix , the mode- product of a tensor with is denoted , and has dimension . More specifically, the product can be expressed in terms of the entries of the tensor as

The tensor product can also be expressed as the product of two matrices. That is, we can write for . The following lemma will be useful in our analysis.

###### Lemma 1

Let and let be a sequence of orthogonal projectors. Then for ,

The proof of this lemma can be found in [41, Theorem 5.1].

##### Tucker representation

The Tucker format of a tensor of rank consists of a core tensor and factor matrices with each . For short, it is written as and represents .

Note that storing a tensor in Tucker form is beneficial as it requires less storage than a full tensor when the target rank is significantly less than the original dimension. For a -mode tensor and target rank with , the cost of storing the Tucker form of is , compared to for a full tensor.

##### Kronecker products

The Kronecker product of two matrices and is

We also note some properties of Kronecker products that will be useful in our analysis, namely

Kronecker products are also useful for expressing tensor mode products in terms of matrix-matrix multiplications. Suppose , then

(1) |

### 2.2 Hosvd/sthosvd

The Higher Order SVD (HOSVD) and Sequentially Truncated Higher Order SVD (STHOSVD) are two popular algorithms for computing low-rank tensor decompositions in the Tucker format. Given a -mode tensor and target rank , both algorithms give a compressed representation for the tensor in the Tucker format where is the core tensor of reduced rank, and are factor matrices such that . The factor matrices all have orthonormal columns, and each .

##### Hosvd

In the HOSVD algorithm, each mode is handled separately. The factor matrix is formed from the first left singular vectors of . Once all factor matrices are found, the core tensor is formed by . The error in approximating using the HOSVD depends on the error in each mode, as shown in the following theorem, the proof of which can be found in [41, Theorem 5.1].

###### Theorem 1

Let be the rank- approximation to -mode tensor using the HOSVD algorithm. Then

This theorem says that the error in the rank approximation of the tensor

computed using the HOSVD is the sum of squares of the discarded singular values from each mode unfolding. To simplify the upper bound, we introduce the notation

(2) |

With this notation, the error in the HOSVD satisfies .

##### Sthosvd

An alternative to the HOSVD is the sequentially truncated HOSVD (STHOSVD) algorithm which also produces a compressed representation in the Tucker format. In contrast to HOSVD which processes the modes independently, the STHOSVD processes the modes sequentially. This makes the order in which the modes are processed important, since we may obtain different approximations by using different processing orders. At each stage, the core tensor is unfolded (initialized as the tensor ) and the factor matrix is obtained by taking the first singular vectors. A new core tensor is obtained by projecting the core tensor onto the subspace spanned by the columns of the factor matrix. A characteristic feature of the STHOSVD is that the core tensor shrinks at each iteration thus making the later modes cheaper to compute.

Given a tensor and the processing order , the rank- STHOSVD approximation of is , where each factor matrix has orthonormal columns, and the core tensor is defined as At the -th step of the process, we have a partially truncated core tensor defined as Then the -th partial approximation, of rank , can be defined as The algorithm is initialized with .

The approximation error in this case is the sum of errors in the successive approximations, and has the same upper bound as that of HOSVD. This is shown in the following theorem, which assumes that the processing order is . If a different processing order is taken, the bound will remain the same, so this assumption is taken for ease of notation. The proof of this theorem can be found in [41, Theorem 6.5].

###### Theorem 2

Let be the rank- STHOSVD approximation to -mode tensor with the processing order of modes. Then

The computational cost of the STHOSVD is lower than the HOSVD, which was established in [41] but is also reviewed in Section 3.3. Although both the error in the HOSVD and the STHOSVD satisfy the same upper bound, it is not clear which algorithm has a lower error. There is strong numerical evidence to suggest that STHOSVD typically has a lower error, although counterexamples to this claim have been found [41]. For these reasons, STHOSVD is preferable to HOSVD since it has a lower cost and the same worst case error bound. A downside to STHOSVD is that the processing order

has to be determined in advance; some heuristics for this choice are given in

[41].### 2.3 Best Approximation

We would like to find an optimal rank- approximation of a given tensor , which we will call . Let . Then is an optimal tensor which satisfies the condition

The Eckart-Young theorem [18] states that the optimal rank- approximation to a matrix can be constructed using the SVD truncated to rank-. Unfortunately, an analog of this result for Tucker forms does not exist in higher dimensions. The existence of is guaranteed by [23, Theorem 10.8]. In general, computing requires the solution of an optimization problem. In [13], the higher order orthogonal iteration (HOOI), was proposed to compute the “best” approximation by generating a sequence of iterates by cycling through the modes sequentially. Because the HOOI algorithm requires repeated computations with the tensor , its implementation for large-scale tensors is challenging because of the overwhelming computational cost. Although neither the HOSVD nor the STHOSVD produce an optimal rank- approximation, they do satisfy the inequality

(3) |

The proofs are available for the HOSVD (Theorem 10.3) and STHOSVD (Theorem 10.5) in [23]. The proof requires the observation that

(4) |

We highlight this inequality since it will be important for our subsequent analysis. The inequality Eq. 3 suggests that the outputs of the HOSVD and the STHOSVD are accurate for low dimensions and can be employed in three different ways: either as approximations to , as starting guesses to the HOOI algorithm, or to fit CP models.

### 2.4 Randomized SVD

It is well-known that computing the full SVD of a matrix costs , assuming . When the dimensions of a matrix are very large, the computational cost of a full SVD may be prohibitively expensive. Randomized SVD, popularized by [24], is a computationally efficient way to compute a rank- approximation of a matrix . Assuming that is approximately low-rank or has singular values that decay rapidly, the randomized SVD delivers a good low-rank representation of the matrix. Compared to the full SVD, the randomized SVD is much more computationally efficient across a wide-range of matrices.

We now describe the randomized SVD algorithm for computing a rank- approximation to , where the target rank . The randomized algorithm has two distinct stages—the range finding stage, and the postprocessing stage to compute the low-rank approximation. In the range finding stage, we first draw a random matrix , where is the desired target rank, and is a small nonnegative integer used as an oversampling parameter. While many choices for the distribution of

are possible, in this paper, we use the Gaussian distribution, i.e., the entries

are independent and identically distributed random variables. Then, we compute the matrix whose columns consist of random linear combinations of the columns of . By taking a thin QR factorization , we get a matrix with orthonormal columns whose span approximates the range of . If the matrix is approximately rank , is a good approximation for as the range of is characterized by just the first left singular vectors. We can then express . In the second stage, we convert the low rank representation into the SVD format. To this end, we compute the thin SVD of , giving . We truncate this representation to rank by only retaining the first diagonal elements of and drop the corresponding columns from and . Finally, we compute , and obtain the low-rank approximation . Algorithm 1 summarizes the process.We briefly review the computational cost of the RandSVD algorithm. Let denote the computational cost of a matrix-vector product (denoted matvec) involving the matrix . Then, the cost of the algorithm is

where denotes the number of nonzeros of .

An error bound for Algorithm 1 in the Frobenius norm is presented below, and we will use this result frequently in our analysis. This theorem and its proof can be found in [46, Theorem 3].

###### Theorem 3

Let , and be a Gaussian random matrix. Suppose is obtained from Algorithm 1 with inputs target rank and oversampling parameter such that , and let be the rank- truncated SVD of . Then, the error in expectation satisfies

###### Remark

We will use two slightly different formulations of this theorem in our later results. Instead of , we will use . It is straightforward to show the equivalence between the two forms, see [35, section 5.3] for the explicit details. We will also use

(5) |

which can be obtained by applying Hölder’s inequality [27, Theorem 23.10] to the stated result in the theorem.

We could use other distributions for the random matrix , but probabilistic bounds in expectation like Theorem 3 do not exist for any distribution other than Gaussian. There are large deviation bounds for other distributions but we do not consider them here.

## 3 Randomized HOSVD/STHOSVD

In this section, we present randomized algorithms that are modified versions of the HOSVD and STHOSVD. We also develop rigorous error analysis and compare the two algorithms in terms of computational cost.

### 3.1 Algorithms

As mentioned earlier, the HOSVD algorithm first computes an SVD of each mode unfolding to construct the factor matrices, which are then used to form the core tensor. Computing this decomposition for large, high-dimensional tensors can be prohibitively expensive. To address this computation cost, we replace a full SVD of each mode unfolding with a randomized SVD of each mode unfolding to construct the factor matrix. The procedure to compute the core tensor remains unchanged. This is reflected in Algorithm 2 and we call this the R-HOSVD algorithm. To our knowledge, this was first proposed in [47].

The randomized version of STHOSVD is obtained in a similar way; at each step, the SVD of the unfolded core tensor is replaced with a randomized SVD. This is shown in Algorithm 3 (we call this R-STHOSVD) and is similar to the algorithm proposed in [8]. The major difference of Algorithm 3 compared to [8] is the distribution of random matrices . The authors in [8] advocate constructed as a Khatri-Rao product of Gaussian random matrices as opposed to standard Gaussian random matrices which we use. The main reason for avoiding standard Gaussian random matrices appear to be because of the high storage costs; however, we note that the matrices need not be stored explicitly. Its entries may be generated on-the-fly, either column-wise or in appropriately sized blocks. We next analyze the error in the decompositions produced using the R-HOSVD and the R-STHOSVD algorithms.

### 3.2 Error Analysis

In the results below, we assume that the matrices are standard Gaussian random matrices of appropriate sizes.

###### Theorem 4 (Randomized HOSVD)

Let be the output of Algorithm 2 with inputs target rank and oversampling parameter . Furthermore, assume that satisfies for . Then, the expected error in the approximation is

(6) | |||||

(7) |

###### Proof

From Lemma 1, we can write

where the equality comes from linearity of expectations and the independence of for each mode . We can unfold each term in the summation as . Then, by applying Theorem 3, we can bound the expected value of the squared error in each mode to obtain

Finally, Hölder’s inequality gives

For the second inequality, recall that from Eq. 4. Thus, combined with the previous inequality, we have

To compare this result to the approximation error obtained using the HOSVD algorithm, we consider a few special cases. Let . Then, if , this factor becomes

Similarly, if we choose for some , the error satisfies

This shows that the application of a randomized SVD in each mode of the tensor does not seriously deteriorate the accuracy compared to using an SVD.

Now consider the randomized STHOSVD approximation. When bounding the error in expectation in this case, it is important to note that at each intermediate step, the partially truncated core tensor is a random tensor. This is in contrast to the R-HOSVD, where we only needed to account for for each mode because the operations are independent across the modes. Computing the modes sequentially means, when processing mode , we must account for all the random matrices , where .

For this theorem, we use the same notation introduced in Section 2.2 for the STHOSVD, in that the partially truncated core tensor at step is , giving a partial approximation .

###### Theorem 5 (Randomized STHOSVD)

Let be the output of Algorithm 3 with inputs target rank , processing order , and oversampling parameter . Furthermore, assume that satisfies for . Then, the approximation error in expectation is

(8) | |||||

(9) |

###### Proof

We first assume that the processing order is . The first equality in Lemma 1 and the linearity of expectations together give

(10) |

We have used the fact that the -th term in the summation does not depend on the random matrices . We first consider Since all the ’s are independent, we can write the expectation in an iterated form as

The -th term, which measures the difference in the sequential iterates, can be expressed as

Now let

If we unfold the difference along the -th mode, using Eq. 1 we have

(11) | ||||

The inequality comes as has orthonormal columns for every .

Now, let for simplicity. We take expectations and bound this last quantity in Eq. 11 using Theorem 3 (keeping fixed), as

(12) |

We recall the definition and properties of Loewner partial ordering [25, Section 7.7]. Let be symmetric; means is positive semidefinite. For , then . Furthermore, for . Since has orthonormal columns, is a projector so that

and the singular values of

, which are squared eigenvalues of

, satisfy(13) |

To summarize, Eqs. 13, 12, 11 and 10 combined give

The equality follows since the tensor is deterministic. Finally, we have by Hölder’s inequality and Eq. 2 that

In the general case, when the processing order does not equal , the proof is similar. We only need to work with the processed order. We omit the details.

We make several observations regarding Theorem 5. First, the upper bound for the error is the same for the R-STHOSVD as for the R-HOSVD (Theorem 4). Second, this result says that the error bound for R-STHOSVD is independent of the processing order. This means that while some processing orders may result in more accurate decompositions, every processing order has the same worst-case error bound. Our recommendation is to pick a processing order that minimizes the computational cost; see Section 3.3 for details. Third, the discussion following Theorem 4 regarding the choice of the oversampling parameter is applicable to the R-STHOSVD as well.

##### Extensions

There are several possible extensions of our results. First, we can extend this analysis to develop concentration results that give insight into the tail bounds. These can be obtained by combining our analysis with the results from, e.g., [21, Theorem 5.8]

. Second, other distributions of random matrices may be used instead of standard Gaussian random matrices. Examples include Rademacher random matrices, sparse Rademacher random matrices, subsampled randomized Fourier transforms, etc. It is also possible to combine the analysis with the probabilistic bounds for the other decompositions. Typically, expectation bounds of the type presented in

Theorem 5 and Theorem 4 are only possible for the standard Gaussian random matrices and one can only develop tail bounds. We do not pursue these extensions here but may consider them in future work.### 3.3 Computational Cost

We discuss the computational costs of the proposed randomized algorithms and compare them against the HOSVD and the STHOSVD algorithms. We make the following assumptions. First, we assume that the tensors are dense and our implementations take no advantage of their structure. Second, we assume that the target ranks in each dimension are sufficiently small, i.e., so that we can neglect the computational cost of the QR factorization and the truncation steps of the RandSVD algorithm. Third, we assume that the random matrices used in the algorithms are standard Gaussian random matrices. If other distributions are used, the computational cost may be lower. Finally, for the STHOSVD and R-STHOSVD algorithms, we assume that the processing order is

The computational cost of both HOSVD and STHOSVD was discussed in [41], and is reproduced in Table 1. In this paper, we also provide an analysis of the computational cost of R-HOSVD and R-STHOSVD, which is summarized in Table 1. The table includes the costs for both a general tensor with target rank , as well as for the special case when with target rank . For ease of notation, denote the product by , and similarly for . The dominant costs of each algorithm lie in computing the SVD of the unfoldings (the first term in each summation) and forming the core tensor (the second term in each summation).

Algorithm | Cost for | Cost for |
---|---|---|

HOSVD | ||

R-HOSVD | ||

STHOSVD | ||

R-STHOSVD |

##### Processing order

In all the previous analyses for the R-STHOSVD algorithm, we took the processing order of modes to be . The error in the approximation depends on the choice of the processing order; however, Theorem 5 suggests that the worst case error is independent of the processing mode. For this reason, we choose a processing order that minimizes the computational cost. Since the dominant cost at each step is a randomized SVD with a cost of