# Low-Rank Tucker Approximation of a Tensor From Streaming Data

This paper describes a new algorithm for computing a low-Tucker-rank approximation of a tensor. The method applies a randomized linear map to the tensor to obtain a sketch that captures the important directions within each mode, as well as the interactions among the modes. The sketch can be extracted from streaming or distributed data or with a single pass over the tensor, and it uses storage proportional to the degrees of freedom in the output Tucker approximation. The algorithm does not require a second pass over the tensor, although it can exploit another view to compute a superior approximation. The paper provides a rigorous theoretical guarantee on the approximation error. Extensive numerical experiments show that that the algorithm produces useful results that improve on the state of the art for streaming Tucker decomposition.

## Authors

• 7 publications
• 10 publications
• 1 publication
• 2 publications
• 34 publications
• ### Streaming PTAS for Binary ℓ_0-Low Rank Approximation

We give a 3-pass, polylog-space streaming PTAS for the constrained binar...
09/25/2019 ∙ by Anup Bhattacharya, et al. ∙ 0

• ### Single Pass PCA of Matrix Products

In this paper we present a new algorithm for computing a low rank approx...
10/21/2016 ∙ by Shanshan Wu, et al. ∙ 0

• ### Fixed-Rank Approximation of a Positive-Semidefinite Matrix from Streaming Data

Several important applications, such as streaming PCA and semidefinite p...
06/18/2017 ∙ by Joel A. Tropp, et al. ∙ 0

• ### Low-Rank Tensor Decomposition via Multiple Reshaping and Reordering Operations

Tensor decomposition has been widely applied to find low-rank representa...
05/22/2018 ∙ by Chao Li, et al. ∙ 0

• ### Sketchy Decisions: Convex Low-Rank Matrix Optimization with Optimal Storage

This paper concerns a fundamental class of convex matrix optimization pr...
02/22/2017 ∙ by Alp Yurtsever, et al. ∙ 0

• ### A Count Sketch Kaczmarz Method For Solving Large Overdetermined Linear Systems

In this paper, combining count sketch and maximal weighted residual Kacz...
04/06/2020 ∙ by Yanjun Zhang, et al. ∙ 0

• ### Universal Uhrig dynamical decoupling for bosonic systems

We construct efficient deterministic dynamical decoupling schemes protec...
10/16/2018 ∙ by Margret Heinze, et al. ∙ 0

##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1 Introduction

Large-scale datasets with natural tensor (multidimensional array) structure arise in a wide variety of applications including computer vision

[37], neuroscience [10], scientific simulation [4], sensor networks [28], and data mining [20]. In many cases, these tensors are too large to manipulate, to transmit, or even to store in a single machine. Luckily, tensors often exhibit a low-rank structure, and can be approximated by a low-rank tensor factorization, such as CANDECOMP/PARAFAC (CP), tensor train, or Tucker factorization [19]. These factorizations reduce the storage costs by exposing the latent structure. Sufficiently low rank tensors can be compressed by several orders of magnitude with negligible loss. However, computing these factorizations can require substantial computational resources. Indeed, one particular challenge is that these large tensors may not fit in main memory on our computer.

In this paper, we develop a new algorithm to compute a low-rank Tucker approximation for a tensor from streaming data, using storage proportional to the degrees of freedom in the output Tucker approximation. The algorithm forms a linear sketch of the tensor, and operates on the sketch to compute a low-rank Tucker approximation. Importantly, the main computational work is all performed on a small tensor, of size proportional to the core tensor of the Tucker factorization. We derive detailed probabilistic error bounds on the quality of the approximation in terms of the tail energy of any matricization of the target tensor.

This algorithm is useful in at least three concrete problem settings:

1. Streaming: Data from the tensor is generated sequentially. At each time stamp, we may observe a low dimensional slice, an individual entry, or an additive update to the tensor (the so-called “turnstile” model [25]). For example, each slice of the tensor may represent a subsequent time step in a simulation, or sensor measurements at a particular time. In the streaming setting, the complete tensor is not stored; indeed, it may be much larger than available computing resources.

Our algorithm can approximate tensors revealed via streaming updates by sketching the updates and storing the sketch. Linearity of the sketch guarantees that sketching commutes with slice, entrywise, or additive updates. Our method forms an approximation of the tensor only after all the data has been observed, rather than approximating the tensor-observed-so-far at any time. This protocol allows for offline data analysis, including many scientific applications. Conversely, this protocol is not suitable for real-time monitoring.

2. Limited memory: Data describing the tensor is stored on a hard disk of a computer with much smaller RAM. This setting reduces to the streaming setting by streaming the data from disk.

3. Distributed: Data describing the tensor may be stored on many different machines. Communicating data between these machines may be costly due to low network bandwidth or high latency. Our algorithm can approximate tensors stored in a distributed computing enviroment by sketching the data on each slave machine and transmitting the sketch to a master, which computes the sum of the sketches. Linearity of the sketch guarantees that the sum of the sketches is the sketch of the full tensor.

In the streaming setting, the tensor is not stored, so we require an algorithm that can compute an approximation from a single pass over the data. In constrast, multiple passes over the data are possible in the memory-limited or distributed settings.

This paper presents algorithms for all these settings, among other contributions:

• We present a new method to form a linear sketch an unknown tensor. This sketch captures both the principal subspaces of the tensor along each mode, and the action of the tensor that links these subspaces. This tensor sketch can be formed from any dimension reduction map.

• We develop a practical one-pass algorithm to compute a low rank Tucker approximation from streaming data. The algorithm sketches the tensor and then recovers a low rank Tucker approximation from this sketch.

• We propose a two-pass algorithm that improves on the the one-pass method. Both the one-pass and two-pass methods are appropriate in a limited memory or distributed data setting.

• We develop provable probabilistic guarantees on the performance of both the one-pass and two-pass algorithms when the tensor sketch is composed of Gaussian dimension reduction maps.

• We exhibit several random maps that can be used to sketch the tensor. Compared to the Gaussian map, others are cheaper to store, easier to apply, and deliver similar performance experimentally in tensor approximation error. In particular, we demonstrate the effective performance of a row-product random matrix, which we call the Tensor Random Projection (TRP), which uses exceedingly low storage.

• We perform a comprehensive simulation study with synthetic data, and consider applications to several real datasets, to demonstrate the practical performance of our method. Our methods reduce approximation error compared to the only existing one-pass Tucker approximation algorithm [24] by more than an order of magnitude given the same storage budget.

• We have developed and released an open-source package in python that implements our algorithms.

## 2 Background and Related Work

### 2.1 Notation

Our paper follows the notation of [19]. We denote scalar, vector, matrix, and tensor variables respectively by lowercase letters (), boldface lowercase letters (), boldface capital letters (), and boldface Euler script letters (

). For two vectors

and , we write if is greater than elementwise.

Define . For a matrix , we denote its row, column, and element as , , and , respectively, for , . We use to denote the Moore–Penrose pseudoinverse of the matrix . In particular, if and has full column rank; , if and has full row rank.

#### 2.1.1 Kronecker and Khatri-Rao product

For two matrices and , we define the Kronecker product as

 A⊗B=⎡⎢ ⎢⎣A(1,1)B⋯A(1,K)B⋮⋱⋮A(I,1)B⋯A(I,K)B⎤⎥ ⎥⎦. (2.1)

For , we define the Khatri-Rao product as , i.e. the “matching column-wise” Kronecker product. The resulting matrix of size is defined as

 A⊙B=[A(⋅,1)⊗B(⋅,1)⋯A(⋅,K)⊗B(⋅,K)]

#### 2.1.2 Tensor basics

For a tensor , its mode or order is the number of dimensions. If , we denote as . The inner product of two tensors is defined as . The Frobenius norm of is .

#### 2.1.3 Tensor unfoldings

Let and , and let denote the vectorization of . The mode- unfolding of is the matrix . The inner product for tensors matches that of any mode- unfolding:

 ⟨X,Y⟩=⟨X(n),Y(n)⟩=Tr((X(n))⊤Y(n)). (2.2)

#### 2.1.4 Tensor rank

The mode- rank is the rank of the mode- unfolding. We say the rank of is if its mode-n rank is for each . This notion of rank corresponds to the size of the core tensor in a Tucker factorization of . A superdiagonal tensor generalizes a diagonal matrix: all entries are zero except for the entries whose indices in each dimension are equal.

#### 2.1.5 Tensor contractions

Write for the mode- (matrix) product of with . That is, . The tensor has dimension . Mode products with respect to different modes commute: for ,

 X×nU×mV=X×mV×nUifn≠m.

Mode products along the same mode simplify: for , ,

 X×nA×nB=X×n(BA).

### 2.2 Tucker Approximation

Given a tensor and target rank , the idea of Tucker approximation is find a core tensor and orthogonal matrices for , called factor matrices, so that

 X≈G×1U1×⋯×NUN.

For brevity, we define . Any best rank- Tucker approximation is of the form , where solve the problem

 minimize∥X−G×1×…Un+1×NUN∥2Fsubject toU⊤nUn=I. (2.3)

The problem Eq. 2.3 is a challenging nonconvex optimization problem. Moreover, the solution is not unique [19]. We use the notation to represent a best rank- Tucker approximation of the tensor , which in general we cannot compute.

#### 2.2.1 Hosvd

The standard approach to computing a rank Tucker approximation for a tensor

begins with the higher order singular value decomposition (HOSVD)

[13, 35], (Algorithm 1):

The HOSVD can be computed in two passes over the tensor [42, 8]. We describe this method briefly here, and in more detail in the next section. In the first pass, sketch each matricization , , and use randomized linear algebra (e.g., the randomized range finder of [15]) to (approximately) recover its range . To form the core requires a second pass over , since the factor matrices depend on . The main algorithmic contribution of this paper is to develop a method to approximate both the factor matrices and the core in just one pass over .

#### 2.2.2 Hooi

The higher order orthogonal iteration (HOOI) [13] (Algorithm 2) improves on the resulting Tucker factorization by repeatedly minimizing the objective of Eq. 2.3 over the the core and the factor matrices.

Notice the core update (2.5) admits the closed form solution , which motivates the second step of HOSVD.

### 2.3 Previous Work

The only previous work on streaming Tucker approximation is [24], which develops a streaming method called Tucker TensorSketch (T.-TS) [24, Algorithm 2]. T.-TS improves on HOOI by sketching the data matrix in the least squares problems. However, the success of the approach depends on the quality of the initial core and factor matrices, and the alternating least squares algorithm takes several iterations to converge.

In contrast, our work is motivated by HOSVD (not HOOI), and requires no initialization or iteration. We treat the tensor as a multilinear operator. The sketch identifies a low-dimensional subspace for each input argument that captures the action of the operator. The reconstruction produces a low-Tucker-rank multilinear operator with the same action on this low-dimensional tensor product space. This linear algebraic view allows us to develop the first guarantees on approximation error for this class of problems111 The guarantees in [24] hold only when a new sketch is applied for each subsequent least squares solve; the resulting algorithm cannot be used in a streaming setting. In contrast, the practical streaming method T.-TS fixes the sketch for each mode, and so has no known guarantees. Interestingly, experiments in [24] show that the method achieves lower error using a fixed sketch (with no guarantees) than using fresh sketches at each iteration. . Moreover, we show in numerical experiments that our algorithm achieves a better approximation of the original tensor given the same memory resources.

More generally, there is a large literature on randomized algorithms for matrix factorizations and for solving optimization problems; see e.g. the review articles [15, 39]. In particular, our method is strongly motivated by the recent papers [32, 33], which provide methods for one-pass matrix approximation. The novelty of this paper is in our design of a core sketch (and reconstruction) for the Tucker decomposition, together with provable performance guarantees. The proof requires a careful accounting of the errors resulting from the factor sketches and from the core sketch. The structure of the Tucker sketch guarantees that these errors are independent.

Many researchers have used randomized algorithms to compute tensor decompositions. For example, [38, 7] apply sketching techniques to the CP decomposition, while [34] suggests sparsifying the tensor.Several papers aim to make Tucker decomposition efficient in the limited-memory or distributed settings [6, 42, 4, 18, 22, 8].

## 3 Dimension Reduction Maps

In this section, we first introduce some commonly used randomized dimension reduction maps together with some mathematical background, and explain how to calculate and update sketches.

### 3.1 Dimension Reduction Map

Dimension reduction maps (DRMs) take a collection of high dimensional objects to a lower dimensional space while preserving certain geometric properties [26]. For example, we may wish to preserve the pairwise distances between vectors, or to preserve the column space of matrices. We call the output of a DRM on an object a sketch of .

Some common DRMs include matrices with i.i.d. Gaussian entries or i.i.d.

entries. The Scrambled Subsampled Randomized Fourier Transform (SSRFT)

[40] and sparse random projections [1, 23] can achieve similar performance with fewer computational and storage requirements; see Appendix G for details.

Our theoretical bounds rely on properties of the Gaussian DRM. However, our numerical experiments indicate that many other DRMs yield qualitatively similar results; see, e.g., Fig. 1, Fig. 11 and Fig. 10) in Appendix J.

### 3.2 Tensor Random Projection

Here we present a strategy for reducing the storage of the random map that makes use of the tensor random projection (TRP), and extremely low storage structured dimension reduction map proposed in [29]. The tensor random projection (TRP) is defined as the iterated Khatri-Rao product of DRMS , :

 Ω=A1⊙⋯⊙AN. (3.1)

Each can be a Gaussian map, sign random projection, SSRFT, etc. The number of constituent maps and their dimensions for are parameters of the TRP, and control the quality of the map; see [29] for details. The TRP map is a row-product random matrix, which behaves like a Gaussian map in many respects [27]. Our experimental results confirm this behavior.

Supposing each is the same for , the TRP can be formed (and stored) using only random variables, while standard dimension reduction maps use randomness (and storage) that grows as when applied to a generic (dense) tensor. Table 1 compares the computational and storage costs for different DRMs.

We do not need to explicitly form or store the TRP map . Instead, we can store its constituent DRMs and compute the action of the map on the matricized tensor using the definition of the TRP. The additional computation required is minimal and empirically incurs almost no performance loss.

## 4 Algorithms for Tucker approximation

In this section, we present our proposed tensor sketch and algorithms for one- and two-pass Tucker approximation, and discuss the computational complexity and storage cost of these methods for both sparse and dense input tensors. We present guarantees for these methods in Section 5.

### 4.1 Tensor compression via sketching

##### The Tucker sketch

Our Tucker sketch generalizes the matrix sketch of [32] to higher order tensors. To compute a Tucker sketch for tensor with sketch size parameters and , draw independent, random DRMs

 Ω1,Ω2,…,ΩN% andΦ1,Φ2,…,ΦN, (4.1)

with and for . Use these DRMs to compute

 Vn =X(n)Ωn ∈ RIn×kn,n∈[N], H =X×1Φ⊤1⋯×NΦ⊤N ∈ Rs1×⋯×sN.

The factor sketch captures the span of the mode- fibers of for each , while the core sketch contains information about the interaction between different modes. See Algorithm 3 for pseudocode.

To produce a rank Tucker approximation of , choose sketch size parameters and . (Vector inequalities hold elementwise.) Our approximation guarantees depend closely on the parameters and . As a rule of thumb, we suggest selecting , as the theory requires , and choosing as large as possible given storage limitations.

The sketches and are linear functions of the original tensor, so we can compute the sketches in a single pass over the tensor . Linearity enables easy computation of the sketch even in the streaming model (Algorithm 8) or distributed model (Algorithm 9). Storing the sketches requires memory : much less than the full tensor.

###### Remark 1

The DRMs are large—much larger than the size of the Tucker factorization we seek! Even using a low memory mapping such as the SSRFT and sparse random map, the storage cost required grows as . However, we do not need to store these matrices. Instead, we can generate (and regenerate) them as needed using a (stored) random seed.222 Our theory assumes the DRMs are random, whereas our experiments use pseudorandom numbers. In fact, for many pseudorandom number generators it is NP hard to determine whether the output is random or pseudorandom [3]. In particular, we expect both to perform similarly for tensor approximation.

###### Remark 2

Alternatively, the TRP (Section 3.2) can be used to limit the storage of required. The Khatri-Rao structure in the sketch need not match the structure in the matricized tensor. However, we can take advantage of the structure of our problem to reduce storage even further. We generate DRMs for and define for each . Hence we need not store the maps , but only the small matrices . The storage required is thereby reduced from to , while the approximation error is essentially unchanged. We use this method in our experiments.

### 4.2 Low-Rank Approximation

Now we explain how to construct a Tucker decomposition of with target Tucker rank from the factor and core sketches.

We first present a simple two-pass algorithm, Algorithm 4, that uses only the factor sketches by projecting the unfolded matrix of original tensor

to the column space of each factor sketch. To project to the column space of each factor matrix, we calculate the QR decomposition of each factor sketch:

 Vn=QnRnfor n∈[N], (4.2)

where has orthonormal columns and is upper triangular. Consider the tensor approximation

 ~X =X×1Q1Q⊤1×2⋯×NQNQ⊤N. (4.3)

This approximation admits the guarantees stated in Appendix B. Using the commutativity of the mode product between different modes, we can rewrite as

 ~X=[X×Q⊤1×2⋯×NQ⊤N]W2×1Q1×2⋯×NQN=⟦W2;Q1,…,QN⟧, (4.4)

which gives an explicit Tucker approximation of our original tensor. The core approximation is much smaller than the original tensor . To compute this approximation, we need access to twice: once to compute , and again to apply them to in order to form .

##### One-Pass Approximation

To develop a one-pass method, we must use the core sketch — the compression of using the random projections – to approximate — the compression of using random projections . To develop intuition, consider the following calculation: if the factor matrix approximations capture the range of well, then projection onto their ranges in each mode approximately preserves the action of :

 X≈X×1Q1Q⊤1×⋯×NQNQ⊤N

Recall that for tensors , and with compatible sizes, . Use this rule to collect terms to recognize the two pass core approximation :

 X≈(X×1Q⊤1×N⋯×Q⊤N)×1Q1⋯×NQN=W2×1Q1⋯×NQN

Now contract both sides of this approximate equality with the DRMs and recognize the core sketch :

 H:=X×1Φ⊤1⋯×NΦ⊤N≈W2×1Φ⊤1Q1×⋯×NΦ⊤NQN.

We have chosen so each

has a left inverse with high probability. Hence we can solve the approximate equality for

:

 W2≈H×1(Φ⊤1Q1)†×⋯×N(Φ⊤NQN)†=:W1.

The right hand side of the approximation defines the one pass core approximation . Section C.2 controls the error in this approximation.

Algorithm 5 summarizes the resulting one-pass algorithm. One (streaming) pass over the tensor can be used to sketch the tensor; to recover the tensor, we only access the sketches. Section 5.1 (below) bounds the overall quality of the approximation.

The time and storage cost of Algorithm 5 is given by Table 2. The time and storage complexity of these methods compare favorably to the only previous method for streaming Tucker approximation [24]; see Appendix I for details.

### 4.3 Fixed-Rank Approximation

Algorithm 4 and Algorithm 5 produce a two-pass and one-pass rank- tensor approximation respectively. It is often valuable to truncate this approximation to a user-specified target rank [33, Figure 4].

Our fixed rank approximation method is motivated by the following lemma:

###### Lemma

Let be a tensor, and let be orthogonal matrices for . Then

 ⟦W×1Q1⋯×NQN⟧r=⟦W⟧r×1Q1⋯×NQN.

(This lemma does not necessarily hold if the best rank-r Tucker approximation is replaced by the output of any concrete algorithm such as HOSVD or HOOI.) The proof of Section 4.3 appears in Appendix D.

Motivated by this lemma, to produce a fixed rank approximation of , we compress the core tensor approximation from Algorithm 4 or Algorithm 5 to rank . This compression is cheap because the core approximation is small. We present this method (using HOOI as the the compression algorithm) as Algorithm 6. Other compression algorithms can be used to trade off the quality of approximation with the difficulty of running the algorithm. Reasonable choices include the sequentially-truncated HOSVD (ST-HOSVD) [36] or TTHRESH [5].

We also define an idealized version of the fixed rank approximation. Algorithm 6 and Algorithm 7 return the same Tucker approximation of when HOOI succeeds in computing the best rank approximation of the core . See [41] for details on the convergence of HOOI.

## 5 Guarantees

In this section, we present probabilistic guarantees on the preceding algorithms. We show that approximation error for the one-pass algorithm is the sum of the error from the two-pass algorithm and the error resulting from the core approximation. Proofs for the three theorems in this section can be found in the corresponding subsections of Appendix A.

#### 5.0.1 Tail energy

To state our results, we will need a tensor equivalent for the decay in the spectrum of a matrix. For each unfolding , define the th tail energy

 (τ(n)ρ)2:=min(In,I(−n))∑k>ρσ2k(X(n)),

where is the th largest singular value of .

### 5.1 Low rank approximation

Section 5.1 guarantees the performance of the two pass method Algorithm 4.

###### Theorem

Sketch the tensor using a Tucker sketch with parameters and using DRMs with i.i.d. Gaussian entries. Then the approximation computed with the two pass method Algorithm 4 satisfies

 E∥X−~X∥2F≤min1≤ρn

The two pass method does not use the core sketch, so this result does not depend on .

Section 5.1 guarantees the performance of one pass method Algorithm 5.

###### Theorem

Sketch the tensor using a Tucker sketch with parameters and using DRMs with i.i.d. Gaussian entries. Then the approximation computed with the one pass method Algorithm 5 satisfies the bound

 E∥X−^X∥2F≤(1+Δ)min1≤ρn

where .

The theorem shows that the method works best for tensors whose unfoldings exhibit spectral decay. As a simple consequence of this result, we see that the two pass method with perfectly recovers a tensor with exact Tucker rank , since in that case for each . However, this theorem states a stronger bound: the method exploits decay in the spectrum, wherever (in the first singular values of each mode unfolding) it occurs.

We see that the additional error due to sketching the core is a multiplicative factor more than the error due to sketching the factor matrices. This factor decreases as the size of the core sketch increases.

Section 5.1 also offers guidance on how to select the sketch size parameters and . In particular, suppose that the mode- unfolding has a good rank approximation for each mode . Then the choices and ensure that

 E∥X−^X∥2F≤4N∑n=1(τ(n)rn)2.

More generally, as and increase, the leading constant in the approximation error tends to one.

### 5.2 Fixed rank approximation

We now present a conditional analysis of the fixed rank approximation method given a low rank approximation. Recall that returns a best rank- Tucker approximation.

###### Theorem

Suppose approximates the target tensor , and let denote the best rank approximation to , computed with the idealized fixed rank method Algorithm 7. Then

 E∥X−^Xr∥F≤∥X−⟦X⟧r∥F+2√E∥X−^X∥2F.

The second term on the right-hand side of Section 5.2 is controlled by Section 5.1 and Section 5.1. Hence we can combine these results to provide guarantees for fixed rank approximation with either the two pass or one pass algorithms.

The resulting bound shows that the best rank- approximation of the output from the one or two pass algorithms is comparable in quality to a true best rank- approximation of the input tensor. An important insight is that the sketch size parameters and that guarantee a good low rank approximation also guarantee a good fixed rank approximation: the error due to sketching depends only on the sketch size parameters and , and not on the target rank .

In practice, one would truncate the rank of the approximation using HOOI (Algorithm 6), rather than the best rank approximation (Algorithm 7). Guarantees for resulting algorithm are beyond the scope of this paper, since there are no strong guarantees on the performance of HOOI; however, it is widely believed to produce an approximation that is usually quite close to the best rank approximation.

### 5.3 Proof sketch

To bound the approximation error of the algorithms presented in the main body of this paper, we first develop several structural results showing an additive decomposition of the error. First, the total error is the sum of the error due to sketching and the error due to fixed rank approximation. Second, the sketching error is the sum of the error due to the factor matrix approximations and to the core approximation. Third, the error due to the factor matrix approximations is the sum of the error in each the modes, as the errors due to each mode are mutually orthogonal. This finishes the approximation error bound for the two pass algorithm, Section 5.1. As for the error due to the core approximation, we rewrite the approximation error in the core tensor as a sum over each mode of errors that are mutually orthogonal. Indeed, these errors have the same form as the errors due to the factor matrix approximations, scaled down by a factor that depends on the sketch sizes and . This argument shows the error due to the core approximation is at most a factor times the error due to the factor matrix approximation.

## 6 Numerical Experiments

In this section, we study the performance of our method. We compare the performance of the method using various different DRMs, including TRP. We also compare our method with the algorithm proposed by [24] to show that for the same storage budget, our method produces better approximations. Our two-pass algorithm outperforms the one-pass version, as expected. (Contrast this to [24], where the multi-pass method performs less well than the one-pass version.)

We evaluate the experimental results using two metrics:

 normalized error:∥X−^X∥F/∥X∥Fregret:(∥X−^X∥F−∥X−XHOOI∥F)/∥X∥F.

The normalized error measures the fraction of the energy in captured by the approximation. The regret measures the increase in normalized error due to using the approximation rather than using . The relative error measures the decrease in performance relative to HOOI. The normalized error of a rank Tucker approximation is always positive when has a larger rank. In general, we find our proposed methods approaches the performance of HOOI for large enough storage budgets.

We ran all experiments on a server with 128 Intel® Xeon® E7-4850 v4 2.10GHz CPU cores and 1056GB memory. The code for our method is available at an anonymous Github repository https://github.com/tensorsketch/tensorsketch.

### 6.1 Synthetic experiments

All synthetic experiments use an input tensor with equal side lengths . We consider three different data generation schemes:

• Low rank + noise. Generate a core tensor with entries drawn from . Independently generate orthogonal factor matrices . Define and the noise parameter . Generate an input tensor as where the noise has i.i.d. entries.

• Sparse low rank + noise. We construct the input tensor as above (Low Rank + Noise), but with sparse factor matrices : If is the sparsity (proportion of non-zero elements) of , then the sparsity of the true signal is . We use unless otherwise specified.

• Polynomial decay. We construct the input tensor as

 X=superdiag(1,…,1,2−t,3−t,…,(I−r)−t).

The first entries are 1. Recall converts a vector to dimensional superdiagonal tensor. Our experiments use (geometric decay).

#### 6.1.1 Different dimension reduction maps perform similarly

Our first experiment investigates the performance of our one-pass fixed-rank algorithm as the sketch size (and hence, required storage) varies, for several types of dimension reductions maps, including Gaussian, SSRFT, Gaussian TRP, and Sparse TRP. We generate synthetic data as described above with . Fig. 1 shows the rank- approximation error as a function of the compression factor . (Results for other input tensors are presented as Fig. 10 and Fig. 11 in Appendix J.) We see that the log relative error for our one-pass algorithm converges to that of HOOI as increases for all input tensors. In the low rank case, the convergence rate is lower for higher noise levels. In general, the performance for different maps are approximately the same, although our theory only pertains to the Gaussian map.

We evaluate the run time for HOOI and our two algorithms with several different DRMs in Fig. 3. We can see that the one-pass algorithm is always slightly faster than the two-pass algorithm. The TRP generally provides a modest speedup in addition to the memory advantage. Both our one-pass and two-pass algorithms achieve nearly the accuracy of HOOI, and are usually much faster.

#### 6.1.2 A second pass reduces error

The second experiment compares our two-pass and one-pass algorithm. The design is similar to the first experiment. Fig. 2 shows that the two-pass algorithm typically outperforms the one-pass algorithm, especially in the high-noise, sparse, or rank-decay case. Both converge at the same asymptotic rate. (Results for other input tensors are available in Appendix J.)

#### 6.1.3 Improvement on state-of-the-art

The third experiment compares the performance of our two-pass and one-pass algorithms and Tucker TensorSketch (T.–TS), as described in [24], the only extant one-pass algorithm. For a fair comparison, we allocate the same storage budget to each algorithm and compare the relative error of the resulting fixed-rank approximations. We approximate synthetic 3D tensors with side length with Tucker rank . We use the suggested parameter settings for each algorithm: and for our methods; for T.–TS. Our one-pass algorithm (with the Gaussian TRP) uses storage, whereas T.-TS uses storage (see Table 3 in Appendix I).

Fig. 4 shows that our algorithms generally perform as well as T.–TS, and dramatically outperforms for small storage budgets. For example, our method achieves 1/50, 1/50, 1/7, and 1/4 the relative error of T.–TS for low rank and sparse low rank (), low rank (), and polynomial-decay input tensors, respectively. For the low rank () tensor, the performance of T.–TS is not even monotone as the storage budget increases! The performance of T.–TS is comparable with that of the algorithms presented in this paper only when the storage budget is large.

###### Remark 3

The paper [24] proposes a multi-pass method, Tucker Tensor-Times-Matrix-TensorSketch (TTMTS) that is dominated by the one-pass method Tucker TensorSketch(TS) in all numerical experiments; hence we compare only with T.-TS.

### 6.2 Applications

We also apply our method to datasets drawn from three application domains: climate, combustion, and video.

• Climate data. We consider global climate simulation datasets from the Community Earth System Model (CESM) Community Atmosphere Model (CAM) 5.0 [16, 17]. The dataset on aerosol absorption has four dimensions: times, altitudes, longitudes, and latitudes (). The data on net radiative flux at surface and dust aerosol burden have three dimensions: times, longitudes, and latitudes (). Each of these quantitives has a strong impact on the absorption of solar radiation and on cloud formation.

• Combustion data. We consider combustion simulation data from [21]. The data consists of three measured quantities — pressure, CO concentration, and temperature — each observed on a spatial grid.

• Video data. We use our streaming method to cluster frames of a video, as in [24]. Here, a low frame rate camera is mounted in a fixed position as people walk by. A 3D tensor is constructed with each video frames as a slice. The video consists of 2493 frames, each of size 1080 by 1980. As a tensor, stored as a numpy.array, the video data is 41.4 GB in total.

#### 6.2.1 Data compression

We show that our proposed algorithms are able to successfully compress climate and combustion data even when the full data does not fit in memory. Since the Tucker rank of the original tensor is unknown, we perform experiments for three different target ranks. In this experiment, we hope to understand the effect of different choices of storage budget to achieve the same compression ratio. We define the compression ratio as the ratio in size between the original input tensor and the output Tucker factors, i.e. . As in our experiments on simulated data, Fig. 5 shows that the two-pass algorithm outperforms the one-pass algorithm as expected. However, as the storage budget increases, both methods converge to the performance of HOOI. The rate of convergence is faster for smaller target ranks. Performance of our algorithms on the combustion simulation is qualitatively similar, but converges faster to the performance of HOOI. Fig. 8 visualizes the recovery of the temperature data in combustion simulation for a slice along the first dimension. We could observe that the recovery for both two-pass and one-pass algorithm approximate the recovery from HOOI. Fig. 14 in Appendix J shows similar results on another dataset.

#### 6.2.2 Video scene classification

We show how to use our single pass method to classify scenes in the video data described above. The goal is to identify frames in which people appear. We remove the first 100 frames and last 193 frames where the camera setup happened, as in [24]. We stream over the tensor and sketch it using parameters . Finally, we compute a fixed-rank approximation with and

. We apply K-means clustering to the resulting 10 or 20 dimensional vectors corresponding to each of the remaining 2200 frames.

We experimented with clustering vectors found in three ways: from the two-pass or one-pass Tucker approximation, or directly from the factor sketch.

When matching the video frames with the classification result, we can see that the background light is relatively dark at the beginning, thus classified into Class . After a change in the backgroun light, most other frames of the video are classified into Class . When a person passes by the camera, the frames are classified into Class . Right after the person passed by, the frames are classified into Class , the brighter background scene, due to the light adjustment.

Our classification results (using the linear sketch or approximation) are similar to those in [24] while using only as much storage; the one pass approximation requires more storage (but still less than [24]) to achieve similar performance. In particular, using the sketch itself, rather than the Tucker approximation, to summarize the data enables very efficient video scene classification.

On the other hand, to reconstruct the original video frames we require much larger and : the video is not very low rank along the spatial dimensions. Fig. 7 shows that even with , the recovered frame is very noisy.

## Acknowledgments

MU, YS, and YG were supported in part by DARPA Award FA8750-17-2-0101. JAT gratefully acknowledges support from ONR Awards N00014-11-10025, N00014-17-12146, and N00014-18-12363. The authors wish to thank Osman Asif Malik and Stephen Becker for their help in understanding and implementing Tucker TensorSketch, and Tamara Kolda for insightful comments on an early draft.

## References

• [1] D. Achlioptas, Database-friendly random projections: Johnson-Lindenstrauss with binary coins, Journal of computer and System Sciences, 66 (2003), pp. 671–687.
• [2] N. Ailon and B. Chazelle, The fast Johnson–Lindenstrauss transform and approximate nearest neighbors, SIAM Journal on computing, 39 (2009), pp. 302–322.
• [3] S. Arora and B. Barak, Computational complexity: a modern approach, Cambridge University Press, 2009.
• [4] W. Austin, G. Ballard, and T. G. Kolda, Parallel tensor compression for large-scale scientific data, in Parallel and Distributed Processing Symposium, 2016 IEEE International, IEEE, 2016, pp. 912–922.
• [5] R. Ballester-Ripoll, P. Lindstrom, and R. Pajarola, Tthresh: Tensor compression for multidimensional visual data, IEEE transactions on visualization and computer graphics, (2019).
• [6] M. Baskaran, B. Meister, N. Vasilache, and R. Lethin, Efficient and scalable computations with sparse tensors, in High Performance Extreme Computing (HPEC), 2012 IEEE Conference on, IEEE, 2012, pp. 1–6.
• [7] C. Battaglino, G. Ballard, and T. G. Kolda, A practical randomized cp tensor decomposition, SIAM Journal on Matrix Analysis and Applications, 39 (2018), pp. 876–901.
• [8] C. Battaglino, G. Ballard, and T. G. Kolda, Faster parallel tucker tensor decomposition using randomization, (2019).
• [9] C. Boutsidis and A. Gittens, Improved matrix algorithms via the subsampled randomized hadamard transform, SIAM Journal on Matrix Analysis and Applications, 34 (2013), pp. 1301–1340.
• [10] A. Cichocki, Tensor decompositions: a new concept in brain data analysis?, arXiv preprint arXiv:1305.0395, (2013).
• [11] K. L. Clarkson and D. P. Woodruff, Low-rank approximation and regression in input sparsity time, Journal of the ACM (JACM), 63 (2017), p. 54.
• [12] G. Cormode and M. Hadjieleftheriou, Finding frequent items in data streams, Proceedings of the VLDB Endowment, 1 (2008), pp. 1530–1541.
• [13] L. De Lathauwer, B. De Moor, and J. Vandewalle, A multilinear singular value decomposition, SIAM journal on Matrix Analysis and Applications, 21 (2000), pp. 1253–1278.
• [14] H. Diao, Z. Song, W. Sun, and D. P. Woodruff, Sketching for Kronecker Product Regression and P-splines, arXiv e-prints, (2017), arXiv:1712.09473, p. arXiv:1712.09473, https://arxiv.org/abs/1712.09473.
• [15] N. Halko, P.-G. Martinsson, and J. A. Tropp, Finding structure with randomness: Probabilistic algorithms for constructing approximate matrix decompositions, SIAM review, 53 (2011), pp. 217–288.
• [16] J. W. Hurrell, M. M. Holland, P. R. Gent, S. Ghan, J. E. Kay, P. J. Kushner, J.-F. Lamarque, W. G. Large, D. Lawrence, K. Lindsay, et al., The community earth system model: a framework for collaborative research, Bulletin of the American Meteorological Society, 94 (2013), pp. 1339–1360.
• [17] J. Kay, C. Deser, A. Phillips, A. Mai, C. Hannay, G. Strand, J. Arblaster, S. Bates, G. Danabasoglu, J. Edwards, et al., The community earth system model (cesm) large ensemble project: A community resource for studying climate change in the presence of internal climate variability, Bulletin of the American Meteorological Society, 96 (2015), pp. 1333–1349.
• [18] O. Kaya and B. Uçar, High performance parallel algorithms for the tucker decomposition of sparse tensors, in Parallel Processing (ICPP), 2016 45th International Conference on, IEEE, 2016, pp. 103–112.
• [19] T. G. Kolda and B. W. Bader, Tensor decompositions and applications, SIAM review, 51 (2009), pp. 455–500.
• [20] T. G. Kolda and J. Sun, Scalable tensor decompositions for multi-aspect data mining, in 2008 Eighth IEEE International Conference on Data Mining, IEEE, 2008, pp. 363–372.
• [21] S. Lapointe, B. Savard, and G. Blanquart, Differential diffusion effects, distributed burning, and local extinctions in high karlovitz premixed flames, Combustion and flame, 162 (2015), pp. 3341–3355.
• [22] J. Li, C. Battaglino, I. Perros, J. Sun, and R. Vuduc, An input-adaptive and in-place approach to dense tensor-times-matrix multiply, in High Performance Computing, Networking, Storage and Analysis, 2015 SC-International Conference for, IEEE, 2015, pp. 1–12.
• [23] P. Li, T. J. Hastie, and K. W. Church, Very sparse random projections, in Proceedings of the 12th ACM SIGKDD international conference on Knowledge discovery and data mining, ACM, 2006, pp. 287–296.
• [24] O. A. Malik and S. Becker, Low-rank tucker decomposition of large tensors using tensorsketch, in Advances in Neural Information Processing Systems, 2018, pp. 10116–10126.
• [25] S. Muthukrishnan et al., Data streams: Algorithms and applications, Foundations and Trends® in Theoretical Computer Science, 1 (2005), pp. 117–236.
• [26] S. Oymak and J. A. Tropp, Universality laws for randomized dimension reduction, with applications, Information and Inference: A Journal of the IMA, (2015).
• [27] M. Rudelson, Row products of random matrices, Advances in Mathematics, 231 (2012), pp. 3199–3231.
• [28] J. Sun, D. Tao, S. Papadimitriou, P. S. Yu, and C. Faloutsos, Incremental tensor analysis: Theory and applications, ACM Transactions on Knowledge Discovery from Data (TKDD), 2 (2008), p. 11.
• [29] Y. Sun, Y. Guo, J. A. Tropp, and M. Udell, Tensor random projection for low memory dimension reduction, in NeurIPS Workshop on Relational Representation Learning, 2018, https://r2learning.github.io/assets/papers/CameraReadySubmission%2041.pdf.
• [30] J. A. Tropp, Improved analysis of the subsampled randomized hadamard transform, Advances in Adaptive Data Analysis, 3 (2011), pp. 115–126.
• [31] J. A. Tropp, A. Yurtsever, M. Udell, and V. Cevher, Practical sketching algorithms for low-rank matrix approximation, SIAM Journal on Matrix Analysis and Applications, 38 (2017), pp. 1454–1485.
• [32] J. A. Tropp, A. Yurtsever, M. Udell, and V. Cevher, More practical sketching algorithms for low-rank matrix approximation, Tech. Report 2018-01, California Institute of Technology, Pasadena, California, 2018.
• [33] J. A. Tropp, A. Yurtsever, M. Udell, and V. Cevher, Streaming low-rank matrix approximation with an application to scientific simulation, Submitted to SISC, (2019), https://arxiv.org/abs/1902.08651.
• [34] C. E. Tsourakakis, Mach: Fast randomized tensor decompositions, in Proceedings of the 2010 SIAM International Conference on Data Mining, SIAM, 2010, pp. 689–700.
• [35] L. R. Tucker, Some mathematical notes on three-mode factor analysis, Psychometrika, 31 (1966), pp. 279–311.
• [36] N. Vannieuwenhoven, R. Vandebril, and K. Meerbergen, A new truncation strategy for the higher-order singular value decomposition, SIAM Journal on Scientific Computing, 34 (2012), pp. A1027–A1052.
• [37] M. A. O. Vasilescu and D. Terzopoulos, Multilinear analysis of image ensembles: Tensorfaces, in European Conference on Computer Vision, Springer, 2002, pp. 447–460.
• [38] Y. Wang, H.-Y. Tung, A. J. Smola, and A. Anandkumar, Fast and guaranteed tensor decomposition via sketching, in Advances in Neural Information Processing Systems, 2015, pp. 991–999.
• [39] D. P. Woodruff et al., Sketching as a tool for numerical linear algebra, Foundations and Trends® in Theoretical Computer Science, 10 (2014), pp. 1–157.
• [40] F. Woolfe, E. Liberty, V. Rokhlin, and M. Tygert, A fast randomized algorithm for the approximation of matrices, Applied and Computational Harmonic Analysis, 25 (2008), pp. 335–366.
• [41] Y. Xu, On the convergence of higher-order orthogonality iteration, arXiv preprint arXiv:1504.00538, (2015).
• [42] G. Zhou, A. Cichocki, and S. Xie, Decomposition of big tensors with low multilinear rank, arXiv preprint arXiv:1412.1885, (2014).

## Appendix A Proof of Main Results

### a.1 Error bound for the two pass approximation Algorithm 4

###### Proof (Proof of Section 5.1)

Suppose is the low-rank approximation from Algorithm 4. Use the definition of the mode- product to see

 ^X =[X×1Q⊤1×2⋯×NQ⊤N]×1Q1×1⋯×NQN =X×1Q1Q⊤1×2⋯×NQNQ⊤N.

Then the error bound follows directly from Appendix B.

### a.2 Error bound for the one pass approximation Algorithm 5

###### Proof (Proof of Section 5.1)

We show the approximation error can be decomposed as the error due to the factor matrix approximations and the error due to the core approximation. Let be the one pass approximation from Algorithm 5, and let

 ~X=X×1Q1Q⊤1×2⋯×NQNQ⊤N, (A.1)

be the two pass approximation from Algorithm 4. We will prove the error due to the factor matrix and core approximations are orthogonal:

 (A.2)

Intuitively, projects to the column space of the factor matrix approximations. Thus, is orthogonal to the span of the factor matrix approximations, while both lie entirely in the span of the factor matrix approximations. We prove Eq. A.2 formally as follows. Define , and for each let

 Yn=X×1Q1Q⊤1×2⋯×nQnQ⊤n. (A.3)

Write the error due to the factor matrix approximations as the telescoping sum

 X−~X=Y0−YN=N∑n=1(Yn−1−Yn).

The differences inside the sum satisfy

 Yn−1−Yn=Yn−1×n(I−QnQ⊤n)

for each . The one pass approximation (from Algorithm 5) and the two pass approximation to show differ only in how they define the core, so

We can characterize the unfoldings of this difference: for each ,

 (^X−~X)(n)=Qn(Bn)(n)

where is the mode th unfolding of the tensor defined as

Now recall from (2.2) that the tensor inner product matches the matrix inner product for any unfolding of the tensors. We use this fact to show orthogonality between the error due to the factor matrix approximation in the th mode and the error due to the core approximation, for each mode :

 ⟨Yn−1−Yn,^X−~X⟩=⟨(I−QnQ⊤n)Y(n)n−1,QnB