Randomized CP Tensor Decomposition

The CANDECOMP/PARAFAC (CP) tensor decomposition is a popular dimensionality-reduction method for multiway data. Dimensionality reduction is often sought since many high-dimensional tensors have low intrinsic rank relative to the dimension of the ambient measurement space. However, the emergence of `big data' poses significant computational challenges for computing this fundamental tensor decomposition. Leveraging modern randomized algorithms, we demonstrate that the coherent structure can be learned from a smaller representation of the tensor in a fraction of the time. Moreover, the high-dimensional signal can be faithfully approximated from the compressed measurements. Thus, this simple but powerful algorithm enables one to compute the approximate CP decomposition even for massive tensors. The approximation error can thereby be controlled via oversampling and the computation of power iterations. In addition to theoretical results, several empirical results demonstrate the performance of the proposed algorithm.



There are no comments yet.


page 17

page 18

page 21

page 22

page 23


Fast and Guaranteed Tensor Decomposition via Sketching

Tensor CANDECOMP/PARAFAC (CP) decomposition has wide applications in sta...

Vectorial Dimension Reduction for Tensors Based on Bayesian Inference

Dimensionality reduction for high-order tensors is a challenging problem...

Low-CP-rank Tensor Completion via Practical Regularization

Dimension reduction techniques are often used when the high-dimensional ...

Protecting Big Data Privacy Using Randomized Tensor Network Decomposition and Dispersed Tensor Computation

Data privacy is an important issue for organizations and enterprises to ...

TEC: Tensor Ensemble Classifier for Big Data

Tensor (multidimensional array) classification problem has become very p...

Randomized Online CP Decomposition

CANDECOMP/PARAFAC (CP) decomposition has been widely used to deal with m...

On Recoverability of Randomly Compressed Tensors with Low CP Rank

Our interest lies in the recoverability properties of compressed tensors...

Code Repositories


Randomized Tensor Decompositions

view repo
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Advances in data acquisition and storage technology have enabled the acquisition of massive amounts of data in a wide range of emerging applications. In particular, numerous applications across the physical, biological, social and engineering sciences generate large multidimensional, multi-relational and/or multi-modal data. Efficient analysis of this data requires dimensionality reduction techniques. However, traditionally employed matrix decompositions techniques such as the singular value decomposition (SVD) and principal component analysis (PCA) can become inadequate when dealing with multidimensional data. This is because reshaping multi-modal data into matrices, or

data flattening, can fail to reveal important structures in the data.

Tensor decompositions overcome this issue of information loss. The canonical CP decomposition is particularly suitable for data-driven discovery since it expresses a tensor as a sum of rank-one tensors. However, tensor decompositions of massive multidimensional data pose a tremendous computational challenge. Hence, innovations that reduce the computational demands have become increasingly relevant in this field. Tensor compression can ease the computational bottlenecks by computation of smaller (compressed) tensors, which are then used as a proxy to efficiently approximate the CP decomposition. Such a compressed tensor can be obtained, for instance, using the Tucker decomposition (Bro and Andersson, 1998; De Lathauwer et al., 2000)

. However, this approach requires the expensive computation of the left singular vectors for each mode. This computational challenge can be eased using modern randomized techniques developed to compute the SVD.

Tsourakakis (2010) presents a randomized Tucker decomposition algorithm based on the work of Achlioptas and McSherry (2007). Later, Zhou et al. (2014) proposed a more refined randomized CP algorithm using the ideas of Halko et al. (2011), however, they omit the important concept of power iterations in their work. As an alternative to random projections, Drineas and Mahoney (2007) proposed a randomized tensor algorithm based on the idea of random column selection. Related work by Vervliet et al. (2014) proposed a sparsity-promoting algorithm for incomplete tensors using concepts of compressed sensing. Alternatively, a different approach to efficiently compute large-scale tensor decompositions is based on the idea of subdividing a tensor into a set of blocks. These smaller blocks can then be used to approximate the CP decomposition of the full tensor in a parallelized or distributed fashion (Phan and Cichocki, 2011). Sidiropoulos et al. (2014) fused the idea of randomization (random projections) and blocking into a highly computationally efficient algorithm. More recently, Vervliet and Lathauwer (2016) also proposed a block sampling CP decomposition method for the analysis of large-scale tensors using randomization, thus showing significant computational savings while attaining near optimal accuracy. These block based algorithms are particularly relevant if the tensor is too large to fit into fast memory.

We present a randomized CP algorithm, which is closely related to randomized matrix decompositions as presented by

Martinsson et al. (2011) and Halko et al. (2011). The idea is to employ a degree of randomness in order to derive a smaller tensor from the high-dimensional input tensor. The smaller tensor can then be used to approximate the factor matrices using either alternating least squares or a block coordinate descent algorithm. Once the approximate factor matrices are computed, the full factor matrices can be efficiently recovered using the orthonormal basis matrices, which were used to project the high-dimensional tensor onto low-dimensional space. Embedding the CP decomposition into this probabilistic framework allows to achieve significant computational savings, while producing near-optimal approximation quality. Figure 1 shows the schematic of the randomized CP decomposition architecture.

CP decomposition

rCP decomposition

A degree of randomness is employed to derive a smaller tensor from the (big) input tensor.

Recover near-optimal factors.


Factor matrices



Figure 1: Schematic of the randomized CP decomposition architecture. First, a degree of randomness is employed to derive a smaller tensor from the big tensor . Then, the CP decomposition is performed on . Finally, the near-optimal high-dimensional factor matrices and are recovered from the approximate factor matrices and .

A crucial aspect to build a sufficient basis are power iterations. This concept is fundamental for modern high-performance randomized matrix decompositions, but as far as we are aware, it has not been applied in the context of tensors. In particular, in the presence of white noise the performance of randomized algorithms based on random projections can suffer considerably without additional power iterations. Thus, oversampling and power iterations allow one to control the error of the decomposition. The work is accompanied by an open-software package written in

Python, which allows the reproduction of all results.111The rTensor package can be obtained from the GIT repository: github.com/Benli11/rtensor.

The manuscript is organized as follows: Section 2 briefly reviews the CP decomposition and randomized matrix decompositions. Section 3 introduces the randomized CP tensor decomposition algorithm. Section 4 presents the evaluation of the computational performance, and examples. Finally, Section 5 summarizes the research findings and outlines further directions.

2 Background

Ideas for multi-way factor analysis emerged in the 1920s with the formulation of the polyadic decomposition by Hitchcock (1927). However, the polyadic tensor decomposition only achieved popularity much later in the 1970s with the canonical decomposition (CANDECOMP) in psychometrics, proposed by Carroll and Chang (1970). Concurrently, the method of parallel factors (PARAFAC) was introduced in chemometrics by Harshman (1970). Hence, this method became known as the CP (CANDECOMP/PARAFAC) decomposition. However, in the past, computation was severely inhibited by available computational power. Today, tensor decompositions enjoy increasing popularity, yet runtime bottlenecks still persist.

2.1 Notation

Scalars are denoted by lower case letters , vectors as bold lower case letters , and matrices by bold capitals . Tensors are denoted by calligraphic letters . The mode- unfolding of a tensor is expressed as , while the mode- folding of a matrix is defined as . The vector outer product, the Kronecker product, the Khatri-Rao product and the Hadamard product are denoted by , , , and . The inner product of two tensors is expressed as , and denotes the Frobenius norm for both matrices and tensors.

2.2 CP Decomposition

The CP decomposition is the tensor equivalent of the SVD since it approximates a tensor by a sum of rank-one tensors. Specifically, tensor rank is defined as the smallest sum of rank-one tensors required to generate the tensor (Hitchcock, 1927). The CP decomposition approximates these rank-one tensors. Given a third order tensor , the rank- CP decomposition is expressed as


where denotes the outer product. Specifically, each rank-one tensor is formulated as the outer product of the rank-one components , , and . Components are often constrained to unit length with the weights absorbed into the vector . Equation (1) can then be re-expressed as (See Fig. 2)


More compactly the components can be expressed as factor matrices, i.e., , , and . Using the Kruskal operator as defined by Kolda and Bader (2009), (2) can be more compactly expressed as

Figure 2: Schematic of the CP decomposition.

2.3 Randomized Matrix Algorithms

The efficient computation of low rank matrix approximations is a ubiquitous problem in machine learning and data mining. Randomized matrix algorithms have been demonstrated to be highly competitive and robust when compared to traditional deterministic methods. Specifically, randomized algorithms aim to construct a smaller matrix approximating a high-dimensional data matrix 

(Mahoney, 2011; Drineas and Mahoney, 2016). There exist several strategies for obtaining the smaller matrix, with random projections being the most robust, off-the-shelf approach. Randomized algorithms have been in particular studied for computing the near-optimal low-rank SVD (Frieze et al., 2004; Liberty et al., 2007; Woolfe et al., 2008; Martinsson et al., 2011). Following the seminal work by Halko et al. (2011), a randomized algorithm computes the low-rank matrix approximation

where the target rank is denoted as , and assumed to be . First a random sampling matrix

is drawn from, for example, a Gaussian distribution to construct the sampled matrix

If has exact rank , then the sampled matrix

spans with high probability a basis for the column space. However, most data matrices do not feature exact rank in practice so that the singular values

are non-zero. Thus, instead of just using samples, it is favorable to slightly oversample , were denotes the number of additional samples. In most situations small values are sufficient to obtain a good basis that is comparable to the best possible basis (Martinsson, 2016). An orthonormal basis

is then obtained via the QR-decomposition

, such that

is satisfied. Finally, is projected to low-dimensional space

where . The matrix can then be used to efficiently compute the matrix decomposition of interest, e.g., the SVD. The approximation error can be controlled by both oversampling, and the concept of power iterations (Rokhlin et al., 2009; Halko et al., 2011; Gu, 2015).

Randomized matrix algorithms are not only pass efficient, but they have also the ability to exploit modern computational parallelized and distributed computing architectures. Implementations in MATLAB, C and R are provided by Szlam et al. (2014)Voronin and Martinsson (2015), and Erichson et al. (2016).

3 Randomized CP Decomposition

Given a third order tensor , the objective of the CP decomposition is to find a set of normalized rank-one tensors which best approximates , i.e., minimizes the Frobenius norm

subject to (3)

If the dimensions of are large, the computational costs of solving this optimization problem can be prohibitive. However, for obtaining the factor matrices , , only the column spaces are of importance, rather then the individual columns of the mode , , matricizations of the tensor . This is because the CP decomposition learns the components based on proportional variations in inter-point distances between the components. Therefore, a compressed tensor must preserve pairwise Euclidean distances, where . This in turn requires that column spaces, and thus pairwise distances, are approximately preserved by the compression procedure. This compression can be achieved by generalizing the concepts of randomized matrix algorithms to tensors. In particular, we build upon the methods introduced by Martinsson et al. (2011) and Halko et al. (2011), as well as related work on randomized tensors by Drineas and Mahoney (2007), who proposed a randomized algorithm based on random column selection.

3.1 Randomized Tensor Algorithm

The aim is to use randomization as a computational resource to efficiently build a suitable basis that captures the action of the tensor . Assuming an -way tensor , the aim is to obtain a smaller compressed tensor , so that its tensor modes capture the action of the input tensor modes. Hence, we seek a natural basis in the form of a set of orthonormal matrices , so that


Here the operator denotes tensor-matrix multiplication defined as follows.

Definition 1

The -mode matrix product multiplies a tensor by the matrices in mode , i.e., each mode- fiber is multiplied by

Given a fixed target rank , these basis matrices can be efficiently obtained using a randomized algorithm. Specifically, the approximate basis for the -th tensor mode is obtained by first drawing random vectors from a Gaussian distribution. These random vectors form the measurement matrix , which is used to sketch the column space of as follows


where is the sample matrix. The sample matrix serves as an approximate basis for the range of the

-th tensor mode. Probability theory guarantees that the set of random vectors

are linearly independent with high probability. Hence, the corresponding random projections efficiently sample the range of a rank deficient tensor mode . The economic QR decomposition of the sample matrix is then used to obtain a natural basis, so that is orthonormal and has the same column space as . The final step restricts the tensor mode to this low-dimensional subspace


Thus, after iterations a compressed tensor and a set of orthonormal matrices is obtained. Since this is an iterative algorithm, we set after each iteration. The number of columns of the basis matrices form a trade-off between accuracy and computational performance. The aim is to use as few columns as possible, yet allow an accurate approximation of the input tensor. Assuming that the tensor exhibits low-rank structure, or equivalently, the rank is much smaller than the ambient dimensions of the tensor, the basis matrices will be an efficient representation. However, to improve the performance of the basis, we allow for additional oversampling in practice. This means, instead of drawing exactly random vectors, it is preferred to draw random vectors, where denotes the oversampling parameter.

The randomized algorithm as presented requires that the mode- unfolding of the tensor has a rapidly decaying spectrum in order to achieve good performance. However, this assumption is often not suitable. In particular, the spectrum starts to decay slowly if the tensor is compressed several times. To overcome this issue, the algorithm’s performance can be substantially improved using power iterations (Rokhlin et al., 2009; Halko et al., 2011; Gu, 2015). Specifically, power iterations turn a slowly decaying spectrum into a rapidly decaying one by taking powers of the tensor modes. Thus, instead of sampling the idea is to sample from the following tensor mode

where denotes a small integer. This power operation enforces that the singular values of are . Instead of using (5), an improved sample matrix is computed as


However, if (7) is implemented in this form it tends to distort the basis due to round-off errors. Therefore, in practice (normalized) subspace iterations are used to form the sample matrix. This means that the sample matrix is orthornormalized between each power iteration in order to stabilize the algorithm. For implementation, see Voronin and Martinsson (2015) and Szlam et al. (2014).

The combination of oversampling and additional power iterations can be used to control the trade-off between approximation quality and computational efficiency of the randomized tensor algorithm. Our results, for example, show that just subspace iterations and an oversampling parameter of about achieves near-optimal results. Algorithm 1 summarizes the computational steps.

Require: An -way tensor , and a desired target rank . Optional: Parameters and to control oversampling, and the number of power iterations. (1) initialize compressed tensor (2) for iterate over all tensor modes (3) slight oversampling (4) (5) generate random test matrix (6) compute sampling matrix (7) for normalized power iterations (optional) (8) (9) (10) (11) end for (12) orthonormalize sampling matrix (13) project tensor to smaller space (14) end for Return: Compressed tensor of dimension , and a set of orthonormal basis matrices .

Due to the convenient mathematical properties of the normal distribution, a Gaussian random test matrix is assumed in theory, however, in practice a uniform distributed random test matrix is sufficient. The performance can be even further improved using structured random matrices 

(Woolfe et al., 2008). If information is uniformly distributed across the data, randomly selected columns can also be used to build a suitable basis as well, which avoids the matrix multiplication in step (6). For numerical stability, normalized power iterations using the pivoted LU decomposition are computed in step 7-11. We recommend a default value of . In practice, the user can decide which modes to compress and specify different oversampling parameters for these modes.

Algorithm 1 A prototype randomized tensor compression algorithm.

3.1.1 Performance analysis

The average behavior of the randomized tensor algorithm is characterized using the expected residual error


where . Theorem 1 follows as a generalization from Theorem 10.5 formulated by Halko et al. (2011).

Theorem 1 (Expected Frobenius error)

Consider a low-rank real -way tensor . Then the expected approximation error, given a target rank and an oversampling parameter for each mode, is


The proof is shown in Appendix A. Intuitively, the projection of each tensor mode onto a low-dimensional space introduces an additional residual. This is expressed by the double sum on the right hand side. If the low-rank approximation captures the column space of each mode accurately, then the singular values for each mode are small. Further, it can be seen that the error can be improved by the oversampling parameter. The computation of additional power (subspace) iterations can improve the error further. This result again follows by generalizing the results of Halko et al. (2011) to tensors. Sharper performance bounds for both oversampling and additional power iterations can be derived following, for instance, the results by Witten and Candes (2015).

3.2 Optimization Strategies

While there exist several optimization strategies for minimizing the objective function defined in (10), we consider alternating least squares (ALS) and block coordinate descent (BCD). Both methods are suitable to deal with a compressed tensor , where . The optimization problem (3) is reformulated

subject to (10)

Once the compressed factor matrices , ,

are estimated, the full factor matrices can be recovered from


where , , denote the orthonormal basis matrices. For simplicity we focus on third order tensors, but the result generalizes to -way tensors.

3.2.1 ALS Algorithm

Due to simplicity and efficiency, ALS is the most popular method for computing the CP decomposition (Comon et al., 2009; Kolda and Bader, 2009). We note that the optimization (10) is equivalent to

with respect to the factor matrices and . Further, the tensor can be expressed in matricized form

where denotes the Khatri-Rao product. The optimization problem in this form is non-convex. However, an estimate for the factor matrices can be obtained using the least-squares method. Therefore, the ALS algorithm updates one component, holding the other two components fixed, in an alternating fashion until convergence. Specifically, the algorithm iterates over the following subproblems


Thus, each step involves a least-squares problem which can be solved using the Khatri-Rao product pseudo-inverse. Algorithm 2 summarizes the computational steps.

Definition 2

The Khatri-Rao product pseudo-inverse is defined as

where the operator denotes the Hadamard product, i.e., the elementwise multiplication of two equal sized matrices.

There exist few general convergence guarantees for the ALS algorithm (Uschmajew, 2012; Wang and Chu, 2014). Moreover, the final solution tends to depend on the initial guess and

. A standard initial guess uses the eigenvectors of

, ,  (Bader and Kolda, 2015). Further, it is important to note that normalization of the factor matrices is necessary after each iteration to achieve good convergence. Specifically, this prevents singularities of the Khatri-Rao product pseudo-inverse Kolda and Bader (2009). The algorithm can be further improved by reformulating the above subproblems as regularized least-squares problems, for instance, see Li et al. (2013) for technical details and convergence results. The ability of the ALS algorithm to impose structure on the factor matrices permits the formulation of non-negative, or sparsity-constrained tensor decompositions (Cichocki et al., 2009).

Require: An tensor , and a desired target rank . Optional: Parameters and to control oversampling, and the number of power iterations. (1) compress tensor using Algorithm 1 (2) (3) repeat (4) (5) (6) (7) (8) (9) (10) (11) until convergence criterion is reached (12) recover factor matrices (13) re-normalize the factor matrices and update the scaling vector Return: Normalized factor matrices and the scaling vector .
Algorithm 2 A prototype randomized CP algorithm using ALS.

3.2.2 BCD Algorithm

While ALS is the most popular algorithm for computing the CP decomposition, many alternative algorithms have been developed. One particularly intriguing algorithm is based on BCD (Xu and Yin, 2013). First, Cichocki and Phan (2009) proposed this approach for computing nonnegative tensor factorizations. The BCD algorithm is based on the idea of successive rank-one deflation. Unlike ALS, which updates the entire factor matrix at each step, BCD computes the rank-1 tensors in a hierarchical fashion. Therefore, the algorithm treats each component as a block. First, the most correlated rank-1 tensor is computed; then the second most correlated rank-1 tensor is learned on the residual tensor, and so on. Assuming that components have been computed, then the -th compressed residual tensor is defined


Then, the algorithm iterates over the following subproblems


Note that the computation can be more efficiently evaluated without explicitly constructing the residual tensor  (Kim et al., 2014). Algorithm 3 summarizes the computation.

Require: An tensor , and a desired target rank . Optional: Parameters and to control oversampling, and the number of power iterations. (1) compress tensor using Algorithm 1 (2) (3) initialize residual tensor (4) for compute rank- approximation (5) repeat (6) (7) (8) (9) (10) (11) (12) (13) until convergence criterion is reached (14) update residual tensor (15) end for (16) recover factor matrices (17) re-normalize the factor matrices and update the scaling vector Return: Normalized factor matrices and the scaling vector .
Algorithm 3 A prototype randomized CP algorithm using BCD.

3.3 Implementation Details

The presented algorithms are implemented in the programming language Python, using numerical linear algebra tools provided by the SciPy (Open Source Library of Scientific Tools) package (Jones et al., 2001). Specifically, SciPy provides MKL (Math Kernel Library) accelerated high performance implementations of BLAS and LAPACK routines. Thus, all linear algebra operations are threaded and highly optimized on Intel processors. The implementation of the CP decomposition follows the MATLAB Tensor Toolbox implementation (Bader and Kolda, 2015)

. This implementation normalizes the components after each step to achieve better convergence. Further, we use eigenvectors (see above) to initialize the factor matrices. Interestingly, randomly initialized factor matrices have the ability to achieve slightly better approximation errors. However, re-running the algorithms several times with different random seeds can display significant variance in the results. Thus, only the former approach is used for initialization. We note that the randomized algorithm introduces some randomness and slight variations into the CP decompositions as well. However, randomization can also act as an implicit regularization on the CP decomposition 

(Mahoney, 2011), meaning that the results of the randomized algorithm can be in some cases even ‘better’ than the results of the corresponding deterministic implementation. Following Bader and Kolda (2015), the convergence criterion is defined as the change in fit. The algorithm therefore stops when the improvement of the fit is less then a predefined threshold, where the fit is computed using .

4 Numerical Results

The randomized CP algorithm is evaluated on a number of examples where the near optimal approximation of massive tensors can be achieved in a fraction of the time using the randomized algorithm. Approximation accuracy is computed with the relative error , where denotes the approximated tensor. All computations are performed on a workstation running Ubuntu 16 LTS, with the following hardware specifications: 12 Intel Xeon CPUs E5-2620 (2.4GHz), and 32GB DDR3 memory.

Figure 3: Average relative error, plotted on a log scale, against increasing signal to noise ratio. The analysis is performed on a rank tensor of dimension . Power iterations improve the the approximation accuracy considerably, while the performance without power iterations is poor.

4.1 Computational Performance

The robustness of the randomized CP algorithm is first assessed on random low-rank tensors. Specifically, it is of interest to examine the approximation quality in the presence of additive white noise. Figure 3 shows the averaged relative errors over runs for varying signal-to-noise ratios (SNR). In the presence of little noise all algorithms converge towards the same relative error. However, at excessive levels of noise (i.e., SNR) the deterministic CP algorithms exhibit small gains in accuracy over the randomized algorithms using power iterations. Here, both the ALS and BCD algorithm show the same performance. The randomized algorithm without power iterations () is, however, poor. This highlights the importance of the power operation for real applications. The oversampling parameter for the randomized algorithms is set to . Increasing can slightly improve the accuracy, but is generally sufficient.

Next, the reconstruction errors and runtimes for tensors of varying dimensions are compared. Figure 4 shows the average evaluation results over runs for random low-rank tensors of different dimensions, and for varying target ranks . The randomized algorithms achieve near optimal approximation accuracy while demonstrating substantial computational savings. The computational advantage becomes pronounced with increasing tensor dimensions, as well as with an increasing number of iterations required for convergence. Using random tensors as presented here, all algorithms rapidly converge after about to iterations. However, it is evident that the computational cost per iteration of the randomized algorithm is substantially lower. Thus, the computational savings in real world applications, which require several hundred iterations to converge, can be even more substantial. Overall, the ALS algorithm is computationally more efficient than BCD. The deterministic ALS algorithm is faster than the BCD by nearly one order of magnitude. However, the randomized algorithms exhibit similar computational timings. Interestingly, the BCD relative error decreases sharply by about one order of magnitude when the target rank is achieved, and the tensor rank is much smaller then the ambient dimensions. Similar performance results are achieved for higher order tensors as well.

(a) Tensor of dimension .
(b) Tensor of dimension .
(c) Tensor of dimension .
Figure 4: Random tensor approximation and performance for rank tensors: rCP methods achieve speedups by - orders of magnitude and the same accuracy as their deterministic counterpart.
Figure 5: Random tensor approximation and performance for a -way rank tensor of dimension .

Figure 5 shows the computational performance for a -way tensor of dimension . Again, the randomized algorithms achieve speedups of 1-2 orders of magnitude, while attaining good approximation errors. Figure 6 shows the computational savings of a rank approximation for varying tensor dimensions.

Figure 6: Algorithm runtimes and speedups for varying tensor dimensions for a target rank approximation. Speedups rise sharply with increasing dimensions.

4.2 Numerical Examples

The examples demonstrate the advantages of the randomized CP decomposition. The first is a multiscale toy video example, and the second is the simulated flow field behind a stationary cylinder. Due to the better and more natural interpretability of the BCD algorithm, only this algorithm is considered in subsequent sections. BCD is in particular advantageous for data with spatial modes and temporal dynamics.

4.2.1 Multiscale Toy Video Example

The approximation of the underlying spatial modes and temporal dynamics of a system is a common problem in signal processing. In the following, we consider a toy example presenting multiscale intermittent dynamics in the time direction. Specifically, the data is generate by four Gaussian modes on a two-dimensional spatial grid (), which are undergoing intermittent oscillations in the temporal direction resolved with time steps. Thus, the resulting tensor is of dimension . Figure 7 shows the corresponding modes and the time dynamics.

Figure 7: Illustration of the multiscale toy video. The system is governed by four spatial modes experiencing intermittent oscillations in the temporal direction. The bottom subplot shows the noisy signal with a signal-to-noise ratio of .

This problem becomes even more challenging when the underlying structure needs to be reconstructed from noisy measurements. In particular, traditional matrix decomposition techniques such as the SVD or PCA face difficulties approximating the intermittent multiscale dynamics. The CP decomposition estimates fewer coefficients, in contrast. This is because the data do not need to be reshaped, which allows for a parsimonious approximation. Comparing the compression ratios between the two methods illustrates the difference. For a real rank tensor of dimension , the compression ratios are computed as follows

The SVD requires this data to be reshaped in some dimension. The comparison displays the striking difference between compression ratios. It is evident that the CP decomposition requires computing many fewer coefficients in order to approximate the tensor. This makes the method more robust in the presence of noise. A further advantage is that much less storage is required to approximate the data. This can be of importance if the bandwidth is constrained and only a limited amount of information can be transmitted, in which case the CP decomposition may be advantageous. However, the advantage of the SVD is that noise free data can be approximated with an accuracy as low as machine precision.

Figure 8, shows the decomposition results of the noisy (SNR=) toy video for both the randomized CP decomposition and the SVD. The first subplot shows the results of a rank approximation computed using the rCP algorithm with power iterations, and a small oversampling parameter . The method faithfully captures the underlying spatial modes and the time dynamics. For illustration, the second subplot shows the decomposition results without additional power iterations. It can clearly be seen that this approach introduces distinct artifacts, and the approximation quality is relatively poor overall. The last subplot corresponds to the SVD. The results show poor performance at separating the different modes. In particular, the spatiotemporal dynamics of modes 2 & 3 are mixed, as well as modes 2 & 4 to a lesser extent. Table 1 further quantifies the observed results. The achieved speedup of rCP is substantial, with a speedup factor of about . Interestingly, the relative error using the randomized algorithm with power iterations is slightly better than the deterministic algorithm. This is due to the beneficial properties of randomization, which can act to regularize. However, the reconstruction error without power iterations is large, as is the error resulting from the SVD.


(a) rCP BCD, .
(b) rCP BCD, .
(c) SVD


Figure 8: Toy video decomposition results. rCP with successfully reconstructs the original spatiotemporal dynamics from noise-corrupted data, while SVD and rCP without subspace iterations yield poor reconstruction results.
Parameters Time (s) Speedup Iterations Error


2.31 - 9 0.0171


, , 0.13 17 9 0.494
, , 0.14 16 10 0.0191
, , 0.15 15 10 0.0164


0.52 - - 0.137
Table 1: Summary of the computational results for the noisy toy video.

4.2.2 Flow behind a cylinder

Extracting the dominant coherent structures from fluid flows helps to better characterize them for modeling and control (Brunton and Noack, 2015). The workhorse algorithm in fluid dynamics and for model reduction is the SVD, also known in these communities as proper orthogonal decomposition (POD). However, fluid simulations generate high-resolution spatiotemporal grids of data which naturally manifest as tensors. In the following we examine the suitability of the CP decomposition for decomposing flow data, and compare the results to those of the SVD.

The fluid flow behind a cylinder, a canonical example in fluid dynamics, is presented here. Specifically, the data are constructed as a time-series of fluid vorticity fields behind a stationary cylinder on an equispaced grid (Colonius and Taira, 2008). The corresponding flow tensor is of dimension , containing 151 snapshots of a spatial grid. Figure 9 shows three example snapshots of the fluid flow.

Figure 9: Snapshots of the fluid flow behind a cylinder at time points .

The flow is characterized by a periodically shedding wake structure and is inherently low-rank in the absence of noise. This can be seen in the normalized spectra for the singular values as shown in Figure 10.

Figure 10: Normalized spectrum. The SVD and (r)CP BCD decompositions successfully capture pairs of characteristic frequencies in the low-rank cylinder flow.

The characteristic frequencies of flow oscillations occur in pairs, reflecting the complex-conjugate pairs of eigenvalues that define sine and cosine temporal dynamics. Accordingly, the normalized singular values and coefficients for both the SVD and the normalized lambda values of the CP decomposition using BCD accurately reflect the true physics of the cylinder wake. The singular values capture more energy in fewer modes than the CP decomposition, reflecting its low matrix rank. But as we shall see, the data is not low in tensor rank, and this is due to the complex spatial structure which the SVD essentially ignores by flattening spatial dimensions. We note that the CP decomposition using ALS does not converge to the underlying dynamics, and hence omit ALS from subsequent results.

Figure 11 shows both the approximated spatial modes and the temporal dynamics for the randomized CP decomposition and the SVD. The temporal dynamics of both methods have similar patterns. However, the spatial modes exhibit a distinct structure. The SVD groups modes together according to their variance in the data, while ignoring coordinate structure. Thus, the first four spatial modes revealed by the CP decomposition are condensed into the first two SVD modes.

CP spatial modes resemble spatial Fourier transforms of the SVD spatial modes. This is because CP modes are rank-one outer products of vectors in the

and dimensions which by construction cannot capture the triangular wake interactions seen in the SVD modes.


(a) rCP BCD modes, .
(b) SVD.


Figure 11: Fluid flow decomposition, no noise. Both methods capture the same dominant frequencies from the time dynamics, while randomized CP requires more rank-1 outer products in and to represent single frequency spatial dynamics.

The CP decomposition represents the spatial structure by single spatial frequencies per mode, which poses difficulties in obtaining a decomposition low in tensor rank. This explains why for a fixed target rank of across all methods, the SVD achieves a substantially lower reconstruction error (See Table 2). However, the compression ratios for the CP and SVD methods are and so that the tensor representation compresses the data nearly two orders of magnitude more.

Parameters Time (s) Speedup Iterations Error


115.55 - 458 0.117


, , 1.27 91 533 0.122
, , 1.41 82 517 0.121
, , 1.56 74 437 0.118


0.57 - - 4.25E-05
Table 2: Summary of the computational results for the noise-free cylinder flow.

Next, the analysis of the same flow is repeated in the presence of additive white noise. While this is not of concern when dealing with flow simulations, it is realistic when dealing with flows obtained from measurement. We choose a signal-to-noise ratio of 2 to demonstrate the robustness of the CP decomposition to noise. Figure 12 shows again the corresponding dominant spatial modes and temporal dynamics. Both the SVD and the CP decomposition faithfully capture the temporal dynamics. However, comparing the modes of the SVD to Figure 11, it is apparent that the spatial modes are overfit and contain a large amount of noise. The spatial modes revealed by the CP decomposition show a significantly better approximation. Again, it is crucial to use power iterations to achieve a good approximation quality (See Table 3). By inspection, the relative reconstruction error using the SVD is poor compared to the error achieved using the CP decomposition. Here, we have shown the error for a rank and approximation. The latter target rank was determined using the optimal hard threshold for singular values (Gavish and Donoho, 2014). However, the suggested target rank is lower then the number of modes which are of interested as seen in  10. The CP decomposition overcomes this disadvantage, and is able to approximate the first modes with only a slight loss of accuracy. Note that the randomized CP decomposition performs better then the deterministic algorithm in the presence of noise.

Parameters Time (s) Speedup Iterations Error


64.01 - 239 0.191


, , 0.99 64 332 0.522
, , 1.23 52 414 0.189
, , 1.13 56 370 0.153


0.58 - - 0.655
0.58 - - 0.311
Table 3: Summary of the computational results for the noise-corrupted cylinder flow.


(a) rCP BCD modes, .
(b) SVD.


Figure 12: Fluid flow decomposition noisy (SNR=2). Randomized CP modes are robust to additive noise and SVD spatial modes are corrupted by noise.

Figure 13 further illustrates the approximation quality with an example snapshot from the reconstructed flow in time. The left column corresponds to the noise free scenario. Here the performance of the SVD is nearly perfect. However, the left column containing the noisy example shows the clear advantage of the CP decomposition. Despite the denoising effect of the SVD, the approximated low-rank subspace remains distorted by noise in the ambient space. While not perfect, the CP decomposition allows a more meaningful interpretation of the underlying structure, and thus can be seen as a valuable tool for the analysis of fluid flows in the presence of noise. In addition, the spatiotemporal, multiscale separation performed by the CP decomposition can be beneficial for flow interpretation in practice. From a purely data analytic perspective, the CP modes offer a new and intriguing interpretation of flow structure, hence the trade-off between lower-rank approximation with SVD and multiscale separation with tensors must be weighed carefully. In practice, a tensor decomposition may be used to filter noise from the data, followed by a standard POD/Galerkin procedure if reduced order models are sought.


(a) Noise free.
(b) Noisy (SNR=2).
Figure 13: Fluid snapshots, and randomized CP and SVD approximations of the snapshot are pictured in the first, second and third rows, respectively. The randomized CP better recovers the true signal (top left) from noise-corrupted flow (top right).

5 Conclusion

The emergence of massive tensors require efficient algorithms for obtaining the CP decomposition. We have presented a randomized CP algorithm which substantially reduces the computational burdens involved in obtaining a tensor decomposition. Despite the computational savings, the approximation quality is near-optimal, and in the presence of noisy measurements even better then the deterministic CP algorithm. A key advantage of the randomized algorithm is that modern computational architectures are fully exploited. Thus, the algorithm benefits substantially from multithreading in a multi-core processor. In contrast to proposed algorithms which are based on computational concepts such as distributed computing, our proposed randomized algorithm provides substantial computational speedups even on standard desktop computers. This is achieved by substantially reducing the computational costs per iteration. In practice, real world examples require a larger number of iterations to converge, and thus traditional deterministic algorithms are often not feasible.

Randomized algorithms have established themselves as highly competitive methods for computing traditional matrix decompositions. Thus the generalization of these concepts to tensors are evident. In particular, the concept of oversampling as well as power iterations are crucial in order to achieve a robust tensor decomposition. Our experimental results show that in general a small oversampling parameter and about two power iterations are sufficient. Once a good basis is obtained, the CP decomposition can be obtained in a computationally efficient manner on the compressed tensors using well established optimization strategies. We have considered both the ALS and the BCD methods. Due to its more natural relationship to the SVD, we have favored the BCD method throughout our numerical examples. Indeed, the randomized CP decomposition demonstrates outstanding performance on several examples using artificial and real-world data. While the SVD is able to achieve reconstruction errors as low as machine precision, the CP decomposition is advantageous in the presence of noise. This is because the parsimonious approximation is more robust to noise. Further, the CP decomposition reveals interesting structures not captured by SVD.

The advantages of randomization make the algorithm useful as a common routine, which may simply replace the deterministic algorithm. The achieved approximation quality is sufficient for many approximations. However, if higher precision is required, the proposed algorithm can be used to efficiently determine the optimal tensor rank and provide good starting values to initialize the factor matrices. Further research will involve the development of randomized algorithms for sparse tensors, and tensors with missing values. Also of interest is the development of randomized algorithms that are tailor-made for non-negative tensor factorizations, which play an important role in many signal and video processing applications.

Appendix A Proof of Theorem 1

In the following, we sketch a proof for Theorem 1, which yields an upper bound for the approximate basis for the range of a tensor. To assess the quality of the basis matrices , we first show that the problem can be expressed as a sum of subproblems. Defining the residual error


Note that the Frobenius norm of a tensor and its matricized forms are equivalent. Defining the orthogonal projector , we can reformulate (19) compactly as


Assuming that yields an exact projection onto the column space of the matrix , we need to show first that the error can be expressed as a sum of the errors of the projections



denotes the identity matrix. Following 

Drineas and Mahoney (2007), let us add and subtract the term in Equation (20) so that we obtain


The bound (23) follows from the triangular inequality for a norm. Next, the common term is factored out in Equation (24). Then, the bound (25) follows from the properties of orthogonal projectors. This is because the , and then it holds that . See Proposition 8.5 by Halko et al. (2011) for a proof using matrices. Subsequently the residual error can be bounded


From this inequality, Equation (21) follows. We take the expectation of Equation (21)


Recalling that Theorem 10.5 formulated by Halko et al. (2011) states the following expected approximation error (formulated here using tensor notation)


assuming that the sample matrix in Equation (5) is constructed using a standard Gaussian matrix . Here denotes the singular values of the matricized tensor greater then the chosen target rank . Combining Equations (27) and (28) then yields the results of the theorem (9).

Figures 14 evaluates the theoretical upper bound over 100 runs.

(a) Tensor of dimension .
(b) Tensor of dimension .
Figure 14: Given both a third and fourth order random low-rank tensor, and assuming a fixed oversampling parameter , the performance of the theoretical upper bound (100 trials) for varying target ranks is bounding the average error faithfully.


  • Achlioptas and McSherry (2007) D. Achlioptas and F. McSherry. Fast computation of low-rank matrix approximations. Journal of the ACM, 54:9, 2007.
  • Bader and Kolda (2015) B. W. Bader and T. G. Kolda. MATLAB tensor toolbox version 2.6, 2015. URL http://www.sandia.gov/~tgkolda/TensorToolbox/.
  • Bro and Andersson (1998) R. Bro and C. A. Andersson. Improving the speed of multiway algorithms: Part II: Compression. Chemometrics and Intelligent Laboratory Systems, 42:105–113, 1998.
  • Brunton and Noack (2015) S. L. Brunton and B. R. Noack. Closed-loop turbulence control: Progress and challenges. Applied Mechanics Reviews, 67:1–60, 2015.
  • Carroll and Chang (1970) J. D. Carroll and J.-J. Chang. Analysis of individual differences in multidimensional scaling via an N-way generalization of ‘Eckart-Young’ decomposition. Psychometrika, 35:283–319, 1970.
  • Cichocki and Phan (2009) A. Cichocki and A. H. Phan. Fast local algorithms for large scale nonnegative matrix and tensor factorizations. IEICE Transactions on Fundamentals of Electronics, Communications and Computer Sciences, 92:708–721, 2009.
  • Cichocki et al. (2009) A. Cichocki, R. Zdunek, A. Phan, and S. Amari. Nonnegative matrix and tensor factorizations: Applications to exploratory multi-way data analysis and blind source separation. John Wiley & Sons, 2009.
  • Colonius and Taira (2008) T. Colonius and K. Taira. A fast immersed boundary method using a nullspace approach and multi-domain far-field boundary conditions. Computer Methods in Applied Mechanics and Engineering, 197:2131–2146, 2008.
  • Comon et al. (2009) P. Comon, X. Luciani, and A. De Almeida. Tensor decompositions, alternating least squares and other tales. Journal of Chemometrics, 23:393–405, 2009.
  • De Lathauwer et al. (2000) L. De Lathauwer, B. De Moor, and J. Vandewalle. A multilinear singular value decomposition. SIAM Journal on Matrix Analysis and Applications, 21:1253–1278, 2000.
  • Drineas and Mahoney (2007) P. Drineas and M. W. Mahoney. A randomized algorithm for a tensor-based generalization of the singular value decomposition. Linear Algebra and its Applications, 420:553–571, 2007.
  • Drineas and Mahoney (2016) P. Drineas and M. W. Mahoney. RandNLA: Randomized numerical linear algebra. Communications of the ACM, 59:80–90, 2016.
  • Erichson et al. (2016) N. B. Erichson, S. Voronin, S. L. Brunton, and J. N. Kutz. Randomized matrix decompositions using R. arXiv preprint arXiv:1608.02148, 2016.
  • Frieze et al. (2004) A. Frieze, R. Kannan, and S. Vempala. Fast Monte-Carlo algorithms for finding low-rank approximations. J. ACM, 51:1025–1041, 2004.
  • Gavish and Donoho (2014) M. Gavish and D. L. Donoho. The optimal hard threshold for singular values is . IEEE Transactions on Information Theory, 60(8):5040–5053, 2014.
  • Gu (2015) M. Gu. Subspace iteration randomization and singular value problems. SIAM Journal on Scientific Computing, 37(3):1139–1173, 2015. doi: 10.1137/130938700.
  • Halko et al. (2011) N. Halko, P. G. Martinsson, and J. A. Tropp. Finding structure with randomness: Probabilistic algorithms for constructing approximate matrix decompositions. SIAM Review, 53(2):217–288, 2011. doi: 10.1137/090771806.
  • Harshman (1970) R. A. Harshman. Foundations of the PARAFAC procedure: Models and conditions for an "explanatory" multi-modal factor analysis. Technical Report 16, Working Papers in Phonetics, UCLA, 1970.
  • Hitchcock (1927) F. L. Hitchcock. The expression of a tensor or a polyadic as a sum of products. Journal of Mathematical Physics, 6:164–189, 1927.
  • Jones et al. (2001) E. Jones, T. Oliphant, and P. Peterson. SciPy: Open source scientific tools for Python, 2001. URL https://www.scipy.org/.
  • Kim et al. (2014) J. Kim, Y. He, and H. Park. Algorithms for nonnegative matrix and tensor factorizations: A unified view based on block coordinate descent framework. Journal of Global Optimization, 58:285–319, 2014.
  • Kolda and Bader (2009) T. G. Kolda and B. W. Bader. Tensor decompositions and applications. SIAM Review, 51:455–500, 2009.
  • Li et al. (2013) N. Li, S. Kindermann, and C. Navasca. Some convergence results on the regularized alternating least-squares method for tensor decomposition. Linear Algebra and its Applications, 438:796–812, 2013.
  • Liberty et al. (2007) E. Liberty, F. Woolfe, P.-G. Martinsson, V. Rokhlin, and M. Tygert. Randomized algorithms for the low-rank approximation of matrices. Proceedings of the National Academy of Sciences, 104:20167–20172, 2007.
  • Mahoney (2011) M. W. Mahoney. Randomized algorithms for matrices and data. Foundations and Trends in Machine Learning, 3:123–224, 2011.
  • Martinsson (2016) P.-G. Martinsson. Randomized methods for matrix computations and analysis of high dimensional data. arXiv preprint arXiv:1607.01649, 2016.
  • Martinsson et al. (2011) P.-G. Martinsson, V. Rokhlin, and M. Tygert. A randomized algorithm for the decomposition of matrices. Applied and Computational Harmonic Analysis, 30:47–68, 2011.
  • Phan and Cichocki (2011) A. Phan and A. Cichocki. PARAFAC algorithms for large-scale problems. Neurocomputing, 74:1970–1984, 2011.
  • Rokhlin et al. (2009) V. Rokhlin, A. Szlam, and M. Tygert. A randomized algorithm for principal component analysis. SIAM Journal on Matrix Analysis and Applications, 31:1100–1124, 2009.
  • Sidiropoulos et al. (2014) N. Sidiropoulos, E.Papalexakis, and C. Faloutsos. Parallel randomly compressed cubes: A scalable distributed architecture for big tensor decomposition. IEEE Signal Processing Magazine, 31:57, 2014.
  • Szlam et al. (2014) A. Szlam, Y. Kluger, and M. Tygert. An implementation of a randomized algorithm for principal component analysis. arXiv preprint arXiv:1412.3510, 2014.
  • Tsourakakis (2010) C. E. Tsourakakis. MACH: Fast randomized tensor decompositions. In Proc. 2010 SIAM Int. Conf. Data Mining, pages 689–700, 2010.
  • Uschmajew (2012) A. Uschmajew. Local convergence of the alternating least squares algorithm for canonical tensor approximation. SIAM Journal on Matrix Analysis and Applications, 33:639–652, 2012.
  • Vervliet and Lathauwer (2016) N. Vervliet and L. D. Lathauwer. A randomized block sampling approach to canonical polyadic decomposition of large-scale tensors. IEEE Journal of Selected Topics in Signal Processing, 10:284–295, 2016.
  • Vervliet et al. (2014) N. Vervliet, O. Debals, L. Sorber, and L. D. Lathauwer.

    Breaking the curse of dimensionality using decompositions of incomplete tensors: Tensor-based scientific computing in big data analysis.

    IEEE Signal Processing Magazine, 31:71–79, 2014.
  • Voronin and Martinsson (2015) S. Voronin and P.-G. Martinsson. RSVDPACK: Subroutines for computing partial singular value decompositions via randomized sampling on single core, multi core, and GPU architectures. arXiv preprint arXiv:1502.05366, 2015.
  • Wang and Chu (2014) L. Wang and M. T. Chu. On the global convergence of the alternating least squares method for rank-one approximation to generic tensors. SIAM Journal on Matrix Analysis and Applications, 35:1058–1072, 2014.
  • Witten and Candes (2015) R. Witten and E. Candes. Randomized algorithms for low-rank matrix factorizations: sharp performance bounds. Algorithmica, 72:264–281, 2015.
  • Woolfe et al. (2008) F. Woolfe, E. Liberty, V. Rokhlin, and M. Tygert. A fast randomized algorithm for the approximation of matrices. Journal of Applied and Computational Harmonic Analysis, 25:335–366, 2008.
  • Xu and Yin (2013) Y. Xu and W. Yin. A block coordinate descent method for regularized multiconvex optimization with applications to nonnegative tensor factorization and completion. SIAM Journal on Imaging Sciences, 6(3):1758–1789, 2013.
  • Zhou et al. (2014) G. Zhou, A. Cichocki, and S. Xie. Decomposition of big tensors with low multilinear rank. arXiv preprint arXiv:1412.1885, 2014.