# Accelerating Alternating Least Squares for Tensor Decomposition by Pairwise Perturbation

The alternating least squares algorithm for CP and Tucker decomposition is dominated in cost by the tensor contractions necessary to set up the quadratic optimization subproblems. We introduce a novel family of algorithms that uses perturbative corrections to the subproblems rather than recomputing the tensor contractions. This approximation is accurate when the factor matrices are changing little across iterations, which occurs when alternating least squares approaches convergence. We provide a theoretical analysis to bound the approximation error, leveraging a novel notion of the tensor condition number. Our numerical experiments demonstrate that the proposed pairwise perturbation algorithms are easy to control and converge to minima that are as good as alternating least squares. The performance of the new algorithms shows improvements of 1.3-2.8X with respect to state of the art alternating least squares approaches for various model tensor problems and real datasets on 1, 16 and 256 Intel KNL nodes of the Stampede2 supercomputer.

## Authors

• 10 publications
• 27 publications
10/22/2020

### Efficient parallel CP decomposition with pairwise perturbation and multi-sweep dimension tree

CP tensor decomposition with alternating least squares (ALS) is dominate...
05/17/2019

### Tensor Ring Decomposition: Energy Landscape and One-loop Convergence of Alternating Least Squares

In this work, we study the tensor ring decomposition and its associated ...
10/14/2021

### More Efficient Sampling for Tensor Decomposition

Recent papers have developed alternating least squares (ALS) methods for...
11/25/2019

### The Epsilon-Alternating Least Squares for Orthogonal Low-Rank Tensor Approximation and Its Global Convergence

The epsilon alternating least squares (ϵ-ALS) is developed and analyzed ...
04/14/2022

### Alternating Mahalanobis Distance Minimization for Stable and Accurate CP Decomposition

CP decomposition (CPD) is prevalent in chemometrics, signal processing, ...
04/06/2020

### Efficient Alternating Least Squares Algorithms for Truncated HOSVD of Higher-Order Tensors

The truncated Tucker decomposition, also known as the truncated higher-o...
02/03/2021

### PARAFAC2 AO-ADMM: Constraints in all modes

The PARAFAC2 model provides a flexible alternative to the popular CANDEC...

## Code Repositories

### pairwise-perturbation

Pairwise Perturbation: an efficient numerical algorithm for alternating least squares in tensor decompositions

##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1 Introduction

Tensor decompositions provide general techniques for approximation and modeling of high dimensional data

[30, 18, 14, 20, 40, 11]. They are fundamental in methods for computational chemistry [8, 26, 24], physics [37], and quantum information [37, 25]. Tensor decompositions are performed on tensors arising both in the context of numerical-PDEs (e.g. as part of preconditinioners [39]) as well as in data-driven statistical modeling [2, 30, 33, 34]. The alternating least squares (ALS) method, which is most commonly used to compute many of these tensor decompositions, has become a target for parallelization [27, 22], performance optimization [12, 43], and acceleration by randomization [7]. We propose a new algorithm that asymptotically accelerates ALS iteration complexity for CP and Tucker decomposition by leveraging an approximation that is provably accurate for well-conditioned problems and is increasingly so as the algorithm approaches the optimization local minima.

Generally, ALS solves quadratic optimization subproblems for individual factors composing the decomposition. It does so in an alternating manner, updating every factor at each sweep. For both CP and Tucker decomposition, computational cost of each sweep is dominated by the tensor contractions needed to setup the quadratic optimization subproblem for every factor matrix. These contractions are redone at every ALS sweep since they involve the factor matrices, all of which change after each sweep. We propose to circumvent these contractions in the scenario when the factor matrices are changing only slightly at each sweep, which is expected when ALS approaches a local minima. Our method approximates the setup of each quadratic optimization subproblem by computing perturbative corrections due to the change in each factor matrices to the quadratic subproblems used for updating each other factor matrix. To do so, pairwise perturbative operators are computed that propagate the change to each factor matrix to the subproblem needed to update each other factor matrix. Computing these operators costs slightly more than a typical ALS iteration. These operators are then reused to approximately perform more ALS sweeps until the changes to the factor matrices are deemed large, at which point, the pairwise operators are recomputed. Each sweep computed approximately in this way costs asymptotically less than a regular ALS sweep.

For CP decomposition, CP-ALS [11, 21] is widely used as it provides high-accuracy for the amount of computation required [30] (although alternatives based on gradient and subgradient descent are also competitive [1]). Within CP-ALS, the computational bottleneck of each iteration involves an operation called the matricized tensor-times Khatri-Rao product (MTTKRP). Similarly, the costliest operation in the ALS-based Tucker decomposition (Tucker-ALS) method is called the tensor times matrix-chain (TTMc) product. For an order tensor with modes of dimension , perturbative approximated computation of ALS sweeps reduces the cost of that sweep from to for a rank- CP decomposition and from to for a rank- Tucker decomposition.

In order to bound the error incurred by the perturbative approximation, we define a tensor condition number in Section 3

, which, for a tensor contracted with vectors along all except one mode, bounds the maximum relative output amplification vector due to any input perturbation to the input vectors. Both the MTTKRP and TTMc products in CP and rank-

Tucker decomposition, respectively, consist of a set of such contractions. We demonstrate that this condition number is the reciprocal of the distance to the nearest tensor which has a Tucker decomposition that results in a core tensor with a zero fiber. This tensor condition number generalizes the condition number of square matrices to equidimensional tensors, and provides a sensible extension to non-square matrices and non-equidimensional tensors.

To quantify the accuracy of the pairwise perturbation approximated algorithm, in Section 5, we provide an error analysis for both the MTTKRP and TTMc operations. For both operations, we bound the relative output error based on the relative magnitude of the perturbation to the factor matrices since the computation of the pairwise operators. We establish error bounds for both cases using the condition number of the input tensor. To achieve the same accuracy for the operation output, tighter restrictions are needed for MTTKRP operation compared to TTMc. For MTTKRP, the column-wise relative error of the output is when the column-wise 2-norm relative perturbation of all input matrices are bounded by . For TTMc, we establish the same relationship as well as a similar bound, relating the matrix 2-norms of the output error and input perturbation. In addition, for the TTMc operation, we derive a condition-number-independent 2-norm relative error bound of that holds when the residual of the Tucker decomposition is somewhat less than the norm of the original tensor. Finally, we derive a Frobenius norm error bound of for TTMc, which assumes only that HOSVD [15, 46] is performed to initialize Tucker-ALS (which is typical).

In order to evaluate the performance benefit of pairwise perturbation, in Section 6, we compare kernels and full decomposition performance on one and many nodes of a Intel KNL system (Stampede2) of implementations developed using the Cyclops [44] and ScaLAPACK [9] libraries. Our microbenchmark results compare the strong and weak scaling performance of one ALS sweep of the dimension tree algorithm and the first/restart step, in which the pairwise perturbation operators are calculated, as well as the middle steps, in which the operators are not recalculated, of the pairwise perturbation algorithm. These results show that the middle pairwise perturbation steps are 7.4-17.7X faster than one ALS sweep of the dimension tree algorithm in the weak scaling regime, while computing the pairwise operators takes no more than 2.6X the time of a dimension tree ALS sweep. We then study the performance and numerical behavior of pairwise perturbation for decomposition of synthetic tensors and application datasets. Our experimental results show that pairwise perturbation achieves as low residuals as standard ALS, and achieves typical speed-ups of 1.3-2.8X with respect to state of the art dimension tree based ALS algorithms on 1, 16, and 256 KNL nodes. Our results also show that with the increase of the input tensor size, the performance improvements increase, confirming the asymptotic cost improvement and indicating the potential of pairwise perturbation in large-scale data analysis.

## 2 Background

This section first outlines the notation that is used throughout this paper, then outlines the basic alternating least square algorithms for both CP and Tucker decomposition.

### 2.1 Notation and Definitions

Our analysis makes use of tensor algebra in both element-wise equations and specialized notation for tensor operations [30]. For vectors, bold lowercase Roman letters are used, e.g., . For matrices, bold uppercase Roman letters are used, e.g., . For tensors, bold calligraphic fonts are used, e.g., . An order tensor corresponds to an -dimensional array with dimensions . Elements of vectors, matrices, and tensors are denotes in parentheses, e.g., for a vector , for a matrix , and for an order 4 tensor . Columns of a matrix are denoted by .

The mode- matrix product of a tensor with a matrix is denoted by , with the result having dimensions . By juxtaposition of tensor and a matrix , we denote the mode- product with the transpose of the matrix, . Matricization is the process of unfolding a tensor into a matrix. Given a tensor the mode- matricized version is denoted by where . We generalize this notation to define the unfoldings of a tensor with dimensions into an order tensor, , where , e.g.,

 X(j,k,l,m)=X(1,3)(j,l,k+(m−1)⋅s2).

We use parenthesized superscripts as labels for different tensors, e.g., and are generally unrelated tensors.

The Hadamard product of two matrices resulting in matrix is denoted by , where . The outer product of K vectors of corresponding sizes is denoted by where is an order tensor. The Kronecker product of vectors and is denoted by where . For matrices and , their Khatri-Rao product resulting in a matrix of size defined by

 A⊙B=[a(1)⊗b(1),…,a(K)⊗b(K)].

### 2.2 CP Decomposition with ALS

The CP tensor decomposition [23, 21]

is a higher-order generalization of the matrix singular value decomposition (SVD). The CP decomposition is denoted by

 X≈[[A(1),⋯,A(N)]],whereA(i)=[a(i)1,⋯,a(i)r],

and serves to approximate a tensor by a sum of tensor products of vectors:

 X≈R∑r=1a(1)r∘⋯∘a(N)r.

The CP-ALS method alternates among quadratic optimization problems for each of the factor matrices , resulting in linear least squares problems for each row,

 A(n)P(n)T≅X(n),

where the matrix , where is formed by Khatri-Rao products of the other factor matrices,

 P(n)=A(1)⊙⋯⊙A(n−1)⊙A(n+1)⊙⋯⊙A(N).

These linear least squares problems are often solved via the normal equations [30]. We also adopt this strategy here to devise the pairwise perturbation method. The normal equations for the th factor matrix are

 A(n)Γ(n)←X(n)P(n),

where can be computed via

 Γ(n)=S(1)∗⋯∗S(n−1) ∗S(n+1)∗⋯∗S(N), with eachS(i)= A(i)TA(i).

These equations also give the th component of the optimality conditions for the unconstrained minimization of the nonlinear objective function,

 f(A(1),…,A(N))=12||X−[[A(1),⋯,A(N)]]||2F,

for which the th component of the gradient is

 ∂f∂A(n)=G(n)=A(n)Γ(n)−X(n)P(n)=(A(n)−A(n)new)Γ(n).

Algorithm 1 presents the basic ALS method described above, keeping track of the Frobenius norm of the components of the overall gradient to ascertain convergence.

The Matricized Tensor Times Khatri-Rao Product or MTTKRP computation is the main computational bottleneck of CP-ALS[6]. The computational cost of MTTKRP is if for all . The tensor contractions necessary for MTTKRP can be amortized across the linear least squares problems necessary for a given ALS sweep (while loop iteration in Algorithm 1). With the best choice of dimension trees [41, 48, 29, 6] to calculate in one ALS iteration, to leading order in , the computational complexity is The normal equations worsen the conditioning, but are advantageous for CP-ALS, since can be computed and inverted in just cost and the MTTKRP can be amortized by dimension trees. If QR is used instead of the normal equations, the product of with the right-hand sides would have the cost and would need to be done for each linear least squares problem, increasing the overall leading order cost by a factor of . Our pairwise perturbation algorithm amortizes the dominant cost of the computation of across multiple ALS sweeps.

### 2.3 Tucker Decomposition with ALS

In this section we review the ALS method for computing a low-rank Tucker decomposition of a tensor [46]. Tucker decomposition approximates a tensor by a core tensor contracted by orthogonal matrices along each mode. The Tucker decomposition is given by

 X≈[[G;A(1),…,A(N)]]=G×1A(1)×2A(2)⋯×NA(N).

The corresponding element-wise expression is

 X(x1,…,xN)≈∑{z1,…,zN}G(z1,…,zN)∏r∈{1,…,N}A(r)(xr,zr).

The core tensor is of order with dimensions (Tucker ranks) (throughout error and cost analysis we assume each for ). The matrices have orthonormal columns.

The higher-order singular value decomposition (HOSVD) [15, 46] computes the leading left singular vectors of each one-mode unfolding of , providing a good starting point for the Tucker-ALS algorithm. The classical HOSVD computes the truncated SVD of and sets for . The interlaced HOSVD [49, 19] instead computes the truncated SVD of

 Z(n)(n)=U(n)Σ(n)V(n)TwhereZ(1)=XandZ(n+1)(n)=Σ(n)V(n)T.

The interlaced HOSVD is cheaper, since the size of each is .

The ALS method for Tucker decomposition [3, 16, 30], which is also called the higher-order orthogonal iteration (HOOI), then proceeds by fixing all except one factor matrix, and computing a low-rank matrix factorization to update that factor matrix and the core tensor. To update the th factor matrix Tucker-ALS factorizes

 Y(n)=X×1A(1)T⋯×n−1A(n−1)T×n+1A(n+1)T⋯×NA(N)T

into a product of an orthogonal matrix

and the core tensor , so that This factorization can be done by taking to be the leading left singular vectors of . This Tucker-ALS procedure is given in Algorithm 2.

As in previous work [38, 13]

, our implementation computes these singular vectors by finding the left eigenvectors of the Gram matrix

Computing the Gram matrix sacrifices some numerical stability, but avoids a large SVD and provides consistency of the signs of the singular vectors across ALS sweeps.

The Tensor Times Matrix-chain or TTMc computes each and is the main computational bottleneck of Tucker-ALS [28]. With the use of dimensions trees [41, 48, 29, 6] to calculate in one ALS sweep, the computational complexity for each while loop iteration in Algorithm 2 to leading order in is

## 3 Tensor Conditioning

Our error analysis makes use of bounds on the spectral norm and condition number of tensors. We introduce a notion of a tensor condition number that corresponds to a global bound on the conditioning of the multilinear vector-valued function, associated with the product of the tensor with vectors along all except the first mode,

 gT(x1,…,xN−1)=T(1)(x1∘⋯∘xN−1).

The norm and condition number are given by extrema of the norm amplification of , which are described by the amplification function ,

 fT(x1,…,xN−1)=||gT(x1,…,xN−1)||2||x1||2⋯||xN−1||2.

The spectral norm of the tensor corresponds to its supremum,

 ||T||2=sup{fT}.

The tensor condition number is then defined as

 κ(T)=sup{fT}/inf{fT},

which enables quantification of the worst-case relative amplification of error with respect to input for the product of a tensor with vectors along all except one mode. In particular, provides an upper bound on the relative norm of the perturbation of with respect to the relative norm of any perturbation to any input vector. This tensor condition number should not be confused with the conditioning of CP decomposition [17, 47, 10].

### 3.1 Tensor Norm

The spectral norm of any tensor is

 ||T||2=max∀i∈{1,…,N−1},xi∈Rsi+1||x1||2=⋯=||xN−1||2=1||T(1)(x1∘⋯∘xN−1)||2.

The spectral tensor norm corresponds to the magnitude of the largest tensor singular value [32] and is invariant under reordering of modes of . Lemma 1 shows submultiplicativity of this norm for the tensor times matrix product.

###### Lemma 1

Given any tensor and matrix , if then .

###### Proof

Assume , then there exist unit vectors such that

 ||V||2=||V(N)(x1∘⋯∘xN−1)||2=||T(N)(x1∘⋯∘xN−2∘MxN−1)||2.

Let , so . We arrive at a contradiction, since

 ||T(N)(x1∘⋯∘xN−2∘z)||2≤||T(N)(x1∘⋯∘xN−2∘z)||2||M||2||z||2≤||T||2||M||2.

### 3.2 Tensor Condition Number

We define the condition number of a tensor as the ratio of the supremum and infimum of the aforementioned function . For a matrix , if this corresponds to defining where is the smallest singular value of in the reduced SVD, while if , then . Generally, for any tensor the condition number is

 κ(T)=||T||2/min∀i∈{1,…,N−1},xi∈Rsi+1||x1||2=⋯=||xN−1||2=1||T1(x1∘⋯∘xN−1)||2.

When tensor dimensions are unequal, the condition number is infinite if the first dimension is not the largest, so for some , . Aside from this condition, the ordering of modes of does not affect the condition number, since for any , the supremum/infimum of over the domain of unit vectors are for some choice of the maximum/minimum singular values of

 K=T(1,m)(x(1)∘⋯∘x(m−1)∘x(m+1)∘⋯∘x(N−1)).

To demonstrate that this condition number can be bounded for tensors of order greater than 2, we provide two examples of order three tensors that have unit condition number. The first example has , and yields a Givens rotation when contracted with a vector along the last mode. It is composed of two slices:

 [11]and[1−1].

The second example has and is composed of four slices:

 ⎡⎢ ⎢ ⎢⎣111−1⎤⎥ ⎥ ⎥⎦,⎡⎢ ⎢ ⎢⎣1−111⎤⎥ ⎥ ⎥⎦,⎡⎢ ⎢ ⎢⎣11−11⎤⎥ ⎥ ⎥⎦,⎡⎢ ⎢ ⎢⎣−1111⎤⎥ ⎥ ⎥⎦.

The fact that this tensor has unit condition number can be verified by symbolic algebraic manipulation or numerical tests. Other perfectly conditioned tensors can be obtained by multiplying the above tensors by an orthogonal matrix along any mode (we prove below that such transformations preserve condition number), but currently we do not know how to construct tensors of other order/dimensions with bounded condition number.

In our analysis, we make use of the following submultiplicativity property of the tensor condition number with respect to tensor times matrix products (the property also generalizes to pairs of arbitrary order tensors contracted over one mode).

###### Lemma

For any and matrix , if then .

###### Proof

Assume , then there exist unit vectors and such that

 κ(T)<κ(V)=||V(N)(x1∘⋯∘xN−1)||2||V(N)(y1∘⋯∘yN−1)||2=||T(1)(x1∘⋯∘xN−2∘MxN−1)||2||T(1)(y1∘⋯∘yN−2∘MyN−1)||2.

Let and , so , yielding a contradiction,

 κ(V)≤||T(1)(x1∘⋯∘xN−2∘(u/||u||2))||2||T(1)(y1∘⋯∘yN−2∘(v/||v||2)||2κ(M)≤κ(T)κ(M).

Applying Lemma 3.2 with a vector, i.e. when and so has condition number , implies . By an analogous argument to the proof of Lemma 3.2, we can also conclude that the norm and infimum of such a product of with unit vectors are bounded by those of , giving the following corollary.

###### Corollary

For any , vector , and any such that with and , if then , , and .

For an orthogonal matrix , Lemma 3.2 can be applied in both directions, namely for and , so we observe that . Using this fact, we demonstrate in the following theorem that any tensor can be transformed by orthogonal matrices along each mode, so that one of its fibers has norm .

###### Theorem 1

For any , there exist orthogonal matrices , with , such that satisfies , , and the first fiber of , i.e. the vector with , satisfies .

###### Proof

Given a tensor with infinite condition number, there must exist unit vectors , such that . We define orthogonal matrices such that . We can then contract with these matrices along the last modes, resulting in , with the same condition number as (by Lemma 3.2) and the same norm (by a similar argument). Then, we have that the first fiber of is

 v=V1(e1∘⋯∘e1)=T1(x1∘⋯∘xN−1),

and consequently .

By Theorem 1, the condition number of a tensor is infinity if and only if it can be transformed by products with orthogonal matrices along the last modes into a tensor with a zero fiber. Further, any tensor may be perturbed to have infinite condition number by adding to it some with relative norm .

## 4 Pairwise Perturbation Algorithms

We now introduce a pairwise perturbation (PP) algorithm to accelerate the ALS procedure when the iterative optimization steps are approaching a local minimum. We first focus on deriving the approximation that asymptotically reduces the cost complexity, then in Section 4.3, provide algorithms that minimize constant factors in cost via dimension trees [41, 48, 29, 6]. The key idea of the pairwise perturbation method is to compute pairwise perturbation operators, which correlate a pair of factor matrices. These tensors are then used to repeatedly update the quadratic subproblems for each tensor. As we will show, these updates are provably accurate if the factor matrices do not change significantly since their state at the time of formation of the pairwise perturbation operators.

### 4.1 Cp-Als

The pairwise perturbation procedure for CP decomposition approximates (defined in Section 2.2) using pairwise perturbation operators for . Below, we define these and more general partially contracted MTTKRP intermediates.

###### Definition

is defined as follows,

 M(i1,i2,…,im)=X(i1,i2,…,im)⨀j∈{1,…,N}∖{i1,i2,…,im}A(j).

Element-wise, we have

 ∑{x1,…,xN}∖{xi1,xi2,…,xim}X(x1,…,xN)∏r∈{1,…,N}∖{i1,i2,…,im}A(r)(xr,k).

Let denote the calculated with a regular ALS step, some number of steps prior to the current one. Then at the current step can be expressed as

 A(n)=A(n)p+dA(n),

and can be expressed as

 M(n)=X(n)N⨀i=1,i≠n(A(i)p+dA(i)).

The expression above can be rewritten as a function of , which is defined in the same way as except that is contracted with for . can be expressed as follows,

 M(n)(y,k)= M(n)p(y,k)+N∑i=1,i≠nsi∑x=1M(i,n)p(x,y,k)dA(i)(x,k)+ N∑i=1,i≠nN∑j=i+1,j≠nsi∑x=1sj∑z=1M(i,j,n)p(x,z,y,k)dA(i)(x,k)dA(j)(z,k)+⋯.

From the above expression we observe that except the first two terms, all terms include the contraction between tensor and at least two matrices , which are small in norm when each is small in norm. The pairwise perturbation algorithm obtains an effective approximation by computing only the first two terms (these terms are described by Figure 1):

 ~M(n)(y,k) =M(n)p(y,k)+N∑i=1,i≠nsi∑x=1M(i,n)p(x,y,k)dA(i)(x,k), whereM(n)p =X(n)N⨀i=1,i≠nA(i)p,andM(i,n)p=X(i,n)N⨀j∈{1,…,N}∖{i,n}A(j)p.

Given and , calculation of for requires operations overall. Further, we show in Section 5.1 that the column-wise relative approximation error of with respect to is small if each is sufficiently small. Algorithm 3 presents the PP-CP-ALS method described above.