 # Learning the Sparse and Low Rank PARAFAC Decomposition via the Elastic Net

In this article, we derive a Bayesian model to learning the sparse and low rank PARAFAC decomposition for the observed tensor with missing values via the elastic net, with property to find the true rank and sparse factor matrix which is robust to the noise. We formulate efficient block coordinate descent algorithm and admax stochastic block coordinate descent algorithm to solve it, which can be used to solve the large scale problem. To choose the appropriate rank and sparsity in PARAFAC decomposition, we will give a solution path by gradually increasing the regularization to increase the sparsity and decrease the rank. When we find the sparse structure of the factor matrix, we can fixed the sparse structure, using a small to regularization to decreasing the recovery error, and one can choose the proper decomposition from the solution path with sufficient sparse factor matrix with low recovery error. We test the power of our algorithm on the simulation data and real data, which show it is powerful.

## Authors

##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1 Introduction

Tensor decomposition has been used as a powerful tool on many fields (Kolda and Bader (2009)

) for multiway data analysis. Among the representation models for tensor decompositions, PARAFAC decomposition is one of the simplest form. It aims to extract the data-dependent basis functions (i.e. factors), so that the observed tensor can be expressed by the multi-linear combination of these factors, which can reveal meaningful underlying structure of the data. The ideal situation for the PARAFAC decomposition is that we know (1) the true rank of the tensor, and (2) the sparse structure (i.e. locations of zeros) of the factors. Given a wrong estimation of the tensor’s rank or sparse structure of the factors, even the recovery error of the decomposition results are very low, there still exits a high generalization error as the learning procedure tries to fit the noise information. But in most of the real-case scenarios, we won’t be able to obtain the prior knowledge of the true rank of the tensor and the true sparse structure of the factors. Various literatures have proposed solutions for either of the above problems. For example, Liu et al. proposed the sparse non-negative tensor factorization using columnwise coordinate descent to get the sparse factors (

Liu et al. (2012)). To get the proper rank estimation, a novel rank regularization PARAFAC decomposition was proposed in (Bazerque et al. (2013)). However, to the author’s best knowledge, there is no work that can handle both of the two problems simultaneously in an integrated framework. And in many works, they try to emphasize on the lower recovery error, few authors focus on finding the true underline factors of the observed data with noise. As we can see from our simulation examples, there are algorithms( e.g. CP-ALS), although they find a lower relative recovery error, but they deviate from the true underline factors. The reason is that these algorithm learn to fit the noise information to reduce the recovery error. Our algorithm is effective to de-noise from both our simulation data and real dataset. Moreover, there seems few researches to use stochastic scheme to calculate the large scale PARAFAC decomposition.

In response to the above problems, we develop a PARAFAC decomposition model capturing the true rank and producing the sparse factors by minimizing the least square lost with elastic net regularization.

The elastic net regularization is a convex combination of the regularization and regularization, where the regularization helps to capture the true rank, while the regularization trends to yield the sparse factors. For solving the minimization problem, we use the block coordinate descent algorithm and stochastic block coordinate descent algorithm, where the block coordinate descent algorithm have been shown to be an efficient scheme (Xu and Yin (2013)

) especially on large scale problems. To identify the true rank and sparse factors, we generate a solution path by gradually increasing the magnitude of the regularization. And the initial dense decomposition is gradually transformed to a low rank and sparse decomposition. But when we find the rank and the appropriate sparse factors, the recovery error may be large because large regularization will shrink the solution closer to origin. One can fix the sparse structure of the factors and perform a constrained optimization with small regularization to decrease the recovery error. At last, we can choose the appropriate low rank and sparse PARAFAC decomposition from the refined solution of the solution path which features a low recovery error. We test our algorithm firstly on the synthetic data. The results shows that the algorithm captures the true rank with high probability and the sparse structure of the factors found by the algorithm is sufficiently close to the sparse structure of the true factors, and the sparse constraint algorithm with the sparse structure of the factors trends to find the true underline factors. And it finds out the meaningful structure of features when applied on the coil-20 dataset and the COPD data.

The rest of this paper is organized as follows: Our model based on a Bayesian framework is derived in Section 2. The coordinate descent algorithm to solve the optimization is proposed in Section 3. Theoretical analysis is given in Section 4. The solution path method is given in Section 5. The stochastic block coordinate descent algorithm to solve the optimization is proposed in Section 6. Results from applying the algorithm on the synthetic data and real data is summarized in Section 7, and conclude this paper in Section 8.

## 2 Bayesian PARAFAC model

The notations used through out the article are as follows: We use bold lowercase and capital letters for vectors

, and matrices , respectively. Tensors are underlined, e.g., . Both the matrix and tensor Frobenius norms are represented by and use denote the vectorization of a matrix. Symbols , , , and , denote the Kroneker, Kathri-Rao, Hadamard (entry-wise), and the outer product, respectively. denote the entry-wise divide operation, and . stands for the trace operation.

The observation is composed of the true solution and the noise tensor .

 Z––i1,i2,…,iN=X––i1,i2,…,iN+E––i1,i2,…,iN (1)

where . We only observed a subset entries of , which is given by a binary tensor .

Assume that comes from the PARAFAC decomposition , which means that

 X––i1,i2,…,iN=R∑r=1A(1)(i1,r)A(2)(i2,r)⋯A(N)(iN,r) (2)

where the factor matrix , for . Let be the columns of . Then we may express the decomposition as

 X–– =R∑r=1a(1)r∘a(2)r∘⋯∘a(N)r (3) =R∑r=1γr(u(1)r∘u(2)r∘⋯∘u(N)r)

where , and character value . Denote , , ,

For fixed , since vectors , are interchangeable, identical distributions, they can be modeled as independent for each other, zero-mean with the density function,

 pn(a(n)r)=βnexp(−1/2a(n)rR−1na(n)r−μn||a(n)r||1) (4)

where is the normalization constant, and the density is the product of the density of Gaussian (mean , covariance ) and the double exponential density .

In addition for are assumed mutually independent. And since scale ambiguity is inherently present in the PARAFAC model, vectors for are set to have equal power, that is,

 θ:=Tr(R1)=Tr(R2)=⋯=Tr(RN) (5) μ:=μ1=μ2=…=μN

And for the computation tractable and simplicity, we assume that are diagonal. Under these assumptions, the negative logarithm of the posterior distribution (up to a constant) is

 L(X––)=12σ2||(Z––−R∑r=1a(1)r∘a(2)r∘⋯∘a(N)r)⊛Δ–––||2F (6) +R∑r=1N∑n=1[12((a(n)r)TR−1na(n)r)+μ||a(n)r||1]

Correspondingly, the MAP estimator is , where is the solution of

 minA12||(Z––−R∑r=1a(1)r∘a(2)r∘⋯∘a(N)r)⊛Δ–––||2F (7) +R∑r=1N∑n=1λ[1−α2((a(n)r)TR−1na(n)r)+α||a(n)r||1]

where , . Note that this regularization is elastic net as introduced in (Zou and Hastie (2005)).

## 3 Block Coordinate Descent for Optimization

In order to solve (7), we derive a coordinate descent-based algorithm, by firstly separating the cost function in (7) to the smooth part and the non-smooth part.

 f(A):= 12||(Z––−R∑r=1a(1)r∘a(2)r∘⋯∘a(N)r)⊛Δ–––||2F (8) +R∑r=1N∑n=1λ[1−α2((a(n)r)TR−1na(n)r)]
 r(A) :=N∑n=1rn(An)=λαR∑r=1N∑n=1||a(n)r||1 (9)
 F(A):=f(A)+r(A) (10)

The optimization problem in equation (7) is solved iteratively by updating one part at a time with all other parts fixed. In detail, we cyclically minimize the columns for and . For example, if we consider do minimization about and fixed all the other columns of , the following subproblem is then obtained:

 F(a(n)1):= 12||(Z––−R∑r=1a(1)r∘a(2)r∘⋯∘a(N)r)⊛Δ–––||2F (11) +λ[1−α2((a(n)1)TR−1na(n)1)+α||a(n)1||1]

Setting

 F(a(n)1):= 12||(W–––−a(1)1∘a(2)1∘⋯∘a(N)1)⊛Δ–––||2F (12) +λ[1−α2((a(n)1)TR−1na(n)1)+α||a(n)1||1]

To make the calculation easier, we unfold the tensor to the matrix form. Let denote the matrix of unfolding the tensor along its mode . And using the fact , where . (12) becomes

 F(a(n)1):= 12||(W(n)−a(n)1hT)⊛Δ(n)||2F (13) +λ[1−α2((a(n)1)TR−1na(n)1)+α||a(n)1||1]

Denote , and drop out all subscript and superscript, we can rewrite (13) as

 F(x):= 12||(W−xhT)⊛Δ||2F (14) +λ[1−α2xTTx+α||x||1]

which can be decomposed as

 F(x) =In∑in=1[12||δin⊛win−(δin⊛h)xin||22] (15) +λ[1−α2xTTx+α||x||1]

where , , represent the -th row of matrices , , respectively. Note that is strongly convex at , the optimal solution

 x∗=argminxF(x) (16)

satisfies the first order condition

 0∈∂F(x∗) (17)

 ∂F(x) =(H+λ(1−α)T)x−u+λαSign(x) (18)

where

 H=⎡⎢ ⎢ ⎢ ⎢ ⎢ ⎢⎣||h⊛δ1||22||h⊛δ2||22⋱||h⊛δIn||22⎤⎥ ⎥ ⎥ ⎥ ⎥ ⎥⎦ (19)
 u=[(h⊛δ1)T(w1⊛δ1),…,(h⊛δIn)T(wIn⊛δIn)]T (20)

and

 Sign(x)i=⎧⎪⎨⎪⎩1% if xi>0[−1,1] if xi=0−1 if xi<0 (21)

Let

 d:=diag(H+λ(1−α)T):=(d1,d2,…,dIn)T (22)

The optimal condition (17) is actually equivalent to:

 0∈x−u⊘d+λαSign(x)⊘d (23)

So the solution is

 x=Tλα(u)⊘d (24)

where is the soft thresholding operator.

We formally give the algorithm 1 to solve (7).

## 4 Theory Analysis

### 4.1 Convergence results

Since our algorithm is a instance of the unified algorithm 1 in (Xu and Yin (2013)) , the convergence of algorithm 1 follows from Theory 2.8 and Theory 2.9 in (Xu and Yin (2013)) and the proof is given in the Appendix section 9.1. We has the following convergence theorem Let be the sequence generated by Algorithm 1, where is the solution in the k-th iteration in the repeat loop. Assume that is bound. Then converges to a critical point , and the asymptotic convergence rates in Theory 2.9 in (Xu and Yin (2013)) apply.

### 4.2 Property of Elastic Regularization

#### 4.2.1 l2 regularization trend to find the true rank

First, the l2 regularization has the property to find the true rank of tensor (Bazerque et al. (2013)). To see this, note that when ( This can be achieved ), the problem (7) becomes

 minA,X–– 12||(Z––−X––)⊛Δ–––||2F (25) +R∑r=1N∑n=1λ[12(a(n)r)TR−1na(n)r] s.t. X––=R∑r=1a(1)r∘a(2)r∘⋯∘a(N)r

And this problem is equivalent to the following problem by the following Proposition 4.2.1(its proof is provided in the Appendix section 9.2)

 min~U(n),~γr,X––12||(Z––−X––)⊛Δ–––||2F+R∑r=1λN2(~γr)2N (26) s.t. X––=R∑r=1~γr(R1/21~u(1)r)∘⋯∘(R1/2N~u(N)r)

The solution of (25) and (26) coincide, i.e. the optimal factors related by To further stress the capability of (25) to produce a low-rank approximate tensor, consider transform (26) once more by rewritten it in the constrained-error form

 min~U(n),~γ,X––||~γ||2N (27) s.t. X––=R∑r=1~γr(R1/21~u(1)r)∘⋯∘(R1/2N~u(N)r) ||(Z––−X––)⊛Δ–––||2F≤η

where , . For any vlue of there exists a corresponding Lagrange multiplier such that (26) and (27) yield the same solution. And the -norm in (27) produces a sparse vector when minimized (Chartrand (2007)). Note that the sparsity in vector implies the low rank of .

The above arguments imply that when properly choose the parameter , the regularization problem (25) can find the true rank of the tensor . i.e. when we give a overestimated rank (where is the true rank of of tensor ) of tensor , the perfect solution of (25) will shrink the redundant columns of the factor matrix to .

#### 4.2.2 l1 regularization trend to find the true sparse structure of the factor matrix

Note that when there is only regularization

 minA 12||(Z––−A(1)∘A(2)∘⋯∘A(N))⊛Δ–––||2F (28) +N∑n=1λ||vec(A(n))||1

For each mode factor matrix , it is a standard lasso problem, which implies that the solution is sparse. Note that if when the true factor matrix is sparse, with the properly choosed , we can reveal the true sparsity structure in . For application, the sparsity structure of can help us to make meaningful explanation on its mode , which standards for some attributes of the considered problem.

#### 4.2.3 The elastic net give us a flexible model to find the true data structure in tensor

Combine the regularization and regularization we get the elastic net regularization, which can combine the advantages of both regularization and regularization, i.e. to find the true (low) rank and closed to the true sparse factor matrix of PARAFAC decomposition of an observed tensor data. It is helpful to understand the structure of the data and reveal the faces of the objects.

#### 4.2.4 Estimate of the covariance matrix

To run the algorithm 1, the covariance must be postulated as a priori, or replace by their sample estimates. And often we don’t know the priori, so the proper sample estimates are very important, since it provides reasonable scaling in each dimension such that the algorithm performs well. Similarly in section C Covariance estimation in article (Bazerque et al. (2013)), we can bridge the covariance matrix with its kernel counterparts. Since that when the model is give in (4) the covariance of the the factor is hard to evaluate, do not has an analytical expression. We can jump out the obstacles just assume our model is Gaussian now,

 pn(a(n)r)=βnexp(−1/2a(n)rR−1na(n)r) (29)

Define the kernel similarity matrix in mode as

 Kn(i,j):=EX(n)(i,:)(X(n)(j,:))T (30)

i.e. the expectation of the inner product of the -th slice and -th slice of in mode . With some calculation(see in the supplementary material), we can get

 Kn=RθN−1Rn (31)

and

 E||X––||2F=Tr(Kn)=RθN (32)

From this we can get the covariance matrix estimate(just drop out the expectation)

 θ=(||X––||2FR)1N (33) Kn(i,j)=X(n)(i,:)(X(n)(j,:))T Rn=KnRθN−1

## 5 Solution Path by Warm Start in Practical Application

In this section, we will give a more practical algorithm which help us how to chose the proper sparse and low rank PARAFAC approximation of the observed tensor with missing values. From the Lagrange theory, the problem (7) is equivalent to

 minA12||(Z––−R∑r=1a(1)r∘a(2)r∘⋯∘a(N)r)⊛Δ–––||2F (34) s.t. R∑r=1N∑n=1[1−α2((a(n)r)TR−1na(n)r)+α||a(n)r||1]≤η

For any value of there exists a corresponding Lagrange multiplier such that (7) and (34) yield the same solution. Although there is no explicit equation to relate and , we know that small implies large , large ball-like solution space; large implies small , small ball-like solution space. So we can fixed the relation of and regularization. And start from a small toward to large , which can shrink the dense solution to the sparse solution(when is very large, the solution is ). And this forms a solution path, the solutions are close to each other when are close to each other, this means when we start from the previous solution, the algorithm can find quickly the close solution. And for the lucky , we can meet the true data structure, i.e. the right sparse structure in the factor matrix. From our numerical test, the situation is just as we imagined. However, on the time when we find the right data structure, i.e. the right sparse structure in the factor matrix, the recovery error is not so low, since when is large, the regularization will shrink the solution close to origin, and the best recovery error is achieved when is small. This phenomenon inspired us a efficient method: using algorithm 1 to generate a solution path, and for a special solution in the solution path, fixed it sparse structure, solve a constrained problem with small to get small recovery error — for fixed , we use Algorithm 1 to calculate a solution path, i.e. the solution generated by Algorithm 1 from a sequence increasing (e.g .

For a special solution , where , calculate and as in (3), reorder such that , reorder the columns of matrix and such that they coincide with the . Let , where is sub-matrix of , which is formed by its first columns. Definde the sparse structure matrix , where of the factor matrix as

 (35)

where , and is a prespecified small number( e.g. ).

For notation simplicity, we still use denote and denote . If the true sparse structure matrix is , to solve (7), it is equivalent to solve the constrained problem,

 minA 12||(Z––−R∑r=1a(1)r∘a(2)r∘⋯∘a(N)r)⊛Δ–––||2F (36) +R∑r=1N∑n=1λ[1−α2((a(n)r)TR−1na(n)r)+α||a(n)r||1] s.t. A(n)(in,r)=0, if S(n)(in,r)=0 for in∈[In],r∈[R]

This problem can solved by algorithm 1 just renew if , and set all if . We formally give the Algorithm 2.

And the convergence theory still holds since we only update a subset elements of factor matrix .

To conclude, we give our finally algorithm 3 aim to find the true data structure and low recovery error. For convenience, we call the solution path method using algorithm 1 SPML, and the correspond using algorithm 2 SPMS where the solution comes from SPML.

## 6 Generalize the algorithm to large scale problem by using stochastic block coordinate descent algorithm by using admax method

For the real application, the data often have a large data dimnsion. the above algorithm may not compute fast, since for each iteration, we have cost of operations. Fortunately, we can fixed this problem to some extent by using a stochastic scheme.

Note that for renew , we use all information to renew it, it it not necessary. To gain the insights. Now consider the problem , where , and . Now suppose we know the true , and we need to update , the form the above algorithm, we renew by the formula(Now we drop out the regularity and suppose their is no noise and no missing entries) . Note that we can also update by random choose a subset , and update . While when there are noise, such that we observed , . Using a subset update . Suppose that

is a white noise. then

, and , . From this simple case, it inspire us that we can use a subset of to update

, because it will lose some accuracy, higher variance, we should it a stochastic update scheme to renew

. Its key idea is keep the memory of the older information of the right direction and also take notice of the current of new information.

Now we begin to deduce our stochastic block coordinate descent method by using the Adamax scheme. Follow the same roads in the Section 3.

For example, if we consider do minimization about