# Regularized Orthogonal Tensor Decompositions for Multi-Relational Learning

Multi-relational learning has received lots of attention from researchers in various research communities. Most existing methods either suffer from superlinear per-iteration cost, or are sensitive to the given ranks. To address both issues, we propose a scalable core tensor trace norm Regularized Orthogonal Iteration Decomposition (ROID) method for full or incomplete tensor analytics, which can be generalized as a graph Laplacian regularized version by using auxiliary information or a sparse higher-order orthogonal iteration (SHOOI) version. We first induce the equivalence relation of the Schatten p-norm (0<p<∞) of a low multi-linear rank tensor and its core tensor. Then we achieve a much smaller matrix trace norm minimization problem. Finally, we develop two efficient augmented Lagrange multiplier algorithms to solve our problems with convergence guarantees. Extensive experiments using both real and synthetic datasets, even though with only a few observations, verified both the efficiency and effectiveness of our methods.

## Authors

• 30 publications
• 35 publications
• 18 publications
• ### Rank properties and computational methods for orthogonal tensor decompositions

The orthogonal decomposition factorizes a tensor into a sum of an orthog...
03/12/2021 ∙ by Chao Zeng, et al. ∙ 0

• ### p-order Tensor Products with Invertible Linear Transforms

This paper studies the issues about tensors. Three typical kinds of tens...
05/23/2020 ∙ by Jun Han, et al. ∙ 0

• ### Theoretical and Experimental Analyses of Tensor-Based Regression and Classification

We theoretically and experimentally investigate tensor-based regression ...
09/06/2015 ∙ by Kishan Wimalawarne, et al. ∙ 0

• ### A Self-consistent-field Iteration for Orthogonal Canonical Correlation Analysis

We propose an efficient algorithm for solving orthogonal canonical corre...
09/25/2019 ∙ by Leihong Zhang, et al. ∙ 0

• ### Deep Multi-Task Learning via Generalized Tensor Trace Norm

The trace norm is widely used in multi-task learning as it can discover ...
02/12/2020 ∙ by Yi Zhang, et al. ∙ 16

• ### Tractable and Scalable Schatten Quasi-Norm Approximations for Rank Minimization

The Schatten quasi-norm was introduced to bridge the gap between the tra...
02/28/2018 ∙ by Fanhua Shang, et al. ∙ 0

• ### Searching to Sparsify Tensor Decomposition for N-ary Relational Data

Tensor, an extension of the vector and matrix to the multi-dimensional c...
04/21/2021 ∙ by Shimin Di, et al. ∙ 0

##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1 Introduction

Relational learning is becoming increasingly important because of the high value hidden in relational data and also of its many applications in various domains such as social networks, the semantic web, bioinformatics, and the linked data cloud [1]. A class of relational learning methods focus mostly on the problem of modeling a single relation type, such as relational learning from latent attributes [2, 3], which models relations between objects as resulting from intrinsic latent attributes of these objects. But in reality, relational data typically involve multiple types of relations between objects or attributes, which can themselves be similar. For example, in social networks [4], relationships between individuals may be personal, familial, or professional. This type of relational data learning is often referred to as multi-relational learning (MRL), which needs to model large-scale sparse relational databases efficiently [5].

People usually make use of the semantic web’s RDF formalism to represent relational data, where relations are modeled as triples of the form (subject, relation, object), and a relation either denotes the relationship between two entries or between an entity and an attribute value. Considering the multiple types of relationships, it is a more natural way stacking the matrices of observed relationships into one big sparse three-order tensor. Fig. 1 shows an illustration of this modeling method. In recent years, tensors have become ubiquitous such as multi-channel images and videos, and become popular due to the ability to discover complex and interesting latent structures and correlations of data [6, 7, 8, 9]. Recently there is a growing interest in tensor methods for link prediction tasks, partially due to their natural representation of multi-relational data.

Tensor decomposition [10, 11, 12], [13] is a popular tool for multi-relational prediction problems [14], [15]. For example, Bader et al. [16] proposed a three-way component decomposition model for analyzing intrinsically asymmetric relationships. In addition, Nickel et al. [1] incorporated collective learning into the tensor factorization, which is designed to account for the inherent structure of relational data. Two of the most popular tensor factorizations are the Tucker decomposition [10] and the CANDECOMP/PARAFAC (CP) decomposition [11]

. To address incomplete tensor estimation, two weighted alternating least-squares methods

[8, 17] were proposed. However, these methods require the ability to reliably estimate the rank of the involved tensor [18, 19].

Recently, the low rank tensor recovery problem has been intensively studied. Liu et al. [20] first extended the trace norm (also known as the nuclear norm [21] or the Schatten -norm [19]) regularization for partially observed low multi-linear rank tensor recovery. Then the tensor recovery problem is transformed into a convex combination of trace norm minimization of the matrix unfolding along each mode. More recently, in Liu et al.’s subsequent paper [22], they proposed three efficient algorithms to solve the low multi-linear rank tensor completion problem. Some similar algorithms can also be found in [18], [23], [24], [25]. In addition, there are some theoretical developments that guarantee the reconstruction of a low rank tensor from partial measurements by solving trace norm minimization under some reasonable conditions [25, 26, 27, 28]

. However, the tensor trace norm minimization problems have to be solved iteratively and involve multiple singular value decompositions (SVDs) in each iteration. Therefore, existing algorithms suffer from high computational cost, making them impractical for real-world applications

[19, 29].

To address both of the issues mentioned above, i.e., the robustness of given ranks and the computational efficiency, we propose a scalable core tensor trace norm Regularized Orthogonal Iteration Decomposition (ROID) method for full or incomplete tensor analytics. We first induce the equivalence relation of the Schatten -norm () of a low multi-linear rank tensor and its core tensor. We use the trace norm of the core tensor to replace that of the whole tensor, and then achieve a much smaller scale matrix trace norm minimization problem. In particular, our ROID method is generalized as a graph Laplacian regularized version by using auxiliary information from the relationships or a sparse higher-order orthogonal iteration (SHOOI) version. Finally, we develop two efficient augmented Lagrange multiplier (ALM) algorithms for our problems. Moreover, we theoretically analyze the convergence property of our algorithms. Our experimental results on real-world datasets verified both the efficiency and effectiveness of our methods.

The rest of the paper is organized as follows. We review preliminaries and related work in Section 2. In Section 3, we propose two novel core tensor trace norm regularized tensor decomposition models, and develop two efficient ALM algorithms and extend one algorithm to solve the SHOOI problem in Section 4. We provide the theoretical analysis of our algorithms in Section 5. We report the experimental results in Section 6. In Section 7, we conclude this paper and point out some potential extensions for future work.

## 2 Notations and Problem Formulations

A third-order tensor is denoted by a calligraphic letter, e.g., , and its entries are denoted as , where for . Fibers are the higher-order analogue of matrix rows and columns. The mode-n fibers of a third-order tensor are , and , respectively.

The mode- unfolding, also known as matricization, of a third-order tensor is denoted by and arranges the mode-n fibers to be the columns of the resulting matrix such that the mode- fiber becomes the row index and all other two modes become column indices. The tensor element is mapped to the matrix element , where

 j=1+3∑k=1,k≠n(ik−1)JkwithJk=k−1∏m=1,m≠nIm.

The inner product of two same-sized tensors and is the sum of the product of their entries, . The Frobenius norm of a third-order tensor is defined as:

 ∥X∥F:=√⟨X,X⟩= ⎷I1∑i1=1I2∑i2=1I3∑i3=1x2i1i2i3.

The -mode product of a tensor with a matrix , denoted by , is defined as:

 (X×1U)ji2i3=I1∑i1=1xi1i2i3uji1.

### 2.1 Tensor Trace Norm

With an exact analogue to the definition of the matrix rank, the rank of a tensor is defined as the smallest number of rank-one tensors that generate as their sum. However, there is no straightforward way to determine the rank of a tensor. In fact, the problem is NP-hard [6, 30]. Fortunately, the multi-linear rank (also called the Tucker rank in [27, 31]) of a tensor is easy to compute, and consists of the ranks of all mode-n unfoldings.

###### Definition 1.

The multi-linear rank of a third-order tensor is the tuple of the ranks of the mode-n unfoldings,

 multi−linearrank=[rank(X(1)),rank(X(2)),rank(X(3))].

In order to keep problems simple, the (weighted) sum of the ranks of all unfoldings along each mode is used to take the place of the multi-linear rank of the tensor, and is relaxed into the following definition.

###### Definition 2.

The Schatten -norm () of a third-order tensor is the average of the Schatten -norms of all unfoldings , i.e.,

 ∥X∥Sp=133∑n=1∥X(n)∥Sp

where denotes the Schatten -norm of the unfolding , and is the -th singular value of . When , the Schatten -norm is the well-known trace norm, .

For some imbalance sparse tensor decomposition problems (e.g., the size of the YouTube data used in Section 6.2 is ), the trace norm of the tensor can be incorporated by some pre-specified weights , , which satisfy .

### 2.2 Weighted Tensor Decompositions

We will introduce two of the most often used tensor decomposition models for MRL problems. In [8], Acar et al. presented a weighted CANDECOMP/PARAFAC (WCP) decomposition model for sparse third-order tensors:

 minA,B,C12∑i,j,kwijk(tijk−R∑r=1airbjrckr)2 (1)

where is a positive integer, denotes a non-negative indicator tensor of the same size as an incomplete tensor : if is observed and otherwise, and

are referred to as the factor matrices which are the combination of the vectors from the rank-one components (e.g.,

).

In [17], the weighted Tucker decomposition (WTucker) model is formulated as follows:

 minG,U,V,W12∥W∗(T−G×1U×2V×3W)∥2F (2)

where denotes the Hadamard (element-wise) product, , , , and is a core tensor with the given multi-linear rank . Since the decomposition rank is in general much smaller than , in this sense, the storage of the Tucker decomposition form can be significantly smaller than that of the original tensor. Moreover, unlike the rank of the tensor , the multi-linear rank is clearly computable. If the factor matrices of the Tucker decomposition are constrained orthogonal, the classical decomposition methods are referred to as the higher-order singular value decomposition (HOSVD) [32] or higher-order orthogonal iteration (HOOI) [33], where the latter leads to the estimation of best rank- approximations while the truncation of HOSVD may achieve a good rank- approximation but in general not the best possible one [33]. Hence, we are particularly interested in extending the HOOI method for sparse MRL problems.

In addition, several extensions of both tensor decomposition models are developed for tensor estimation problems, such as [34], [35], [36]. However, for all those methods, a suitable rank value needs to be given, and it has been shown that both WTucker and WCP models are usually sensitive to the given ranks due to their least-squares formulations [18, 19], and they have poor performance when the data have a high rank [22].

### 2.3 Problem Formulations

For multi-relational prediction, the sparse tensor trace norm minimization problem is formulated as follows:

 minX3∑n=1αn∥X(n)∥∗,s.t.,XΩ=TΩ (3)

where ’s are pre-specified weights, and is the set of indices of observed entries. Liu et al. [22] proposed three efficient algorithms (e.g., the HaLRTC algorithm) to solve (3). In addition, there are some similar convex tensor completion algorithms in [18, 23, 24]. Tomioka and Suzuki [25] proposed a latent trace norm minimization model,

 minXn1λN∑n=1∥Xn,(n)∥∗+12∥PΩ(N∑n=1Xn)−PΩ(T)∥2F (4)

where is the projection operator: if and otherwise, and is a regularization parameter.

More recently, it has been shown that the tensor trace norm minimization models mentioned above can be substantially suboptimal [27, 37]. However, if the order of the involved tensor is no more than three, the models (3) and (4) often perform better than the more balanced (square) matrix model in [27]. Indeed each unfolding shares the same entries, and thus cannot be optimized independently. Therefore, we must apply variable splitting and introduce multiple additional equal-sized variables to all unfoldings of . Moreover, existing algorithms involve multiple SVDs in each iteration and suffer from high computational cost , where the assumed size of the tensor is .

## 3 Core Tensor Trace Norm Regularized Tensor Decomposition

To address the poor scalability of existing low multi-linear rank tensor recovery algorithms, we present two scalable core tensor trace norm (or together with graph Laplacian) regularized orthogonal decomposition models, and then achieve three smaller-scale matrix trace norm minimization problems. Then in Section 4, we will develop some efficient algorithms for solving the problems.

### 3.1 Core Tensor Trace Norm Minimization Models

Assume that is a multi-relational tensor with multi-linear rank , can be decomposed as:

 X=G×1U×2V×3W (5)

where , and are the column-wise orthonormal matrices, and can be thought of as the principal components in each mode. The entries of the core tensor show the level of interaction between the different components. For (), we recommend a matrix rank estimation approach recently developed in [38] to compute some good values for the multi-linear rank of the involved tensor. Then we can give some relatively large integers satisfying and , .

###### Theorem 1.

Let with multi-linear rank and satisfy , and , and , then

 ∥X∥Sp=∥G∥Sp

where and denote the Schatten -norm of and its core tensor , respectively.

The proof of Theorem 1 is given in APPENDIX A. Since the trace norm (i.e., the Schatten -norm) is the tightest convex surrogate to the rank function [21], [39], we mainly consider the trace norm case in this paper. According to the equivalence relation of the trace norm of a low multi-linear rank tensor and its core tensor, the tensor completion model (3) is formulated into the following form:

 minG,U,V,W,X1λ∥G∥∗+12∥X−G×1U×2V×3W∥2F, (6) s.t.,XΩ=TΩ,UTU=Id1,VTV=Id2,WTW=Id3.

When all entries of are observed, the model (6) degenerates to the following core tensor trace norm regularized tensor decomposition problem [29]:

 minG,U,V,W1λ∥G∥∗+12∥T−G×1U×2V×3W∥2F, (7) s.t.,UTU=Id1,VTV=Id2,WTW=Id3.

It is clear that the core tensor of size has much smaller size than the whole tensor , i.e., for all . Therefore, our core tensor trace norm regularized orthogonal tensor decomposition models (6) and (7) can alleviate the SVD computational burden of much larger unfoldings in both models (3) and (4). Besides, the core tensor trace norm term promotes low multi-linear rank tensor decompositions, and enhances the robustness of the multi-linear rank selection, while those traditional tensor decomposition methods are usually sensitive to the given multi-linear rank [22, 29].

### 3.2 Sparse HOOI Model

When , the model (6) degenerates to the following sparse tensor HOOI (SHOOI) problem,

 minG,U,V,W,Z12∥W∗(Z−T)∥2F, (8) s.t.,Z=G×1U×2V×3W,UTU=Id1,VTV=Id2,WTW=Id3.

In a sense, the SHOOI model (8) is a special case of our ROID method (see the Supplementary Materials for detailed discussion). When all entries of are observed, the SHOOI model (8) becomes a traditional HOOI problem in [33].

### 3.3 Graph Regularized Model

Inspired by the work in [40], [41], [42], we also exploit the auxiliary information given as link-affinity matrices in a graph regularized ROID (GROID) model:

 (9) +μ2[Tr(UTL1U)+Tr(VTL2V)+Tr(WTL3W)], s.t.,XΩ=TΩ,UTU=Id1,VTV=Id2,WTW=Id3

where is a regularization constant, denotes the matrix trace, is the graph Laplacian matrix, i.e., , is the weight matrix for the object set or different relations, and is the diagonal matrix whose entries are column sums of , i.e., .

## 4 Optimization Algorithms

In this section, we propose an efficient method of augmented Lagrange multipliers (ALM) to solve our ROID problem (6), and then extend the proposed algorithm to solve (7)-(9

). As a variant of the standard ALM, the alternating direction method of multipliers (ADMM) has received much attention recently due to the tremendous demand from large-scale machine learning applications

[43, 44]. Similar to (3), the proposed problem (6) is difficult to solve due to the interdependent tensor trace norm term . Therefore, we first introduce three much smaller auxiliary variables into (6), and then reformulate it into the following equivalent form:

 minG,U,V,W,{Gn},X3∑n=1∥Gn∥∗3λ+12∥X−G×1U×2V×3W∥2F, (10) s.t.,XΩ=TΩ,G(n)=Gn,UTU=Id1,VTV=Id2,WTW=Id3.

The partial augmented Lagrangian function of (10) is

 Lρ({Gn},G,U,V,W,X,{Yn})= (11) 3∑n=1(∥Gn∥∗3λ+⟨Yn,G(n)−Gn⟩+ρ2∥G(n)−Gn∥2F) +12∥X−G×1U×2V×3W∥2F

where are the matrices of Lagrange multipliers (or dual variables), and is called the penalty parameter. Our ADMM iterative scheme for solving (10) is derived by successively minimizing over , and then updating .

### 4.1 Updating {Gk+11,Gk+12,Gk+13}

By keeping all the other variables fixed, is updated by solving the following problem,

 minGn∥Gn∥∗3λ+ρk2∥Gk(n)−Gn+Ykn/ρk∥2F. (12)

For solving (12), we give the shrinkage operator [45] below.

###### Definition 3.

For any matrix , the singular vector thresholding (SVT) operator is defined as:

 SVTμ(M):=¯¯¯¯Udiag(max{¯¯¯σ−μ,0})¯¯¯¯VT

where should be understood element-wise, , and are obtained by SVD of , i.e., .

Therefore, a closed-form solution to (12) is given by:

 Gk+1n=SVT1/(3λρk)(Gk(n)+Ykn/ρk). (13)

It it clear that only some smaller size matrices in (13) need to perform SVD. Therefore, our shrinkage operator has a significantly lower computational complexity while the computational complexity of those algorithms for solving (3) and (4) is for each iteration. Hence, our algorithm has a much lower complexity than those as in [18, 22, 23, 24, 25].

### 4.2 Updating {Uk+1,Vk+1,Wk+1,Gk+1}

The optimization problem (10) with respect to , , and is formulated as follows:

 minG,U,V,W 3∑n=1ρk2∥G(n)−Gk+1n+Ykn/ρk∥2F (14) +12∥Xk−G×1U×2V×3W∥2F, s.t., UTU=Id1,VTV=Id2,WTW=Id3.

Unlike the HOOI algorithm in [33], we propose a new orthogonal iteration scheme to update the matrices , and for the optimization of (14). Moreover, the conventional HOOI can be seen as a special case of (14) when . For any estimate of these matrices, the optimal solution with respect to is given by the following theorem.

###### Theorem 2.

For given matrices , and , the optimal core tensor of the optimization problem (14) is given by

 G=11+3ρk(A+ρkB) (15)

where , and denotes the refolding of the matrix into a tensor.

Please see APPENDIX B for the proof of Theorem 2. Moreover, we propose an orthogonal iteration scheme for solving , and , which is an alternating orthogonal procrustes scheme to solve the rank- problem. Analogous with Theorem 4.2 in [33], we first state that the minimization problem (14) can be formulated as follows:

###### Theorem 3.

Assume a real third-order tensor , then the minimization problem (14) is equivalent to the maximization (over these matrices , and having orthonormal columns) of the following function

 g(U,V,W)=∥A+ρkB∥2F. (16)

The detailed proof of Theorem 3 is given in APPENDIX C. According to the theorem, an orthogonal iteration scheme is proposed to successively solve , and by fixing the other variables. Imagine that the matrices and are fixed and that the optimization problem (16) is merely a quadratic function of the unknown matrix . Consisting of orthonormal columns, we have

 maxU,UTU=Id1∥Mk1×1UT+ρkB∥2F=∥(Mk1)T(1)U+ρkBT(1)∥2F (17)

where . This is actually the well-known orthogonal procrustes problem [46]. Hence, we have

 Uk+1=ORT((Mk1)(1)BT(1)) (18)

where , and and are the left singular vector and right singular vector matrices obtained by the tight SVD of the matrix . Repeating the above procedure for and , we have

 Vk+1 =ORT((Mk2)(2)BT(2)), (19) Wk+1 =ORT((Mk3)(3)BT(3))

where and .

For the updated matrices , and , then is updated by

 Gk+1= ρk∑3n=1refold(Gk+1n−Ykn/ρk)1+3ρk (20) +Mk3×3(Wk+1)T1+3ρk.

### 4.3 Updating Xk+1

The optimization problem (10) with respect to is formulated as follows:

 minX∥X−Gk+1×1Uk+1×2Vk+1×3Wk+1∥2F, (21) s.t.,XΩ=TΩ.

By introducing a Lagrangian multiplier for , the Lagrangian function of (21) is given by

 H(X,Y)= ∥X−Gk+1×1Uk+1×2Vk+1×3Wk+1∥2F +⟨Y,PΩ(X)−PΩ(T)⟩.

Letting , we then obtain the following Karush-Kuhn-Tucker (KKT) conditions:

 2(X−Gk+1×1Uk+1×2Vk+1×3Wk+1)+PΩ(Y) =0, XΩ−TΩ =0.

By deriving simply the KKT conditions, we have the optimal solution as follows:

 Xk+1=PΩ(T)+P⊥Ω(Gk+1×1Uk+1×2Vk+1×3Wk+1) (22)

where is the complementary operator of .

Based on the above analysis, we develop an efficient ADMM algorithm for solving (10), as outlined in Algorithm 1. Moreover, Algorithm 1 can be extended to solve (7) and the SHOOI problem (8) (the details can be found in the Supplementary Materials). For instance, with the tensor of Lagrange multipliers , the iterations of ADMM for solving (8) go as follows:

 minG,U,V,W12∥Zk−G×1U×2V×3W+Yk/ρk∥2F, (23) s.t.,UTU=Id1,VTV=Id2,WTW=Id3,
 minZ ρk2∥Z−Gk+1×1Uk+1×2Vk+1×3Wk+1+Yk/ρk∥2F (24) +12∥W∗(Z−T)∥2F.

To monitor convergence of Algorithm 1, the adaptively adjusting strategy of the penalty parameter in [43] is introduced. The necessary optimality conditions for (10) are primal feasibility

 X∗Ω=TΩ,G∗n =G∗(n),n=1,2,3, (25) (U∗)TU∗=Id1,(V∗)T V∗=Id2,(W∗)TW∗=Id3

and dual feasibility

 0∈∂ ∥G∗n∥∗/(3λ)−Y∗n, (26) G∗−X∗×1(U∗)T×2 (V∗)T×3(W∗)T+3∑n=1refold(Y∗n)=0

where is a KKT point of (10). By the optimal conditions of (12) and (14) and , we have

 0∈∂∥Gk+1n∥∗/(3λ)−Yk+1n+ρk(Gk+1(n)−Gk(n)), Gk+1−Xk+1×1(Uk+1)T×2(Vk+1)T×3(Wk+1)T+3∑n=1refold(Yk+1n) +(Xk+1−Xk)×1(Uk+1)T×2(Vk+1)T×3(Wk+1)T=0.

Let , be the primal residual and be the dual residual at iteration , we require the primal and dual residuals at the -iteration to be small such that they satisfy the optimal conditions in (25) and (26). Following [43], an efficient strategy is to let (the initialization in Algorithm 1) and update iteratively by:

 ρk+1=⎧⎪⎨⎪⎩γρk,rk>10sk,ρk/γ,sk>10rk,ρk,otherwise, (27)

where .

### 4.4 Extension for GROID

Algorithm 1 can be extended to solve our GROID problem (9), where the main difference is that the subproblem with respect to , , and is formulated as follows:

 minG,U,V,W3∑n=1ρk2∥G(n)−Gk+1n+Ykn/ρk∥2F (28) +12∥Xk−