# Duality-Induced Regularizer for Tensor Factorization Based Knowledge Graph Completion

Tensor factorization based models have shown great power in knowledge graph completion (KGC). However, their performance usually suffers from the overfitting problem seriously. This motivates various regularizers—such as the squared Frobenius norm and tensor nuclear norm regularizers—while the limited applicability significantly limits their practical usage. To address this challenge, we propose a novel regularizer—namely, DUality-induced RegulArizer (DURA)—which is not only effective in improving the performance of existing models but widely applicable to various methods. The major novelty of DURA is based on the observation that, for an existing tensor factorization based KGC model (primal), there is often another distance based KGC model (dual) closely associated with it. Experiments show that DURA yields consistent and significant improvements on benchmarks.

## Authors

• 3 publications
• 3 publications
• 68 publications
• ### On Dropout and Nuclear Norm Regularization

We give a formal and complete characterization of the explicit regulariz...
05/28/2019 ∙ by Poorya Mianjy, et al. ∙ 0

• ### Convex Factorization Machine for Regression

We propose the convex factorization machine (CFM), which is a convex var...
07/04/2015 ∙ by Makoto Yamada, et al. ∙ 0

• ### Scalable Tensor Completion with Nonconvex Regularization

Low-rank tensor completion problem aims to recover a tensor from limited...
07/23/2018 ∙ by Quanming Yao, et al. ∙ 0

• ### A Dual Framework for Low-rank Tensor Completion

We propose a novel formulation of the low-rank tensor completion problem...
12/04/2017 ∙ by Madhav Nimishakavi, et al. ∙ 0

• ### Implicit Regularization in Deep Tensor Factorization

Attempts of studying implicit regularization associated to gradient desc...
05/04/2021 ∙ by Paolo Milanesi, et al. ∙ 0

• ### Binarized Knowledge Graph Embeddings

Tensor factorization has become an increasingly popular approach to know...
02/08/2019 ∙ by Koki Kishimoto, et al. ∙ 0

##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1 Introduction

Knowledge graphs contain quantities of factual triplets, which represent structured human knowledge. In the past few years, knowledge graphs have made great achievements in many areas, such as natural language processing

(Zhang et al., 2019b), question answering (Huang et al., 2019), recommendation systems (Wang et al., 2018)

, and computer vision

Marino et al. (2016). Although commonly used knowledge graphs usually contain billions of triplets, they still suffer from the incompleteness problem that a lot of factual triplets are missing. Due to the large scale of knowledge graphs, it is impractical to find all valid triplets manually. Therefore, knowledge graph completion (KGC)—which aims to predict missing links between entities based on known links automatically—has attracted much attention recently.

Distance based (DB) models and tensor factorization based (TFB) models are two important categories of KGC models. DB models use the Minkowski distance to measure the plausibility of a triplet. Although they can achieve state-of-the-art performance, many of them still have difficulty in modeling complex relation patterns, such as one-to-many and many-to-one relations Lin et al. (2015); Wang et al. (2014). TFB models treat knowledge graphs as partially observed third-order binary tensors and formulate KGC as a tensor completion problem. Theoretically, these models are highly expressive and can well handle complex relations. However, their performance usually suffers from the overfitting problem seriously and consequently cannot achieve state-of-the-art.

To tackle the overfitting problem of TFB models, researchers propose various regularizers. The squared Frobenius norm regularizer is a popular one that applies to various models Nickel et al. (2011); Yang et al. (2015); Trouillon et al. (2016). However, experiments show that it may decrease performance for some models (e.g., RESCAL) Ruffinelli et al. (2020). More recently, motivated by the great success of the matrix trace norm in the matrix completion problem Srebro et al. (2005); Candès and Recht (2009), Lacroix et al. (2018) propose a tensor nuclear -norm regularizer. It gains significant improvements against the squared Frobenius norm regularizer. However, it is only suitable for canonical polyadic (CP) decomposition Hitchcock (1927) based models, such as CP and ComplEx Trouillon et al. (2016), but not appropriate for a more general class of models, such as RESCAL Nickel et al. (2011). Therefore, it is still challenging to find a regularizer that is both widely applicable and effective.

In this paper, we propose a novel regularizer for tensor factorization based KGC models—namely, DUality-induced RegulArizer (DURA). The major novelty of DURA is based on the observation called duality—for an existing tensor factorization based KGC model (primal), there is often another distance based KGC model closely associated with it (dual). The duality can be derived by expanding the squared score functions of the associated distance based models. Then, the cross-term in the expansion is exactly a tensor factorization based KGC model, and the squared terms in it give us a regularizer. Using DURA, we can preserve the expressiveness of tensor factorization based KGC models and prevent them from the overfitting problem. DURA is widely applicable to various tensor factorization based models, including CP, ComplEx, and RESCAL. Experiments show that, DURA yields consistent and significant improvements on datasets for the knowledge graph completion task. It is worth noting that, when incorporated with DURA, RESCAL Nickel et al. (2011)—which is one of the first knowledge graph completion models—performs comparably to state-of-the-art methods and even beats them on several benchmarks.

## 2 Preliminaries

In this section, we review the background of this paper in Section 2.1 and introduce the notations used throughout this paper in Section 2.2.

### 2.1 Background

Knowledge Graph   Given a set of entities and a set of relations, a knowledge graph is a set of triplets, where and are the -th entity and -th relation, respectively. Usually, and are also called the head entity and the tail entity, respectively.

Knowledge Graph Completion (KGC)    The goal of KGC is to predict valid but unobserved triplets based on the known triplets in . KGC models contain two important categories: distance based models and tensor factorization based models, both of which are knowledge graph embedding (KGE) methods. KGE models associate each entity and relation

with an embedding (may be real or complex vectors, matrices, and tensors)

and . Generally, they define a score function to associate a score with each potential triplet . The scores measure the plausibility of triplets. For a query , KGE models first fill the blank with each entity in the knowledge graphs and then score the resulted triplets. Valid triplets are expected to have higher scores than invalid triplets.

Distance Based (DB) KGC Models   DB models define the score function with the Minkowski distance. That is, the score functions have the formulation of , where is a model-specific function. Equivalently, we can also use a squared score function .

Tensor Factorization Based (TFB) KGC Models   TFB models regard a knowledge graph as a third-order binary tensor . The entry if is valid otherwise . Suppose that denotes the -th frontal slice of , i.e., the adjacency matrix of the -th relation. Usually, a TFB KGC model factorizes as , where the -th (-th) row of H (T) is (), is a matrix representing relation , and are the real part and the conjugate of a complex matrix, respectively. That is, the score functions are defined as . Note that the real part and the conjugate of a real matrix are itself. Then, the aim of TFB models is to seek matrices , such that can approximate . Let and be a tensor of which the -th frontal slice is . The regularized formulation of a tensor factorization based model can be written as

 min^X1,…,^X|R||R|∑j=1L(Xj,^Xj)+λg(^X), (1)

where is a fixed parameter, measures the discrepancy between and , and is the regularization function.

### 2.2 Other Notations

We use and to distinguish head and tail entities. Let , , and denote the norm, the norm, and the Frobenius norm of matrices or vectors. We use to represent the inner products of two real or complex vectors. Specifically, if are two row vectors in the complex space, then the inner product is defined as .

## 3 Related Work

Knowledge graph completion (KGC) models include rule-based methods Galárraga et al. (2013); Yang et al. (2017), KGE methods, and hybrid methods Guo et al. (2018). This work is related to KGE methods Bordes et al. (2013); Trouillon et al. (2016); Nathani et al. (2019); Zhang et al. (2019a). More specifically, it is related to distance based KGE models and tensor factorization based KGE models.

Distance based models describe relations as relational maps between head and tail entities. Then, they use the Minkowski distance to measure the plausibility of a given triplet. For example, TransE Bordes et al. (2013) and its variants Wang et al. (2014); Lin et al. (2015) represent relations as translations in vector spaces. They assume that a valid triplet satisfies , where and mean that entity embeddings may be relation-specific. Structured embedding (SE) Bordes et al. (2011) uses linear maps to represent relations. Its score function is defined as . RotatE Sun et al. (2019) defines each relation as a rotation in a complex vector space and the score function is defined as , where and . ModE Zhang et al. (2020) assumes that is diagonal and

is an identity matrix. It shares a similar score function

with RotatE but .

Tensor factorization based models formulate the KGC task as a third-order binary tensor completion problem. RESCAL Nickel et al. (2011) factorizes the -th frontal slice of as , in which embeddings of head and tail entities are from the same space. As the relation specific matrices contain lots of parameters, RESCAL is prone to be overfitting. DistMult Yang et al. (2015) simplifies the matrix in RESCAL to be diagonal, while it sacrifices the expressiveness of models and can only handle symmetric relations. In order to model asymmetric relations, ComplEx Trouillon et al. (2016) extends DistMult to complex embeddings. Both DistMult and ComplEx can be regarded as variants of CP decomposition Hitchcock (1927), which are in real and complex vector spaces, respectively.

Tensor factorization based (TFB) KGC models usually suffer from overfitting problem seriously, which motivates various regularizers. In the original papers of TFB models, the authors usually use the squared Frobenius norm ( norm) regularizer Nickel et al. (2011); Yang et al. (2015); Trouillon et al. (2016). This regularizer cannot bring satisfying improvements. Consequently, TFB models do not gain comparable performance to distance based models Sun et al. (2019); Zhang et al. (2020). More recently, Lacroix et al. (2018) propose to use the tensor nuclear 3-norm Friedland and Lim (2018) (N3) as a regularizer, which brings more significant improvements than the squared Frobenius norm regularizer. However, it is designed for the CP-like models, such as CP and ComplEx, and not suitable for more general models such as RESCAL. Moreover, some regularization methods aim to leverage external background knowledge Minervini et al. (2017a); Ding et al. (2018); Minervini et al. (2017b). For example, to model equivalence and inversion axioms, Minervini et al. (2017a) impose a set of model-dependent soft constraints on the predicate embeddings. Ding et al. (2018) use non-negativity constraints on entity embeddings and approximate entailment constraints on relation embeddings to impose prior beliefs upon the structure of the embeddings space.

## 4 Methods

In this section, we introduce a novel regularizer—DUality-induced RegulArizer (DURA)—for tensor factorization based knowledge graph completion. We first introduce basic DURA in Section 4.1 and explain why it is effective in Section 4.2. Then, we introduce DURA in Section 4.3. Finally, we give a theoretical analysis for DURA under some special cases in Section 4.4.

### 4.1 Basic DURA

Consider the knowledge graph completion problem . That is, we are given the head entity and the relation, aiming to predict the tail entity. Suppose that measures the plausibility of a given triplet , i.e., . Then the score function of a TFB model is

 fj(i,k)=Re\,(¯¯¯hiRjt⊤k)=Re\,(⟨hi¯¯¯Rj,tk⟩). (2)

It first maps the entity embeddings

, and then uses the real part of an inner product to measure the similarity between and . Notice that another commonly used similarity measure—the squared Euclidean distance—can replace the inner product similarity in Equation (2). We can obtain an associated distance based model formulated as

 fEj(i,k)=−∥hi¯¯¯Rj−tk∥22. (3)

Therefore, there exists a duality: for an existing tensor factorization based KGC model (primal), there is often another distance based KGC model (dual) closely associated with it.

Specifically, the relationship between the primal and the dual can be formulated as

 fEj(i,k)=−∥hi¯¯¯Rj−tk∥22=−∥hi¯¯¯Rj∥22−∥% tk∥22+2Re\,(⟨hi¯¯¯Rj,tk⟩)=2fj(i,k)−∥hi¯¯¯Rj∥22−∥tk∥22. (4)

Usually, we expect and to be higher for all valid triplets than those for invalid triplets. Suppose that is the set that contains all valid triplets. Then, for triplets in , we have that

 maxfEj(i,k)=min−fEj(i,k) = min−2fj(i,k)+∥hi¯¯¯Rj∥22+∥tk∥22. (5)

By noticing that is exactly the aim of a TFB model, the duality induces a regularizer for tensor factorization based KGC models, i.e.,

 ∑(hi,rj,tk)∈S∥hi¯Rj∥22+∥tk∥22, (6)

which is called basic DURA.

### 4.2 Why Basic DURA Helps

In this section, we demonstrate that basic DURA encourages tail entities connected to a head entity through the same relation to have similar embeddings, which accounts for its effectiveness in improving performance of TFB models.

First, we claim that tail entities connected to a head entity through the same relation should have similar embeddings. Suppose that we know a head entity and a relation , and our aim is to predict the tail entity. If is a one-to-many relation, i.e., there exist two entities and such that both and are valid, then we expect that and have similar semantics. For example, if two triplets (felid, include, tigers) and (felid, include, lions) are valid, then tigers and lions should have similar semantics. Further, we expect that entities with similar semantics have similar embeddings. In this way, if we have known that (tigers, is, mammals) is valid, then we can predict that (lions, is, mammals) is also valid. See Figure 0(a) for an illustration of the prediction process.

However, TFB models fail to achieve the above goal. As shown in Figure 0(b), suppose that we have known when the embedding dimension is . Then, we can get the same score for so long as lies on the same line perpendicular to . Generally, the entities and have similar semantics. However, their embeddings and can even be orthogonal, which means that the two embeddings are dissimilar. Therefore, the performance of TFB models for knowledge graph completion is usually unsatisfying.

By Equation (4.1), we know that basic DURA constraints the distance between and . When and are known, lies in a small region (see Figure 0(c) and we verify this claim in Section 5.4). Therefore, tail entities connected to a head entity through the same relation will have similar embeddings, which is beneficial to the prediction of unknown triplets.

### 4.3 Dura

Basic DURA encourages tail entities with similar semantics to have similar embeddings. However, it cannot handle the case that head entities have similar semantics.

Suppose that two triplets (tigers, is, mammals) and (lions, is, mammals) are valid. Similar to the discussion in Section 4.2, we expect that tigers and lions have similar semantics and thus have similar embeddings. If we further know that (feild, include, tigers) is valid, we can predict that (feild, include, lions) is valid. However, basic DURA cannot handle the case. Let , , , and be the embeddings of tigers, lions, mammals, and is, respectively. Then, and can be equal even if and are orthogonal, as long as .

To tackle the above issue, noticing that , we define another dual distance based KGC model

 ~fEj(i,k)=−∥tkR⊤j−hi∥22,

Then, similar to the derivation in Equation (4.1), the duality induces a regularizer given by

 ∑(hi,rj,tk)∈S∥tkR% ⊤j∥2+∥hi∥2. (7)

When a TFB model are incorporated with regularizer (7), head entities with similar semantics will have similar embeddings.

Finally, combining the regularizer (6) and (7), DURA has the form of

 ∑(hi,rj,tk)∈S[∥hi¯¯¯Rj∥22+∥tk∥22+∥tkR⊤j∥22+∥hi∥22]. (8)

### 4.4 Theoretic Analysis for Diagonal Relation Matrices

If we further relax the summation condition in the regularizer (8) to all possible entities and relations, we can write DURA as:

 |E||R|∑j=1(∥H¯¯¯Rj∥2F+∥T∥2F+∥TR⊤j∥2F+∥H∥2F), (9)

where and are the number of entities and relations, respectively.

In the rest of this section, we use the same definitions of and as in the problem (1). When the relation embedding matrices are diagonal in or as in CP or ComplEx, the formulation (9) gives an upper bound to the tensor nuclear 2-norm of , which is an extension of trace norm regularizers in matrix completion. To simplify the notations, we take CP as an example, in which all involved embeddings are real. The conclusion in complex space can be analogized accordingly.

###### Definition 1 (Friedland and Lim (2018)).

The nuclear 2-norm of a 3D tensor is

 ∥A∥∗=min {r∑i=1∥u1,i∥2∥u2,i∥2∥u3,i∥2:A=r∑i=1u1,i⊗u2,i⊗u3,i,r∈N},

where for , , and denotes the outer product.

For notation convenience, we define a relation matrix , of which the -th row consists of the diagonal entries of . That is, where represents the entry in the -th row and -th column of the matrix R.

In the knowledge graph completion problem, the tensor nuclear 2-norm of is

 ∥^X∥∗=min {D∑d=1∥h:d∥2∥r:d∥2∥t:d∥2:^X=D∑d=1%h:d⊗r:d⊗t:d},

where is the embedding dimension, , , and are the -th columns of H, , and T.

For DURA in (9), we have the following theorem.

###### Theorem 1.

Suppose that for , where are real matrices and is diagonal. Then, the following equation holds

 min^Xj=H% RjT⊤1√|R||R|∑j=1(∥HRj∥2F+∥T∥2F+∥T% R⊤j∥2F+∥H∥2F)=∥^X∥∗.

The minimization attains if and only if and , where , , and are the -th columns of H, , and T, respectively.

###### Proof.

See the supplementary material. ∎

Therefore, DURA in (9) gives an upper bound to the tensor nuclear 2-norm, which is a tensor analog to the matrix trace norm.

Remark   DURA in (8) is actually a weighted version of the one in (9), in which the regularization terms corresponding to the sampled valid triplets. As shown in Srebro and Salakhutdinov (2010) and Lacroix et al. (2018), the weighted versions of regularizers usually outperform the unweighted regularizer when entries of the matrix or tensor are sampled non-uniformly. Therefore, in the experiments, we implement DURA in a weighted way as in (8).

## 5 Experiments

In this section, we introduce the experimental settings in Section 5.1 and show the effectiveness of DURA in Section 5.2. We compare DURA to other regularizers in Section 5.3 and visualize the entity embeddings in Section 5.4. Finally, we analyze the sparsity induced by DURA in Section 5.5. The code of HAKE is available on GitHub at https://github.com/MIRALab-USTC/KGE-DURA.

### 5.1 Experimental Settings

We consider three public knowledge graph datasets—WN18RR Toutanova and Chen (2015)

, FB15k-237

Dettmers et al. (2018), and YAGO3-10 Mahdisoltani et al. (2015) for the knowledge graph completion task, which have been divided into training, validation, and testing set in previous works. The statistics of these datasets are shown in Table 1. WN18RR, FB15k-237, and YAGO3-10 are extracted from WN18 Bordes et al. (2013), FB15k Bordes et al. (2013), and YAGO3 Mahdisoltani et al. (2015), respectively. Toutanova and Chen (2015) and Dettmers et al. (2018)

indicated the test set leakage problem in WN18 and FB15k, where some test triplets may appear in the training dataset in the form of reciprocal relations. They created WN18RR and FB15k-237 to avoid the test set leakage problem, and we use them as the benchmark datasets. We use MRR and Hits@N (H@N) as evaluation metrics. For more details of training and evaluation protocols, please refer to the supplementary material.

Moreover, we find it better to assign different weights for the parts involved with relations. That is, the optimization problem has the form of

 min ∑(ei,rj,ek)∈S[ℓijk(H,R1,…,RJ,T) + λ(λ1(∥hi∥22+∥tk∥22)+λ2(∥hi¯¯¯Rj∥22+∥tkR⊤j∥22))],

where

are fixed hyperparameters. We search

in and in .

### 5.2 Main Results

In this section, we compare the performance of DURA against several state-of-the-art KGC models, including CP Hitchcock (1927), RESCAL Nickel et al. (2011), ComplEx Trouillon et al. (2016), TuckER Balazevic et al. (2019b) and some DB models: RotatE Sun et al. (2019), MuRP Balazevic et al. (2019a), and HAKE Zhang et al. (2020).

Table 2 shows the effectiveness of DURA. RESCAL-DURA and ComplEx-DURA perform competitively with the SOTA DB models. RESCAL-DURA outperforms all the compared DB models in terms of MRR and H@1. Note that we reimplement CP, ComplEx, and RESCAL under the “reciprocal” setting Kazemi and Poole (2018); Lacroix et al. (2018), and obtain better results than the reported performance in the original papers. Overall, TFB models with DURA significantly outperform those without DURA, which shows its effectiveness in preventing models from overfitting.

Generally, models with more parameters and datasets with smaller sizes imply a larger risk of overfitting. Among the three datasets, WN18RR has the smallest size of only kinds of relations and around training samples. Therefore, the improvements brought by DURA on WN18RR are expected to be larger compared with other datasets, which is consistent with the experiments. As stated in Wang et al. (2017), RESCAL is a more expressive model, but it is prone to overfit on small- and medium-sized datasets because it represents relations with much more parameters. For example, on WN18RR dataset, RESCAL gets an H@10 score of 0.493, which is lower than ComplEx (0.522). The advantage of its expressiveness does not show up at all. However, incorporated with DURA, RESCAL gets an 8.4% improvement on H@10 and finally attains 0.577, which outperforms all compared models. On larger datasets such as YAGO3-10, overfitting also exists but may be non-significant. Nonetheless, DURA still leads to consistent improvement, demonstrating the ability of DURA to prevent models from overfitting.

### 5.3 Comparison to Other Regularizers

In this section, we compare DURA to the popular squared Frobenius norm regularizer and the recent tensor nuclear 3-norm (N3) regularizer Lacroix et al. (2018). The squared Frobenius norm regularizer is given by . N3 regularizer is given by , where denotes norm of vectors.

We implement both the squared Frobenius norm (FRO) and N3 regularizers in the weighted way as stated in Lacroix et al. (2018). Table 3 shows the performance of the three regularizers on three popular models: CP, ComplEx, and RESCAL. Note that when the TFB model is RESCAL, we only compare DURA to the squared Frobenius norm regularization as N3 does not apply to it.

For CP and ComplEx, DURA brings consistent improvements compared to FRO and N3 on all datasets. Specifically, on FB15k-237, compared to CP-N3, CP-DURA gets an improvement of 0.013 in terms of MRR. Even for the previous state-of-the-art TFB model ComplEx, DURA brings further improvements against the N3 regularizer. Incorporated with FRO, RESCAL performs worse than the vanilla model, which is consistent with the results in Ruffinelli et al. (2020). However, RESCAL-DURA brings significant improvements against RESCAL. All the results demonstrate that DURA is more widely applicable than N3 and more effective than the squared Frobenius norm regularizer.

### 5.4 Visualization

In this section, we visualize the tail entity embeddings using T-SNE van der Maaten and Hinton (2008) to show that DURA encourages tail entities with similar semantics to have similar embeddings.

Suppose that is a query, where and are head entities and relations, respectively. An entity is an answer to a query if is valid. We randomly selected 10 queries in FB15k-237, each of which has more than 50 answers. 111For more details about the 10 queries, please refer to the supplementary material. Then, we use T-SNE to visualize the answers’ embeddings generated by CP and CP-DURA. Figure 2 shows the visualization results. Each entity is represented by a 2D point and points in the same color represent tail entities with the same context (i.e. query). Figure 2 shows that, with DURA, entities with the same contexts are indeed being assigned more similar representations, which verifies the claims in Section 4.2.

### 5.5 Sparsity Analysis

As real-world knowledge graphs usually contain billions of entities, the storage of entity embeddings faces severe challenges. Intuitively, if embeddings are sparse, that is, most of the entries are zero, we can store them with less storage. Therefore, the sparsity of the generated entity embeddings becomes crucial for real-world applications. In this part, we analyze the sparsity of embeddings induced by different regularizers.

Generally, there are few entries of entity embeddings that are exactly equal to 0 after training, which means that it is hard to obtain sparse entity embeddings directly. However, when we score triplets using the trained model, the embedding entries with values close to 0 will have minor contributions to the score of a triplet. If we set the embedding entries close to 0 to be exactly 0, we can transform embeddings into sparse ones. Thus, there is a trade-off between sparsity and performance decrement.

We define the following -sparsity to indicate the proportion of entries that are close to zero:

 sλ=∑Ii=1∑Dd=11{|x|<λ}(Eid)I×D, (10)

where is the entity embedding matrix, is the entry in the -th row and -th column of E, is the number of entities, is the embedding dimension, and is the indicator function that takes value of 1 if or otherwise the value of .

To generate sparse version entity embeddings, following Equation (10), we select all the entries of entity embeddings—of which the absolute value are less than a threshold —and set them to be 0. Note that for any given , we can always find a proper threshold to approximate it, as the formula is increasing with respect to . Then, we evaluate the quality of sparse version entity embeddings on the knowledge graph completion task. Figure 3 shows the effect of entity embeddings’ -sparsity on MRR. Results in the figure show that DURA causes much gentler performance decrement as the embedding sparsity increases. In Figure 2(a), incorporated with DURA, CP maintains MRR of 0.366 unchanged even when 60% entries are set to 0. More surprisingly, when the sparsity reaches 70%, CP-DURA can still outperform CP-N3 with zero sparsity. For RESCAL, when setting 80% entries to be 0, RESCAL-DURA still has the MRR of 0.341, which significantly outperforms vanilla RESCAL, whose MRR has decreased from 0.352 to 0.286. In a word, incorporating with DURA regularizer, the performance of CP and RESCAL remains comparable to the state-of-the-art models, even when 70% of entity embeddings’ entries are set to 0.

Following Han et al. (2015), we store the sparse version embedding matrices using compressed sparse row (CSR) format or compressed sparse column (CSC) format, which requires numbers, where is the number of non-zero elements and is the number of rows or columns. Experiments show that DURA brings about 65% fewer storage costs for entity embeddings when 70% of the entries are set to 0. Therefore, DURA can significantly reduce the storage usage while maintaining satisfying performance.

## 6 Conclusion

We propose a widely applicable and effective regularizer—namely, DURA—for tensor factorization based knowledge graph completion models. DURA is based on the observation that, for an existing tensor factorization based KGC model (primal), there is often another distance based KGC model (dual) closely associated with it. Experiments show that DURA brings consistent and significant improvements to TFB models on benchmark datasets. Moreover, visualization resultls show that DURA can encourage entities with similar semantics to have similar embeddings, which is beneficial to the prediction of unknown triplets.

## A.    Proof for Theorem 1

###### Theorem 2.

Suppose that for , where are real matrices and is diagonal. Then, the following equation holds

 min^Xj=H% RjT⊤ 1√|R||R|∑j=1(∥HRj∥2F+∥T∥2F+∥T% R⊤j∥2F+∥H∥2F)=∥^X∥∗.

The equation holds if and only if and for all , where , , and are the -th columns of H, , and T, respectively.

###### Proof.

We have that

 |R|∑j=1(∥HRj∥2F+∥T∥2F) = |R|∑j=1(I∑i=1∥hi∘rj∥2F+D∑d=1∥t:d∥2F) = |R|∑j=1(D∑d=1∥t:d∥2F+I∑i=1D∑d=1h2idr2jd) = |R|∑j=1D∑d=1∥t:d∥22+D∑d=1∥h:d∥22∥r:d∥22 = D∑d=1(∥h:d∥22∥r:d∥22+|R|∥t:d∥22) ≥ D∑d=12√|R|∥h:d∥2∥r:d∥2∥t:d∥2 = 2√|R|D∑d=1∥h:d∥2∥r:d∥2∥t:d∥2.

The equality holds if and only if , i.e., .

For all CP decomposition , we can always let , and such that

 ∥h′:d∥2∥r′:d∥2=√|R|∥t′:d∥2,

and meanwhile ensure that . Therefore, we know that

 1√|R||R|∑j=1∥^Xj∥∗ =12√|R||R|∑j=1min^Xj=HRjT⊤(∥HRj∥2F+∥T∥2F) ≤12√|R|min^Xj=HRjT⊤|R|∑j=1(∥%HRj∥2F+∥T∥2F) =min^X=∑Dd=1h:d⊗r:d⊗t:dD∑d=1∥h:d∥2∥r:d∥2∥t:d∥2 =∥^X∥∗.

In the same manner, we know that

 12√|R|min^Xj=HRjT⊤|R|∑j=1(∥%TR⊤j∥2F+∥H∥2F)=∥^X∥∗.

The equality holds if and only if .

Therefore, the conclusion holds if and only if and . ∎

Therefore, for DURA, we know that

 min^Xj=H% RjT⊤1√|R|g(^X)=∥^X∥∗,

which completes the proof.

## B.    The optimal value of p

In DB models, the commonly used is either or . When , DURA takes the form as the one in Equation (8) in the main text. If , we cannot expand the squared score function of the associated DB models as in Equation (4). Thus, the induced regularizer takes the form of . The above regularizer with (Reg_p1) does not gives an upper bound on the tensor nuclear-2 norm as in Theorem 1. Table 4 shows that, DURA significantly outperforms Reg_p1 on WN18RR and FB15k-237. Therefore, we choose .

## C.    Computational Complexity

Suppose that is the number of triplets known to be true in the knowledge graph, is the embedding dimension of entities. Then, for CP and ComplEx, the complexity of DURA is ; for RESCAL, the complexity of DURA is . That is to say, the computational complexity of weighted DURA is the same as the weighted squared Frobenius norm regularizer.

## D.    More Details About Experiments

In this section, we introduce the training protocol and the evaluation protocol.

### D.1    Training Protocol

We adopt the cross entropy loss function for all considered models as suggested in

Ruffinelli et al. (2020). We adopt the “reciprocal” setting that creates a new triplet for each triplet Lacroix et al. (2018); Kazemi and Poole (2018). We use Adagrad (Duchi et al., 2011) as the optimizer, and use grid search to find the best hyperparameters based on the performance on the validation datasets. Specifically, we search learning rates in , regularization coefficients in . On WN18RR and FB15k-237, we search batch sizes in and embedding sizes in . On YAGO3-10, we search batch sizes in and embedding sizes in . We search both and in .

We implement DURA in PyTorch and run on all experiments with a single NVIDIA GeForce RTX 2080Ti graphics card.

As we regard the link prediction as a multi-class classification problem and adopt the cross entropy loss, we can assign different weights for different classes (i.e., tail entities) based on their frequency of occurrence in the training dataset. Specifically, suppose that the loss of a given query is , where is the true tail entity, then the weighted loss is

 w(t)ℓ((h,r,?),t),

where

 w(t)=w0#tmax{#ti:ti∈training set}+(1−w0),

is a fixed number, denotes the frequency of occurrence in the training set of the entity . For all models on WN18RR and RESCAL on YAGO3-10, we choose and for all the other cases we choose .

We choose a learning rate of after grid search. Table 5 shows the other best hyperparameters for DURA found by grid search. Please refer to the Experiments part in the main text for the search range of the hyperparameters.

### D.2    Evaluation Protocol

Following Bordes et al. (2013), we use entity ranking as the evaluation task. For each triplet in the test dataset, the model is asked to answer and . To do this, we fill the positions of missing entities with candidate entities to create a set of candidate triplets, and then rank the triplets in descending order by their scores. Following the “Filtered” setting in Bordes et al. (2013), we then filter out all existing triplets known to be true at ranking. We choose Mean Reciprocal Rank (MRR) and Hits at N (H@N) as the evaluation metrics. Higher MRR or H@N indicates better performance. Detailed definitions are as follows.

• The mean reciprocal rank is the average of the reciprocal ranks of results for a sample of queries Q:

 MRR=1|Q||Q|∑i=11ranki.
• The Hits@N is the ratio of ranks that no more than :

 Hits@N=1|Q||Q|∑i=11x≤N(ranki),

where if or otherwise .

### D.3    The queries in T-SNE visualization

In Table 6, we list the ten queries used in the T-SNE visualization (Section 5.4 in the main text). Note that a query is represented as , where denotes the head entity and denotes the relation.

## References

• I. Balazevic, C. Allen, and T. Hospedales (2019a) Multi-relational poincaré graph embeddings. In Advances in Neural Information Processing Systems 32, pp. 4463–4473. Cited by: §5.2.
• I. Balazevic, C. Allen, and T. Hospedales (2019b) TuckER: tensor factorization for knowledge graph completion. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pp. 5185–5194. Cited by: §5.2.
• A. Bordes, N. Usunier, A. Garcia-Duran, J. Weston, and O. Yakhnenko (2013) Translating embeddings for modeling multi-relational data. In Advances in Neural Information Processing Systems 26, pp. 2787–2795. Cited by: §3, §3, §5.1, D.2    Evaluation Protocol.
• A. Bordes, J. Weston, R. Collobert, and Y. Bengio (2011) Learning structured embeddings of knowledge bases. In

Proceedings of the Twenty-Fifth AAAI Conference on Artificia