## 1 Introduction

Knowledge graphs (KG) [LehmannIJJKMHMK15, MahdisoltaniBS15] represent facts as subject-relation-object triples, e.g., (London, capital_of, UK). KG embedding (KGE) models embed each entity and each relation of a given KG into a latent semantic space such that important structure of the KG is retained. A large number of KGE models has been proposed in the literature; applications include question answering [questionanwsering, QANLP2], semantic search [SemanticSearch], and recommendation [RECKG, RECKG2].

Many of the available KGE models can be expressed as *bilinear models*, on
which we focus throughout. Examples include RESCAL [NickelTK11],
DistMult [Tucker1966], ComplEx [TrouillonWRGB16], Analogy [Analogy], and
CP [Canonical]. KGE models assign a “score” to each
subject-relation-object triple; high-scoring triples are considered more likely
to be true. In bilinear models, the score is computed using a relation-specific
linear combination of the pairwise interactions of the embeddings of the subject
and the
object. The models differ in the kind of interactions that are considered: RESCAL is
dense in that it considers all pairwise interactions, whereas all other of the
aforementioned models are sparse in that they consider only a small, hard-coded
subset of interactions (and learn weights only for this subset). As a
consequence, these later models have fewer parameters. They empirically show
state-of-the-art performance [LiuWY17, TrouillonWRGB16, Canonical] for
multi-relational link prediction tasks.

In this paper, we propose the Relational Tucker3 (RT) decomposition, which tailors the standard Tucker3 decomposition [Tucker1966] to the relational domain. The RT decomposition is inspired by RESCAL, which specialized the Tucker2 decomposition in a similar way. We use the RT decomposition as a tool to to explore (1) whether we can automatically learn which interactions should be considered instead of using hard-coded sparsity patterns, (2) whether and when this is beneficial, and finally (3) whether sparsity is indeed necessary to learn good representations.

In a nutshell, RT decomposes the KG into an entity embedding matrix, a relation embedding matrix, and a core tensor. We show that all existing bilinear models are special cases of RT under different viewpoints: the fixed core tensor view and the constrained core tensor view. In both cases, the differences between different bilinear models are reflected in different (fixed a priori) sparsity patterns of the associated core tensor. In contrast to bilinear models, RT offers a natural way to decouple entity and relation embedding sizes and allows parameter sharing across relations. These properties allow us to learn state-of-the-art dense representations for KGs. Moreover, to study the questions raised above, we propose and explore a sparse RT decomposition, in which the core tensor is encouraged to be sparse, but without using a predefined sparsity pattern.

We conducted an experimental study on common benchmark datasets to gain insight into the dense and sparse RT decompositions and to compare them with state-of-the-art models. Our results indicate that dense RT models can outperform state-of-the-art sparse models (when using the same number of parameters), and that it is possible and sometimes beneficial to learn sparsity patterns via a sparse RT model. We found that the best-performing method is dataset-dependent.

## 2 Background

#### Multi-relational link prediction.

Given a set of entities and
a set of relations , a knowledge graph is a set of
triples , where and . Commonly, , and
are referred to as the *subject*, *relation*, and *object*,
respectively. A knowledge base can be viewed as a labeled graph, where each
vertex corresponds to an entity, each label to a relation, and each labeled edge
to a triple. The goal of multi-relational link prediction is to determine
correct but unobserved triples based on . The
task has been studied extensively in the literature [Nickel0TG16]. The main
approaches include rule-based methods [PATH, AMIE, meilicke2018fine], knowledge
graph
embeddings [BordesUGWY13, TrouillonWRGB16, NickelTK11, NickelRP16, Analogy, dettmers2018conve],
and combined methods such as [COMRULEEMBED].

#### KG embedding (KGE) models.

A KGE model associates with each entity and relation an embedding and

in a low-dimensional vector space, respectively. Here

are hyper-parameters that refer to the*size*of the entity embeddings and relation embeddings, respectively. Each model uses a scoring function to associate a score to each triple . The scoring function depends on , , and only through their respective embeddings , , and . Triples with high scores are considered more likely to be true than triples with low scores.

Embedding models roughly can be classified into translation-based models

[BordesUGWY13, WangZFC14], factorization models [TrouillonN17, Analogy], and neural models [SocherCMN13, dettmers2018conve]. Many of the available KGE models can be expressed as bilinear models [Bilinear], in which the scoring function takes form(1) |

where and . We refer to matrix as the mixing matrix for relation ; is derived from using a model-specific mapping . Existing bilinear models differ from each other mainly in this mapping. We summarize some of the most prevalent models in what follows. We use for the Hadamard product (i.e., elementwise multiplication), for the vectorization of a matrix from its columns, for the identity matrix, for the diagonal matrix built from the arguments, and for . By convention, vectors of form refer to rows of some matrix (as a column vector) and scalars to individual entries.

#### Rescal [NickelTK11].

RESCAL is an unconstrained bilinear model and directly learns the mixing matrices . In our notation, RESCAL sets and uses

All of the bilinear models discussed below can be seen as constrained variants of RESCAL; constraints are used to facilitate learning and reduce the number of parameters.

#### DistMult [YangYHGD14a].

DistMult puts a diagonality constraint on the mixing matrices. The relation embeddings hold the values on the diagonal, i.e., and

Since each mixing matrix is symmetric, we have so that DistMult can only model symmetric relations. DistMult is equivalent to the INDSCAL tensor decomposition [Carroll1970].

#### Cp [Canonical].

CP is another classical tensor decomposition [Tensor] and has recently shown good results for KGE. Here CP associates two embeddings and with each entity and uses scoring function of form . The CP decomposition can be expressed as a bilinear model using mixing matrix

where is even, , and thus . To see this, observe that if we set , then . Note that CP can model symmetric and asymmetric relations.

#### ComplEx [TrouillonWRGB16] .

ComplEx is currently one of the best-performing KGE models (see also Sec. LABEL:sec:compvsprior). Let be even, set , and denote by and the first and last entries of . ComplEx then uses mixing matrix

As CP, ComplEx can model both symmetric () and asymmetric () relations.

ComplEx can be expressed in a number of equivalent
ways [SimpleE]. In their original work,
TrouillonWRGB16 use complex embeddings (instead of real ones) and
scoring function , where
extracts the real part of a complex number.
Likewise, HolE [NickelRP16] is
equivalent to ComplEx [HayashiS17]. HolE uses the scoring function , where refers to the *circular
correlation* between and (i.e., ). The idea of using
circular correlation relates to associative memory [NickelRP16]. HolE uses
as the circumvent matrix resulting from . In the
remainder of this paper, we use the formulation using
given above.

#### Analogy [LiuWY17].

Analogy uses block-diagonal mixing matrices , where each block is either (1) a real scalar or (2) a matrix of form , where refer to entries of and each entry of appears in exactly one block. We have . Analogy aims to capture commutative relational structure: the constraint ensures that for all . Both DistMult (only first case allowed) and ComplEx (only second case allowed) are special cases of Analogy.

## 3 The Relational Tucker3 Decomposition

In this section, we introduce the Relational Tucker3 (RT) decomposition, which decomposes the KG into entity embeddings, relation embeddings, and a core tensor. We show that each of the existing bilinear models can be viewed (1) as an unconstrained RT decomposition with a fixed (sparse) core tensor or (2) as a constrained RT decomposition with fixed relation embeddings. In contrast to bilinear models, the RT decomposition allows parameters sharing across different relations, and it decouples the entity and relation embedding sizes. Our experimental study (4) suggests that both properties can be beneficial. We also introduce a sparse variant of RT called SRT to approach the question of whether we can learn sparsity patterns of the core tensor from the data.

In what follows, we make use of the tensor representation of KGs. In particular,
we represent knowledge graph via a binary tensor ,
where if and only if . Note that if , we
assume that the truth value of triple is missing instead of false.
The *scoring tensor* of a particular embedding model of is the
tensor of all predicted scores, i.e., with . We use to refer to the -th frontal slice of a 3-way
tensor . Note that contains the data for relation , and that
*scoring matrix* contains the respective predicted scores.
Generally, embedding models aim to construct scoring tensors that suitably
approximate [Nickel0TG16].

### 3.1 Definition

We start with the classicial Tucker3 decomposition [Tucker1966], focusing on 3-way tensors throughout. The Tucker3 decomposition decomposes a given data tensor into three factor matrices (one per mode) and a core tensor, which stores the weights of the three-way interactions. The decomposition can be can be viewed as a form of higher-order PCA [Tensor]. In more detail, given a tensor and sufficiently large parameters , the Tucker3 decomposition factorizes into factor matrices , , , and core tensor such that

where refers to the mode-3 tensor product defined as

i.e., a linear combination of the frontal slices of . If are smaller than , core tensor can be interpreted as a compressed version of . It is well-known that the CP decomposition [Tensor] corresponds to the special case where and is fixed to the tensor with iff , else 0. The RT decomposition, which we introduce next, allows us to view existing bilinear models as decompositions with a fixed core tensor as well.

In particular, in KG embedding models, we associate each entity with a single
embedding, which we use to represent the entity in both subject and object
position. The relational Tucker3 (RT) decomposition applies this approach to the
Tucker3 decomposition by enforcing . In particular, given embedding
sizes and , the RT decomposition is parameterized by an *entity
embedding matrix* , a *relation embedding matrix*
, and a core tensor . As in the
standard Tucker3 decomposition, RT composes mixing matrices from the frontal
slices of the core tensor, i.e.,

The scoring tensor has entries

(2) |

Note that the mixing matrices for different relations share parameters through the frontal slices of the core tensor.

The RT decomposition can represent any given tensor, i.e., the restriction on a single embedding per entity does not limit expressiveness. To see this, suppose that we are given a Tucker3 decomposition of some tensor. Now consider the RT decomposition given by

We can verify that both decompositions produce the same tensor. Note that we used a similar construction in Sec. 2 to represent the CP decomposition as a bilinear model.

### 3.2 The Fixed Core Tensor View

The RT decomposition gives rise to a new interpretation of the bilinear models of Sec. 2: We can view them as RT decompositions with a fixed core tensor and unconstrained entity and relation embedding matrices.

Intuitively, in this fixed core tensor view, the relation embedding carries the relation-specific parameters as before, and the core tensor describes where to “place” these parameters in the mixing matrix. For example, in RESCAL, we have , and each entry of is placed at a separate position in the mixing matrix. We can express this placement via a fixed core tensor with

The frontal slices of for (and thus ) are shown in Fig. 1. As another example, we can use a similar construction for ComplEx, where we use with

The corresponding frontal slices for are illustrated in Fig. 2.

If we express prior bilinear models via the fixed core tensor viewpoint, we obtain extremely sparse core tensors. The sparsity pattern of the core tensor is fixed though, and differs across bilinear models. A natural question is whether or not we we can learn the sparsity pattern from the data instead of fixing it a priori, and whether and when such an approach is beneficial. We empirically approach this question in our experimental study in Sec 4.

### 3.3 The Constrained Core Tensor View

An alternate viewpoint of existing bilinear models is in terms of a constrained core tensor. In this viewpoint, the relation embedding matrix is fixed to the identity matrix, the entity embedding matrix is unconstrained. We have

The core tensor thus contains the mixing matrices directly, i.e., . The various bilinear models can be expressed by constraining the frontal slices of the core tensor appropriately (as in Sec 2).

### 3.4 Discussion

One of the main difference of both viewpoints is that in the fixed core tensor viewpoint, is determined by (e.g., for ComplEx) and generally independent of the number of relations . If , we perform compression along the third mode (corresponding to relations). In contrast, in the constrained core tensor viewpoint, we have so that no compression of the third mode is performed.

In a general RT decomposition, there is no a priori coupling between the entity embedding size and the relation embedding size , which allows us to choose freely. Moreover, since the core tensor is shared across relations, different mixing matrices depend on shared parameters. This property can be beneficial if dependencies exists between relations. To illustrate this point, assume a relational dataset containing the three relations parent (), mother (), and father (). Since generally , the relations are highly dependent. Suppose for simplicity that there exists mixing matrices and that perfectly reconstruct the data in that (likewise ). If we set , then , i.e., we can reconstruct the parent relation without additional parameters. We can express this with an RT decomposition with , where , , , , and . By choosing , we thus compress the frontal slices and force the model to discover commonalities between the various relations.

Since the mixing matrix of each relation is determined by both the relation embeddings and the core tensor , an RT decomposition can have many more parameters per relation than . To effectively compare various models, we define the effective relation embedding size of a given RT decomposition as the average number of parameters per relation. More precisely, we set

where refers to the number of non-zero free parameters in its argument. This definition ensures that the effective relation embedding size of a bilinear model is identical under both the fixed and the constrained core tensor interpretation (even though differs). Consider, for example, a ComplEx model. Under the fixed core tensor viewpoint, we have and so that . In the constrained core tensor viewpoint, we have and so that as well (although ). For RESCAL, we have . For a fixed entity embedding size , it is plausible that the suitable choice of is data-dependent. In the RT decomposition, we can control via , and thus also decouple the entity embedding size from the effective relation embedding sizes. The effective number of parameters of an RT model is given by

### 3.5 Sparse Relational Tucker Decomposition

In bilinear models such as ComplEx, DistMult, Analogy, or the CP decomposition, the core tensor is extremely sparse under both interpretations. In a general RT decomposition, this may not be the case and, in fact, the core tensor can become excessively large if it is dense and and are large (it has entries). On the other hand, the RT decomposition allows to share parameters across relations so that we may use significantly smaller values of to obtain suitable representations. In our experimental study, we found that this was indeed the case in certain settings.

To explore the question of whether and when we can learn sparsity patterns from the data instead of fixing them upfront, we make use of a sparse RT (SRT) decomposition, i.e., an RT decomposition with a sparse core tensor. Let be the parameter set and

be a loss function. In SRT, we add an additional

regularization term on the core tensor and optimize(3) |

where is an regularization hyper-parameter. Solving Eq. (3) exactly is NP-hard. In practice, to obtain an approximate solution, we apply the hard concrete approximation [Concrete, L0norm]

, which has shown good results on sparsifying neural networks.

^{1}

^{1}1Other sparsification techniques can be applied as well, of course, but we found this one to work well in practice. This approach also allows us to maintain a sparse model during the training process. In contrast to prior models, the frontal slices of the learned core tensor can have different sparsity patterns, capturing the distinct shared components.

## 4 Experiments

We conducted an experimental study on common benchmark datasets to gain insight into the RT decompositon and its comparison to the state-of-the-art model ComplEx. Our main goal was empirically study whether and when we can learn sparsity patterns from the data, and whether sparsity is necessary. We compared dense RT (DRT) decompositions, sparse RT (SRT) decompositions, and ComplEx w.r.t. (1) best prediction performance overall, (2) the relationship between entity embedding size and prediction performance, and (3) the relationship between model size (in terms of effective number of parameters) and prediction performance.

We found that an SRT *can* perform similar or better than ComplEx,
indicating that it is sometimes possible and even beneficial to learn the
sparsity pattern. Likewise, we observed that a DRT *can* outperform both
SRT and ComplEx with a similar effective number of parameters and with only a
fraction of the entity embedding size. This is not always the case, though: the
best model generally depends on the dataset and model size requirements.

### 4.1 Experiment Setup

#### Data and evaluation.

We followed the widely adopted entity ranking evaluation
protocol [BordesUGWY13] on two benchmark datasets: FB15K-237 and
WN18RR [DBLP:conf/emnlp/ToutanovaCPPCG15, dettmers2018conve]. The datasets
are subsets of the larger WN18 and FB15K datasets, which are derived from
WordNet and Freebase respectively [BordesUGWY13]. Since WN18 and FB15K can
be modeled well by simple rules [dettmers2018conve, meilicke2018fine],
FB15K-237 and WN18RR were constructed to be more challenging. See
Tab. 4.1 for key statistics. In *entity ranking*, we rank
entities for test queries of the form or . We report the
*mean reciprocal rank (MRR)* and *HITS@k* in the *filtered*
setting, in which predictions that correspond to tuples in the training or
validation datasets are discarded (so that only new predictions are evaluated).

Comments

There are no comments yet.