Knowledge Graph Completion via Complex Tensor Factorization

02/22/2017 ∙ by Théo Trouillon, et al. ∙ xerox UCL 0

In statistical relational learning, knowledge graph completion deals with automatically understanding the structure of large knowledge graphs---labeled directed graphs---and predicting missing relationships---labeled edges. State-of-the-art embedding models propose different trade-offs between modeling expressiveness, and time and space complexity. We reconcile both expressiveness and complexity through the use of complex-valued embeddings and explore the link between such complex-valued embeddings and unitary diagonalization. We corroborate our approach theoretically and show that all real square matrices---thus all possible relation/adjacency matrices---are the real part of some unitarily diagonalizable matrix. This results opens the door to a lot of other applications of square matrices factorization. Our approach based on complex embeddings is arguably simple, as it only involves a Hermitian dot product, the complex counterpart of the standard dot product between real vectors, whereas other methods resort to more and more complicated composition functions to increase their expressiveness. The proposed complex embeddings are scalable to large data sets as it remains linear in both space and time, while consistently outperforming alternative approaches on standard link prediction benchmarks.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 18

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Web-scale knowledge graph provide a structured representation of world knowledge, with projects such as the Google Knowledge Vault (Dong et al., 2014). They enable a wide range of applications including recommender systems, question answering and automated personal agents. The incompleteness of these knowledge graphs—also called knowledge bases—has stimulated research into predicting missing entries, a task known as link prediction or knowledge graph completion. The need for high quality predictions required by link prediction applications made it progressively become the main problem in statistical relational learning (Getoor and Taskar, 2007), a research field interested in relational data representation and modeling.

Knowledge graphs were born with the advent of the Semantic Web, pushed by the World Wide Web Consortium (W3C) recommendations. Namely, the Resource Description Framework (RDF) standard, that underlies knowledge graphs’ data representation, provides for the first time a common framework across all connected information systems to share their data under the same paradigm. Being more expressive than classical relational databases, all existing relational data can be translated into RDF knowledge graphs (Sahoo et al., 2009).

Knowledge graphs express data as a directed graph with labeled edges (relations) between nodes (entities). Natural redundancies between the recorded relations often make it possible to fill in the missing entries of a knowledge graph. As an example, the relation CountryOfBirth could not be recorded for all entities, but it can be inferred if the relation CityOfBirth is known. The goal of link prediction is the automatic discovery of such regularities. However, many relations are non-deterministic: the combination of the two facts IsBornIn(John,Athens) and IsLocatedIn(Athens,Greece) does not always imply the fact HasNationality(John,Greece). Hence, it is natural to handle inference probabilistically, and jointly with other facts involving these relations and entities. To this end, an increasingly popular method is to state the knowledge graph completion task as a 3D binary tensor completion problem, where each tensor slice is the adjacency matrix of one relation in the knowledge graph, and compute a decomposition of this partially-observed tensor from which its missing entries can be completed.

Factorization models with low-rank embeddings were popularized by the Netflix challenge (Koren et al., 2009). A partially-observed matrix or tensor is decomposed into a product of embedding matrices with much smaller dimensions, resulting in fixed-dimensional vector representations for each entity and relation in the graph, that allow completion of the missing entries. For a given fact r(s,o) in which the subject entity is linked to the object entity through the relation , a score for the fact can be recovered as a multilinear product between the embedding vectors of , and , or through more sophisticated composition functions (Nickel et al., 2016a).

Binary relations in knowledge graphs exhibit various types of patterns: hierarchies and compositions like FatherOf, OlderThan or IsPartOf, with strict/non-strict orders or preorders, and equivalence relations like IsSimilarTo. These characteristics maps to different combinations of the following properties: reflexivity/irreflexivity, symmetry/antisymmetry and transitivity. As described in Bordes et al. (2013a), a relational model should (i) be able to learn all combinations of such properties, and (ii) be linear in both time and memory in order to scale to the size of present-day knowledge graphs, and keep up with their growth.

A natural way to handle any possible set of relations is to use the classic canonical polyadic (CP) decomposition (Hitchcock, 1927), which yields two different embeddings for each entity and thus low prediction performances as shown in Section 5

. With unique entity embeddings, multilinear products scale well and can naturally handle both symmetry and (ir)-reflexivity of relations, and when combined with an appropriate loss function, dot products can even handle transitivity

(Bouchard et al., 2015). However, dealing with antisymmetric—and more generally asymmetric—relations has so far almost always implied superlinear time and space complexity (Nickel et al., 2011; Socher et al., 2013) (see Section 2), making models prone to overfitting and not scalable. Finding the best trade-off between expressiveness, generalization and complexity is the keystone of embedding models.

In this work, we argue that the standard dot product between embeddings can be a very effective composition function, provided that one uses the right representation: instead of using embeddings containing real numbers, we discuss and demonstrate the capabilities of complex embeddings. When using complex vectors, that is vectors with entries in , the dot product is often called the Hermitian (or sesquilinear) dot product, as it involves the conjugate-transpose of one of the two vectors. As a consequence, the dot product is not symmetric any more, and facts about one relation can receive different scores depending on the ordering of the entities involved in the fact. In summary, complex embeddings naturally represent arbitrary relations while retaining the efficiency of a dot product, that is linearity in both space and time complexity.

This paper extends a previously published article (Trouillon et al., 2016). This extended version adds proofs of existence of the proposed model in both single and multi-relational settings, as well as proofs of the non-uniqueness of the complex embeddings for a given relation. Bounds on the rank of the proposed decomposition are also demonstrated and discussed. The learning algorithm is provided in more details, and more experiments are provided, especially regarding the training time of the models.

The remainder of the paper is organized as follows. We first provide justification and intuition for using complex embeddings in the square matrix case (Section 2), where there is only a single type of relation between entities, and show the existence of the proposed decomposition for all possible relations. The formulation is then extended to a stacked set of square matrices in a third-order tensor to represent multiple relations (Section 3

). The stochastic gradient descent algorithm used to learn the model is detailed in Section

4, where we present an equivalent reformulation of the proposed model that involves only real embeddings. This should help practitioners when implementing our method, without requiring the use of complex numbers in their software implementation. We then describe experiments on large-scale public benchmark knowledge graphs in which we empirically show that this representation leads not only to simpler and faster algorithms, but also gives a systematic accuracy improvement over current state-of-the-art alternatives (Section 5). Related work is discussed in Section 6.

2 Relations as the Real Parts of Low-Rank Normal Matrices

We consider in this section a simplified link prediction task with a single relation, and introduce complex embeddings for low-rank matrix factorization.

We will first discuss the desired properties of embedding models, show how this problem relates to the spectral theorems, and discuss the classes of matrices these theorems encompass in the real and in the complex case. We then propose a new matrix decomposition—to the best of our knowledge—and a proof of its existence for all real square matrices. Finally we discuss the rank of the proposed decomposition.

2.1 Modeling Relations

Let be a set of entities, with . The truth of the single relation holding between two entities is represented by a sign value , where 1 represents true facts and -1 false facts, is the subject entity and

is the object entity. The probability for the relation holding true is given by

(1)

where is a latent matrix of scores indexed by the subject (rows) and object entities (columns), is a partially-observed sign matrix indexed in identical fashion, and

is a suitable sigmoid function. Throughout this paper we used the logistic inverse link function

.

2.1.1 Handling Both Asymmetry and Unique Entity Embeddings

In this work we pursue three objectives: finding a generic structure for that leads to a computationally efficient model, an expressive enough approximation of common relations in real world knowledge graphs, and good generalization performances in practice. Standard matrix factorization approximates by a matrix product , where and are two functionally-independent matrices, being the rank of the matrix. Within this formulation it is assumed that entities appearing as subjects are different from entities appearing as objects. In the Netflix challenge (Koren et al., 2009) for example, each row corresponds to the user and each column corresponds to the movie

. This extensively studied type of model is closely related to the singular value decomposition (SVD) and fits well with the case where the matrix

is rectangular.

However, in many knowledge graph completion problems, the same entity can appear as both subject or object and will have two different embedding vectors, and , depending on whether it appears as subject or object of a relation. It seems natural to learn unique embeddings of entities, as initially proposed by Nickel et al. (2011) and Bordes et al. (2011) and since then used systematically in other prominent approaches (Bordes et al., 2013b; Yang et al., 2015; Socher et al., 2013)

. In the factorization setting, using the same embeddings for left- and right-side factors boils down to a specific case of eigenvalue decomposition:

orthogonal diagonalization.

Definition 1

A real square matrix is orthogonally diagonalizable if it can be written as , where , is diagonal, and orthogonal so that where

is the identity matrix.

The spectral theorem for symmetric matrices tells us that a matrix is orthogonally diagonalizable if and only if it is symmetric (Cauchy, 1829). It is therefore often used to approximate covariance matrices, kernel functions and distance or similarity matrices.

However as previously stated, this paper is explicitly interested in problems where matrices—and thus the relation patterns they represent—can also be antisymmetric, or even not have any particular symmetry pattern at all (asymmetry). In order to both use a unique embedding for entities and extend the expressiveness to asymmetric relations, researchers have generalised the notion of dot products to scoring functions, also known as composition functions, that allow more general combinations of embeddings. We briefly recall several examples of scoring functions in Table 1, as well as the extension proposed in this paper.

These models propose different trade-offs between the three essential points:

  • Expressiveness, which is the ability to represent symmetric, antisymmetric and more generally asymmetric relations.

  • Scalability, which means keeping linear time and space complexity scoring function.

  • Generalization, for which having unique entity embeddings is critical.

RESCAL (Nickel et al., 2011) and NTN (Socher et al., 2013) are very expressive, but their scoring functions have quadratic complexity in the rank of the factorization. More recently the HolE model (Nickel et al., 2016b) proposes a solution that has quasi-linear complexity in time and linear space complexity. DistMult (Yang et al., 2015) can be seen as a joint orthogonal diagonalization with real embeddings, hence handling only symmetric relations. Conversely, TransE (Bordes et al., 2013b) handles symmetric relations to the price of strong constraints on its embeddings. The canonical-polyadic decomposition (CP) (Hitchcock, 1927) generalizes poorly with its different embeddings for entities as subject and as object.

We reconcile expressiveness, scalability and generalization by going back to the realm of well-studied matrix factorizations, and making use of complex linear algebra, a scarcely used tool in the machine learning community.

Model Scoring Function Relation Parameters
CP (Hitchcock, 1927)
RESCAL (Nickel et al., 2011)
TransE (Bordes et al., 2013b)
NTN (Socher et al., 2013) 20cm
DistMult (Yang et al., 2015)
HolE (Nickel et al., 2016b)
ComplEx (this paper)
Table 1: Scoring functions of state-of-the-art latent factor models for a given fact , along with the representation of their relation parameters, and time and space (memory) complexity. is the dimensionality of the embeddings. The entity embeddings and of subject and object are in for each model, except for ComplEx, where . is the complex conjugate, and is an additional latent dimension of the NTN model. and

denote respectively the Fourier transform and its inverse,

is the element-wise product between two vectors, denotes the real part of a complex vector, and denotes the trilinear product.

2.1.2 Decomposition in the Complex Domain

We introduce a new decomposition of real square matrices using unitary diagonalization, the generalization of orthogonal diagonalization to complex matrices. This allows decomposition of arbitrary real square matrices with unique representations of rows and columns.

Let us first recall some notions of complex linear algebra as well as specific cases of diagonalization of real square matrices, before building our proposition upon these results.

A complex-valued vector , with is composed of a real part and an imaginary part , where denotes the square root of . The conjugate of a complex vector inverts the sign of its imaginary part: .

Conjugation appears in the usual dot product for complex numbers, called the Hermitian product, or sesquilinear form, which is defined as:

A simple way to justify the Hermitian product for composing complex vectors is that it provides a valid topological norm in the induced vector space. For example, implies while this is not the case for the bilinear form as there are many complex vectors for which .

This yields an interesting property of the Hermitian product concerning the order of the involved vectors: , meaning that the real part of the product is symmetric, while the imaginary part is antisymmetric.

For matrices, we shall write for the conjugate-transpose . The conjugate transpose is also often written or .

Definition 2

A complex square matrix is unitarily diagonalizable if it can be written as , where , is diagonal, and is unitary such that .

Definition 3

A complex square matrix is normal if it commutes with its conjugate-transpose so that .

We can now state the spectral theorem for normal matrices.

Theorem 1 (Spectral theorem for normal matrices, von Neumann (1929))

Let be a complex square matrix. Then is unitarily diagonalizable if and only if is normal.

It is easy to check that all real symmetric matrices are normal, and have pure real eigenvectors and eigenvalues. But the set of purely real normal matrices also includes all real antisymmetric matrices (useful to model hierarchical relations such as

IsOlder), as well as all real orthogonal matrices (including permutation matrices), and many other matrices that are useful to represent binary relations, such as assignment matrices which represent bipartite graphs. However, far from all matrices expressed as are purely real, and Equation (1) requires the scores to be purely real.

As we only focus on real square matrices in this work, let us summarize all the cases where is real square and if is unitarily diagonalizable, where , is diagonal and is unitary:

  • is symmetric if and only if is orthogonally diagonalizable and and are purely real.

  • is normal and non-symmetric if and only if is unitarily diagonalizable and and are not both purely real.

  • is not normal if and only if is not unitarily diagonalizable.

We generalize all three cases by showing that, for any , there exists a unitary diagonalization in the complex domain, of which the real part equals :

(2)

In other words, the unitary diagonalization is projected onto the real subspace.

Theorem 2

Suppose is a real square matrix. Then there exists a normal matrix such that .

Let . Then

so that

Therefore is normal. Note that there also exists a normal matrix such that .

Following Theorem 1 and Theorem 2, any real square matrix can be written as the real part of a complex diagonal matrix through a unitary change of basis.

Corollary 1

Suppose is a real square matrix. Then there exist , where is unitary, and is diagonal, such that .

From Theorem 2, we can write , where is a normal matrix, and from Theorem 1, is unitarily diagonalizable.

Applied to the knowledge graph completion setting, the rows of here are vectorial representations of the entities corresponding to rows and columns of the relation score matrix . The score for the relation holding true between entities and is hence

(3)

where and is diagonal. For a given entity, its subject embedding vector is the complex conjugate of its object embedding vector.

To illustrate this difference of expressiveness with respect to real-valued embeddings, let us consider two complex embeddings of dimension 1, with arbitrary values: , and ; as well as their real-valued, twice-bigger counterparts: and . In the real-valued case, that corresponds to the DistMult model (Yang et al., 2015), the score is . Figure 1 represents the heatmaps of the scores and , as a function of in the complex-valued case, and as a function of diagonal in the real-valued case. In the real-valued case, that is symmetric in the subject and object entities, the scores and are equal for any value of diagonal. Whereas in the complex-valued case, the variation of allows to score and with any desired pair of values.

Figure 1: Left: Scores (top) and (bottom) for the proposed complex-valued decomposition, plotted as a function of , for fixed entity embeddings , and . Right: Scores (top) and (bottom) for the corresponding real-valued decomposition with the same number of free real-valued parameters (i.e. in twice the dimension), plotted as a function of diagonal, for fixed entity embeddings and . By varying , the proposed complex-valued decomposition can attribute any pair of scores to and , whereas for all with the real-valued decomposition.

This decomposition however is non-unique, a simple example of this non-uniqueness is obtained by adding a purely imaginary constant to the eigenvalues. Let , and where is unitary, is diagonal. Then for any real constant we have:

In general, there are many other possible couples of matrices and that preserve the real part of the decomposition. In practice however this is no synonym of low generalization abilities, as many effective matrix and tensor decomposition methods used in machine learning lead to non-unique solutions (Paatero and Tapper, 1994; Nickel et al., 2011). In this case also, the learned representations prove useful as shown in the experimental section.

2.2 Low-Rank Decomposition

Addressing knowledge graph completion with data-driven approaches assumes that there is a sufficient regularity in the observed data to generalize to unobserved facts. When formulated as a matrix completion problem, as it is the case in this section, one way of implementing this hypothesis is to make the assumption that the matrix has a low rank or approximately low rank. We first discuss the rank of the proposed decomposition, and then introduce the sign-rank and extend the bound developed on the rank to the sign-rank.

2.2.1 Rank Upper Bound

First, we recall one definition of the rank of a matrix (Horn and Johnson, 2012).

Definition 4

The rank of an -by- complex matrix , if has exactly linearly independent columns.

Also note that if is diagonalizable so that with , then has non-zero diagonal entries for some diagonal

and some invertible matrix

. From this it is easy to derive a known additive property of the rank:

(4)

where .

We now show that any rank real square matrix can be reconstructed from a -dimensional unitary diagonalization.

Corollary 2

Suppose and . Then there exist such that the columns of form an orthonormal basis of , is diagonal, and .

Consider the complex square matrix . We have .

From Equation (4), .

The proof of Theorem 2 shows that is normal. Thus with , where the columns of form an orthonormal basis of , and is diagonal.

Since is not necessarily square, we replace the unitary requirement of Corollary 1 by the requirement that its columns form an orthonormal basis of its smallest dimension, .

Also, given that such decomposition always exists in dimension (Theorem 2), this upper bound is not relevant when .

2.2.2 Sign-Rank Upper Bound

Since we encode the truth values of each fact with , we deal with square sign matrices: . Sign matrices have an alternative rank definition, the sign-rank.

Definition 5

The sign-rank of an -by- sign matrix is the rank of the -by- real matrix of least rank that has the same sign-pattern as so that

where .

We define the sign function of as

where the value is here arbitrarily assigned to to allow zero entries in , conversely to the stricter usual definition of the sign-rank.

To make generalization possible, we hypothesize that the true matrix has a low sign-rank, and thus can be reconstructed by the sign of a low-rank score matrix . The low sign-rank assumption is theoretically justified by the fact that the sign-rank is a natural complexity measure of sign matrices (Linial et al., 2007) and is linked to learnability (Alon et al., 2016) and empirically confirmed by the wide success of factorization models (Nickel et al., 2016a).

Using Corollary 2, we can now show that any square sign matrix of sign-rank can be reconstructed from a rank unitary diagonalization.

Corollary 3

Suppose , . Then there exists , where the columns of form an orthonormal basis of , and is diagonal, such that .

By definition, if , there exists a real square matrix such that and . From Corollary 2, where , where the columns of form an orthonormal basis of , and is diagonal.

Previous attempts to approximate the sign-rank in relational learning did not use complex numbers. They showed the existence of compact factorizations under conditions on the sign matrix (Nickel et al., 2014), or only in specific cases (Bouchard et al., 2015). In contrast, our results show that if a square sign matrix has sign-rank , then it can be exactly decomposed through a -dimensional unitary diagonalization.

Although we can only show the existence of a complex decomposition of rank for a matrix with sign-rank , the sign rank of is often much lower than the rank of , as we do not know any matrix for which (Alon et al., 2016). For example, the identity matrix has rank , but its sign-rank is only 3! By swapping the columns and for in , the identity matrix corresponds to the relation marriedTo, a relation known to be hard to factorize over the reals (Nickel et al., 2014), since the rank is invariant by row/column permutations. Yet our model can express it at most in rank 6, for any .

Hence, by enforcing a low-rank on , individual relation scores between entities and can be efficiently predicted, as and is diagonal.

Finding the that matches the sign-rank of corresponds to finding the smallest that brings the 0–1 loss on to , as link prediction can be seen as binary classification of the facts. In practice, and as classically done in machine learning to avoid this NP-hard problem, we use a continuous surrogate of the 0–1 loss, in this case the logistic loss as described in Section 4, and validate models on different values of , as described in Section 5.

2.2.3 Rank Bound Discussion

Corollaries 2 and 3 use the aforementioned subadditive property of the rank to derive the upper bound. Let us give an example for which this bound is strictly greater than .

Consider the following -by- sign matrix:

Not only is this matrix not normal, but one can also easily check that there is no real normal -by- matrix that has the same sign-pattern as . Clearly, is a rank matrix since its columns are linearly dependent, hence its sign-rank is also . From Corollary 3, we know that there is a normal matrix whose real part has the same sign-pattern as , and whose rank is at most .

However, there is no rank unitary diagonalization of which the real part equals . Otherwise we could find a 2-by-2 complex matrix such that and , where , , . This is obviously unsatisfiable. This example generalizes to any -by- square sign matrix that only has on its first row and is hence rank 1, the same argument holds considering and .

This example shows that the upper bound on the rank of the unitary diagonalization showed in Corollaries 2 and 3 can be strictly greater than , the rank or sign-rank, of the decomposed matrix. However, there might be other examples for which the addition of an imaginary part could—additionally to making the matrix normal—create some linear dependence between the rows/columns and thus decrease the rank of the matrix, up to a factor of 2.

We summarize this section in three points:

  1. The proposed factorization encompasses all possible score matrices for a single binary relation.

  2. By construction, the factorization is well suited to represent both symmetric and antisymmetric relations.

  3. Relation patterns can be efficiently approximated with a low-rank factorization using complex-valued embeddings.

3 Extension to Multi-Relational Data

Let us now extend the previous discussion to models with multiple relations. Let be the set of relations, with . We shall now write for the score tensor, for the score matrix of the relation , and for the partially-observed sign tensor.

Given one relation and two entities , the probability that the fact r(s,o) is true given by:

(5)

where is the scoring function of the model considered and denotes the model parameters. We denote the set of all possible facts (or triples) for a knowledge graph by . While the tensor as a whole is unknown, we assume that we observe a set of true and false triples where and is the set of observed triples. The goal is to find the probabilities of entries for a set of targeted unobserved triples .

Depending on the scoring function used to model the score tensor , we obtain different models. Examples of scoring functions are given in Table 1.

3.1 Complex Factorization Extension to Tensors

The single-relation model is extended by jointly factorizing all the square matrices of scores into a -order tensor , with a different diagonal matrix for each relation , and by sharing the entity embeddings across all relations:

(6)

where

is the rank hyperparameter,

are the rows in corresponding to the entities and , is a complex vector, and is the component-wise multilinear dot product222This is not the Hermitian extension of the multilinear dot product as there appears to be no standard definition of the Hermitian multilinear product in the linear algebra literature.. For this scoring function, the set of parameters is . This resembles the real part of a complex matrix decomposition as in the single-relation case discussed above. However, we now have a different vector of eigenvalues for every relation. Expanding the real part of this product gives:

(7)

These equations provide two interesting views of the model:

  • Changing the representation: Equation (6) would correspond to DistMult with real embeddings (see Table 1), but handles asymmetry thanks to the complex conjugate of the object-entity embedding.

  • Changing the scoring function: Equation (7) only involves real vectors corresponding to the real and imaginary parts of the embeddings and relations.

By separating the real and imaginary parts of the relation embedding as shown in Equation (7), it is apparent that these parts naturally act as weights on each latent dimension: over the real part of which is symmetric, and over the imaginary part of which is antisymmetric.

Indeed, the decomposition of each score matrix for each can be written as the sum of a symmetric matrix and an antisymmetric matrix. To see this, let us rewrite the decomposition of each score matrix in matrix notation. We write the real part of matrices with primes and imaginary parts with double primes :

(8)

It is trivial to check that the matrix is symmetric and that the matrix is antisymmetric. Hence this model is well suited to model jointly symmetric and antisymmetric relations between pairs of entities, while still using the same entity representations for subjects and objects. When learning, it simply needs to collapse to zero for symmetric relations , and to zero for antisymmetric relations , as is indeed symmetric when is purely real, and antisymmetric when is purely imaginary.

From a geometrical point of view, each relation embedding is an anisotropic scaling of the basis defined by the entity embeddings , followed by a projection onto the real subspace.

3.2 Existence of the Tensor Factorization

Let us first discuss the existence of the multi-relational model where the rank of the decomposition , which relates to simultaneous unitary decomposition.

Definition 6

A family of matrices

is simultaneously unitarily diagonalizable, if there is a single unitary matrix

, such that for all in , where are diagonal.

Definition 7

A family of normal matrices is a commuting family of normal matrices, if , for all in .

Theorem 3 (see Horn and Johnson (2012))

Suppose is the family of matrices . Then is a commuting family of normal matrices if and only if is simultaneously unitarily diagonalizable.

To apply Theorem 3 to the proposed factorization, we would have to make the hypothesis that the relation score matrices are a commuting family, which is too strong a hypothesis. Actually, the model is slightly different since we take only the real part of the tensor factorization. In the single-relation case, taking only the real part of the decomposition rids us of the normality requirement of Theorem 1 for the decomposition to exist, as shown in Theorem 2.

In the multiple-relation case, it is an open question whether taking the real part of the simultaneous unitary diagonalization will enable us to decompose families of arbitrary real square matrices—that is with a single unitary matrix that has at most columns. Though it seems unlikely, we could not find a counter-example yet.

However, by letting the rank of the tensor factorization to be greater than , we can show that the proposed tensor decomposition exists for families of arbitrary real square matrices, by simply concatenating the decomposition of Theorem 2 of each real square matrix .

Theorem 4

Suppose . Then there exists and are diagonal, such that for all in .

From Theorem 2 we have , where is diagonal, and each is unitary for all in .

Let , and

where the zero matrix. Therefore for all in .

By construction, the rank of the decomposition is at most . When , this bound actually matches the general upper bound on the rank of the canonical polyadic (CP) decomposition (Hitchcock, 1927; Kruskal, 1989). Since corresponds to the number of relations and to the number of entities, is always smaller than in real world knowledge graphs, hence the bound holds in practice.

Though when it comes to relational learning, we might expect the actual rank to be much lower than for two reasons. The first one, as discussed above, is that we are dealing with sign tensors, hence the rank of the matrices need only match the sign-rank of the partially-observed matrices . The second one is that the matrices are related to each other, as they all represent the same entities in different relations, and thus benefit from sharing latent dimensions. As opposed to the construction exposed in the proof of Theorem 4, where other relations dimensions are canceled out. In practice, the rank needed to generalize well is indeed much lower than as we show experimentally in Figure 5.

Also, note that with the construction of the proof of Theorem 4, the matrix is not unitary any more. However the unitary constraints in the matrix case serve only the proof of existence, which is just one solution among the infinite ones of same rank. In practice, imposing orthonormality is essentially a numerical commodity for the decomposition of dense matrices, through iterative methods for example (Saad, 1992). When it comes to matrix and tensor completion, and thus generalisation, imposing such constraints is more of a numerical hassle than anything else, especially for gradient methods. As there is no apparent link between orthonormality and generalisation properties, we did not impose these constraints when learning this model in the following experiments.

4 Algorithm

Algorithm 1 describes stochastic gradient descent (SGD) to learn the proposed multi-relational model with the AdaGrad learning-rate updates (Duchi et al., 2011). We refer to the proposed model as ComplEx, for Complex Embeddings. We expose a version of the algorithm that uses only real-valued vectors, in order to facilitate its implementation. To do so, we use separate real-valued representations of the real and imaginary parts of the embeddings.

These real and imaginary part vectors are initialized with vectors having a zero-mean normal distribution with unit variance. If the training set

contains only positive triples, negatives are generated for each batch using the local closed-world assumption as in Bordes et al. (2013b). That is, for each triple, we randomly change either the subject or the object, to form a negative example. In this case the parameter sets the number of negative triples to generate for each positive triple. Collision with positive triples in is not checked, as it occurs rarely in real world knowledge graphs as they are largely sparse, and may also be computationally expensive.

Squared gradients are accumulated to compute AdaGrad learning rates, then gradients are updated. Every iterations, the parameters are evaluated over the evaluation set (evaluate_AP_or_MRR function in Algorithm 1). If the data set contains both positive and negative examples, average precision (AP) is used to evaluate the model. If the data set contains only positives, then mean reciprocal rank (MRR) is used as average precision cannot be computed without true negatives. The optimization process is stopped when the measure considered decreases compared to the last evaluation (early stopping).

Bern(

) is the Bernoulli distribution, the

one_random_sample function sample uniformly one entity in the set of all entities , and the sample_batch_of_size_b function sample true and false triples uniformly at random from the training set .

For a given embedding size , let us rewrite Equation (7), by denoting the real part of embeddings with primes and the imaginary part with double primes: , , , . The set of parameters is , and the scoring function involves only real vectors:

(9)

where each entity and each relation has two real embeddings.

Gradients are now easy to write:

where is the element-wise (Hadamard) product.

We optimized the negative log-likelihood of the logistic model described in Equation (5) with regularization on the parameters :

(10)

where is the regularization parameter.

To handle regularization, note that using separate representations for the real and imaginary parts does not change anything as the squared -norm of a complex vector is the sum of the squared modulus of each entry:

which is actually the sum of the -norms of the vectors of the real and imaginary parts.

We can finally write the gradient of with respect to a real embedding for one triple and its truth value :

(11)
0:  Training set , validation set , learning rate , rank , regularization factor , negative ratio , batch size , maximum iteration , validate every iterations, AdaGrad regularizer .
0:  Embeddings .
   , for each
  , for each
   , for each
   , for each
  
  for  do
     for  do
         sample_batch_of_size_b
        // Negative sampling:
        
        for  in  do
           for  do
               one_random_sample
              if Bern 0.5 then
                 
              else
                 
              end if
           end for
        end for
        
        for  in  do
           for  in  do
              // AdaGrad updates:
              
              // Gradient updates:
              
           end for
        end for
     end for
     // Early stopping
     if  then
         evaluate_AP_or_MRR
        if  then
           break
        end if
        
     end if
  end for
  return
Algorithm 1 Stochastic gradient descent with AdaGrad for the ComplEx model

5 Experiments

We evaluated the method proposed in this paper on both synthetic and real data sets. The synthetic data set contains both symmetric and antisymmetric relations, whereas the real data sets are standard link prediction benchmarks based on real knowledge graphs.

We compared ComplEx to state-of-the-art models, namely TransE (Bordes et al., 2013b), DistMult (Yang et al., 2015), RESCAL (Nickel et al., 2011) and also to the canonical polyadic decomposition (CP) (Hitchcock, 1927), to emphasize empirically the importance of learning unique embeddings for entities. For experimental fairness, we reimplemented these models within the same framework as the ComplEx

model, using a Theano-based SGD implementation

333https://github.com/lmjohns3/downhill (Bergstra et al., 2010).

For the TransE model, results were obtained with its original max-margin loss, as it turned out to yield better results for this model only. To use this max-margin loss on data sets with observed negatives (Sections 5.1 and 5.2), positive triples were replicated when necessary to match the number of negative triples, as described in Garcia-Duran et al. (2016). All other models are trained with the negative log-likelihood of the logistic model (Equation (10)). In all the following experiments we used a maximum number of iterations , a batch size , and validated the models for early stopping every iterations.

5.1 Synthetic Task

To assess our claim that ComplEx can accurately model jointly symmetry and antisymmetry, we randomly generated a knowledge graph of two relations and 30 entities. One relation is entirely symmetric, while the other is completely antisymmetric. This data set corresponds to a tensor. Figure 2 shows a part of this randomly generated tensor, with a symmetric slice and an antisymmetric slice, decomposed into training, validation and test sets. To ensure that all test values are predictable, the upper triangular parts of the matrices are always kept in the training set, and the diagonals are unobserved. We conducted a 5-fold cross-validation on the lower-triangular matrices, using the upper-triangular parts plus 3 folds for training, one fold for validation and one fold for testing. Each training set contains 1392 observed triples, whereas validation and test sets contain 174 triples each.

Figure 3 shows the best cross-validated average precision (area under the precision-recall curve) for different factorization models of ranks ranging up to 50. The regul