1 Introduction
Webscale knowledge graph provide a structured representation of world knowledge, with projects such as the Google Knowledge Vault (Dong et al., 2014). They enable a wide range of applications including recommender systems, question answering and automated personal agents. The incompleteness of these knowledge graphs—also called knowledge bases—has stimulated research into predicting missing entries, a task known as link prediction or knowledge graph completion. The need for high quality predictions required by link prediction applications made it progressively become the main problem in statistical relational learning (Getoor and Taskar, 2007), a research field interested in relational data representation and modeling.
Knowledge graphs were born with the advent of the Semantic Web, pushed by the World Wide Web Consortium (W3C) recommendations. Namely, the Resource Description Framework (RDF) standard, that underlies knowledge graphs’ data representation, provides for the first time a common framework across all connected information systems to share their data under the same paradigm. Being more expressive than classical relational databases, all existing relational data can be translated into RDF knowledge graphs (Sahoo et al., 2009).
Knowledge graphs express data as a directed graph with labeled edges (relations) between nodes (entities). Natural redundancies between the recorded relations often make it possible to fill in the missing entries of a knowledge graph. As an example, the relation CountryOfBirth could not be recorded for all entities, but it can be inferred if the relation CityOfBirth is known. The goal of link prediction is the automatic discovery of such regularities. However, many relations are nondeterministic: the combination of the two facts IsBornIn(John,Athens) and IsLocatedIn(Athens,Greece) does not always imply the fact HasNationality(John,Greece). Hence, it is natural to handle inference probabilistically, and jointly with other facts involving these relations and entities. To this end, an increasingly popular method is to state the knowledge graph completion task as a 3D binary tensor completion problem, where each tensor slice is the adjacency matrix of one relation in the knowledge graph, and compute a decomposition of this partiallyobserved tensor from which its missing entries can be completed.
Factorization models with lowrank embeddings were popularized by the Netflix challenge (Koren et al., 2009). A partiallyobserved matrix or tensor is decomposed into a product of embedding matrices with much smaller dimensions, resulting in fixeddimensional vector representations for each entity and relation in the graph, that allow completion of the missing entries. For a given fact r(s,o) in which the subject entity is linked to the object entity through the relation , a score for the fact can be recovered as a multilinear product between the embedding vectors of , and , or through more sophisticated composition functions (Nickel et al., 2016a).
Binary relations in knowledge graphs exhibit various types of patterns: hierarchies and compositions like FatherOf, OlderThan or IsPartOf, with strict/nonstrict orders or preorders, and equivalence relations like IsSimilarTo. These characteristics maps to different combinations of the following properties: reflexivity/irreflexivity, symmetry/antisymmetry and transitivity. As described in Bordes et al. (2013a), a relational model should (i) be able to learn all combinations of such properties, and (ii) be linear in both time and memory in order to scale to the size of presentday knowledge graphs, and keep up with their growth.
A natural way to handle any possible set of relations is to use the classic canonical polyadic (CP) decomposition (Hitchcock, 1927), which yields two different embeddings for each entity and thus low prediction performances as shown in Section 5
. With unique entity embeddings, multilinear products scale well and can naturally handle both symmetry and (ir)reflexivity of relations, and when combined with an appropriate loss function, dot products can even handle transitivity
(Bouchard et al., 2015). However, dealing with antisymmetric—and more generally asymmetric—relations has so far almost always implied superlinear time and space complexity (Nickel et al., 2011; Socher et al., 2013) (see Section 2), making models prone to overfitting and not scalable. Finding the best tradeoff between expressiveness, generalization and complexity is the keystone of embedding models.In this work, we argue that the standard dot product between embeddings can be a very effective composition function, provided that one uses the right representation: instead of using embeddings containing real numbers, we discuss and demonstrate the capabilities of complex embeddings. When using complex vectors, that is vectors with entries in , the dot product is often called the Hermitian (or sesquilinear) dot product, as it involves the conjugatetranspose of one of the two vectors. As a consequence, the dot product is not symmetric any more, and facts about one relation can receive different scores depending on the ordering of the entities involved in the fact. In summary, complex embeddings naturally represent arbitrary relations while retaining the efficiency of a dot product, that is linearity in both space and time complexity.
This paper extends a previously published article (Trouillon et al., 2016). This extended version adds proofs of existence of the proposed model in both single and multirelational settings, as well as proofs of the nonuniqueness of the complex embeddings for a given relation. Bounds on the rank of the proposed decomposition are also demonstrated and discussed. The learning algorithm is provided in more details, and more experiments are provided, especially regarding the training time of the models.
The remainder of the paper is organized as follows. We first provide justification and intuition for using complex embeddings in the square matrix case (Section 2), where there is only a single type of relation between entities, and show the existence of the proposed decomposition for all possible relations. The formulation is then extended to a stacked set of square matrices in a thirdorder tensor to represent multiple relations (Section 3
). The stochastic gradient descent algorithm used to learn the model is detailed in Section
4, where we present an equivalent reformulation of the proposed model that involves only real embeddings. This should help practitioners when implementing our method, without requiring the use of complex numbers in their software implementation. We then describe experiments on largescale public benchmark knowledge graphs in which we empirically show that this representation leads not only to simpler and faster algorithms, but also gives a systematic accuracy improvement over current stateoftheart alternatives (Section 5). Related work is discussed in Section 6.2 Relations as the Real Parts of LowRank Normal Matrices
We consider in this section a simplified link prediction task with a single relation, and introduce complex embeddings for lowrank matrix factorization.
We will first discuss the desired properties of embedding models, show how this problem relates to the spectral theorems, and discuss the classes of matrices these theorems encompass in the real and in the complex case. We then propose a new matrix decomposition—to the best of our knowledge—and a proof of its existence for all real square matrices. Finally we discuss the rank of the proposed decomposition.
2.1 Modeling Relations
Let be a set of entities, with . The truth of the single relation holding between two entities is represented by a sign value , where 1 represents true facts and 1 false facts, is the subject entity and
is the object entity. The probability for the relation holding true is given by
(1) 
where is a latent matrix of scores indexed by the subject (rows) and object entities (columns), is a partiallyobserved sign matrix indexed in identical fashion, and
is a suitable sigmoid function. Throughout this paper we used the logistic inverse link function
.2.1.1 Handling Both Asymmetry and Unique Entity Embeddings
In this work we pursue three objectives: finding a generic structure for that leads to a computationally efficient model, an expressive enough approximation of common relations in real world knowledge graphs, and good generalization performances in practice. Standard matrix factorization approximates by a matrix product , where and are two functionallyindependent matrices, being the rank of the matrix. Within this formulation it is assumed that entities appearing as subjects are different from entities appearing as objects. In the Netflix challenge (Koren et al., 2009) for example, each row corresponds to the user and each column corresponds to the movie
. This extensively studied type of model is closely related to the singular value decomposition (SVD) and fits well with the case where the matrix
is rectangular.However, in many knowledge graph completion problems, the same entity can appear as both subject or object and will have two different embedding vectors, and , depending on whether it appears as subject or object of a relation. It seems natural to learn unique embeddings of entities, as initially proposed by Nickel et al. (2011) and Bordes et al. (2011) and since then used systematically in other prominent approaches (Bordes et al., 2013b; Yang et al., 2015; Socher et al., 2013)
. In the factorization setting, using the same embeddings for left and rightside factors boils down to a specific case of eigenvalue decomposition:
orthogonal diagonalization.Definition 1
A real square matrix is orthogonally diagonalizable if it can be written as , where , is diagonal, and orthogonal so that where
is the identity matrix.
The spectral theorem for symmetric matrices tells us that a matrix is orthogonally diagonalizable if and only if it is symmetric (Cauchy, 1829). It is therefore often used to approximate covariance matrices, kernel functions and distance or similarity matrices.
However as previously stated, this paper is explicitly interested in problems where matrices—and thus the relation patterns they represent—can also be antisymmetric, or even not have any particular symmetry pattern at all (asymmetry). In order to both use a unique embedding for entities and extend the expressiveness to asymmetric relations, researchers have generalised the notion of dot products to scoring functions, also known as composition functions, that allow more general combinations of embeddings. We briefly recall several examples of scoring functions in Table 1, as well as the extension proposed in this paper.
These models propose different tradeoffs between the three essential points:

Expressiveness, which is the ability to represent symmetric, antisymmetric and more generally asymmetric relations.

Scalability, which means keeping linear time and space complexity scoring function.

Generalization, for which having unique entity embeddings is critical.
RESCAL (Nickel et al., 2011) and NTN (Socher et al., 2013) are very expressive, but their scoring functions have quadratic complexity in the rank of the factorization. More recently the HolE model (Nickel et al., 2016b) proposes a solution that has quasilinear complexity in time and linear space complexity. DistMult (Yang et al., 2015) can be seen as a joint orthogonal diagonalization with real embeddings, hence handling only symmetric relations. Conversely, TransE (Bordes et al., 2013b) handles symmetric relations to the price of strong constraints on its embeddings. The canonicalpolyadic decomposition (CP) (Hitchcock, 1927) generalizes poorly with its different embeddings for entities as subject and as object.
We reconcile expressiveness, scalability and generalization by going back to the realm of wellstudied matrix factorizations, and making use of complex linear algebra, a scarcely used tool in the machine learning community.
Model  Scoring Function  Relation Parameters  
CP (Hitchcock, 1927)  
RESCAL (Nickel et al., 2011)  
TransE (Bordes et al., 2013b)  
NTN (Socher et al., 2013)  20cm  
DistMult (Yang et al., 2015)  
HolE (Nickel et al., 2016b)  
ComplEx (this paper) 
denote respectively the Fourier transform and its inverse,
is the elementwise product between two vectors, denotes the real part of a complex vector, and denotes the trilinear product.2.1.2 Decomposition in the Complex Domain
We introduce a new decomposition of real square matrices using unitary diagonalization, the generalization of orthogonal diagonalization to complex matrices. This allows decomposition of arbitrary real square matrices with unique representations of rows and columns.
Let us first recall some notions of complex linear algebra as well as specific cases of diagonalization of real square matrices, before building our proposition upon these results.
A complexvalued vector , with is composed of a real part and an imaginary part , where denotes the square root of . The conjugate of a complex vector inverts the sign of its imaginary part: .
Conjugation appears in the usual dot product for complex numbers, called the Hermitian product, or sesquilinear form, which is defined as:
A simple way to justify the Hermitian product for composing complex vectors is that it provides a valid topological norm in the induced vector space. For example, implies while this is not the case for the bilinear form as there are many complex vectors for which .
This yields an interesting property of the Hermitian product concerning the order of the involved vectors: , meaning that the real part of the product is symmetric, while the imaginary part is antisymmetric.
For matrices, we shall write for the conjugatetranspose . The conjugate transpose is also often written or .
Definition 2
A complex square matrix is unitarily diagonalizable if it can be written as , where , is diagonal, and is unitary such that .
Definition 3
A complex square matrix is normal if it commutes with its conjugatetranspose so that .
We can now state the spectral theorem for normal matrices.
Theorem 1 (Spectral theorem for normal matrices, von Neumann (1929))
Let be a complex square matrix. Then is unitarily diagonalizable if and only if is normal.
It is easy to check that all real symmetric matrices are normal, and have pure real eigenvectors and eigenvalues. But the set of purely real normal matrices also includes all real antisymmetric matrices (useful to model hierarchical relations such as
IsOlder), as well as all real orthogonal matrices (including permutation matrices), and many other matrices that are useful to represent binary relations, such as assignment matrices which represent bipartite graphs. However, far from all matrices expressed as are purely real, and Equation (1) requires the scores to be purely real.As we only focus on real square matrices in this work, let us summarize all the cases where is real square and if is unitarily diagonalizable, where , is diagonal and is unitary:

is symmetric if and only if is orthogonally diagonalizable and and are purely real.

is normal and nonsymmetric if and only if is unitarily diagonalizable and and are not both purely real.

is not normal if and only if is not unitarily diagonalizable.
We generalize all three cases by showing that, for any , there exists a unitary diagonalization in the complex domain, of which the real part equals :
(2) 
In other words, the unitary diagonalization is projected onto the real subspace.
Theorem 2
Suppose is a real square matrix. Then there exists a normal matrix such that .
Let . Then
so that
Therefore is normal. Note that there also exists a normal matrix such that .
Following Theorem 1 and Theorem 2, any real square matrix can be written as the real part of a complex diagonal matrix through a unitary change of basis.
Corollary 1
Suppose is a real square matrix. Then there exist , where is unitary, and is diagonal, such that .
From Theorem 2, we can write , where is a normal matrix, and from Theorem 1, is unitarily diagonalizable.
Applied to the knowledge graph completion setting, the rows of here are vectorial representations of the entities corresponding to rows and columns of the relation score matrix . The score for the relation holding true between entities and is hence
(3) 
where and is diagonal. For a given entity, its subject embedding vector is the complex conjugate of its object embedding vector.
To illustrate this difference of expressiveness with respect to realvalued embeddings, let us consider two complex embeddings of dimension 1, with arbitrary values: , and ; as well as their realvalued, twicebigger counterparts: and . In the realvalued case, that corresponds to the DistMult model (Yang et al., 2015), the score is . Figure 1 represents the heatmaps of the scores and , as a function of in the complexvalued case, and as a function of diagonal in the realvalued case. In the realvalued case, that is symmetric in the subject and object entities, the scores and are equal for any value of diagonal. Whereas in the complexvalued case, the variation of allows to score and with any desired pair of values.
This decomposition however is nonunique, a simple example of this nonuniqueness is obtained by adding a purely imaginary constant to the eigenvalues. Let , and where is unitary, is diagonal. Then for any real constant we have:
In general, there are many other possible couples of matrices and that preserve the real part of the decomposition. In practice however this is no synonym of low generalization abilities, as many effective matrix and tensor decomposition methods used in machine learning lead to nonunique solutions (Paatero and Tapper, 1994; Nickel et al., 2011). In this case also, the learned representations prove useful as shown in the experimental section.
2.2 LowRank Decomposition
Addressing knowledge graph completion with datadriven approaches assumes that there is a sufficient regularity in the observed data to generalize to unobserved facts. When formulated as a matrix completion problem, as it is the case in this section, one way of implementing this hypothesis is to make the assumption that the matrix has a low rank or approximately low rank. We first discuss the rank of the proposed decomposition, and then introduce the signrank and extend the bound developed on the rank to the signrank.
2.2.1 Rank Upper Bound
First, we recall one definition of the rank of a matrix (Horn and Johnson, 2012).
Definition 4
The rank of an by complex matrix , if has exactly linearly independent columns.
Also note that if is diagonalizable so that with , then has nonzero diagonal entries for some diagonal
and some invertible matrix
. From this it is easy to derive a known additive property of the rank:(4) 
where .
We now show that any rank real square matrix can be reconstructed from a dimensional unitary diagonalization.
Corollary 2
Suppose and . Then there exist such that the columns of form an orthonormal basis of , is diagonal, and .
Consider the complex square matrix . We have .
From Equation (4), .
The proof of Theorem 2 shows that is normal. Thus with , where the columns of form an orthonormal basis of , and is diagonal.
Since is not necessarily square, we replace the unitary requirement of Corollary 1 by the requirement that its columns form an orthonormal basis of its smallest dimension, .
Also, given that such decomposition always exists in dimension (Theorem 2), this upper bound is not relevant when .
2.2.2 SignRank Upper Bound
Since we encode the truth values of each fact with , we deal with square sign matrices: . Sign matrices have an alternative rank definition, the signrank.
Definition 5
The signrank of an by sign matrix is the rank of the by real matrix of least rank that has the same signpattern as so that
where .
We define the sign function of as
where the value is here arbitrarily assigned to to allow zero entries in , conversely to the stricter usual definition of the signrank.
To make generalization possible, we hypothesize that the true matrix has a low signrank, and thus can be reconstructed by the sign of a lowrank score matrix . The low signrank assumption is theoretically justified by the fact that the signrank is a natural complexity measure of sign matrices (Linial et al., 2007) and is linked to learnability (Alon et al., 2016) and empirically confirmed by the wide success of factorization models (Nickel et al., 2016a).
Using Corollary 2, we can now show that any square sign matrix of signrank can be reconstructed from a rank unitary diagonalization.
Corollary 3
Suppose , . Then there exists , where the columns of form an orthonormal basis of , and is diagonal, such that .
By definition, if , there exists a real square matrix such that and . From Corollary 2, where , where the columns of form an orthonormal basis of , and is diagonal.
Previous attempts to approximate the signrank in relational learning did not use complex numbers. They showed the existence of compact factorizations under conditions on the sign matrix (Nickel et al., 2014), or only in specific cases (Bouchard et al., 2015). In contrast, our results show that if a square sign matrix has signrank , then it can be exactly decomposed through a dimensional unitary diagonalization.
Although we can only show the existence of a complex decomposition of rank for a matrix with signrank , the sign rank of is often much lower than the rank of , as we do not know any matrix for which (Alon et al., 2016). For example, the identity matrix has rank , but its signrank is only 3! By swapping the columns and for in , the identity matrix corresponds to the relation marriedTo, a relation known to be hard to factorize over the reals (Nickel et al., 2014), since the rank is invariant by row/column permutations. Yet our model can express it at most in rank 6, for any .
Hence, by enforcing a lowrank on , individual relation scores between entities and can be efficiently predicted, as and is diagonal.
Finding the that matches the signrank of corresponds to finding the smallest that brings the 0–1 loss on to , as link prediction can be seen as binary classification of the facts. In practice, and as classically done in machine learning to avoid this NPhard problem, we use a continuous surrogate of the 0–1 loss, in this case the logistic loss as described in Section 4, and validate models on different values of , as described in Section 5.
2.2.3 Rank Bound Discussion
Corollaries 2 and 3 use the aforementioned subadditive property of the rank to derive the upper bound. Let us give an example for which this bound is strictly greater than .
Consider the following by sign matrix:
Not only is this matrix not normal, but one can also easily check that there is no real normal by matrix that has the same signpattern as . Clearly, is a rank matrix since its columns are linearly dependent, hence its signrank is also . From Corollary 3, we know that there is a normal matrix whose real part has the same signpattern as , and whose rank is at most .
However, there is no rank unitary diagonalization of which the real part equals . Otherwise we could find a 2by2 complex matrix such that and , where , , . This is obviously unsatisfiable. This example generalizes to any by square sign matrix that only has on its first row and is hence rank 1, the same argument holds considering and .
This example shows that the upper bound
on the rank of the unitary diagonalization showed in Corollaries 2
and 3 can be strictly greater than , the rank or signrank,
of the decomposed matrix. However, there might be other examples for which the
addition of an imaginary part could—additionally to making the matrix normal—create
some linear dependence between the rows/columns and thus decrease the rank of the matrix,
up to a factor of 2.
We summarize this section in three points:

The proposed factorization encompasses all possible score matrices for a single binary relation.

By construction, the factorization is well suited to represent both symmetric and antisymmetric relations.

Relation patterns can be efficiently approximated with a lowrank factorization using complexvalued embeddings.
3 Extension to MultiRelational Data
Let us now extend the previous discussion to models with multiple relations. Let be the set of relations, with . We shall now write for the score tensor, for the score matrix of the relation , and for the partiallyobserved sign tensor.
Given one relation and two entities , the probability that the fact r(s,o) is true given by:
(5) 
where is the scoring function of the model considered and denotes the model parameters. We denote the set of all possible facts (or triples) for a knowledge graph by . While the tensor as a whole is unknown, we assume that we observe a set of true and false triples where and is the set of observed triples. The goal is to find the probabilities of entries for a set of targeted unobserved triples .
Depending on the scoring function used to model the score tensor , we obtain different models. Examples of scoring functions are given in Table 1.
3.1 Complex Factorization Extension to Tensors
The singlerelation model is extended by jointly factorizing all the square matrices of scores into a order tensor , with a different diagonal matrix for each relation , and by sharing the entity embeddings across all relations:
(6)  
where
is the rank hyperparameter,
are the rows in corresponding to the entities and , is a complex vector, and is the componentwise multilinear dot product^{2}^{2}2This is not the Hermitian extension of the multilinear dot product as there appears to be no standard definition of the Hermitian multilinear product in the linear algebra literature.. For this scoring function, the set of parameters is . This resembles the real part of a complex matrix decomposition as in the singlerelation case discussed above. However, we now have a different vector of eigenvalues for every relation. Expanding the real part of this product gives:(7)  
These equations provide two interesting views of the model:

Changing the scoring function: Equation (7) only involves real vectors corresponding to the real and imaginary parts of the embeddings and relations.
By separating the real and imaginary parts of the relation embedding as shown in Equation (7), it is apparent that these parts naturally act as weights on each latent dimension: over the real part of which is symmetric, and over the imaginary part of which is antisymmetric.
Indeed, the decomposition of each score matrix for each can be written as the sum of a symmetric matrix and an antisymmetric matrix. To see this, let us rewrite the decomposition of each score matrix in matrix notation. We write the real part of matrices with primes and imaginary parts with double primes :
(8)  
It is trivial to check that the matrix is symmetric and that the matrix is antisymmetric. Hence this model is well suited to model jointly symmetric and antisymmetric relations between pairs of entities, while still using the same entity representations for subjects and objects. When learning, it simply needs to collapse to zero for symmetric relations , and to zero for antisymmetric relations , as is indeed symmetric when is purely real, and antisymmetric when is purely imaginary.
From a geometrical point of view, each relation embedding is an anisotropic scaling of the basis defined by the entity embeddings , followed by a projection onto the real subspace.
3.2 Existence of the Tensor Factorization
Let us first discuss the existence of the multirelational model where the rank of the decomposition , which relates to simultaneous unitary decomposition.
Definition 6
A family of matrices
is simultaneously unitarily diagonalizable, if there is a single unitary matrix
, such that for all in , where are diagonal.Definition 7
A family of normal matrices is a commuting family of normal matrices, if , for all in .
Theorem 3 (see Horn and Johnson (2012))
Suppose is the family of matrices . Then is a commuting family of normal matrices if and only if is simultaneously unitarily diagonalizable.
To apply Theorem 3 to the proposed factorization, we would have to make the hypothesis that the relation score matrices are a commuting family, which is too strong a hypothesis. Actually, the model is slightly different since we take only the real part of the tensor factorization. In the singlerelation case, taking only the real part of the decomposition rids us of the normality requirement of Theorem 1 for the decomposition to exist, as shown in Theorem 2.
In the multiplerelation case, it is an open question whether taking the real part of the simultaneous unitary diagonalization will enable us to decompose families of arbitrary real square matrices—that is with a single unitary matrix that has at most columns. Though it seems unlikely, we could not find a counterexample yet.
However, by letting the rank of the tensor factorization to be greater than , we can show that the proposed tensor decomposition exists for families of arbitrary real square matrices, by simply concatenating the decomposition of Theorem 2 of each real square matrix .
Theorem 4
Suppose . Then there exists and are diagonal, such that for all in .
From Theorem 2 we have , where is diagonal, and each is unitary for all in .
Let , and
where the zero matrix. Therefore for all in .
By construction, the rank of the decomposition is at most . When , this bound actually matches the general upper bound on the rank of the canonical polyadic (CP) decomposition (Hitchcock, 1927; Kruskal, 1989). Since corresponds to the number of relations and to the number of entities, is always smaller than in real world knowledge graphs, hence the bound holds in practice.
Though when it comes to relational learning, we might expect the actual rank to be much lower than for two reasons. The first one, as discussed above, is that we are dealing with sign tensors, hence the rank of the matrices need only match the signrank of the partiallyobserved matrices . The second one is that the matrices are related to each other, as they all represent the same entities in different relations, and thus benefit from sharing latent dimensions. As opposed to the construction exposed in the proof of Theorem 4, where other relations dimensions are canceled out. In practice, the rank needed to generalize well is indeed much lower than as we show experimentally in Figure 5.
Also, note that with the construction of the proof of Theorem 4, the matrix is not unitary any more. However the unitary constraints in the matrix case serve only the proof of existence, which is just one solution among the infinite ones of same rank. In practice, imposing orthonormality is essentially a numerical commodity for the decomposition of dense matrices, through iterative methods for example (Saad, 1992). When it comes to matrix and tensor completion, and thus generalisation, imposing such constraints is more of a numerical hassle than anything else, especially for gradient methods. As there is no apparent link between orthonormality and generalisation properties, we did not impose these constraints when learning this model in the following experiments.
4 Algorithm
Algorithm 1 describes stochastic gradient descent (SGD) to learn the proposed multirelational model with the AdaGrad learningrate updates (Duchi et al., 2011). We refer to the proposed model as ComplEx, for Complex Embeddings. We expose a version of the algorithm that uses only realvalued vectors, in order to facilitate its implementation. To do so, we use separate realvalued representations of the real and imaginary parts of the embeddings.
These real and imaginary part vectors are initialized with vectors having a zeromean normal distribution with unit variance. If the training set
contains only positive triples, negatives are generated for each batch using the local closedworld assumption as in Bordes et al. (2013b). That is, for each triple, we randomly change either the subject or the object, to form a negative example. In this case the parameter sets the number of negative triples to generate for each positive triple. Collision with positive triples in is not checked, as it occurs rarely in real world knowledge graphs as they are largely sparse, and may also be computationally expensive.Squared gradients are accumulated to compute AdaGrad learning rates, then gradients are updated. Every iterations, the parameters are evaluated over the evaluation set (evaluate_AP_or_MRR function in Algorithm 1). If the data set contains both positive and negative examples, average precision (AP) is used to evaluate the model. If the data set contains only positives, then mean reciprocal rank (MRR) is used as average precision cannot be computed without true negatives. The optimization process is stopped when the measure considered decreases compared to the last evaluation (early stopping).
Bern(
) is the Bernoulli distribution, the
one_random_sample function sample uniformly one entity in the set of all entities , and the sample_batch_of_size_b function sample true and false triples uniformly at random from the training set .For a given embedding size , let us rewrite Equation (7), by denoting the real part of embeddings with primes and the imaginary part with double primes: , , , . The set of parameters is , and the scoring function involves only real vectors:
(9) 
where each entity and each relation has two real embeddings.
Gradients are now easy to write:
where is the elementwise (Hadamard) product.
We optimized the negative loglikelihood of the logistic model described in Equation (5) with regularization on the parameters :
(10) 
where is the regularization parameter.
To handle regularization, note that using separate representations for the real and imaginary parts does not change anything as the squared norm of a complex vector is the sum of the squared modulus of each entry:
which is actually the sum of the norms of the vectors of the real and imaginary parts.
We can finally write the gradient of with respect to a real embedding for one triple and its truth value :
(11) 
5 Experiments
We evaluated the method proposed in this paper on both synthetic and real data sets. The synthetic data set contains both symmetric and antisymmetric relations, whereas the real data sets are standard link prediction benchmarks based on real knowledge graphs.
We compared ComplEx to stateoftheart models, namely TransE (Bordes et al., 2013b), DistMult (Yang et al., 2015), RESCAL (Nickel et al., 2011) and also to the canonical polyadic decomposition (CP) (Hitchcock, 1927), to emphasize empirically the importance of learning unique embeddings for entities. For experimental fairness, we reimplemented these models within the same framework as the ComplEx
model, using a Theanobased SGD implementation
^{3}^{3}3https://github.com/lmjohns3/downhill (Bergstra et al., 2010).For the TransE model, results were obtained with its original maxmargin loss, as it turned out to yield better results for this model only. To use this maxmargin loss on data sets with observed negatives (Sections 5.1 and 5.2), positive triples were replicated when necessary to match the number of negative triples, as described in GarciaDuran et al. (2016). All other models are trained with the negative loglikelihood of the logistic model (Equation (10)). In all the following experiments we used a maximum number of iterations , a batch size , and validated the models for early stopping every iterations.
5.1 Synthetic Task
To assess our claim that ComplEx can accurately model jointly symmetry and antisymmetry, we randomly generated a knowledge graph of two relations and 30 entities. One relation is entirely symmetric, while the other is completely antisymmetric. This data set corresponds to a tensor. Figure 2 shows a part of this randomly generated tensor, with a symmetric slice and an antisymmetric slice, decomposed into training, validation and test sets. To ensure that all test values are predictable, the upper triangular parts of the matrices are always kept in the training set, and the diagonals are unobserved. We conducted a 5fold crossvalidation on the lowertriangular matrices, using the uppertriangular parts plus 3 folds for training, one fold for validation and one fold for testing. Each training set contains 1392 observed triples, whereas validation and test sets contain 174 triples each.
Figure 3 shows the best crossvalidated average precision (area under the precisionrecall curve) for different factorization models of ranks ranging up to 50. The regularization parameter is validated in 0.1, 0.03, 0.01, 0.003,0.001, 0.0003, 0.00001, 0.0 and the learning rate was initialized to 0.5.
As expected, DistMult (Yang et al., 2015) is not able to model antisymmetry and only predicts the symmetric relations correctly. Although TransE (Bordes et al., 2013b) is not a symmetric model, it performs poorly in practice, particularly on the antisymmetric relation. RESCAL (Nickel et al., 2011), with its large number of parameters, quickly overfits as the rank grows. Canonical Polyadic (CP) decomposition (Hitchcock, 1927) fails on both relations as it has to push symmetric and antisymmetric patterns through the entity embeddings. Surprisingly, only ComplEx succeeds even on such simple data.