Knowledge graphs store semantic information in the form of entities and relationships that is easily machine-processable — a property that is considered as an important ingredient to build more intelligent systems by taking advantage of such semantically structured representations. Thanks to long-term collaborative efforts, many knowledge graphs such as WordNet , YAGO , DBpedia , and Freebase , which contain a huge amount of data, are now readily available and have been successfully used for coreference resolution , question expansion , and questing answering 
. However, their underlying rigid symbolic representations, while being very interpretable and efficient for their original purposes, make them hard to be integrated, especially into deep learning systems that focus on learning distributed representations of data.
A promising method is to embed the entities and relations from a knowledge graph into a continuous low-dimensional vector space. Once those embeddings are well learned from the existing facts, the relationships between entities can be derived from interactions of their embeddings via an appropriate operator for each relation. Many possible ways have been proposed to model these interactions and to derive the existence of a relationship from them [3, 21, 22, 19, 15, 24, 7, 39, 16, 28, 37]. Knowledge graph completion (or link prediction) is considered as an outstanding merit for these relational learning models since knowledge graphs are often missing many facts, and some of the edges they contain might be incorrect. Encoding entities in distributed embeddings also leads to great improvement in efficiency for the link prediction because such predictions can be made without exploring the original big graph.
The relational learning models for knowledge graphs usually predict the existence of a (subject, predicate, object) triple via a score function which represents the model’s confidence that a triple is true. These models are normally trained by maximizing the plausibility of observed triples. However, training on all-positive samples is tricky, because the model easily over generalize . The problem is that knowledge graphs usually only contain positive triples. A widely-used method to generate negative samples is to “perturb” positive triples by replacing the subjects or objects of true triples with entities selected at random. Unfortunately, good “plausible” negative examples are still hard to come and usually not sufficient to train useful models: while it is relatively easy to predict that a person is born in a city, it is difficult to predict which city in particular. A better approach (based on perturbation) to generate more informative negative examples is to replace the subjects or objects of observed triples with those semantically close to the replaced one.
We propose a relational learning method using generative adversarial architecture  in which the negative examples are partly generated by a generative network (GN), and a discriminative network (DN) is trained to distinguish ground truths from the generated triples and randomly sampled false ones. GN and DN compete in a two-player minimax game: the discriminator tried to differentiate the positive triples from the others, and the generator tries to fool the discriminator. Competition in this game drives both networks to improve their performance until the generated examples are indistinguishable from the true triples. When arriving at a convergence, GN recovers the training data and can be used for knowledge graph completion, while DN is trained to be a good triple classifier. Unlike previous work [5, 32] using generative adversarial architectures, our GN is capable of unseen “plausible” triples whereas they just use GN to grade and select negative samples (already existed) for DN. Experiments showed that our method can significantly improve the performance of classical relational learning models (e.g. TransE) on both the link prediction and triple classification tasks.
The remainder of this paper is structured as follows. Section presents a brief overview of related work. In Section
, our neural network architecture and training algorithm are represented. Sectionreports experimental results. The conclusion will be given in Section .
2 Related Work
models. The intuition behind the former is that the edges can be recovered by the features extracted from the observable properties of the graph, and those models look at the direct correlation of patterns observed in a graph. The former can be further divided into three classes: those that predict links using path ranking algorithms, those that infer new links using the rules extracted from graphs, and those that link the entities using the derived similarity between them. The latter try to find the correlation between nodes or edges in a graph through latent variables. We here focus on the latter that is more related to this study.
What all latent feature-based models have in common is that they use latent features of entities to explain observable triples, and those features are not directly observed in the data (that is why we call them “latent”). A (subject, predicate, object) triple is usually represented by while the latent features of triples is represented by three vectors , where is the dimensionality of latent feature vector representations. The key intuition behind such models is that the relationships between entities can be inferred from the interactions of their latent features. We briefly review several typical ways to model these interactions for predicting the new facts below.
The structured embedding (SE) model 
derives the probability of relationships from the distances between latent feature representations of entities, and model the score of a triple as, where the matrices and transform the feature vectors of entities to model relationships specifically for the relation
. To reduce the number of parameters in SE model, Bordes et al aaai-bordes:13 proposed TransE model that translates the latent feature representations using a relation-specific distance instead of linear transformation. The score of a triple is then defined as. The main shortcoming of this model is that the latent features of two entities do not interact with each other, because they are independently mapped to a common space. TransH  projects the entity vectors
onto the relation hyperplanes to alleviate many-to-many problem. Immediately after TransH, TransR and TransD  view entities and relations as two independent space, different mapping technique from entity space to relation space are proposed in these models. Our architecture use those models as basic building blocks, and we show that their performance can be significantly improved using our training method.
RESCAl  is a bilinear model that explains triples by capturing the pairwise interactions between two entity feature vectors using multiplicative terms. The score of a triple is modeled as , where is a weight matrix that specifies how the latent features interact for the relation . Such bilinear model has been augmented with diagonal weight matrices , and complex-valued embeddings 
. RESCAL can be seen as a special case of tensor factorization methods, and similar methods have been explored for predicting triples  and modeling highly multi-relational data 
. Socher et al nips-socher:13 stated that the bilinear models can only capture linear interactions and might be unable to fit more complex relations, and they use a neural tensor to directly relate the two entity feature vectors across multiple dimensions. Convolutional neural networks were also tried to capture the similar relations[12, 25]. Even though neural tensors or convolutional networks have much more expressive power that is useful for modelling large knowledge bases, they have more parameters than SE and RESCAl models. Dong et al sigkdd-dong:14 and Yang el al iclr-yang:15 reported that such models tend to overfit, at least on the relatively small datasets.
where an adversarial learning framework is applied. Their generators are trained to provide better negative samples for the discriminators than randomly selectors. Their generated negative samples in fact exist in the training dataset, while our generator can produce unseen “plausible” examples. Beside, they train the generator by a reinforcement learning because discrete sampling steps prevents gradients from back-propagating. In our method, the generator was desired to take two elements of a triple as input and make up the missing entity in form of its vector, which makes the whole process easily trainable and fully differentiable.
3 Adversarial Learning-Based Framework
We here describe an adversarial learning-based framework to model the relation between two entities in a graph through latent variables, in which the entities are embedded into a continuous vector space, and the relations between them can be recovered from their distributed representations (or embeddings). The generative network (GN) is trained to deceive the discriminative network (DN) by gradually improving its ability in generating “just-like-truth” triples, while DN is taught to differentiate truth triples from the generated ones as well as randomly selected negative samples. Competition in this game drives both networks to improve their performance until the generated triples are indistinguishable from the genuine ones. In our framework, any relational learning model (e.g. TransE) can play the role of GN or DN in such two-player game, and we try to explore these two possibilities. In this section, we first formally introduce the architecture of the proposed framework, and then describe two implementations of this framework: one taking a translation-based model as the generator, and another using it as the discriminator.
A knowledge graph (KG) consists of a set of triples , where are entities and is a relation. We call the head, and the tail of a triple that represent has the relation with . To embed the entities and relations of KG into a continuous vector space, an adversarial learning-based framework is designed to learn such embeddings so that the triples stored in KG can be recovered from those embeddings. The proposed framework is illustrated in Figure 1, which has two main components: a generator (GN) and a discriminator (DN). GN takes a head-relation pair as input, and attempt to generate the vector representation of a tail that should be indistinguishable from the truth tail . DN is trained to distinguish the ground truth triples from others by a score function , and the score of a triple represents its confidence that the triple is true.
The training objective of DN is to differentiate the ground truth triples from the triple generated by GN. However, in the early stages of training, GN is incapable of generating good negative samples for DN because GN is not well trained yet. Thus, another negative sample is added to train DN, and is a false triple constructed by replacing the correct tail with randomly sampled from the entities in KG. Inspired by Wasserstein GAN 
, the loss function for DN’s part can be formalized as below, whereis a scoring function.
In our definition, the lower the score is, the more likely the triple is true. As training progresses, DN learns to becomes a better triple classifier. The score of positive sample, , is doubled in order to counteract the effect of two types of negative samples.
GN is trained to generate the embedding of for a given pair, and to make DN taking as a truth triple. Thus, the training loss of GN is defined as follows.
When arriving at a convergence, GN learns to generate the “plausible” triples that are indistinguishable from the genuine ones, which ability can be used for knowledge completion.
GN and DN are trained jointly in a two-player minimax game, and they use the same embedding matrices (for entities) and (for relations), where is the number of entities, the number of relations, and is the dimensionality of embeddings.
As mentioned above, any relational learning model can be served as the role of GN or DN. However, if overly complex models are used at the both sides, they may suffer from a very large search space, which makes them difficult to be trained, especially for the adversarial learning situation. We reduce the search space by requiring at least one of GN or DN to adopt simple, but robust translation-based model, such as TransE , TransH , and TransD . We explore several variants for the proposed framework and mainly build two types of implementations: one using translation-based models as DN while another taking them as GN.
3.2 Translation-based discriminator
In this setting, we choose to use multilayer perceptron (MLP) or convolutional neural network (CNN) as the generator, and one of translation-based models (TransE, H, and D) is taken as the discriminator, so we have six different combinations.
When playing as a generator, CNN takes the concatenation of vector representations of a head and a relation
as input, A convolution with multiple filters is used to yield another feature vector by taking the dot product of filter vectors with the input vector. After the input vector is convoluted with the filter matrix, a non-linear function is applied, following a classical linear transformation. A MLP is a class of traditional feed-forward neural network, consisting of multiple linear layers, interleaved with some non-linearity function.
The discriminator works as a maximum-margin classifier so that the distances between positive and negative samples are maximized with the chosen hyperplane.
3.3 Translation-based generator
A translation-based model is used as the generator in this configuration. Taking TransH  as example, the generated triple in its vector representation takes the form of , where and denote the embeddings of a head and a relation respectively, and . The weights in are used to project the vector onto the hyperplane, defined for the relation .
When taking a translation-based model as a generator, the loss function for GN’s part is defined as:
This loss has two parts: one is defined to maximally separate the positive triples from the negative ones in the hyperspace, and the other is used to better deceive the discriminator by making use of its feedback. As discussed before, various neural networks, including MLP and CNN, are tried as discriminative networks to test several variants for the framework.
We evaluate our adversarial learning-based framework on two standard tasks for learning structured embedding of KG: link prediction and triple classification. For the link prediction, we report the results produced by generators, while for the triple classification, the performance of discriminators is reported.
We used WN18RR  and FB15k-237  as datasets for the link prediction, and WN11  and FB13  for the triple classification. WN18RR and Fb15k-237 are built to remove the reversible relations existed in WN18 and FB15k, which are much easier to be predicted. WN18RR is a subset of WN18 after removing such relations, and Fb15k-237 is a subset of FB15k by removing the redundancy. For each relation , we add its reverse relation into the dataset so that our GN always takes the a head-relation pair as input when it is fed with a . For example, the triple is expanded to . However, we make sure that if a relation is in the training set, its reverse must not occur in the test.
A set of negative triples is required to evaluate the triple classifier. We choose to use the datasets released by Socher et al nips-socher:13, where one negative sample are added for each positive triple. The size of datasets is listed in Table .
|GN (MLP) + DN (TransE)|
|GN (CNN) + DN (TransE)|
|GN (TransE) + DN (MLP)|
|GN (TransE) + DN (CNN)|
|GN (MLP) + DN (TransH)||1555|
|GN (CNN) + DN (TransH)|
|GN (TransH) + DN (MLP)|
|GN (TransH) + DN (CNN)||52.0|
|GN (MLP) + DN (TransD)|
|GN (CNN) + DN (TransD)|
|GN (TransD) + DN (MLP)|
|GN (TransD) + DN (CNN)||173|
4.2 The Choice of Hyperparameters
We use grid search technique to determine the values of hyper-parameters from few choices: for the dimensionality of embeddings, the margin, the learning rate, the batch size, the weight decay, the number of critic iterations for the used Wasserstein GAN, and the clipping threshold. We test the MLP with different layers and hidden sizes . As to the convolutional network, we explore several number of filters .
4.3 Link Prediction
Link prediction aims to predict the missing entity or for a positive triple . We evaluate the performance following the “filtered” setting 
: ranking the test triples against all corrupted triples except the test triplet of interest not appearing in the training, validation, or test sets. We employ three widely-used evaluation metrics: Mean Rank (MR), Mean Reciprocal Rank (MRR) and Hits with tenth (Hits@). The lower MR, the higher MRR, and the higher Hits@, the better.
On the validation set of WN18RR, the highest scores in Hits@ is achieved with , , , , , , and
. We achieved the highest performance on FB15k-237 with the similar values of hyperparameters except for settingto .
The results of the link prediction are shown in Table 2 by comparing to the baseline or four state-of-art systems. Our results are all achieved by the generators for this task. We listed several different implementations for the framework, where GN indicates which model is used as the generator and DN indicates which as the discriminator in our adversarial learning-based framework. For example of “GN (MLP) + DN (TransE)”, this implementation uses a MLP as the generator and takes a model based on TransE as the discriminator.
From these numbers, a handful of trends are readily apparent. First, the adversarial learning framework improves all the performance of baseline systems in most cases. The proposed method boosts the baseline systems by and in average of Hit@ on WN18RR and FB15k-237 respectively. The largest increment is achieved by “GN (TransH) + DN (CNN)” with improvement in Hit@10 and improvement in MRR. Another striking result of these experiments is that “GN (TransH) + DN (CNN)” achieved state-of-the-art on the WN18RR, and comparable result on FB15k-237 in the metric of Hit@, using relatively quite simple models as its components.
Although “(TransH) + DN (CNN)” does not outperform ConvE on FB15k-237 in Hit@ and MRR, it performs competitively. The main goal of this study is to investigate how well the proposed framework can further improve the baseline systems (such as TransE, D and H) comparing with the other adversarial learning-based method. The best result on FB15k-237 achieved by ConvE  with relatively high computational cost. For each training triple, ConvE needs to compute the dot product of a tail vector with all the entity embeddings in KG that might be not scale well to large knowledge graphs. Furthermore, we tried only a few different network configurations, and there are many ways (such as unsupervised pre-training, and carefully designed network architecture) that we could improve it further. “GN (TransH) + DN (CNN)” performed better than KBGAN  with a significant margin because the reinforcement learning is not required to train the networks, and our entire training process is fully differentiable even though the similar adversarial learning-based framework is applied.
4.4 Triple Classification
Triple classification aims to judge how well a classifier to distinguish the a ground truth triple from the others. Given a triple, a specific threshold is required for the classifier to decide whether the triple is true. Specifically, if the score of is less than the given threshold , the triple is classified as a fact, otherwise a false. is chosen by maximizing classification accuracy on the validation set. We reported the performance of triple classification using the outputs of the discriminators. The experimental results are shown in Table . On the validation set of WN11, the highest accuracy is obtained by using , , , , , , and . We achieved the highest performance on FB13 with the similar values of hyperparameters except for setting to .
|GN (CNN) + DN (TransE)|
|GN (TransH) + DN (CNN)||85.5|
The results of TransE and TransH are excerpted from , “unif” denotes the way that the negative samples are produced by replacing a head or a tail with a randomly selected entity, while “bern” denotes the similar sampling method but the replacing entity is picked according to their frequencies . As shown in Table , “GN(CNN) + DN(TransE)” improves TransE by in accuracy on WN11, and “GN(TransH) + DN(CNN)” boosts TransH by on FB13. “GN (TransH) + DN (CNN)” achieved the highest accuracy on WN11 dataset and comparable accuracy on FB13 datasets, comparing with the listed competitors.
We proposed a novel generative adversarial-based framework to learn structured embeddings of knowledge graphs, in which the generator is trained to recover the training data while the discriminator is trained to be a good triple classifier. Experimental results demonstrate that our method can improve classical relational learning models (e.g. TransE, D, and H) with a significant margin on both the link prediction and triple classification tasks. Unlike few previous studies based on generative adversarial architectures, our generative network is able to generate unseen instances while they use it as just a negative sample selector for the discriminative ones. The ability in directly generating the feature vector representations of unseen “plausible” entities make the framework promising for practical integration with other intelligent systems, especially for deep learning-based systems that focus on learning distributed representations. Such distinguishing feature is analogous to the open-world assumption of description logics with respect to the modeling languages developed in the study of databases.
-  (2017) Wasserstein gan. arXiv preprint arXiv:1701.07875. Cited by: §3.1, §4.2.
-  (2009) Tensor decompositions and applications. SIAM Review 51 (3), pp. 455–500. Cited by: §2.
Learning strucgtured embeddings of knowledge bases.
Proceedings of the 25th AAAI Conference on Artificial Intelligence (AAAI’11), Cited by: §1, §1, §2, §2.
-  (2014) Generative adversarial nets. In Proceedings of Advances in Neural Information Processing Systems (NIPS’14), Cited by: §1.
-  (2017) Kbgan: adversarial learning for knowledge graph embeddings. arXiv preprint arXiv:1711.04071. Cited by: §1, §2, §4.3, Table 2.
Improving machine learning approaches to coreference resolution. In Proceedings of the 40th annual meeting on association for computational linguistics, pp. 104–111. Cited by: §1.
-  (2014) Knowledge graph embedding by translating on hyperplanes. In Proceedings of the AAAI Conference on Artificial Intelligence (AAAI’14), Cited by: §1, §2, §3.1, §3.3, §4.4, Table 2, Table 3.
Random walk inference and learning in a large scale knowledge base.
Proceedings of the Conference on Empirical Methods in Natural Language Processing, pp. 529–539. Cited by: §2.
-  (2016) Learning first-order logic embeddings via matrix factorization. In Proceedings of the 25th International Joint Conference on Artificial Intelligence(ICJAI’16), Cited by: §2.
-  (2010) Relational retrieval using a combination of path-constrained random walks. Machine Learning 81 (1), pp. 53–67. Cited by: §2.
-  (2015) Embedding entities and relations for learning and inference in knowledge bases. In International Conference on Learning Representations, Cited by: §2, Table 2.
-  (2018-02) Convolutional 2d knowledge graph embeddings. In Proceedings of the 32th AAAI Conference on Artificial Intelligence, pp. 1811–1818. External Links: Cited by: §2, §4.1, §4.3, Table 2.
-  (2012) Mining the semantic web - statistical learning for next generation knowledge bases. Data Mining and Knowledge Discovery 24 (3), pp. 613–662. Cited by: §2.
-  (2015) A review of relational machine learning for knowledge graphs. abs/1503.00759v3. External Links: Cited by: §1.
-  (2013) Translating embeddings for modeling multi-relational data. In Proceedings of Advances in Neural Information Processing Systems (NIPS’13), Cited by: §1, §3.1, §3.2, §4.3, Table 2, Table 3.
-  (2015) Semantically smooth knowledge graph embedding. In Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pp. 84–94. Cited by: §1, §2.
-  (2015) Learning to represent knowledge graphs with gaussian embedding. In Proceedings of the 24th ACM International on Conference on Information and Knowledge Management, pp. 623–632. Cited by: Table 3.
-  (2007) DBpedia: a nucleus for a web of open data. In Proceedings of the 6th International Semantic Web Conference (ISWC’07), Cited by: §1.
-  (2012) A latent factor model for highly multi-relational data. In Advances in neural information processing systems, pp. 3167–3175. Cited by: §1, §2.
-  (2015) Knowledge graph embedding via dynamic mapping matrix. In Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pp. 687–696. Cited by: §2, §3.1, Table 2.
-  (2011) A three-way model for collective learning on multi-relatonal data. In Proceedings of the International Conference on Machine learning (ICML’11), Cited by: §1, §2, §2.
-  (2012) Factorizing yago: scalable machine learning for linked data. In Proceedings of the International Conference on World Wide Web (WWW’12), Cited by: §1.
-  (1995) WordNet: a lexical database for English. Communications of the ACM 38 (11), pp. 39–41. Cited by: §1.
-  (2013) Reasoning with neural tensor networks for knowledge base completion. In Advances in neural information processing systems, pp. 926–934. Cited by: §1, §2, §4.1.
-  (2017) A novel embedding model for knowledge base completion based on convolutional neural network. arXiv preprint arXiv:1712.02121. Cited by: §2.
-  (2016) Compositional learning of embeddings for relation paths in knowlege bases and text. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 1434–1444. Cited by: §2.
-  (2012) Predicting rdf triples in incomplete knowledge bases with tensor factorization. In Proceedings of the 27th Annual ACM Symposium on Aplied Computing (SAC’12), Cited by: §2.
-  (2016) Representation learning of knowledge graphs with hierarchical types. In IJCAI, Cited by: §1.
-  (2008) Freebase: a collaboratively created graph database for structuring human knowledge. In Proceedings of the ACM SIGMOD International Conference on Management of Data (SIGMOD’08), Cited by: §1.
-  (2015) Observed versus latent features for knowledge base and text inference. In Proceedings of the 3rd Workshop on Continuous Vector Space Models and their Compositionality, pp. 57–66. Cited by: §4.1.
-  (2016) Complex embeddings for simple link prediction. In International Conference on Machine Learning (ICML), pp. 2071–2080. Cited by: §2, Table 2.
-  (2018) Incorporating gan for negative sampling in knowledge representation learning. In Thirty-Second AAAI Conference on Artificial Intelligence, Cited by: §1, §2.
-  (2005) The spheresearch engine for unified ranked retrieval of heterogeneous xml and web documents. In Proceedings of the 31st International Conference on Very Large Data Bases (VLDB’05), Cited by: §1.
-  (2007) Yago: a core of semantic knowledge. In Proceedings of the International Conference on World Wide Web (WWW’07), Cited by: §1.
-  (2010) Building Watson: an overview of the DeepQA project. AI Magazine 31 (3), pp. 59–79. Cited by: §1.
-  (2016) TransG: a generative model for knowledge graph embedding. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Vol. 1, pp. 2316–2325. Cited by: Table 3.
Asynchronous bidirectional decoding for neural machine translation. In Thirty-Second AAAI Conference on Artificial Intelligence, Cited by: §1.
-  (2018) NSCaching: simple and efficient negative sampling for knowledge graph embedding. arXiv preprint arXiv:1812.06410. Cited by: Table 2.
-  (2015) Learning entity and relation embeddings for knowledge graph completion. In Twenty-ninth AAAI conference on artificial intelligence, Cited by: §1, §2, §2.