Large-scale knowledge graphs collect an increasing amount of structured data, where nodes correspond to entities and edges reflect the relationships between head and tail entities. This graph-structured knowledge base has become a resource of enormous value, with potential applications such as search engine, recommendation systems and question answering systems. However, it is still incomplete and cannot cater to the increasing need of intelligent systems. To solve this problem, many studies [2, 20] achieve notable performance on automatically finding and filling the missing facts of existing relations. But for newly-added relations, there is still a non-negligible limitation, and obtaining adequate training instances for every new relation is an increasingly impractical solution. Therefore, people prefer an automatic completion solution, or even a more radical method that recognizes unseen classes without seeing any training instances.
Zero-shot learning aims to recognize objects or facts of new classes (unseen classes) with no examples being seen during the training stage. Correspondingly, an appealing characteristic of human learning is that, with a certain accumulation of knowledge, people are able to recognize new categories merely from their text descriptions. Therefore, instead of learning from instances, the semantic features of new classes can be reflected by their textual descriptions. Moreover, textual descriptions contain rich and unambiguous information and can be easily accessed from dictionaries, encyclopedia articles or various online resources, which is critical for large-scale recognition tasks.
In this paper, we propose a zero-shot relational learning method for knowledge graph. As shown in Figure 1
, we convert zero-shot learning into a knowledge transfer problem. We focus on how to generate reasonable relation embeddings for unseen relations merely from their text descriptions. Once trained, this system is capable of generating relation embeddings for arbitrary relations without fine-tuning. With these relation embeddings, the facts of unseen relations can be recognized simply by cosine similarity. To meet these requirements, the first challenge is how to establish an effective knowledge transfer process from text semantic space to knowledge graph semantic space. We leverage the conditional GANs to generate the plausible relation embeddings from text descriptions and provide the inter-class diversity for unseen relations. The second challenge is the noise suppression of text descriptions. Human language expression always includes irrelevant words (such as function words) for identifying target relations. As in Figure1, the bold words are more critical for the meaning of relation League_players; Therefore, the indiscriminate weights for words will lead to inferior performance. For this problem, We adopt the simple bag-of-words model based on word embeddings; Simultaneously, we calculate the TF-IDF features to down-weight the importance of the less relevant words for zero-shot learning. Our main contributions are three-fold:
We are the first to consider zero-shot learning for knowledge graph completion, and propose a generative adversarial framework to generate reasonable relation embeddings for unseen relations merely from text descriptions;
Our method is model-agnostic and can be potentially applied to any version of KG embeddings;
We present two newly constructed datasets for zero-shot knowledge graph completion and show that our method achieves better performance than various embedding-based methods.
projects relations and entities from symbolic space to vector space, and the missing links of the existing relations can be inferred via simple vector operations. Subsequently, many notable embedding-based studies[23, 20] are proposed for knowledge graph completion. However, these methods are incapable of any action when dealing with newly-add relations. Unlike that, the proposed method still has good recognition ability for the relation facts of unseen relations. xiong2018one xiong2018one proposes a few-shot learning method that learns a matching network and predicts the unseen relation facts by calculating their matching score with a few labeled instances. In contrast, our method follows the zero-shot setting and do not need any training instances for unseen relations. KBGAN  adopts adversarial training to learn a better discriminator via selecting high-quality negative samples, but it still focuses on the link prediction of existing relations.
The core of zero-shot learning (ZSL) is realizing knowledge sharing and inductive transfer between the seen and the unseen classes, and the common solution is to find an intermediate semantic representation. For this purpose, akata2013label akata2013label propose an attribute-based model that learns a transformation matrix to build the correlations between attributes and instances. However, attribute-based methods still depend on lots of human labor to create attributes, and are sensitive to the quality of attributes. Text-based methods 
are to create the intermediate semantic representation directly from the available online unstructured text information. To suppress the noise in raw text, wang2019learning wang2019learning leverage TF-IDF features to down-weight the irrelevant words. As for model selection, the ZSL framework of zhu2018generative zhu2018generative greatly inspires us, which leverages a conditional GANs model to realize zero-shot learning on image classification task. Currently, the majority of ZSL research works are from the computer vision domain. In the field of Natural Language Processing, artetxe2019massively artetxe2019massively use a single sentence encoder to finish the multilingual tasks by only training the target model on a single language. To the best of our knowledge, this work is the first zero-shot relational learning for knowledge graphs.
Zero-Shot Learning Settings
Here we present the problem definition and some notations of zero-shot learning based on knowledge graph completion task. Knowledge graph is a directed graph-structured knowledge base and constructed from tremendous relation fact triples . Since the proposed work aims to explore the recognition ability when meeting the newly-added relations, our target can be formulated as predicting the tail entity given the head entity and the query relation . To be more specific, for each query tuple , there are a ground-truth tail entity and a candidate set ; our model needs to assign the highest ranking to against the rest candidate entities . According to the zero-shot setting, there are two different relation sets, the seen relation set and the unseen relation set , and obviously .
At the start, we have a background knowledge graph that collects a large scale of triples , and is available during the zero-shot training stage. With this knowledge graph, we establish a training set for the seen relations . During testing, the proposed model is to predict the relation facts of unseen relations . As for textual description, we automatically extract an online textual description for each relation in . In view of feasibility, we only consider a closed set of entities; More specifically, each entity that appears in the testing triples is still in the entity set . Thus, our testing set can be formulated as . With the same requirement of the training process, the ground-truth tail entity needs to be correctly recognized by ranking with the candidate tail entities . We leave out a subset of as the validation set by removing all training instances of the validation relations.
Generative Adversarial Models
Generative adversarial networks 
have been enjoying the considerable success of generating realistic objective, especially on image domain. The generator aims to synthesize the reasonable pseudo data from random variables, and the discriminator is to distinguish them from the real-world data. Besides random variables, zhang2017stackgan zhang2017stackgan and zhu2018generative zhu2018generative have proved that the generator possesses the capability of knowledge transfer from the textual inputs. The desired solution of this game is Nash equilibrium; Otherwise, it is prone to unstable training behavior and mode collapse. Recently, many works[1, 9] have been proposed to effectively alleviate this problem. Compared with the non-saturating GAN111Goodfellow’s team  clarified that the standard GAN  should be uniformly called non-saturating GANs. , WGAN  optimizes the original objective by utilizing Wasserstein distance between real and fake distributions. On this basis, gulrajani2017improved gulrajani2017improved propose a gradient penalty strategy as the alternative to the weight clipping strategy of WGAN, in which way to better satisfy Lipschitz constraint. miyato2018spectral miyato2018spectral introduce spectral normalization to further stabilize the training of discriminator. Practice proves that our model benefits a lot from these advanced strategies.
Besides, because KG triples are from different relations, our task should be regarded as a class conditional generation problem, and it is a common phenomenon in real-world datasets. ACGAN  adds an auxiliary category recognition branch to the cost function of the discriminator and apparently improves the diversity of the generated samples222miyato2018cgans miyato2018cgans proposes a projection-based way to alleviate model collapse when dealing with too many classes, but it is not suitable for our margin ranking loss.. Spectral normalization is also impressively beneficial to the diversity of the synthetic data.
In this section, we describe the proposed model for zero-shot knowledge graph relational learning. As shown in Figure 3, the core of our approach is the design of a conditional generative model to learn the qualified relation embeddings from raw text descriptions. Fed with text representations, the generator is to generate the reasonable relation embeddings that reflect the corresponding relational semantic information in the knowledge graph feature space. Based on this, the prediction of unseen relations is converted to a simple supervised classification task. On the contrary, the discriminator seeks to separate the fake data from the real data distribution and identifies the relation type as well. For real data representations, it is worth mentioning that we utilize a feature encoder to generate reasonable real data distribution from KG embeddings. The feature encoder is trained in advance from the training set and fixed during the adversarial training process.
Traditional KG embeddings fit well on the seen relation facts during training; However, the optimal zero-shot feature representations should provide the cluster-structure distribution for both seen and unseen relation facts. Therefore, we design a feature encoder to learn better data distribution from the pretrained KG embeddings and one-hop structures.
Network Architecture: Feature encoder consists of two sub-encoders, the neighbor encoder and the entity encoder. In the premise of the feasibility of real-world large-scale KGs, for each entity , we only consider the one-hop neighbors . Therefore, we adopt the neighbor encoder to generate structural representations. Given a KG embedding matrix of dimension , we first utilize an embedding layer to look up the corresponding neighbor entity and relation embeddings , . Then, the structure-based representation of entity is calculated  as below,
where is tanhactivation function, and denotes the concatenation operation. In consideration of scalability, we set an upper limit for the number of neighbors. Besides, we also apply a simple feed-forward layer as the entity encoder to extract the information from entity pair themselves,
To sum up, as Figure 2, the relation fact representation is formulated as the concatenation of the neighbor embeddings , and the entity pair embedding ,
where , , are the learned parameters.
Pretraining Strategy: The core of this pretraining step is to learn the cluster-structure data distribution that reflects a higher intra-class similarity and relatively lower inter-class similarity. The traditional supervised way with cross-entropy loss gives inter classes too much penalty and is impracticable for unseen classes. Thus, we adopt an effective matching-based way via margin ranking loss . For each relation , in one training step, we first randomly take out reference triples from the training set, a batch of positive triples from the rest of training set, and a batch of negative triples 333The negative triples are generated by polluting the tail entities.. Then we use the feature encoder to generate the reference embedding , and calculate its cosine similarity respectively with and as and . Therefore, the margin ranking loss can be described as below,
where is the parameter set to learn and denotes the margin. The best parameters of the feature encoder are determined by the validation set .
Generative Adversarial Model
Generator: The generator is to generate the plausible relation embeddings from textual descriptions. First, for text representations, we simply adopt the bag-of-words method, where words are encoded with the pretrained word embeddings [12, 15] as in Figure 3. To suppress the noise information, we first remove stop-words and punctuations, and then evaluate the importance of the rest words via TF-IDF features . Thus, the text embedding is the vector sum of word embeddings weighted by TF-IDF values. To meet the GANs requirements, we concatenate each text embedding with a random vector
sampled from Gaussian distribution. As in Figure 3, the following knowledge transfer process is completed by two fully-connected (FC) layers and a layer normalization operation. So, relation embedding is generated by the generator with parameters . To avoid mode collapse and improve diversity, we adopt the Wasserstein loss and an additional classification loss. This classification loss is formulated as the margin ranking loss as equation 4. Here, the cluster center is regarded as the real relation representation, where is the number of facts of relation . Thus, positive scores are calculated from and ; Negative scores are calculated from and negative fact representations where negative facts are generated by polluting tail entities. In addtion, visual pivot regularization  is also applied to provide enough inter-class discrimination.
Discriminator: The discriminator attempts to distinguish whether an input is the real data or the fake one ; Besides, it also needs to correctly recognize their corresponding relation types. As in Figure 3
, the input features are first transformed via a FC layer with Leaky ReLU
. Following this, there are two network branches. The first branch is a FC layer that acts as a binary classifier to separate real data from fake data, and we utilize the Wasserstein loss as well. The other branch is the classification performance. In order to stabilize training behavior and eliminate mode collapse, we also adopt thegradient penalty
to enforce the Lipschitz constraint. It penalizes the model if the gradient norm moves away from its target norm value 1. In summary, the loss function of the discriminator is formulated as:
Predicting Unseen Relations
After adversarial training, given a relation textual description , the generator can generate its plausible relation embedding . For a query tuple , the similarity ranking value can be calculated by the cosine similarity between and . It is worth mentioning that, since can be sampled indefinitely, we can generate an arbitrary number of generated relation embeddings . For the better generalization ability, we utilize the average cosine similarity value as the ultimate ranking score,
Datasets and Evaluation Protocols
|Dataset||# Ent.||# Triples||# Train/Dev/Test|
KG Triples: Because there is not available zero-shot relational learning dataset for knowledge graph, we decide to construct two reasonable datasets from the existing KG Datasets. We select NELL444http://rtw.ml.cmu.edu/rtw/  and Wikidata555https://pypi.org/project/Wikidata/ for two reasons: the large scale and the existence of official relation descriptions. For NELL, we take the latest dump and remove those inverse relations. The dataset statistics are presented in Table 1.
Textual Description: The NELL and Wikidata are two well-configured knowledge graphs. Our textual descriptions consist of multiple information. For NELL, we integrate the relation description and its entity type descriptions. For Wikidata, each relation is represented as a property item. Besides its property description, we also leverage the attributes P31, P1629, P1855 as the additional descriptions.
In our experiments, the baselines include three commonly-used KG embedding methods: TransE , DistMult  and ComplEx . Obviously, these original models cannot handle zero-shot learning. Therefore, based on these three methods, we propose three zero-shot baselines, ZS-TransE, ZS-DistMult and ZS-ComplEx. Instead of randomly initializing a relation embedding matrix to represent relations, we add a feed-forward network with the same structure666This feed-forward network does not receive random noise as input. of our generator to calculate relation embeddings for these three methods. Equally, we utilize text embeddings as input and fine-tune this feed-forward network and entity embeddings via their original objectives. Under this setting, the unseen relation embeddings can be calculated via their text embeddings, and the unseen relation facts can be predicted via their original score functions. RESCAL  cannot directly adopt the same feed-forward network for zero-shot learning; For a fair comparison, we do not consider this KG embedding method.
For NELL-ZS dataset, we set the embedding size as 100. For Wiki-ZS, we set the embedding size as 50 for faster training. The three aforementioned baselines are implemented based on the Open-Source Knowledge Embedding toolkitOpenKE777https://github.com/thunlp/OpenKE, and their hyperparameters are tuned using the Hits@10 metric on the validation set . The proposed generative method uses the pre-trained KG embeddings as input, which are trained on the triples in the training set. For TransE and DistMult, we directly use their 1-D vectors. For ComplEx, we set two experiments by respectively using the real embedding matrix and the imaginary embedding matrix as in Table 2. For both the feature encoder and the generative model, we adopt the Adam  for parameter updates, and the margin is set as 10.0. For feature encoder, the upper limit of the neighbor number is 50, the number of reference triples in one training step is 30, and the learning rate is . For the generative model, the learning rate is , and , are set as 0.5, 0.9 respectively. When updating the generator one time, the iteration number of the discriminator is 5. The dimension of the random vector is 15, and the number of the generated relation embedding is 20. Spectral normalization is applied for both generator and discriminator. These hyperparameters are also tuned on the validation set . As for word embeddings, we directly use the released word embedding set GoogleNews- vectors-negative300.bin888http://code.google.com/p/word2vec/ of dimension 300.
|Relations||# Can. Num.||# Cos. Sim.||ZSGAN||ZS-DistMult||ZSGAN||ZS-DistMult|
Compared with baselines, the link prediction results of our method are shown in Table 2. Even though NELL-ZS and Wiki-ZS have different scales of triples and relation sets, the proposed generative method still achieves consistent improvements over various baselines on both two zero-shot datasets. It demonstrates that the generator successfully finds the intermediate semantic representation to bridge the gap between seen and unseen relations and generates reasonable relation embeddings for unseen relations merely from their text descriptions. Therefore, once trained, our model can be used to predict arbitrary newly-added relations without fine-tuning, which is significant for real-world knowledge graph completion.
Model-Agnostic Property: From the results of baselines, we can see that their performances are sensitive to the particular method of KG embeddings. Taking MRR and Hits@10 as examples, ZS-DistMult yeilds respectively 0.138 and 12.3% higher performance than ZS-TransE on NELL-ZS dataset. However, our method achieves relatively consistent performance no matter which KG embedding matrix is used.
Analysis of Textual Representations
Figure 4 illustrates the statistical information of text descriptions for two datasets. On the whole, the textual descriptions of NELL-ZS are longer than Wiki-ZS. However, after calculating their TF-IDF values, the number of highly-weighted words of both datasets are located in [2, 5]. For example, the highly-weighted words of relation Worker is livelihood, employed and earning. It demonstrates the capacity of noise suppression. As for word representations999ZSGAN is not limited to a particular type of word embedding., besides Word2Vec, we also attempt the contextualized word representations from BERT101010We use the uncased-BERT-Base model of hidden size 768.  as in Table 4. But their performance is less than satisfactory for two reasons: their high dimension and the sequence-level information involved in the representations. It is difficult for the generator to reduce dimension and extract discriminative features; So, GANs is hard to reach Nash equilibrium.
Quality of Generated Data
In Table 3, we analyze the quality of the generated relation embeddings by our generator and present the comparable results of different relations against the ZS-DistMult, since ZS-DistMult is the best baseline model from Table 2. Unlike image, our generated data cannot be observed intuitively. Instead, we calculate the cosine similarity between the generated relation embeddings and the cluster center of their corresponding relations. It can be seen that our method indeed generates the plausible relation embeddings for many relations and the link prediction performance is positively correlated with the quality of the relation embeddings.
In the respect of text information, we adopt the simple bag-of-words model rather than the neural-network-based text encoder, such CNN and LSTM. We indeed have tried these relatively complicated encoders, but their performance is barely satisfactory. We analyze that one of the main reasons is that the additional trainable parameter set involved in these encoders reduces the difficulty of adversarial training. In other words, the generator is more likely to over-fit the training set; Therefore, the generalization ability of generator is poor when dealing with unseen relations. Even though the bag-of-words model achieves better performance here, it still has the shortage of semantic diversity, especially when the understanding of a relation type needs consider the word sequence information in its textual description. In addition, as mentioned in the background, our zero-shot setting is based on an unified entity set. It can be understood as expanding the current large-scale knowledge graph by adding the unseen relation edges between the existing entity nodes. It must be more beneficial to further consider the unseen entities. We leave these two points in future work.
In this paper, we propose a novel generative adversarial approach for zero-shot knowledge graph relational learning. We leverage GANs to generate plausible relation embeddings from raw textual descriptions. Under this condition, zero-shot learning is converted to the traditional supervised classification problem. An important aspect of our work is that our framework does not depend on the specific KG embedding methods, meaning that it is model-agnostic that could be potentially applied to any version of KG embeddings. Experimentally, our model achieves consistent improvements over various baselines on various datasets.
Pengda Qin is supported by China Scholarship Council and National Natural Science Foundation of China (61702047). Chunyun Zhang is supported by National Natural Science Foundation of China (61703234). Weiran Xu is supported by State Education Ministry – China Mobile Research Fund Project (MCM20190701), DOCOMO Beijing Communications Laboratories Co., Ltd, National Key Research and Development Project No. 2019YFF0303302. The authors from UCSB are not supported by any of the projects above.
-  (2017) Wasserstein gan. arXiv preprint arXiv:1701.07875. Cited by: Generative Adversarial Models.
-  (2013) Translating embeddings for modeling multi-relational data. In Advances in neural information processing systems, pp. 2787–2795. Cited by: Introduction, Related Work, Baselines.
-  (2018) KBGAN: adversarial learning for knowledge graph embeddings. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers), pp. 1470–1480. Cited by: Related Work.
-  (2010) Toward an architecture for never-ending language learning. In Twenty-Fourth AAAI, Cited by: Datasets and Evaluation Protocols.
-  (2019) BERT: pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pp. 4171–4186. Cited by: Analysis of Textual Representations.
-  (2017) Many paths to equilibrium: gans do not need to decrease a divergence at every step. arXiv preprint arXiv:1710.08446. Cited by: footnote 1.
-  (2014) Generative adversarial nets. In Advances in neural information processing systems, pp. 2672–2680. Cited by: Generative Adversarial Models, footnote 1.
-  (2018) OpenKE: an open toolkit for knowledge embedding. In Proceedings of EMNLP, Cited by: Implementation Details.
-  (2017) Gans trained by a two time-scale update rule converge to a local nash equilibrium. In Advances in Neural Information Processing Systems, pp. 6626–6637. Cited by: Generative Adversarial Models.
-  (2014) Adam: a method for stochastic optimization. arXiv preprint arXiv:1412.6980. Cited by: Implementation Details.
Rectifier nonlinearities improve neural network acoustic models.
in ICML Workshop on Deep Learning for Audio, Speech and Language Processing, Cited by: Generative Adversarial Model.
-  (2013) Distributed representations of words and phrases and their compositionality. In Advances in neural information processing systems, pp. 3111–3119. Cited by: Generative Adversarial Model.
-  (2011) A three-way model for collective learning on multi-relational data.. In ICML, Vol. 11, pp. 809–816. Cited by: Related Work, Baselines.
-  (2017) Conditional image synthesis with auxiliary classifier gans. In Proceedings of the 34th ICML-Volume 70, pp. 2642–2651. Cited by: Generative Adversarial Models.
-  (2014) Glove: global vectors for word representation. In Proceedings of the 2014 conference on EMNLP (EMNLP), pp. 1532–1543. Cited by: Generative Adversarial Model.
-  (2016) Less is more: zero-shot learning from online textual documents with noise suppression. In CVPR, pp. 2249–2257. Cited by: Related Work.
-  (1988) Term-weighting approaches in automatic text retrieval. Information processing & management 24 (5), pp. 513–523. Cited by: Generative Adversarial Model.
-  (2018) Modeling relational data with graph convolutional networks. In European Semantic Web Conference, pp. 593–607. Cited by: Feature Encoder.
-  (2015) Representing text for joint embedding of text and knowledge bases. In Proceedings of the 2015 Conference on EMNLP, pp. 1499–1509. Cited by: Datasets and Evaluation Protocols.
-  (2016) Complex embeddings for simple link prediction. In ICML, pp. 2071–2080. Cited by: Introduction, Related Work, Baselines.
-  (2017) Zero-shot learning-the good, the bad and the ugly. In CVPR, pp. 4582–4591. Cited by: Feature Encoder.
-  (2018) One-shot relational learning for knowledge graphs. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pp. 1980–1990. Cited by: Feature Encoder, Datasets and Evaluation Protocols.
-  (2014) Embedding entities and relations for learning and inference in knowledge bases. arXiv preprint arXiv:1412.6575. Cited by: Related Work, Datasets and Evaluation Protocols, Baselines.
-  (2018) A generative adversarial approach for zero-shot learning from noisy texts. In CVPR, pp. 1004–1013. Cited by: Generative Adversarial Model.