Knowledge graph (KG) is a special kind of graph structure, with entities as nodes, and relations as directed edges. Each edge (also called a fact) is represented as a triplet with the form (head entity, relation, tail entity), which is denoted as , indicating that two entities are connected by a specific relation, e.g. (Shakespeare, isAuthorOf, Hamlet) [4, 29, 2, 36]. KG is very general and useful, and it has been used as fundamental building blocks for many applications, e.g., structured search , question answering [8, 5], and intelligent virtual assistant . Such importance has also inspired many famous KG projects, e.g., FreeBase , DBpedia , and YAGO .
However, as these triplets are hard to manipulate, one fundamental problem is how to find a good representation for entities and relations in the KG . Early works towards this goal lie in statistical relational learning using the symbolic triplet data [23, 13, 25]. However, these methods neither lead to good generalization performance, nor can they be applied for large scale knowledge graphs. Recently, graph embedding techniques  have been introduced in KG. These methods attempt to encode entities and relations in KG into a low-dimensional vector space while capturing nodes’ and edges’ connection properties. They are scalable and have also shown a promising performance in basic KG tasks, such as link prediction and triplet classification [7, 40].
Besides, based on the learned entity and relation embeddings, downstream tasks, such as entity classification  and entity linking , can also be benefited. Given that the relation encoding entity types (denoted as IsA) or the relation encoding equivalent entities (denoted as EqualTo) is contained in the KG, and has been included into the learning process, entity classification can be treated as the link prediction task , and entity linking treated as triplet classification task . A more direct entity linking method proposed in  is to check the similarity score between embeddings of two entities.
In recent years, constructing new scoring functions which can better model the complex interactions between entities and relations have been the main focus for improving KG embedding’s performance [42, 20, 46, 38]. However, another very important perspective of KG embedding, i.e., negative sampling, is not sufficiently emphasized. The need of negative sampling comes from the fact that there are only positive triplets in KG . To avoid trivial solutions of the embedding, for each positive triplet, a set that contains its all possible negative samples, needs to be hand-made. Then, for the effectiveness and efficiency of stochastic updates in the KG embedding, once we have picked up a positive triplet, we also need to sample a negative triplet from its corresponding negative sample set. Unfortunately, the quality of these negative triplets does matter.
Due to its simplicity and efficiency, uniform sampling is broadly used in KG embedding . However, it is a fixed scheme and ignores changes on the distribution of negative triplets during the training. Thus, it suffers seriously from the vanishing gradient problem. Specifically, as observed in 
, most negative triplets in the sampling set are easily classified ones. Since scoring functions tend to give observed (positive) triplets large values, as training goes, scores (evaluated from scoring functions) for most non-observed (probably negative) triplets become smaller. Thus, when negative triplets are uniformly sampled, it is very likely that we pick up one with zero gradient. As a result, the training process of KG embedding will be impeded by such vanishing gradients rather than by the optimization algorithm. Such problem prevents KG embedding getting desired performance. A better sampling scheme, i.e., Bernoulli sampling, is introduced in. It improves uniform sampling by considering one-to-many, many-to-many, and many-to-one mapping in relation between head and tail. However, it is still a fixed sampling scheme, which suffers from vanishing gradients.
Therefore, high-quality negative triplets should have large scores. To efficiently capture them during training, we have two main challenges for the negative sampling: (i). How to capture and model negative triplets’ dynamic distribution? and (ii). How can we sample negative triplets in an efficient way? Recently, there are two pioneered works, i.e., IGAN  and KBGAN , attempting to address these challenges. Their ideas are both replacing the fixed sampling scheme with a generative adversarial network (GAN) . However, the GAN-based solutions still have many problems. First, GAN increases the number of training parameters because an extra generator is introduced. Second, GAN training can suffer from instability and degeneracy [1, 18], and the REINFORCE gradient 
used in IGAN and KBGAN is known to have high variance. These drawbacks lead to instable performance for different scoring functions, and hence pretrain becomes a must for both IGAN and KBGAN.
In this paper, to address the challenges of high-quality negative sampling while avoiding the problems from using GAN, we propose a new negative sampling method based on cache, called NSCaching. With empirically studying the score distribution of negative samples, we find that the score distribution is highly skew, i.e., there are only a few negative triplets with large scores and the rest are useless. This observation motivates to only maintain high-quality negative triplets during the training, and dynamically update the maintained triplets. First, we store the high-quality negative triplets in cache, and then design importance sampling (IS) strategy to update the cache. The IS strategy can not only capture the dynamic characteristic of the distribution, but also benefit the efficiency of NSCaching. Furthermore, we also take good care of “exploration and exploitation”, which balances exploring all possible high-quality negative triplets and sampling from a few large score negative triplets in cache. Contributions of our work are summarized as follows:
We propose a simple and efficient negative sampling scheme, NSCaching. It is a general negative sampling scheme, which can be injected into all popularly used KG embedding models. NSCaching has fewer parameters than both IGAN and KBGAN, and can be trained with gradient descent as the original KG embedding models.
We propose the uniform strategy to sample from the cache and IS strategy to update the cache in NSCaching with good care of “exploration and exploitation”.
We conduct experiments on four popular data sets, i.e., WN18 and FB15K, and their variants WN18RR and FB15K237. Experimental results demonstrate that our method is very efficient and is more effective than the state-of-the-arts, i.e., IGAN and KBGAN, as well.
Notation. We denote the set of entities as and set of relations as . A fact (edge) in KG is represented by a triplet, i.e., , where is the head entity, is the tail entity, and is the relationship. Observed facts in a KG are represented by a set . Finally, we denote the embedding vectors of , and by its corresponding boldface character, i.e. , and .
Ii Preliminary: Framework of KG Embedding
Here, we first introduce the general framework for training KG embedding models in Section II-A. Then, we describe its two key components, i.e., negative sampling and scoring function in Section II-B and II-C respectively.
Ii-a The General Framework
To build a KG embedding model, the most important thing is to design a scoring function , which captures the interactions between two entities based on a relation . Different scoring functions have their own weaknesses and strengths in capturing the underneath interactions. Besides, the observed facts in KG are supposed to have larger scores than non-observed ones. With the factual information, the embeddings are learned by solving the optimization problem that maximizes the scoring function for positive triplets and minimizes for non-observed triplets at the same time. Based on the properties of scoring functions, KG embedding models are generally divided into two categories. The first one is translational distance model, i.e.,
and the second one is semantic matching model, i.e.,
where is the hand-made negative triplet for and is the logistic loss.
The above two objectives can be optimized by using stochastic gradient descent in an unified framework (Algorithm1). In each iteration, a mini-batch of size is firstly sampled from at step 3. In step 5, since there are no negative triplets in , a set , i.e.,
which contains negative triplets for , is made, and one negative triplet is sampled from . Finally, embedding parameters are updated in step 6. Thus, in optimization, the most important problem is how to do negative sampling, i.e. generate and sample negative triplet from .
Ii-B Negative Sampling
Existing works on negative sampling can be divided into two categories, i.e., sample from fixed and sample from dynamic distributions.
Ii-B1 Sample from fixed distributions
In the early works , negative triplets are uniformly sampled from the set . Such strategy is simple and efficient. Later, a better sampling scheme, i.e., Bernoulli sampling, is introduced in . It improves uniform sampling by reducing the appearance of false negative triplets existing in one-to-many, many-to-many, and many-to-one relations between head and tail entities. However, as mentioned in the introduction, they still sample from fixed distributions, which can neither model the dynamic changes in distributions of negative triplets nor sample triplets with large scores. Thus, they seriously suffer from vanishing gradient.
Ii-B2 Sample from dynamic distributions
made a more dedicated analysis of problems with fixed sampling scheme. They observed that most of the negative triplets are easy ones, of which scores quickly go small during the training. This leads to the vanishing gradient problem if a fixed sampling scheme is used. Motivated by the success of Generative Adversarial Network (GAN) and its ability to model dynamic distribution, IGAN and KBGAN introduce GAN for negative sampling in KG.
When GAN is applied to negative sampling, a jointly trained generator serves as a sampler that can not only generate high-quality triplets by confusing the discriminator, but also dynamically adapt to the new distributions by keeping training. The discriminator, i.e., the KG embedding model, learns to distinguish between the positive triplets and the negative triplets selected by the generator. Under an alternating training procedure, the generator dynamically approximates the negative sample distribution, and the KG embedding model is improved by high-quality negative samples.
Specifically, given a positive triplet , IGAN models the distribution over all entities to form a negative triplet . The quality of is measured by the scoring function of the discriminator, i.e. the target KG embedding model. By joint training, IGAN can dynamically sample negative triplets with high quality. KBGAN operates in a different way. Instead of modeling a distribution over the whole entity set, KBGAN learns to sample from a subset of random entities. Namely, it first uniformly samples a set of entities to form a candidate set , and then picks up one triplet from it. Under the framework of GAN, generator in KBGAN can approximate the score distribution of triplets in the set , and sample a triplet with relatively high quality.
Even though GAN provides a solution to model the dynamic negative sample distribution, it is famous for suffering from instability and degeneracy [1, 18]. Besides, REINFORCE gradient  has to be used, which is known to have high variance. Thus, pretrain is a must for both IGAN and KBGAN. Finally, it increases the number of model’s parameters and brings extra costs on training.
Ii-C Scoring Functions
The design of scoring function has been the main power source for improving embedding performance in recent years. Depending on the property of scoring functions, they are used in either translational distance or semantic matching models.
Ii-C1 Translational distance model
The simplest and most representative translational distance model is TransE . Inspired from the word representation learning area , if a triplet is true, the entity embeddings should be connected by the relational vector , i.e. . Under this assumption for example, two facts (China, Capital, Beijing) and (UK, Capital, London) will enjoy a relation that in the embedding space. Thus in TransE, the scoring function is defined as the negative translational distance of and connected by relation , i.e., .
Despite the simplicity of TransE, it faces the problem when dealing with one-to-many, many-to-one and many-to-many relations. Take one-to-many relation for example, TransE enforces for different tail entity , thus resulting in very similar embeddings for these different entities. To solve this problem, variants like TransH , TransR , TransD  are introduced to project embeddings of head/tail entity into various spaces. By maximizing the scoring function for all positive triplets, the distance between and in corresponding space can be reduced.
Ii-C2 Semantic matching model
Another group of scoring functions operate without the assumption that . Instead, they use similarity to measure the plausibility of triplets . RESCAL  is the most original model. The entity embeddings are also continuous vectors in . But for each relation, it is represented as a matrix which models the pairwise interaction between every dimension in entity embedding space . Namely, the scoring function of a triplet is defined as , where the relation is represented as a matrix . This scoring function captures pairwise interactions between all components of and , which needs parameters per relation.
Some simple and effective variants of RESCAL are DistMult , HolE  and ComplEx . DistMult simplifies RESCAL by restricting the interaction matrix into a diagonal matrix, which can reduce the number of parameters per relation from to . HolE and ComplEx improves DistMult by modeling asymmetric relations.
Iii Proposed Model
In this section, we first describe our key observations in Section III-A, which are ignored by existing works but are the main motivations of our work. The proposed method is described in Section III-B, where we show how challenges in negative sampling are addressed by cache. Finally, we show an interesting connection between NSCaching and self-pace learning  in Section III-C, which further explains the good performance.
Iii-a Closer Look at Distribution of Negative Triplets
Recall that, in Equation (5), the negative triplet is formed by replacing either the head or tail entity of a positive triplet with any other entities in . Before introducing the proposed method, we analyze the distribution of scores for .
complementary cumulative distribution function(CCDF) to show the proportion of negative triplets that satisfy . The red dashed line shows where the margin lies. (a) is the distribution of negative triplets in 6 timestamp of a certain triplet . (b) is the negative sample distribution of 5 different triplets after the pretraining stage.
Figure 1(a) shows the changes in the distribution of negative samples for one positive triplet; and Figure 1(b) shows distributions of negative samples from different positive triplets. Note that once the distance is larger than the margin , i.e., the red vertical line, the gradient of corresponding negative triplets will vanish to zero. Indeed, we can see the distribution changes during the training process; and negative triplets with large scores are rare. These observations are consistent with those ones in [39, 9], which further explain the vanishing gradient problem of uniform sampling, as most sampled negative triplets will have small scores.
Although, the necessity of finding negative triplets with large scores from a dynamic distribution is mentioned by above works, they do not deeply study these distributions. Key Observations. The more important observations are:
The score distribution of negative triplets is highly skew.
Thus, while GAN has strong ability to monitor the full generation process of negative triplets, it wastes a lot of parameters and training time on learning how negative triplets with small scores are distributed. This is obviously not necessary. Besides, reinforcement learning has been used once GAN is applied, which increases the difficulties on training. As a result, is it possible to directly keep track of those negative triplets with large scores?
|baseline||uniform random||gradient descent (from scratch)|
|IGAN ||GAN||reinforce learning (with pre-train)|
|KBGAN ||GAN||reinforce learning (with pre-train)|
|NSCaching||using cache||gradient descent (from scratch)|
Iii-B NSCaching: the Proposed Method
In this section, we describe the proposed method, which addresses the aforementioned question. The basic idea is very simple and intuitive. Recall that the challenges in negative sampling are (i) how to model the dynamic distribution of negative triplets and (ii) how to sample negative triplets in an efficient way. By considering the key observations, we are motivated to use a small amount of memory, which caches negative samples with large scores for each triplet in , and sample the negative triplet directly from the cache. Algorithm 2 shows the KG embedding framework based on our cache-based negative sampling scheme. Note that the proposed sampling scheme does not depend on the choice of scoring functions, all ones previously mentioned in Section II-C can be used here.
Basically, as a negative triplet can be constructed by either replacing the head or tail entity, we maintain a head-cache (indexed by ) and a tail-cache (indexed by ), which store and respectively. Each pair or corresponds to a unique index. First, when a positive triplet is received, the corresponding cache containing candidates for negative triplets, i.e., and , are indexed in step 5. A negative triplet is generated from and at step 6-7, and then the cache is updated in step 8. Finally, the embeddings are updated based on the choice of scoring functions.
An overview of the proposed method with state-of-the-arts are in Table I. The main difference with general KG embedding framework in Algorithm 1 is step 5-8 in Algorithm 2, where the sampling scheme is based on the cache instead. Besides, compared with previous complex GAN-based works [39, 9], our method in Algorithm 2 acts like a discriminative and distilled model of GAN, which only cares about negative triplets with large scores during the training. Thus, the proposed method, i.e., NSCaching, not only has fewer parameters, but also can be easily trained from randomly initialized models (from the scratch). Moreover, experimental results in Section IV show that NSCaching achieves the best performance.
However, in order to achieve best performance, we need to carefully design how to sample from the cache (step 6) and update the cache (step 8). In the sequel, we will describe the “exploration and exploitation” inside these steps and how they are balanced in detail. Then, we give a time and space analysis of Algorithm 2, which further explain its efficiency and memory saving. Note that, we only discuss operations and designs for the head-cache here, as designs are the same for the tail-cache .
Iii-B1 Uniform sampling strategy from the cache (step 6)
Recall that only head in negative triplets with large scores are in cache , thus picking up any probably avoids the vanishing gradient problem. As larger scores also lead to bigger gradients, a very natural scheme is to always sample the negative triplet with the largest score.
However, as the distribution can change during the iterations of the algorithm, the negative triplets in the cache may not be accurate enough for the sampling in the latest iteration. Besides, there are false negative triplets in the negative sample sets, of which scores can also be very high . As a consequence, we also need to consider other triplets except the one with largest score in the cache.
This raises the question that how to keep the balance between exploration (i.e., explore all the possible high-quality negative samples) and exploitation (i.e., sample the largest score negative triplet in cache).
These motivate us to use uniformly random sampling scheme in step 6. It is simple, efficient, and does not introduce any bias into the selection process. Indeed, a stronger scheme can be sampling based on triplets’ scores, where larger score indicates higher probability to be sampled. However, it has extra memory costs as scores needs to be stored as well. Moreover, it introduces bias causing by dynamic changing distribution and false negative triplets, which leads to inferior performance as shown in Section IV-C1.
Iii-B2 Importance sampling strategy to update the cache (step 8)
As mentioned in Section II-A, the cache needs to be dynamically changed during the iterations of the algorithm. Otherwise, while negative triplets are kept in , sampling from cache is still a scheme with fixed distribution, which eventually suffers from vanishing gradient problem. Thus, we need to refresh the cache in each iteration. Moreover, the cache needs to be updated in an efficient way.
The proposed importance sampling (IS) strategy is presented in Algorithm 3. First, we uniformly sample a subset of size (step 2), then union it with and obtain . The scores for all triplets in are evaluated in step 4. After that, we construct a subset from by sampling entries in without replacement times following probability
Finally, is returned as the updated head-cache.
Note that exploration and exploitation also need to be carefully balanced in Algorithm 3. As the cache needs to be updated, we have to sample from , and uniform sampling is chosen due to its efficiency. Thus, a bigger implies more exploitation, while a larger leads to more exploration. In step 6, indeed, uniform sampling or keeping triplets with top scores can be alternative choices. However, both of them are inappropriate. First, uniformly sampling is obviously not proper, as triplets in have much larger scores than those in . Then, deterministically sampling top is not appropriate as well, which again dues to the existence of false negative triplets (Section III-B1). All above concerns will also be empirically studied in experiments Section IV.
Iii-B3 Space and time complexities
Here, we analyze the space and time complexities of NSCaching (Algorithm 2). Comparing with basic Algorithm 1, the main additional cost by introducing cache comes from Algorithm 3 in step 8. In Algorithm 3, the time complexity of computing the score of candidate triplets is . The cost of step 6 contains two parts, i.e., normalization of the score and uniform sampling, they take and respectively, which are very small. Thus, the total cost of introducing cache is for one triplet. We can lazily update the cache epochs later rather than immediately updating, which can further reduce update complexity to .
As for space complexity, evaluating the scores for candidate triplets takes space. Since we only store indices in the cache, it takes space to store these indices for negative triplets. However, since there are many one-to-many, many-to-one and many-to-many relations, the cost will be smaller than and the cache does not need to be stored in memory. In our experiments, values of and used on WN18 and FB15K are both , which is much smaller than the number of entities.
In comparison, to generate one negative triplet, the generator in IGAN  costs time since it needs to compute the distribution over all entities. KBGAN  needs cost for measuring a candidate set of triplets. The additional space cost for IGAN and KBGAN is also and respectively. Finally, the comparisons are summarized in Table I with TransE as the scoring function.
Iii-B4 Discussion on the Convergence
Both the baseline KG embedding models  and NSCaching use stochastic gradient descent (SGD) for model training. While there is no theoretical guarantee, SGD has been applied on many nonconvex and complex models , where the convergence is empirically observed, including the baseline KG embedding model [7, 6, 42, 11, 38, 31, 27]. The only difference of NSCaching to that baseline model is how to sample negative triplets.
Besides, since NSCaching samples negative triplets with larger scores, its gradients have larger magnitude than that of baseline approach. This also prevents NSCaching from being early stopped by the sampling process and helps to converge with higher testing performance that of baseline models. The above are all empirically shown and studied in Section IV.
Iii-C Connection to Self-Pace Learning
The main idea of self-paced (or curriculum) learning [3, 24] is to pick up easy samples first, and then gradually switch to hard ones. In this way, the classifier can first identify the rough position where the decision boundary should locate, and then the boundary can be further refined near hard examples. It is very effective for complex and noncovex models.
Recently, it is also introduced into network embedding and a big improvement on embedding’s quality has been reported . Besides, GAN is also used to monitor the distribution of edges in the network, and negative edges with scores above one threshold are sampled from the generator in GAN. Self-paced learning is achieved by increasing the threshold during the training of embedding . Thus, we can see neither KBGAN nor IGAN has benefited from self-paced learning.
In contrast, our caching scheme can explicitly benefit from it. The reason is that the embedding model only has weak discriminative ability in the beginning of the training. Thus, while there are still a lot of negative triplets with large scores, it is more likely that they are easy ones as most of negative samples are easy. However, as training goes on, those easy samples will gradually have small scores and are removed from the cache. These mean NSCaching will learn from easy samples first, but then gradually focus on hard ones, which is exactly the principle of self-paced learning. The above explanations are also verified by experiments, where we can see the negative triplets in the cache change from easy to hard ones (Section IV-F) and NSCaching training from scratch can already achieve better performance than IGAN and KBGAN with pre-training (Section IV-B).
In this section, we carry empirical study of our method. All algorithms are written in Python with PyTorch framework and run on a TITAN Xp GPU.
Iv-a Experiment Setup
Four datasets are used here, i.e., WN18, FB15K and their variants WN18RR, FB15K237. WN18 and FB15K are firstly introduced in . They are widely tested among the most famous Knowledge Graph embedding learning works [7, 20, 38, 39, 9]. WN18RR and FB15K237 are variants that remove near-duplicate or inverse-duplicate relations from WN18 and FB15K, and are introduced by  and  respectively. The two variants are harder and more realistic. Their statistics are shown in Table II.
Specifically, WN18 and WN18RR are subsets of Wordnet , which is a large lexical database of English. The entities correspond to word senses, and relations mean the lexical relation between them. FB15K and FB15K237 are subsets of Freebase dataset  which contains general facts of the world. Freebase keeps growing until January 2014 and it now contains approximately 44 million topics and 2.4 billion triplets.
Following previous KG embedding works [7, 42, 20, 38], and the GAN-based works [39, 9], we test the performance on link prediction task. This is also the testbed to measure KG embedding models. Link prediction aims to predict the missing entity or for a positive triplet . In this task, we measure the rank of head entity and tail entity among all the entity sets. Thus, link prediction emphasizes the rank of the correct entity rather than their concrete scores.
Iv-A3 Performance measurements
Mean reciprocal ranking (MRR): It is computed by average of the reciprocal ranks where is a set of ranking results;
Hit@10: It is the percentage of appearance in top-: , where is the indicator function;
Mean rank (MR): It is computed by . Smaller value of MR tends to infer better results.
MRR and Hit@ measure the top rankings of positive entity in different level. Hit@10 cares about general top rankings, and the top 1 samples contribute most to MRR. The larger value of MRR and Hit@ indicates better performance. To avoid underestimating the performance of different models, we report the performance in a “Filtered” setting, i.e., all the corrupted triplets that exist in train, valid and test set are filtered out [39, 9]. Note that, MR is not a good metric, as it is easily influenced by false positive samples. We report it here to keep consistency with existing literatures [39, 9].
Iv-A4 Choices of the scoring function
A large amount of scoring functions have been proposed in literature, including translational distance models TransE , TransH , TransR , TransD , TranSparse , TransM , ManifoldE , etc., and semantic matching models RESCAL , DistMult , HolE , ComplEx , ANALOGY , etc. All these methods are summarized in a recent survey . Follow [9, 39], in the sequel, TransE, TransH, TransD, DistMult and ComplEx will be used as scoring functions for comparison (see their definitions in Table III).
Iv-B Comparison with State-of-the-arts
In this section, we focus on the comparison with state-of-the-arts methods. Hyper-parameters of NSCaching are studied in Section IV-C.
Iv-B1 Compared methods
Following methods for negative sampling are compared:
Bernoulli : As a basic extension of the uniform sampling scheme used in TransE, Bernoulli sampling aims at reducing false negative labels by replacing the head or tail with different probability for one-to-many, many-to-one and many-to-many relations. Specifically, it samples or
under a predefined Bernoulli distribution. Since it is shown to be better than uniform sampling, we choose it as the basic random sampling scheme;
KBGAN : This model firstly samples a set uniformly from the whole entity set . Then head or tail entity is replaced with the entities in to form a set of candidate and . The generator in KBGAN tries to pick up one triplet among them. As proposed in , we choose the simplest model TransE as the generator. For fair comparison, the size of set is same as our cache size . We use the published code 111https://github.com/cai-lw/KBGAN and change the configure same as ours for fair comparison;
NSCaching (Algorithm 2): As in Section III, the negative samples are formed by replacing the head entity or tail entity with one uniformly sampled from head cache or tail cache . The cache is updated as in Algorithm 3. Note that we can also lazily update the cache several iterations later, which can further save time. However, we just report the result of immediate update, which is shown to be both effective and efficient. We use and lazy-update with unless otherwise specified.
As the source code of IGAN  is not available, we do not compare with it here. Instead, we directly use the reported performance in the sequel. Finally, we also use Bernoulli sampling to choose between and for KBGAN and NSCaching.
From scratch: The embedding of relations and entities are initialized by the Xavier uniform initializer , and the models (denoted as KBGAN + scratch and NSCaching + scratch) are directly applied to train the given KG;
With pretrain: Same as [9, 39], we firstly pretrain each scoring function under the baseline model, i.e. Bernoulli sampling, several epochs on both data sets. We denote it as pretrained. Then the obtained parameters are used to warm-start the given KG rather than from scratch. We keep training based on the warm-started KG embedding and evaluate the performance under different models, i.e., Bernoulli, KBGAN + pretrain and NSCaching + pretrain. Besides, the generator in KBGAN is warm-started with corresponding TransE model.
Iv-B2 Hyper-parameter settings
We use grid search to select the following hyper-parameters: hidden dimension , learning rate . For translational distance models, we tune the margin value . And for semantic matching models, we tune the penalty value . We use Adam , which is a popular variant of SGD algorithm for the training, and adopt its default settings, except for the learning rate. The best hyper-parameter is tuned under Bernoulli sampling scheme and evaluated by the MRR metric on validation set. We keep them fixed for the baseline methods Bernoulli, KBGAN and our proposed NSCaching. Following , we save and record the pretrained model after several initial training epochs. Then, Bernoulli method keeps training until 3000 epochs; and the results of KBGAN and NSCaching algorithm are evaluated within 1000 epochs, either from scratch or with pretrain. All the recorded results are tested based on the best parameters chosen by the MRR value on validation set.
Iv-B3 Results on translational distance models
The performance on link prediction is compared in Table IV. First, we can see that, for the translational distance models (TransE, TransH, TransD), KBGAN, NSCaching and IGAN (both with pretrain and from scratch) gain significant improvement upon the baseline scheme Bernoulli, especially for the gaining on MRR, which is mainly influenced by top rankings. This verifies the needs of using high-quality negative triplets during negative sampling and these methods can effectively pick up these negative triplets.
Then, IGAN and KBGAN with pretrain can perform better, indicated by MRR and Hit@10, than from scratch. This shows pretrain is helpful for GAN based models. In comparison, the proposed NSCaching trained from either state (pretrain or scratch) can outperform IGAN and KBGAN. Finally, we find that MR is not an appropriate metric, as many of the pretrained models, which is not converged yet, show even smaller MR than the Bernoulli.
Convergence of testing performance for various algorithms are shown in Figure 2 and 3. We use TransD as it offers the best performance among the three translational distance models. As can be seen, all algorithms will converge to a stable testing MRR and Hit@10, which verifies the empirical convergence of Adam optimizer. Then, while pretrain is a must for KBGAN to achieve good performance, NSCaching can obtain good performance either from scratch or using pretrain. Finally, in all cases, NSCaching converges much faster and is more stable than both Bernoulli and KBGAN.
Iv-B4 Results on semantic matching models
The performance is shown in the bottom rows of Table IV. Same as the performance on translational distance models, NSCaching outperforms baseline scheme Bernoulli significantly, as indicated by the bold and underline numbers. However, KBGAN does not show consistent performance. It performs even worse than the Bernoulli sampling scheme on WN18, WN18RR and FB15K, KBGAN from scratch even performs much worse than with pretrian. This observation further verifies the fact that GAN based methods usually suffer from instability and degeneracy. This method needs careful balance between the generator and the target KG embedding model. However, NSCaching works consistently and performs the best among various settings.
Convergence of testing performance for various algorithms are shown in Figure 4 and 5. We use ComplEx as the representative since it is much better than DistMult. As can be seen, both Bernoulli and the proposed NSCaching will converge to a stable state. In the contrast, KBGAN will turn down and overfit after several epochs. However, NSCaching, either with pretrain or from scratch, leads the performance and is well adopted on the semantic matching models without further tuning.
Iv-B5 Results on triplets classification
To further verify the quality of learned embedding, we test the learned embeddings on triplet classification task on WN18RR and FB15K237 datasets. This task is to confirm whether a given triplet is correct or not, i.e., binary classification on triplet . In practice, it can help us to quickly answer the truth-or-false questions. The WN18RR 222https://github.com/thunlp/OpenKE/blob/master/benchmarks/WN18RR/valid_neg.txt and FB15K237 333https://github.com/thunlp/OpenKE/blob/master/benchmarks/FB15K237/valid_neg.txt dataset released a set of positive and negative triplets, which can be used to evaluate the performance on the classification task. The decision rule of classification is as follows: for each , if its score is no less than the relation-specific threshold , then predict positive. Otherwise, negative. The threshold is determined according to maximizing the classification accuracy on the validation set. As shown in Table V, NSCaching still outperforms the baselines. The new experiment further justifies that our proposed NSCaching can help learn a better embedding of the KG.
Iv-C Cache Update and Sampling Scheme
In Section IV-B, we have shown that NSCaching achieves the best performance on four benchmark datasets. Here, we analyze design concerns on “exploration and exploitation” at step 6 and 8 in Algorithm 2. TransD and WN18 are used here.
Iv-C1 Uniform sampling from the cache (step 6)
Given a cache, which stores high-quality negative samples, how to sample from it is the first question we care about. Recall that we discussed three strategies in Section III-B1, i.e., (i) uniform sampling from the cache (dented as “uniform sampling”); (ii) importance sampling according to the score of each sample in cache (denoted as “IS sampling”); and (iii) top sampling, by choosing the sample with largest score (denoted as “top sampling”). Testing performance of MRR on WN18 trained by TransD are compared in Figure 6.(a). As can be seen, top sampling has the worst performance, and uniform sampling is the best.
To show how exploration and exploitation are balanced here, we further compute two criterion to show the difference between these strategies. (i) Repeat ratio (denoted as “RR”), which measures the percentage of repeated negative triplets within epochs; and (ii) non-zero loss ratio (denoted as “NZL”), which is the percentage of non-zero losses in same range. The value of RR is related to exploration, if the number of repeated negative triplets is high, the negative samples only explore a small part of the sample spaces, thus result in worse exploration. NZL ratio measures exploitation, a larger NZL means higher quality of picked negative samples.
The RR is shown in Figure 7(a). The Bernoulli sampling method has almost zero repeat triplets since the number of explored negatives is extremely large, it has the best exploration. Among the schemes based on NSCaching, uniform sampling has better exploration than IS, then followed by top sampling. NZL ratio is shown in Figure 7(b). As training going on, the baseline Bernoulli model suffers the zero loss problem severely, thus leading to vanishing gradient. All of the three schemes have more than half non-zero losses, thus achieves exploitation. To sum up, uniform sampling is the most balanced strategy among the three schemes, thus NSCaching + uniform achieving the best performance.
Iv-C2 Importance sampling strategy to update the cache (step 8)
As discussed in Section III-B2, we have two choices on updating the cache: (i) importance sampling based method, which samples entities from candidates according to the probability in (6) without replacement, (IS update). (ii) top sampling method, which directly select entities with top scores in the candidates, (top update). Again, let us first look at performance comparison in Figure 6.(b). We can see that IS update outperforms top update by a large margin.
Then, to explain the exploration and exploitation here, we add two extra measurements for comparison. They are (i). the number of changed elements in cache (denoted as “CE”) and (ii) the ratio of non-zero losses, i.e., NZL. More changed elements leads to larger exploration, and more nonzero losses means more exploitation.
The value of CE measures the different elements in the cache in a period of epochs. As shown in Figure 8.(a), the number of changed elements in top update scheme is much smaller than that of the importance sampling update. As a result, the cache is updated quite slow and the model mainly focuses on these highly scored negative triplets, which may contain many false positive triplets. As a comparison, the importance sampling based update scheme can keep the cache fresh and keep track of dynamic changes of the negative sampling distribution. It not only provides enough qualified negative triplets for the KG embedding model to avoid zero loss, but also explore the large negative sample space well. In summary, we choose the importance sampling strategy to update the cache.
Iv-D Sensitivity Analysis: Cache Size
Comparing with the baseline KG embedding models (i.e., Bernoulli [42, 26]), the only extra hyper-parameters here are and . Basically, is the size of cache. Then, is the size of randomly sampled negative triplets from , which will be later used to update the cache. Here, we show their impact on NSCaching’s performance.
Figure 9.(a) shows how performance changes by varying the cache size among , with fixed . When the cache size is small, average score of entities stored in cache should be larger than those in larger cache. Thus, false negative samples will be more likely to be sampled, which will influence the boundary to a bad location. As for the others values of , NSCaching performs quite stable. The convergence speed is similar, as well as the values in converged state. Thus, when finding appropriate cache size, the value of can be searched from smaller value until the performance is stable.
Different performance of the random candidate subset size is shown in Figure 9.(b). Obviously, the entities in cache will be updated more frequently when gets larger, which lead to better exploration. But the trade-off is that larger value of costs more. As shown by the colored lines in Figure 9.(b), NSCaching performs consistently when is larger than 10. However, if the random subset is small, the content in cache will be harder to be updated, thus lead to poor performance as the yellow dashed line ().
Iv-E Illustration of Vanishing Gradient
To further clarity the vanishing gradient problem, we plot average -norm of gradients v.s. number of epochs in Figure 10. Note that Adam , which is a stochastic gradient descent algorithm, is used as the optimizer. First, we can see that while norms of gradients for both NSCaching and Bernoulli become smaller, they will not decrease to zero since the sampling process of the mini-batch will introduce noisy into gradients. However, the norm from NSCaching is larger than that from Bernoulli, which dues to the usage of caching-based negative sampling scheme. Thus, we can see NSCaching can successfully avoid the problem of vanishing gradient.
Iv-F Explanation of the connection to Self-Paced Learning
Finally, we visualize the changes of entities in the cache, which verifies the effects of self-paced learning introduced in Section III-C. Following , we also use FB13 here since its triplets are more interpretable than the four evaluated datasets. We pick up , , as the positive triplets, and the changes in its tail-cache are show in Table VI. As can be seen, entities are firstly meaninglessness, e.g., ostrava and ben_lilly, then they gradually changes to human jobs, e.g., artist and sex_worker.
|epoch||entities in cache|
|0||allen_clarke, jose_gola, ostrava, ben_lilly, hans_zinsser|
|20||accountant, frank_pais, laura_marx, como, domitia_lepida|
|100||artist, , aviator, hans_zinsse, john_h_cough|
|200||physician, artist, raich_carter, coach, mark_shivas|
|500||artist, physician, cavan, sex_worker, attorney_at_law|
V Related work
V-1 Generative Adversarial Network
Generative Adversarial Network (GAN) is originally introduced as a powerful model for plausible image generation. The GAN contains two modules: a generator that serves as a complex distribution sampler, and a discriminator that measures the quality of generated samples. Under elaborately control on the training procedure of generator and discriminator [1, 18]
, GAN achieved significant success computer vision field[35, 47]. It has been shown to sample high-quality negative samples for knowledge graph embedding [9, 39].
V-2 Negative Sampling
Negative sampling is originally introduced as an alternative to the hierarchical softmax, which aims at reducing complexity of softmax on large scale dataset . It then becomes popular in embedding learning, especially for word embedding , graph embedding , and KG embedding . More recently, there have been interests in applying the GAN to negative sampling, e.g., IGAN  and KBGAN  for KG embedding and self-paced GAN  for network embedding.
We proposed NSCaching as a novel negative sampling method for knowledge graph embedding learning. The negative samples are from a cache that can dynamically hold high-quality negative samples. We analyze the designing of NSCaching through the balance of exploration and exploitation. Experimentally, we empirically test NSCaching on two datasets and five scoring functions. Results show that the method can generalize well under various settings and achieves state-of-the-arts performance on FB15K dataset. When dealing with millions scale KG, memory of storing the cache becomes a problem. Using distributed computation or hashing will be pursued as future works. Besides, the theoretical convergence of NSCaching is also an important and interesting future work.
-  M. Arjovsky, S. Chintala, and L. Bottou. Wasserstein GAN. Technical report, 2017.
-  S. Auer, C. Bizer, G. Kobilarov, J. Lehmann, R. Cyganiak, and Z. Ives. DBpedia: A nucleus for a web of open data. In The Semantic Web, pages 722–735. Springer, 2007.
-  Y. Bengio, J. Louradour, R. Collobert, and J. Weston. Curriculum learning. In ICML, pages 41–48. ACM, 2009.
-  K. Bollacker, C. Evans, P. Paritosh, T. Sturge, and J. Taylor. Freebase: a collaboratively created graph database for structuring human knowledge. In SIGMOD, pages 1247–1250. ACM, 2008.
-  A. Bordes, S. Chopra, and J. Weston. Question answering with subgraph embeddings. In EMNLP, pages 615–620, 2014.
-  A. Bordes, X. Glorot, J. Weston, and Y. Bengio. A semantic matching energy function for learning with multi-relational data. Machine Learning, 94(2):233–259, 2014.
-  A. Bordes, N. Usunier, A. Garcia-Duran, J. Weston, and O. Yakhnenko. Translating embeddings for modeling multi-relational data. In NIPS, pages 2787–2795, 2013.
-  A. Bordes, J. Weston, and N. Usunier. Open question answering with weakly supervised embedding models. In ECML-PKDD, pages 165–180. Springer, 2014.
-  L. Cai and W.Y. Wang. Kbgan: Adversarial learning for knowledge graph embeddings. In ACL, volume 1, pages 1470–1480, 2018.
L. Drumond, S. Rendle, and L. Schmidt-Thieme.
Predicting rdf triples in incomplete knowledge bases with tensor factorization.In SAC, pages 326–331, 2012.
-  M. Fan, Q. Zhou, E. Chang, and T. F. Zheng. Transition-based knowledge graph embedding with relational mapping properties. In PACLIC, 2014.
-  H. Gao and H. Huang. Self-paced network embedding. In SIGKDD, pages 1406–1415, 2018.
-  L. Getoor and B. Taskar. Introduction to statistical relational learning, volume 1. The MIT Press, 2007.
X. Glorot and Y. Bengio.
Understanding the difficulty of training deep feedforward neural networks.In AISTATS, pages 249–256, 2010.
-  Y. Goldberg and O. Levy. word2vec explained: deriving mikolov et al.’s negative-sampling word-embedding method. Technical report, 2014.
-  I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and Y. Bengio. Generative adversarial nets. In NIPS, pages 2672–2680, 2014.
-  A. Grover and J. Leskovec. node2vec: Scalable feature learning for networks. In SIGKDD, pages 855–864. ACM, 2016.
-  I. Gulrajani, F. Ahmed, M. Arjovsky, V. Dumoulin, and A. C. Courville. Improved training of wasserstein gans. In NIPS, pages 5767–5777, 2017.
-  M. Gutmann and A. Hyvärinen. Noise-contrastive estimation: A new estimation principle for unnormalized statistical models. In AISTATS, pages 297–304, 2010.
-  G. Ji, S. He, L. Xu, K. Liu, and J. Zhao. Knowledge graph embedding via dynamic mapping matrix. In ACL, volume 1, pages 687–696, 2015.
-  G. Ji, K. Liu, S. He, and J. Zhao. Knowledge graph completion with adaptive sparse transfer matrix. In AAAI, pages 985–991, 2016.
-  Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. Technical report, 2014.
-  S. Kok and P. Domingos. Statistical predicate invention. In ICML, pages 433–440, 2007.
-  M. P. Kumar, B. Packer, and D. Koller. Self-paced learning for latent variable models. In NIPS, pages 1189–1197, 2010.
-  N. Lao, T. Mitchell, and W. W. Cohen. Random walk inference and learning in a large scale knowledge base. In EMNLP, pages 529–539. Association for Computational Linguistics, 2011.
-  Y. Lin, Z. Liu, M. Sun, Y. Liu, and X. Zhu. Learning entity and relation embeddings for knowledge graph completion. In AAAI, volume 15, pages 2181–2187, 2015.
-  H. Liu, Y. Wu, and Y. Yang. Analogical inference for multi-relational embeddings. In ICML, pages 2168–2178, 2017.
-  T. Mikolov, W. Yih, and G. Zweig. Linguistic regularities in continuous space word representations. In ACL, pages 746–751, 2013.
-  G. A. Miller. Wordnet: a lexical database for english. Communications of the ACM, 38(11):39–41, 1995.
-  M. Nickel, K. Murphy, V. Tresp, and E. Gabrilovich. A review of relational machine learning for knowledge graphs. Proceedings of the IEEE, 104(1):11–33, 2016.
-  M. Nickel, L. Rosasco, and T. A. Poggio. Holographic embeddings of knowledge graphs. In AAAI, volume 2, pages 3–2, 2016.
-  M. Nickel, V. Tresp, and H. Kriegel. A three-way model for collective learning on multi-relational data. In ICML, volume 11, pages 809–816, 2011.
-  A. Paszke, S. Gross, S. Chintala, G. Chanan, E. Yang, Z. DeVito, Z. Lin, A. Desmaison, L. Antiga, and A. Lerer. Automatic differentiation in pytorch. Technical report, 2017.
-  A. Singhal. Introducing the knowledge graph: things, not strings. Official Google blog, 5, 2012.
-  Q. Song, H. Ge, J. Caverlee, and X. Hu. Self-attention generative adversarial networks. Technical report, 2018.
-  F. M. Suchanek, G. Kasneci, and G. Weikum. Yago: a core of semantic knowledge. In WWW, pages 697–706, 2007.
-  Kristina Toutanova and Danqi Chen. Observed versus latent features for knowledge base and text inference. In Workshop on Continuous Vector Space Models and their Compositionality, pages 57–66, 2015.
-  T. Trouillon, J. Welbl, S. Riedel, and G. Gaussier, É. Complex embeddings for simple link prediction. In ICML, pages 2071–2080, 2016.
-  P. Wang, S. Li, and R. Pan. Incorporating GAN for negative sampling in knowledge representation learning. AAAI, 2018.
-  Q. Wang, Z. Mao, B. Wang, and L. Guo. Knowledge graph embedding: A survey of approaches and applications. TKDE, 29(12):2724–2743, 2017.
-  Y. Wang, D. Ruffinelli, S. Broscheit, and R.ainer Gemulla. On evaluating embedding models for knowledge base completion. arXiv preprint arXiv:1810.07180, 2018.
Z. Wang, J. Zhang, J. Feng, and Z. Chen.
Knowledge graph embedding by translating on hyperplanes.In AAAI, volume 14, pages 1112–1119, 2014.
-  D. A. White. The knowledge-based software assistant: A program summary. In ICKBSE, pages 2–6, 1991.
-  R. J. Williams. Simple statistical gradient-following algorithms for connectionist reinforcement learning. Machine Learning, 8(3-4):229–256, 1992.
-  H. Xiao, M. Huang, and X. Zhu. From one point to a manifold: knowledge graph embedding for precise link prediction. In IJCAI, pages 1315–1321, 2016.
-  B. Yang, W. Yih, X. He, J. Gao, and L. Deng. Embedding entities and relations for learning and inference in knowledge bases. Technical report, 2017.
J. Zhu, T. Park, P. Isola, and A. A. Efros.
Unpaired image-to-image translation using cycle-consistent adversarial networks.In ICCV, pages 2242–2251. IEEE, 2017.