I Introduction
Knowledge graph (KG) is a special kind of graph structure, with entities as nodes, and relations as directed edges. Each edge (also called a fact) is represented as a triplet with the form (head entity, relation, tail entity), which is denoted as , indicating that two entities are connected by a specific relation, e.g. (Shakespeare, isAuthorOf, Hamlet) [4, 29, 2, 36]. KG is very general and useful, and it has been used as fundamental building blocks for many applications, e.g., structured search [34], question answering [8, 5], and intelligent virtual assistant [43]. Such importance has also inspired many famous KG projects, e.g., FreeBase [4], DBpedia [2], and YAGO [36].
However, as these triplets are hard to manipulate, one fundamental problem is how to find a good representation for entities and relations in the KG [30]. Early works towards this goal lie in statistical relational learning using the symbolic triplet data [23, 13, 25]. However, these methods neither lead to good generalization performance, nor can they be applied for large scale knowledge graphs. Recently, graph embedding techniques [40] have been introduced in KG. These methods attempt to encode entities and relations in KG into a lowdimensional vector space while capturing nodes’ and edges’ connection properties. They are scalable and have also shown a promising performance in basic KG tasks, such as link prediction and triplet classification [7, 40].
Besides, based on the learned entity and relation embeddings, downstream tasks, such as entity classification [32] and entity linking [6], can also be benefited. Given that the relation encoding entity types (denoted as IsA) or the relation encoding equivalent entities (denoted as EqualTo) is contained in the KG, and has been included into the learning process, entity classification can be treated as the link prediction task , and entity linking treated as triplet classification task . A more direct entity linking method proposed in [32] is to check the similarity score between embeddings of two entities.
In recent years, constructing new scoring functions which can better model the complex interactions between entities and relations have been the main focus for improving KG embedding’s performance [42, 20, 46, 38]. However, another very important perspective of KG embedding, i.e., negative sampling, is not sufficiently emphasized. The need of negative sampling comes from the fact that there are only positive triplets in KG [10]. To avoid trivial solutions of the embedding, for each positive triplet, a set that contains its all possible negative samples, needs to be handmade. Then, for the effectiveness and efficiency of stochastic updates in the KG embedding, once we have picked up a positive triplet, we also need to sample a negative triplet from its corresponding negative sample set. Unfortunately, the quality of these negative triplets does matter.
Due to its simplicity and efficiency, uniform sampling is broadly used in KG embedding [40]. However, it is a fixed scheme and ignores changes on the distribution of negative triplets during the training. Thus, it suffers seriously from the vanishing gradient problem. Specifically, as observed in [39]
, most negative triplets in the sampling set are easily classified ones. Since scoring functions tend to give observed (positive) triplets large values, as training goes, scores (evaluated from scoring functions) for most nonobserved (probably negative) triplets become smaller. Thus, when negative triplets are uniformly sampled, it is very likely that we pick up one with zero gradient. As a result, the training process of KG embedding will be impeded by such vanishing gradients rather than by the optimization algorithm. Such problem prevents KG embedding getting desired performance. A better sampling scheme, i.e., Bernoulli sampling, is introduced in
[42]. It improves uniform sampling by considering onetomany, manytomany, and manytoone mapping in relation between head and tail. However, it is still a fixed sampling scheme, which suffers from vanishing gradients.Therefore, highquality negative triplets should have large scores. To efficiently capture them during training, we have two main challenges for the negative sampling: (i). How to capture and model negative triplets’ dynamic distribution? and (ii). How can we sample negative triplets in an efficient way? Recently, there are two pioneered works, i.e., IGAN [39] and KBGAN [9], attempting to address these challenges. Their ideas are both replacing the fixed sampling scheme with a generative adversarial network (GAN) [16]. However, the GANbased solutions still have many problems. First, GAN increases the number of training parameters because an extra generator is introduced. Second, GAN training can suffer from instability and degeneracy [1, 18], and the REINFORCE gradient [44]
used in IGAN and KBGAN is known to have high variance. These drawbacks lead to instable performance for different scoring functions, and hence pretrain becomes a must for both IGAN and KBGAN.
In this paper, to address the challenges of highquality negative sampling while avoiding the problems from using GAN, we propose a new negative sampling method based on cache, called NSCaching. With empirically studying the score distribution of negative samples, we find that the score distribution is highly skew, i.e., there are only a few negative triplets with large scores and the rest are useless. This observation motivates to only maintain highquality negative triplets during the training, and dynamically update the maintained triplets. First, we store the highquality negative triplets in cache, and then design importance sampling (IS) strategy to update the cache. The IS strategy can not only capture the dynamic characteristic of the distribution, but also benefit the efficiency of NSCaching. Furthermore, we also take good care of “exploration and exploitation”, which balances exploring all possible highquality negative triplets and sampling from a few large score negative triplets in cache. Contributions of our work are summarized as follows:

We propose a simple and efficient negative sampling scheme, NSCaching. It is a general negative sampling scheme, which can be injected into all popularly used KG embedding models. NSCaching has fewer parameters than both IGAN and KBGAN, and can be trained with gradient descent as the original KG embedding models.

We propose the uniform strategy to sample from the cache and IS strategy to update the cache in NSCaching with good care of “exploration and exploitation”.

We conduct experiments on four popular data sets, i.e., WN18 and FB15K, and their variants WN18RR and FB15K237. Experimental results demonstrate that our method is very efficient and is more effective than the stateofthearts, i.e., IGAN and KBGAN, as well.
Notation. We denote the set of entities as and set of relations as . A fact (edge) in KG is represented by a triplet, i.e., , where is the head entity, is the tail entity, and is the relationship. Observed facts in a KG are represented by a set . Finally, we denote the embedding vectors of , and by its corresponding boldface character, i.e. , and .
Ii Preliminary: Framework of KG Embedding
Here, we first introduce the general framework for training KG embedding models in Section IIA. Then, we describe its two key components, i.e., negative sampling and scoring function in Section IIB and IIC respectively.
Iia The General Framework
To build a KG embedding model, the most important thing is to design a scoring function , which captures the interactions between two entities based on a relation [40]. Different scoring functions have their own weaknesses and strengths in capturing the underneath interactions. Besides, the observed facts in KG are supposed to have larger scores than nonobserved ones. With the factual information, the embeddings are learned by solving the optimization problem that maximizes the scoring function for positive triplets and minimizes for nonobserved triplets at the same time. Based on the properties of scoring functions, KG embedding models are generally divided into two categories. The first one is translational distance model, i.e.,
(1) 
and the second one is semantic matching model, i.e.,
(2) 
where is the handmade negative triplet for and is the logistic loss.
(3) 
(4) 
The above two objectives can be optimized by using stochastic gradient descent in an unified framework (Algorithm
1). In each iteration, a minibatch of size is firstly sampled from at step 3. In step 5, since there are no negative triplets in , a set , i.e.,(5) 
which contains negative triplets for , is made, and one negative triplet is sampled from . Finally, embedding parameters are updated in step 6. Thus, in optimization, the most important problem is how to do negative sampling, i.e. generate and sample negative triplet from .
IiB Negative Sampling
Existing works on negative sampling can be divided into two categories, i.e., sample from fixed and sample from dynamic distributions.
IiB1 Sample from fixed distributions
In the early works [7], negative triplets are uniformly sampled from the set . Such strategy is simple and efficient. Later, a better sampling scheme, i.e., Bernoulli sampling, is introduced in [42]. It improves uniform sampling by reducing the appearance of false negative triplets existing in onetomany, manytomany, and manytoone relations between head and tail entities. However, as mentioned in the introduction, they still sample from fixed distributions, which can neither model the dynamic changes in distributions of negative triplets nor sample triplets with large scores. Thus, they seriously suffer from vanishing gradient.
IiB2 Sample from dynamic distributions
More recently, two pioneered works [39, 9]
made a more dedicated analysis of problems with fixed sampling scheme. They observed that most of the negative triplets are easy ones, of which scores quickly go small during the training. This leads to the vanishing gradient problem if a fixed sampling scheme is used. Motivated by the success of Generative Adversarial Network (GAN)
[16] and its ability to model dynamic distribution, IGAN and KBGAN introduce GAN for negative sampling in KG.When GAN is applied to negative sampling, a jointly trained generator serves as a sampler that can not only generate highquality triplets by confusing the discriminator, but also dynamically adapt to the new distributions by keeping training. The discriminator, i.e., the KG embedding model, learns to distinguish between the positive triplets and the negative triplets selected by the generator. Under an alternating training procedure, the generator dynamically approximates the negative sample distribution, and the KG embedding model is improved by highquality negative samples.
Specifically, given a positive triplet , IGAN models the distribution over all entities to form a negative triplet . The quality of is measured by the scoring function of the discriminator, i.e. the target KG embedding model. By joint training, IGAN can dynamically sample negative triplets with high quality. KBGAN operates in a different way. Instead of modeling a distribution over the whole entity set, KBGAN learns to sample from a subset of random entities. Namely, it first uniformly samples a set of entities to form a candidate set , and then picks up one triplet from it. Under the framework of GAN, generator in KBGAN can approximate the score distribution of triplets in the set , and sample a triplet with relatively high quality.
Even though GAN provides a solution to model the dynamic negative sample distribution, it is famous for suffering from instability and degeneracy [1, 18]. Besides, REINFORCE gradient [44] has to be used, which is known to have high variance. Thus, pretrain is a must for both IGAN and KBGAN. Finally, it increases the number of model’s parameters and brings extra costs on training.
IiC Scoring Functions
The design of scoring function has been the main power source for improving embedding performance in recent years. Depending on the property of scoring functions, they are used in either translational distance or semantic matching models.
IiC1 Translational distance model
The simplest and most representative translational distance model is TransE [7]. Inspired from the word representation learning area [28], if a triplet is true, the entity embeddings should be connected by the relational vector , i.e. . Under this assumption for example, two facts (China, Capital, Beijing) and (UK, Capital, London) will enjoy a relation that in the embedding space. Thus in TransE, the scoring function is defined as the negative translational distance of and connected by relation , i.e., .
Despite the simplicity of TransE, it faces the problem when dealing with onetomany, manytoone and manytomany relations. Take onetomany relation for example, TransE enforces for different tail entity , thus resulting in very similar embeddings for these different entities. To solve this problem, variants like TransH [42], TransR [26], TransD [20] are introduced to project embeddings of head/tail entity into various spaces. By maximizing the scoring function for all positive triplets, the distance between and in corresponding space can be reduced.
IiC2 Semantic matching model
Another group of scoring functions operate without the assumption that . Instead, they use similarity to measure the plausibility of triplets . RESCAL [32] is the most original model. The entity embeddings are also continuous vectors in . But for each relation, it is represented as a matrix which models the pairwise interaction between every dimension in entity embedding space . Namely, the scoring function of a triplet is defined as , where the relation is represented as a matrix . This scoring function captures pairwise interactions between all components of and , which needs parameters per relation.
Some simple and effective variants of RESCAL are DistMult [46], HolE [31] and ComplEx [38]. DistMult simplifies RESCAL by restricting the interaction matrix into a diagonal matrix, which can reduce the number of parameters per relation from to . HolE and ComplEx improves DistMult by modeling asymmetric relations.
Iii Proposed Model
In this section, we first describe our key observations in Section IIIA, which are ignored by existing works but are the main motivations of our work. The proposed method is described in Section IIIB, where we show how challenges in negative sampling are addressed by cache. Finally, we show an interesting connection between NSCaching and selfpace learning [24] in Section IIIC, which further explains the good performance.
Iiia Closer Look at Distribution of Negative Triplets
Recall that, in Equation (5), the negative triplet is formed by replacing either the head or tail entity of a positive triplet with any other entities in . Before introducing the proposed method, we analyze the distribution of scores for .
complementary cumulative distribution function
(CCDF) to show the proportion of negative triplets that satisfy . The red dashed line shows where the margin lies. (a) is the distribution of negative triplets in 6 timestamp of a certain triplet . (b) is the negative sample distribution of 5 different triplets after the pretraining stage.Figure 1(a) shows the changes in the distribution of negative samples for one positive triplet; and Figure 1(b) shows distributions of negative samples from different positive triplets. Note that once the distance is larger than the margin , i.e., the red vertical line, the gradient of corresponding negative triplets will vanish to zero. Indeed, we can see the distribution changes during the training process; and negative triplets with large scores are rare. These observations are consistent with those ones in [39, 9], which further explain the vanishing gradient problem of uniform sampling, as most sampled negative triplets will have small scores.
Although, the necessity of finding negative triplets with large scores from a dynamic distribution is mentioned by above works, they do not deeply study these distributions. Key Observations. The more important observations are:

The score distribution of negative triplets is highly skew.
Thus, while GAN has strong ability to monitor the full generation process of negative triplets, it wastes a lot of parameters and training time on learning how negative triplets with small scores are distributed. This is obviously not necessary. Besides, reinforcement learning has been used once GAN is applied, which increases the difficulties on training. As a result, is it possible to directly keep track of those negative triplets with large scores?
strategy  minibatch computation  model  
negative sample  training  time  space  parameters  
baseline  uniform random  gradient descent (from scratch)  
IGAN [39]  GAN  reinforce learning (with pretrain)  
KBGAN [9]  GAN  reinforce learning (with pretrain)  
NSCaching  using cache  gradient descent (from scratch) 
IiiB NSCaching: the Proposed Method
In this section, we describe the proposed method, which addresses the aforementioned question. The basic idea is very simple and intuitive. Recall that the challenges in negative sampling are (i) how to model the dynamic distribution of negative triplets and (ii) how to sample negative triplets in an efficient way. By considering the key observations, we are motivated to use a small amount of memory, which caches negative samples with large scores for each triplet in , and sample the negative triplet directly from the cache. Algorithm 2 shows the KG embedding framework based on our cachebased negative sampling scheme. Note that the proposed sampling scheme does not depend on the choice of scoring functions, all ones previously mentioned in Section IIC can be used here.
Basically, as a negative triplet can be constructed by either replacing the head or tail entity, we maintain a headcache (indexed by ) and a tailcache (indexed by ), which store and respectively. Each pair or corresponds to a unique index. First, when a positive triplet is received, the corresponding cache containing candidates for negative triplets, i.e., and , are indexed in step 5. A negative triplet is generated from and at step 67, and then the cache is updated in step 8. Finally, the embeddings are updated based on the choice of scoring functions.
An overview of the proposed method with stateofthearts are in Table I. The main difference with general KG embedding framework in Algorithm 1 is step 58 in Algorithm 2, where the sampling scheme is based on the cache instead. Besides, compared with previous complex GANbased works [39, 9], our method in Algorithm 2 acts like a discriminative and distilled model of GAN, which only cares about negative triplets with large scores during the training. Thus, the proposed method, i.e., NSCaching, not only has fewer parameters, but also can be easily trained from randomly initialized models (from the scratch). Moreover, experimental results in Section IV show that NSCaching achieves the best performance.
However, in order to achieve best performance, we need to carefully design how to sample from the cache (step 6) and update the cache (step 8). In the sequel, we will describe the “exploration and exploitation” inside these steps and how they are balanced in detail. Then, we give a time and space analysis of Algorithm 2, which further explain its efficiency and memory saving. Note that, we only discuss operations and designs for the headcache here, as designs are the same for the tailcache .
IiiB1 Uniform sampling strategy from the cache (step 6)
Recall that only head in negative triplets with large scores are in cache , thus picking up any probably avoids the vanishing gradient problem. As larger scores also lead to bigger gradients, a very natural scheme is to always sample the negative triplet with the largest score.
However, as the distribution can change during the iterations of the algorithm, the negative triplets in the cache may not be accurate enough for the sampling in the latest iteration. Besides, there are false negative triplets in the negative sample sets, of which scores can also be very high [40]. As a consequence, we also need to consider other triplets except the one with largest score in the cache.
This raises the question that how to keep the balance between exploration (i.e., explore all the possible highquality negative samples) and exploitation (i.e., sample the largest score negative triplet in cache).
These motivate us to use uniformly random sampling scheme in step 6. It is simple, efficient, and does not introduce any bias into the selection process. Indeed, a stronger scheme can be sampling based on triplets’ scores, where larger score indicates higher probability to be sampled. However, it has extra memory costs as scores needs to be stored as well. Moreover, it introduces bias causing by dynamic changing distribution and false negative triplets, which leads to inferior performance as shown in Section IVC1.
IiiB2 Importance sampling strategy to update the cache (step 8)
As mentioned in Section IIA, the cache needs to be dynamically changed during the iterations of the algorithm. Otherwise, while negative triplets are kept in , sampling from cache is still a scheme with fixed distribution, which eventually suffers from vanishing gradient problem. Thus, we need to refresh the cache in each iteration. Moreover, the cache needs to be updated in an efficient way.
The proposed importance sampling (IS) strategy is presented in Algorithm 3. First, we uniformly sample a subset of size (step 2), then union it with and obtain . The scores for all triplets in are evaluated in step 4. After that, we construct a subset from by sampling entries in without replacement times following probability
(6) 
Finally, is returned as the updated headcache.
Note that exploration and exploitation also need to be carefully balanced in Algorithm 3. As the cache needs to be updated, we have to sample from , and uniform sampling is chosen due to its efficiency. Thus, a bigger implies more exploitation, while a larger leads to more exploration. In step 6, indeed, uniform sampling or keeping triplets with top scores can be alternative choices. However, both of them are inappropriate. First, uniformly sampling is obviously not proper, as triplets in have much larger scores than those in . Then, deterministically sampling top is not appropriate as well, which again dues to the existence of false negative triplets (Section IIIB1). All above concerns will also be empirically studied in experiments Section IV.
IiiB3 Space and time complexities
Here, we analyze the space and time complexities of NSCaching (Algorithm 2). Comparing with basic Algorithm 1, the main additional cost by introducing cache comes from Algorithm 3 in step 8. In Algorithm 3, the time complexity of computing the score of candidate triplets is . The cost of step 6 contains two parts, i.e., normalization of the score and uniform sampling, they take and respectively, which are very small. Thus, the total cost of introducing cache is for one triplet. We can lazily update the cache epochs later rather than immediately updating, which can further reduce update complexity to .
As for space complexity, evaluating the scores for candidate triplets takes space. Since we only store indices in the cache, it takes space to store these indices for negative triplets. However, since there are many onetomany, manytoone and manytomany relations, the cost will be smaller than and the cache does not need to be stored in memory. In our experiments, values of and used on WN18 and FB15K are both , which is much smaller than the number of entities.
In comparison, to generate one negative triplet, the generator in IGAN [39] costs time since it needs to compute the distribution over all entities. KBGAN [9] needs cost for measuring a candidate set of triplets. The additional space cost for IGAN and KBGAN is also and respectively. Finally, the comparisons are summarized in Table I with TransE as the scoring function.
IiiB4 Discussion on the Convergence
Both the baseline KG embedding models [40] and NSCaching use stochastic gradient descent (SGD) for model training. While there is no theoretical guarantee, SGD has been applied on many nonconvex and complex models [22], where the convergence is empirically observed, including the baseline KG embedding model [7, 6, 42, 11, 38, 31, 27]. The only difference of NSCaching to that baseline model is how to sample negative triplets.
Besides, since NSCaching samples negative triplets with larger scores, its gradients have larger magnitude than that of baseline approach. This also prevents NSCaching from being early stopped by the sampling process and helps to converge with higher testing performance that of baseline models. The above are all empirically shown and studied in Section IV.
IiiC Connection to SelfPace Learning
The main idea of selfpaced (or curriculum) learning [3, 24] is to pick up easy samples first, and then gradually switch to hard ones. In this way, the classifier can first identify the rough position where the decision boundary should locate, and then the boundary can be further refined near hard examples. It is very effective for complex and noncovex models.
Recently, it is also introduced into network embedding and a big improvement on embedding’s quality has been reported [12]. Besides, GAN is also used to monitor the distribution of edges in the network, and negative edges with scores above one threshold are sampled from the generator in GAN. Selfpaced learning is achieved by increasing the threshold during the training of embedding [12]. Thus, we can see neither KBGAN nor IGAN has benefited from selfpaced learning.
In contrast, our caching scheme can explicitly benefit from it. The reason is that the embedding model only has weak discriminative ability in the beginning of the training. Thus, while there are still a lot of negative triplets with large scores, it is more likely that they are easy ones as most of negative samples are easy. However, as training goes on, those easy samples will gradually have small scores and are removed from the cache. These mean NSCaching will learn from easy samples first, but then gradually focus on hard ones, which is exactly the principle of selfpaced learning. The above explanations are also verified by experiments, where we can see the negative triplets in the cache change from easy to hard ones (Section IVF) and NSCaching training from scratch can already achieve better performance than IGAN and KBGAN with pretraining (Section IVB).
Iv Experiments
In this section, we carry empirical study of our method. All algorithms are written in Python with PyTorch framework
[33] and run on a TITAN Xp GPU.Iva Experiment Setup
IvA1 Datasets
Four datasets are used here, i.e., WN18, FB15K and their variants WN18RR, FB15K237. WN18 and FB15K are firstly introduced in [7]. They are widely tested among the most famous Knowledge Graph embedding learning works [7, 20, 38, 39, 9]. WN18RR and FB15K237 are variants that remove nearduplicate or inverseduplicate relations from WN18 and FB15K, and are introduced by [41] and [37] respectively. The two variants are harder and more realistic. Their statistics are shown in Table II.
Dataset  #entity  #relation  #train  #valid  #test 

WN18  40,943  18  141,442  5,000  5,000 
WN18RR  93,003  11  86,835  3,034  3,134 
FB15K  14,951  1,345  484,142  50,000  59,071 
FB15K237  14,541  237  272,115  17,535  20,466 
Specifically, WN18 and WN18RR are subsets of Wordnet [29], which is a large lexical database of English. The entities correspond to word senses, and relations mean the lexical relation between them. FB15K and FB15K237 are subsets of Freebase dataset [4] which contains general facts of the world. Freebase keeps growing until January 2014 and it now contains approximately 44 million topics and 2.4 billion triplets.
IvA2 Tasks
Following previous KG embedding works [7, 42, 20, 38], and the GANbased works [39, 9], we test the performance on link prediction task. This is also the testbed to measure KG embedding models. Link prediction aims to predict the missing entity or for a positive triplet . In this task, we measure the rank of head entity and tail entity among all the entity sets. Thus, link prediction emphasizes the rank of the correct entity rather than their concrete scores.
IvA3 Performance measurements
As in previous works [7, 38, 39, 9] , we evaluate different models based on the following metrics:

Mean reciprocal ranking (MRR): It is computed by average of the reciprocal ranks where is a set of ranking results;

Hit@10: It is the percentage of appearance in top: , where is the indicator function;

Mean rank (MR): It is computed by . Smaller value of MR tends to infer better results.
MRR and Hit@ measure the top rankings of positive entity in different level. Hit@10 cares about general top rankings, and the top 1 samples contribute most to MRR. The larger value of MRR and Hit@ indicates better performance. To avoid underestimating the performance of different models, we report the performance in a “Filtered” setting, i.e., all the corrupted triplets that exist in train, valid and test set are filtered out [39, 9]. Note that, MR is not a good metric, as it is easily influenced by false positive samples. We report it here to keep consistency with existing literatures [39, 9].
IvA4 Choices of the scoring function
A large amount of scoring functions have been proposed in literature, including translational distance models TransE [7], TransH [42], TransR [26], TransD [20], TranSparse [21], TransM [11], ManifoldE [45], etc., and semantic matching models RESCAL [32], DistMult [46], HolE [31], ComplEx [38], ANALOGY [27], etc. All these methods are summarized in a recent survey [40]. Follow [9, 39], in the sequel, TransE, TransH, TransD, DistMult and ComplEx will be used as scoring functions for comparison (see their definitions in Table III).
model  scoring function  definition 

translational  TransE [7]  
distance  TransH [42]  
TransD [20]  
semantic  DistMult [46]  
matching  ComplEx [38] 
IvB Comparison with Stateofthearts
In this section, we focus on the comparison with stateofthearts methods. Hyperparameters of NSCaching are studied in Section IVC.
scoring  Dataset  WN18  WN18RR  FB15K  FB15K237  
functions  Metrics  MRR  MR  Hit@10  MRR  MR  Hit@10  MRR  MR  Hit@10  MRR  MR  Hit@10  
TransE  pretrained  0.4213  217  91.50  0.1753  4038  44.48  0.4679  60  74.70  0.2262  237  38.64  
Bernoulli  0.5001  249  94.13  0.1784  3924  45.09  0.4951  65  77.37  0.2556  197  41.89  
KBGAN  pretrain  0.6880  293  94.92  0.1864  4420  45.39  0.4858  82  77.02  0.2938  628  46.69  
scratch  0.6606  301  94.80  0.1808  5356  43.24  0.3771  335  72.67  0.2926  722  46.59  
NSCaching  pretrain  0.7867  271  66.62  0.2048  4404  47.38  0.6475  62  81.54  0.3004  188  47.36  
scratch  0.7818  249  94.63  0.2002  4472  47.83  0.6391  62  80.95  0.2993  186  47.64  
IGAN  pretrain  ——  240  91.3  ——  ——  ——  ——  81  74.0  ——  ——  ——  
scratch  ——  244  92.7  ——  ——  ——  ——  90  73.1  ——  ——  ——  
TransH  pretrained  0.4527  233  92.71  0.1755  5646  43.30  0.4316  58  73.98  0.2222  223  38.80  
Bernoulli  0.5206  288  94.52  0.1862  4113  45.09  0.4518  60  76.55  0.2329  202  40.10  
KBGAN  pretrain  0.6168  335  94.84  0.1923  4708  45.31  0.4262  86  75.91  0.2807  401  46.39  
scratch  0.6018  288  94.60  0.1869  4881  44.81  0.3364  311  72.53  0.2779  455  46.19  
NSCaching  pretrain  0.8063  286  95.32  0.2038  4425  48.04  0.6520  54  81.96  0.2812  187  46.48  
scratch  0.8038  266  95.29  0.2041  4491  48.04  0.6391  54  81.05  0.2832  185  46.59  
IGAN  pretrain  ——  258  94.0  ——  ——  ——  ——  81  77.0  ——  ——  ——  
scratch  ——  276  86.9  ——  ——  ——  ——  90  73.3  ——  ——  ——  
TransD  pretrained  0.4426  243  92.69  0.1782  4955  42.18  0.4320  59  73.98  0.2244  215  39.53  
Bernoulli  0.5093  256  94.61  0.1901  3555  46.41  0.4529  63  76.55  0.2451  188  42.89  
KBGAN  pretrain  0.6130  307  94.92  0.1917  3785  46.49  0.4069  75  74.27  0.2487  798  44.33  
scratch  0.5950  332  94.68  0.1875  4083  46.41  0.3151  184  69.77  0.2465  825  44.40  
NSCaching  pretrain  0.8022  295  94.99  0.2013  2952  48.36  0.6567  54  82.02  0.2883  184  48.33  
scratch  0.7994  286  95.16  0.2013  3104  48.39  0.6415  58  81.32  0.2863  189  47.85  
IGAN  pretrain  ——  248  93.3  ——  ——  ——  ——  79  77.6  ——  ——  ——  
scratch  ——  221  93.0  ——  ——  ——  ——  89  74.0  ——  ——  ——  
DistMult  pretrained  0.6340  1174  92.28  0.3765  7405  44.85  0.5004  176  77.46  0.2247  408  36.03  
Bernoulli  0.7918  862  93.38  0.3964  7420  45.25  0.5698  148  76.32  0.2491  280  42.03  
KBGAN  pretrain  0.6955  1143  93.11  0.3849  7586  44.32  0.5568  201  75.57  0.2670  370  45.34  
scratch  0.7275  794  93.08  0.2039  11351  29.52  0.4227  321  64.35  0.2272  276  39.91  
NSCaching  pretrain  0.8297  1038  93.83  0.4148  7477  45.80  0.7177  98  84.56  0.2882  265  45.79  
scratch  0.8306  827  93.74  0.4128  7708  45.45  0.7501  132  84.36  0.2834  273  45.56  
ComplEx  pretrained  0.8046  1106  93.75  0.3934  8259  41.63  0.5558  115  79.95  0.2201  418  35.55  
Bernoulli  0.9115  808  94.39  0.4431  4693  51.77  0.6713  78  85.05  0.2596  238  43.54  
KBGAN  pretrain  0.8976  1060  93.73  0.4287  6929  47.03  0.6254  162  80.95  0.2818  268  45.37  
scratch  0.7233  966  85.81  0.3180  7528  35.51  0.5002  294  76.10  0.1910  881  32.07  
NSCaching  pretrain  0.9286  1079  94.03  0.4487  4861  51.76  0.7459  123  84.17  0.3017  220  47.75  
scratch  0.9355  1072  93.98  0.4463  5365  50.89  0.7721  82  86.82  0.3021  221  48.05 
IvB1 Compared methods
Following methods for negative sampling are compared:

Bernoulli [42]: As a basic extension of the uniform sampling scheme used in TransE, Bernoulli sampling aims at reducing false negative labels by replacing the head or tail with different probability for onetomany, manytoone and manytomany relations. Specifically, it samples or
under a predefined Bernoulli distribution. Since it is shown to be better than uniform sampling, we choose it as the basic random sampling scheme;

KBGAN [9]: This model firstly samples a set uniformly from the whole entity set . Then head or tail entity is replaced with the entities in to form a set of candidate and . The generator in KBGAN tries to pick up one triplet among them. As proposed in [9], we choose the simplest model TransE as the generator. For fair comparison, the size of set is same as our cache size . We use the published code ^{1}^{1}1https://github.com/cailw/KBGAN and change the configure same as ours for fair comparison;

NSCaching (Algorithm 2): As in Section III, the negative samples are formed by replacing the head entity or tail entity with one uniformly sampled from head cache or tail cache . The cache is updated as in Algorithm 3. Note that we can also lazily update the cache several iterations later, which can further save time. However, we just report the result of immediate update, which is shown to be both effective and efficient. We use and lazyupdate with unless otherwise specified.
As the source code of IGAN [39] is not available, we do not compare with it here. Instead, we directly use the reported performance in the sequel. Finally, we also use Bernoulli sampling to choose between and for KBGAN and NSCaching.
Besides, as suggested in [9, 39], two training strategies are used for KBGAN and NSCaching, i.e.,

From scratch: The embedding of relations and entities are initialized by the Xavier uniform initializer [14], and the models (denoted as KBGAN + scratch and NSCaching + scratch) are directly applied to train the given KG;

With pretrain: Same as [9, 39], we firstly pretrain each scoring function under the baseline model, i.e. Bernoulli sampling, several epochs on both data sets. We denote it as pretrained. Then the obtained parameters are used to warmstart the given KG rather than from scratch. We keep training based on the warmstarted KG embedding and evaluate the performance under different models, i.e., Bernoulli, KBGAN + pretrain and NSCaching + pretrain. Besides, the generator in KBGAN is warmstarted with corresponding TransE model.
IvB2 Hyperparameter settings
We use grid search to select the following hyperparameters: hidden dimension , learning rate . For translational distance models, we tune the margin value . And for semantic matching models, we tune the penalty value [38]. We use Adam [22], which is a popular variant of SGD algorithm for the training, and adopt its default settings, except for the learning rate. The best hyperparameter is tuned under Bernoulli sampling scheme and evaluated by the MRR metric on validation set. We keep them fixed for the baseline methods Bernoulli, KBGAN and our proposed NSCaching. Following [9], we save and record the pretrained model after several initial training epochs. Then, Bernoulli method keeps training until 3000 epochs; and the results of KBGAN and NSCaching algorithm are evaluated within 1000 epochs, either from scratch or with pretrain. All the recorded results are tested based on the best parameters chosen by the MRR value on validation set.
IvB3 Results on translational distance models
The performance on link prediction is compared in Table IV. First, we can see that, for the translational distance models (TransE, TransH, TransD), KBGAN, NSCaching and IGAN (both with pretrain and from scratch) gain significant improvement upon the baseline scheme Bernoulli, especially for the gaining on MRR, which is mainly influenced by top rankings. This verifies the needs of using highquality negative triplets during negative sampling and these methods can effectively pick up these negative triplets.
Then, IGAN and KBGAN with pretrain can perform better, indicated by MRR and Hit@10, than from scratch. This shows pretrain is helpful for GAN based models. In comparison, the proposed NSCaching trained from either state (pretrain or scratch) can outperform IGAN and KBGAN. Finally, we find that MR is not an appropriate metric, as many of the pretrained models, which is not converged yet, show even smaller MR than the Bernoulli.
Convergence of testing performance for various algorithms are shown in Figure 2 and 3. We use TransD as it offers the best performance among the three translational distance models. As can be seen, all algorithms will converge to a stable testing MRR and Hit@10, which verifies the empirical convergence of Adam optimizer. Then, while pretrain is a must for KBGAN to achieve good performance, NSCaching can obtain good performance either from scratch or using pretrain. Finally, in all cases, NSCaching converges much faster and is more stable than both Bernoulli and KBGAN.
IvB4 Results on semantic matching models
The performance is shown in the bottom rows of Table IV. Same as the performance on translational distance models, NSCaching outperforms baseline scheme Bernoulli significantly, as indicated by the bold and underline numbers. However, KBGAN does not show consistent performance. It performs even worse than the Bernoulli sampling scheme on WN18, WN18RR and FB15K, KBGAN from scratch even performs much worse than with pretrian. This observation further verifies the fact that GAN based methods usually suffer from instability and degeneracy. This method needs careful balance between the generator and the target KG embedding model. However, NSCaching works consistently and performs the best among various settings.
Convergence of testing performance for various algorithms are shown in Figure 4 and 5. We use ComplEx as the representative since it is much better than DistMult. As can be seen, both Bernoulli and the proposed NSCaching will converge to a stable state. In the contrast, KBGAN will turn down and overfit after several epochs. However, NSCaching, either with pretrain or from scratch, leads the performance and is well adopted on the semantic matching models without further tuning.
IvB5 Results on triplets classification
To further verify the quality of learned embedding, we test the learned embeddings on triplet classification task on WN18RR and FB15K237 datasets. This task is to confirm whether a given triplet is correct or not, i.e., binary classification on triplet [42]. In practice, it can help us to quickly answer the truthorfalse questions. The WN18RR ^{2}^{2}2https://github.com/thunlp/OpenKE/blob/master/benchmarks/WN18RR/valid_neg.txt and FB15K237 ^{3}^{3}3https://github.com/thunlp/OpenKE/blob/master/benchmarks/FB15K237/valid_neg.txt dataset released a set of positive and negative triplets, which can be used to evaluate the performance on the classification task. The decision rule of classification is as follows: for each , if its score is no less than the relationspecific threshold , then predict positive. Otherwise, negative. The threshold is determined according to maximizing the classification accuracy on the validation set. As shown in Table V, NSCaching still outperforms the baselines. The new experiment further justifies that our proposed NSCaching can help learn a better embedding of the KG.
scoring function  Dataset  WN18RR  FB15K237  

TransD  Bernoulli  86.81  78.24  
KBGAN  pretrained  85.93  79.03  
scratch  86.01  79.05  
NSCaching  pretrained  87.84  80.63  
scratch  87.64  80.69  
ComplEx  Bernoulli  84.48  77.64  
KBGAN  pretrained  79.87  74.11  
scratch  71.73  72.61  
NSCaching  pretrained  84.96  79.88  
scratch  84.83  80.21 
IvC Cache Update and Sampling Scheme
In Section IVB, we have shown that NSCaching achieves the best performance on four benchmark datasets. Here, we analyze design concerns on “exploration and exploitation” at step 6 and 8 in Algorithm 2. TransD and WN18 are used here.
IvC1 Uniform sampling from the cache (step 6)
Given a cache, which stores highquality negative samples, how to sample from it is the first question we care about. Recall that we discussed three strategies in Section IIIB1, i.e., (i) uniform sampling from the cache (dented as “uniform sampling”); (ii) importance sampling according to the score of each sample in cache (denoted as “IS sampling”); and (iii) top sampling, by choosing the sample with largest score (denoted as “top sampling”). Testing performance of MRR on WN18 trained by TransD are compared in Figure 6.(a). As can be seen, top sampling has the worst performance, and uniform sampling is the best.
To show how exploration and exploitation are balanced here, we further compute two criterion to show the difference between these strategies. (i) Repeat ratio (denoted as “RR”), which measures the percentage of repeated negative triplets within epochs; and (ii) nonzero loss ratio (denoted as “NZL”), which is the percentage of nonzero losses in same range. The value of RR is related to exploration, if the number of repeated negative triplets is high, the negative samples only explore a small part of the sample spaces, thus result in worse exploration. NZL ratio measures exploitation, a larger NZL means higher quality of picked negative samples.
The RR is shown in Figure 7(a). The Bernoulli sampling method has almost zero repeat triplets since the number of explored negatives is extremely large, it has the best exploration. Among the schemes based on NSCaching, uniform sampling has better exploration than IS, then followed by top sampling. NZL ratio is shown in Figure 7(b). As training going on, the baseline Bernoulli model suffers the zero loss problem severely, thus leading to vanishing gradient. All of the three schemes have more than half nonzero losses, thus achieves exploitation. To sum up, uniform sampling is the most balanced strategy among the three schemes, thus NSCaching + uniform achieving the best performance.
IvC2 Importance sampling strategy to update the cache (step 8)
As discussed in Section IIIB2, we have two choices on updating the cache: (i) importance sampling based method, which samples entities from candidates according to the probability in (6) without replacement, (IS update). (ii) top sampling method, which directly select entities with top scores in the candidates, (top update). Again, let us first look at performance comparison in Figure 6.(b). We can see that IS update outperforms top update by a large margin.
Then, to explain the exploration and exploitation here, we add two extra measurements for comparison. They are (i). the number of changed elements in cache (denoted as “CE”) and (ii) the ratio of nonzero losses, i.e., NZL. More changed elements leads to larger exploration, and more nonzero losses means more exploitation.
The value of CE measures the different elements in the cache in a period of epochs. As shown in Figure 8.(a), the number of changed elements in top update scheme is much smaller than that of the importance sampling update. As a result, the cache is updated quite slow and the model mainly focuses on these highly scored negative triplets, which may contain many false positive triplets. As a comparison, the importance sampling based update scheme can keep the cache fresh and keep track of dynamic changes of the negative sampling distribution. It not only provides enough qualified negative triplets for the KG embedding model to avoid zero loss, but also explore the large negative sample space well. In summary, we choose the importance sampling strategy to update the cache.
IvD Sensitivity Analysis: Cache Size
Comparing with the baseline KG embedding models (i.e., Bernoulli [42, 26]), the only extra hyperparameters here are and . Basically, is the size of cache. Then, is the size of randomly sampled negative triplets from , which will be later used to update the cache. Here, we show their impact on NSCaching’s performance.
Figure 9.(a) shows how performance changes by varying the cache size among , with fixed . When the cache size is small, average score of entities stored in cache should be larger than those in larger cache. Thus, false negative samples will be more likely to be sampled, which will influence the boundary to a bad location. As for the others values of , NSCaching performs quite stable. The convergence speed is similar, as well as the values in converged state. Thus, when finding appropriate cache size, the value of can be searched from smaller value until the performance is stable.
Different performance of the random candidate subset size is shown in Figure 9.(b). Obviously, the entities in cache will be updated more frequently when gets larger, which lead to better exploration. But the tradeoff is that larger value of costs more. As shown by the colored lines in Figure 9.(b), NSCaching performs consistently when is larger than 10. However, if the random subset is small, the content in cache will be harder to be updated, thus lead to poor performance as the yellow dashed line ().
IvE Illustration of Vanishing Gradient
To further clarity the vanishing gradient problem, we plot average norm of gradients v.s. number of epochs in Figure 10. Note that Adam [22], which is a stochastic gradient descent algorithm, is used as the optimizer. First, we can see that while norms of gradients for both NSCaching and Bernoulli become smaller, they will not decrease to zero since the sampling process of the minibatch will introduce noisy into gradients. However, the norm from NSCaching is larger than that from Bernoulli, which dues to the usage of cachingbased negative sampling scheme. Thus, we can see NSCaching can successfully avoid the problem of vanishing gradient.
IvF Explanation of the connection to SelfPaced Learning
Finally, we visualize the changes of entities in the cache, which verifies the effects of selfpaced learning introduced in Section IIIC. Following [39], we also use FB13 here since its triplets are more interpretable than the four evaluated datasets. We pick up , , as the positive triplets, and the changes in its tailcache are show in Table VI. As can be seen, entities are firstly meaninglessness, e.g., ostrava and ben_lilly, then they gradually changes to human jobs, e.g., artist and sex_worker.
epoch  entities in cache 

0  allen_clarke, jose_gola, ostrava, ben_lilly, hans_zinsser 
20  accountant, frank_pais, laura_marx, como, domitia_lepida 
100  artist, , aviator, hans_zinsse, john_h_cough 
200  physician, artist, raich_carter, coach, mark_shivas 
500  artist, physician, cavan, sex_worker, attorney_at_law 
V Related work
V1 Generative Adversarial Network
Generative Adversarial Network (GAN) is originally introduced as a powerful model for plausible image generation. The GAN contains two modules: a generator that serves as a complex distribution sampler, and a discriminator that measures the quality of generated samples. Under elaborately control on the training procedure of generator and discriminator [1, 18]
, GAN achieved significant success computer vision field
[35, 47]. It has been shown to sample highquality negative samples for knowledge graph embedding [9, 39].V2 Negative Sampling
Negative sampling is originally introduced as an alternative to the hierarchical softmax, which aims at reducing complexity of softmax on large scale dataset [19]. It then becomes popular in embedding learning, especially for word embedding [15], graph embedding [17], and KG embedding [40]. More recently, there have been interests in applying the GAN to negative sampling, e.g., IGAN [39] and KBGAN [9] for KG embedding and selfpaced GAN [12] for network embedding.
Vi Conclusion
We proposed NSCaching as a novel negative sampling method for knowledge graph embedding learning. The negative samples are from a cache that can dynamically hold highquality negative samples. We analyze the designing of NSCaching through the balance of exploration and exploitation. Experimentally, we empirically test NSCaching on two datasets and five scoring functions. Results show that the method can generalize well under various settings and achieves stateofthearts performance on FB15K dataset. When dealing with millions scale KG, memory of storing the cache becomes a problem. Using distributed computation or hashing will be pursued as future works. Besides, the theoretical convergence of NSCaching is also an important and interesting future work.
References
 [1] M. Arjovsky, S. Chintala, and L. Bottou. Wasserstein GAN. Technical report, 2017.
 [2] S. Auer, C. Bizer, G. Kobilarov, J. Lehmann, R. Cyganiak, and Z. Ives. DBpedia: A nucleus for a web of open data. In The Semantic Web, pages 722–735. Springer, 2007.
 [3] Y. Bengio, J. Louradour, R. Collobert, and J. Weston. Curriculum learning. In ICML, pages 41–48. ACM, 2009.
 [4] K. Bollacker, C. Evans, P. Paritosh, T. Sturge, and J. Taylor. Freebase: a collaboratively created graph database for structuring human knowledge. In SIGMOD, pages 1247–1250. ACM, 2008.
 [5] A. Bordes, S. Chopra, and J. Weston. Question answering with subgraph embeddings. In EMNLP, pages 615–620, 2014.
 [6] A. Bordes, X. Glorot, J. Weston, and Y. Bengio. A semantic matching energy function for learning with multirelational data. Machine Learning, 94(2):233–259, 2014.
 [7] A. Bordes, N. Usunier, A. GarciaDuran, J. Weston, and O. Yakhnenko. Translating embeddings for modeling multirelational data. In NIPS, pages 2787–2795, 2013.
 [8] A. Bordes, J. Weston, and N. Usunier. Open question answering with weakly supervised embedding models. In ECMLPKDD, pages 165–180. Springer, 2014.
 [9] L. Cai and W.Y. Wang. Kbgan: Adversarial learning for knowledge graph embeddings. In ACL, volume 1, pages 1470–1480, 2018.

[10]
L. Drumond, S. Rendle, and L. SchmidtThieme.
Predicting rdf triples in incomplete knowledge bases with tensor factorization.
In SAC, pages 326–331, 2012.  [11] M. Fan, Q. Zhou, E. Chang, and T. F. Zheng. Transitionbased knowledge graph embedding with relational mapping properties. In PACLIC, 2014.
 [12] H. Gao and H. Huang. Selfpaced network embedding. In SIGKDD, pages 1406–1415, 2018.
 [13] L. Getoor and B. Taskar. Introduction to statistical relational learning, volume 1. The MIT Press, 2007.

[14]
X. Glorot and Y. Bengio.
Understanding the difficulty of training deep feedforward neural networks.
In AISTATS, pages 249–256, 2010.  [15] Y. Goldberg and O. Levy. word2vec explained: deriving mikolov et al.’s negativesampling wordembedding method. Technical report, 2014.
 [16] I. Goodfellow, J. PougetAbadie, M. Mirza, B. Xu, D. WardeFarley, S. Ozair, A. Courville, and Y. Bengio. Generative adversarial nets. In NIPS, pages 2672–2680, 2014.
 [17] A. Grover and J. Leskovec. node2vec: Scalable feature learning for networks. In SIGKDD, pages 855–864. ACM, 2016.
 [18] I. Gulrajani, F. Ahmed, M. Arjovsky, V. Dumoulin, and A. C. Courville. Improved training of wasserstein gans. In NIPS, pages 5767–5777, 2017.
 [19] M. Gutmann and A. Hyvärinen. Noisecontrastive estimation: A new estimation principle for unnormalized statistical models. In AISTATS, pages 297–304, 2010.
 [20] G. Ji, S. He, L. Xu, K. Liu, and J. Zhao. Knowledge graph embedding via dynamic mapping matrix. In ACL, volume 1, pages 687–696, 2015.
 [21] G. Ji, K. Liu, S. He, and J. Zhao. Knowledge graph completion with adaptive sparse transfer matrix. In AAAI, pages 985–991, 2016.
 [22] Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. Technical report, 2014.
 [23] S. Kok and P. Domingos. Statistical predicate invention. In ICML, pages 433–440, 2007.
 [24] M. P. Kumar, B. Packer, and D. Koller. Selfpaced learning for latent variable models. In NIPS, pages 1189–1197, 2010.
 [25] N. Lao, T. Mitchell, and W. W. Cohen. Random walk inference and learning in a large scale knowledge base. In EMNLP, pages 529–539. Association for Computational Linguistics, 2011.
 [26] Y. Lin, Z. Liu, M. Sun, Y. Liu, and X. Zhu. Learning entity and relation embeddings for knowledge graph completion. In AAAI, volume 15, pages 2181–2187, 2015.
 [27] H. Liu, Y. Wu, and Y. Yang. Analogical inference for multirelational embeddings. In ICML, pages 2168–2178, 2017.
 [28] T. Mikolov, W. Yih, and G. Zweig. Linguistic regularities in continuous space word representations. In ACL, pages 746–751, 2013.
 [29] G. A. Miller. Wordnet: a lexical database for english. Communications of the ACM, 38(11):39–41, 1995.
 [30] M. Nickel, K. Murphy, V. Tresp, and E. Gabrilovich. A review of relational machine learning for knowledge graphs. Proceedings of the IEEE, 104(1):11–33, 2016.
 [31] M. Nickel, L. Rosasco, and T. A. Poggio. Holographic embeddings of knowledge graphs. In AAAI, volume 2, pages 3–2, 2016.
 [32] M. Nickel, V. Tresp, and H. Kriegel. A threeway model for collective learning on multirelational data. In ICML, volume 11, pages 809–816, 2011.
 [33] A. Paszke, S. Gross, S. Chintala, G. Chanan, E. Yang, Z. DeVito, Z. Lin, A. Desmaison, L. Antiga, and A. Lerer. Automatic differentiation in pytorch. Technical report, 2017.
 [34] A. Singhal. Introducing the knowledge graph: things, not strings. Official Google blog, 5, 2012.
 [35] Q. Song, H. Ge, J. Caverlee, and X. Hu. Selfattention generative adversarial networks. Technical report, 2018.
 [36] F. M. Suchanek, G. Kasneci, and G. Weikum. Yago: a core of semantic knowledge. In WWW, pages 697–706, 2007.
 [37] Kristina Toutanova and Danqi Chen. Observed versus latent features for knowledge base and text inference. In Workshop on Continuous Vector Space Models and their Compositionality, pages 57–66, 2015.
 [38] T. Trouillon, J. Welbl, S. Riedel, and G. Gaussier, É. Complex embeddings for simple link prediction. In ICML, pages 2071–2080, 2016.
 [39] P. Wang, S. Li, and R. Pan. Incorporating GAN for negative sampling in knowledge representation learning. AAAI, 2018.
 [40] Q. Wang, Z. Mao, B. Wang, and L. Guo. Knowledge graph embedding: A survey of approaches and applications. TKDE, 29(12):2724–2743, 2017.
 [41] Y. Wang, D. Ruffinelli, S. Broscheit, and R.ainer Gemulla. On evaluating embedding models for knowledge base completion. arXiv preprint arXiv:1810.07180, 2018.

[42]
Z. Wang, J. Zhang, J. Feng, and Z. Chen.
Knowledge graph embedding by translating on hyperplanes.
In AAAI, volume 14, pages 1112–1119, 2014.  [43] D. A. White. The knowledgebased software assistant: A program summary. In ICKBSE, pages 2–6, 1991.
 [44] R. J. Williams. Simple statistical gradientfollowing algorithms for connectionist reinforcement learning. Machine Learning, 8(34):229–256, 1992.
 [45] H. Xiao, M. Huang, and X. Zhu. From one point to a manifold: knowledge graph embedding for precise link prediction. In IJCAI, pages 1315–1321, 2016.
 [46] B. Yang, W. Yih, X. He, J. Gao, and L. Deng. Embedding entities and relations for learning and inference in knowledge bases. Technical report, 2017.

[47]
J. Zhu, T. Park, P. Isola, and A. A. Efros.
Unpaired imagetoimage translation using cycleconsistent adversarial networks.
In ICCV, pages 2242–2251. IEEE, 2017.
Comments
There are no comments yet.