Distributional Negative Sampling for Knowledge Base Completion

08/16/2019 ∙ by Sarthak Dash, et al. ∙ ibm 0

State-of-the-art approaches for Knowledge Base Completion (KBC) exploit deep neural networks trained with both false and true assertions: positive assertions are explicitly taken from the knowledge base, whereas negative ones are generated by random sampling of entities. In this paper, we argue that random sampling is not a good training strategy since it is highly likely to generate a huge number of nonsensical assertions during training, which does not provide relevant training signal to the system. Hence, it slows down the learning process and decreases accuracy. To address this issue, we propose an alternative approach called Distributional Negative Sampling that generates meaningful negative examples which are highly likely to be false. Our approach achieves a significant improvement in Mean Reciprocal Rank values amongst two different KBC algorithms in three standard academic benchmarks.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Knowledge Base Completion (KBC) systems leverage existing knowledge from within an incomplete input KB in order to validate the truth value of unknown assertions, so as to augment an existing KB with additional assertions. A sub-task within KBC called link prediction is defined as follows: given the predicate and either subject or object entity of an assertion, the system should correctly predict the missing entity. For every possible entity in the KB an assertion is generated, and the system provides an assessment of its validity which it then uses to generate a ranked list of all possible candidates.

One of the popular approaches to train models for KBC leverage the assertions in the original KB in order to provide positive examples and use Random Negative Sampling (RNS) to generate negatives. RNS for link prediction consists of selecting a triple from the KB and substituting one of its entity (e.g. the object) with a random entity from the KB. For example, a positive assertion PlaysInMovie(Elijah Wood, Lord Of The Rings) might generate PlaysInMovie(Elijah Wood, DVR) or PlaysInMovie(Elijah Wood, Star Trek) as negative examples. However, this approach is far from being the optimal choice for training KBC systems, which are supposed to distinguish between true and false assertions. In fact, if one looks in depth into the meaning of the two different negative assertions generated by RNS in the example above, there is a very big difference: the first one does not make sense, while the second is actually meaningful, but false.

In this work, we introduce Distributional Negative Sampling (DNS), an alternative approach for RNS that can be applied to train deep learning based KBC systems. DNS is aimed at generating meaningful negative examples which are highly likely to be false, preventing generation of nonsensical negatives. DNS uses distributional similarity to drive the negative sampling process. Specifically during training, given a positive triple, DNS generates negative examples by replacing that entity with other entities that are similar to it. In the example above, the vector for

Lord Of The Rings

, as estimated during the KBC learning process at that state is compared against all other entities in the KG; those having higher similarity are more likely to be chosen as negative examples as opposed to random ones. As an example, the representation for

StarTrek is highly likely to be similar to Lord Of The Rings, as opposed to DVR, and therefore the assertion PlaysInMovie(Elijah Wood, Star Trek) is more likely to be selected as a negative sample. This creates a repository of meaningful (and mostly false) statements, that we use as negative examples for that statement.

Our intuition is that the entity vectors trained by KBC deep nets represent the distributional properties of entities in the original KB. Therefore, cosine similarity between those vectors can be used as a proxy for their distributional similarity. Since entity vectors are updated dynamically by the KBC algorithm, the behavior of DNS improves at every epoch, sparking a virtuous circle that reduces substantially the number of epochs needed for while significantly improving accuracy.

We test DNS on two different KBC architectures, showing consistent and significant improvement in five out of six settings - two architectures across three different evaluation benchmarks. DNS is the main contribution of this paper.

The rest of the paper is structured as follows. In section 2, we describe some of the common KBC approaches, focusing on the deep learning models and their negative sampling strategies. DNS is described in Section 3. In section 4 we describe our evaluation benchmarks, discussing the evaluation results. We also provide an analysis of the algorithm in section 5. Finally, Section 6 concludes the paper opening some interesting research direction for the future work.

2 Related Work

KBC is a very active field of research and state of the art approaches for this task are almost entirely deep learning based and make use of RNS. Bordes et al. (2013) define an algorithm TransE wherein the relation vector translates the subject representation onto the object representation. Wang et al. (2014) define an algorithm TransH that behaves like TransE

with a minor adjustment that the translation happens on relation specific hyper-planes upon which the entity vectors are projected. Compositional models using the tensor product such as

RESCAL Nickel et al. (2011) and Neural Tensor Network Socher et al. (2013) employ higher order tensors to represent entities and relations. Nickel et al. (2016b) employs correlation operator between entity representations and a dot product with the relation vector, to represent interactions within the triples for their HolE system. Trouillon et al. (2016) represent each real valued embeddings into their corresponding complex variants, and performs Hermitian dot product operation - the argument being the fact that this operation models both symmetric and antisymmetric relations. More complex models have been developed that use the path and content information in the KGs, for instance see Lin et al. (2015a); Toutanova and Chen (2015).

Just an handful of work in the KBC proposes alternatives to RNS. Xie et al. (2017) introduce a domain sampling approach, where for each relation

a probability

(based on the training data) is calculated. The negative sample is sampled from within the domain with probability and with probability it is sampled from the set of all entities. Cai and Wang (2017) define a GAN, an adversarial learning framework that uses the generator network to provide negative samples to the discriminator. Kotnis and Nastase (2017) proposes an approach to negative sampling based on using additional external knowledge about the type of the entities in the KB.

This approach consists of two steps, first a KBC model is trained using Typed-Sampling i.e. corrupted entities belonging to the same type as the replacement entity are used to construct negative samples. In the second step, a standard KBC model is trained which uses nearest neighbor sampling (on the frozen model in the first step) to generate negative samples. Since, such a system requires additional knowledge, this approach cannot be applied to KB where such type information is sparse or not available, making its applicability to real word scenarios unfeasible.

DNS approach is different from above work in the following fashion. First, DNS does not use type information and any other additional assumption. Second, DNS uses a stochastic approach to sample negatives as opposed to a deterministic nearest neighbor approach.

Note that, DNS reuses the embedding updated within the same KBC network it is has been trained from, generating a positive feedback loop that ultimately results in faster convergence (shown empirically) as measured by the number of epochs.

3 Distributional Negative Sampling

Let us define the problem setting in terms of notations. Denote the knowledge base consisting of a list of true triples as,

where denotes the fact present in the knowledge base comprising of head entity, relation and tail entity respectively. Let denote the total number of triples present in the knowledge base, and denote the set of entities and relations in the KB by and . The problem statement for KB Completion is to learn representations of entities and relations within and that best predict the triples present in the . Most deep learning architectures for KBC exploit fixed dimensional tensors to represent entities and relations; and differ by the way they combine these tensors to score a given triple - this difference yields different architectures.

Training of these models for KBC requires generation of negative samples, which are created under the Local Closed World Assumption (LCWA). Leveraging this assumption, one randomly generates negative examples by corrupting existing triples from the . We call this set , as opposed to the set which contains triples from original KB.

A commonly used loss function for training (follows from LCWA) is the pairwise ranking loss,

(1)

where specifies the width of the margin Bordes et al. (2011).

In practice, a fixed number C (a hyper-parameter) of negative samples are generated at random (hereby referred to as Random Negative Sampling(RNS)) in order to create a skewed distribution with a small yet fixed negatives-to-positives ratio. The negative samples are generated by either replacing the head (or tail) entity of a given triple by a random entity from

. This approach is then repeated for all triples in , followed by calculating the loss function in Equation 1 which is subsequently optimized during training111

For a quick review of machine learning on knowledge graphs that use pairwise ranking loss see also

Nickel et al. (2016a).
.

DNS on the other hand, has been designed to provide a solution to the problem of generating nonsensical training examples. DNS approach relies on the idea that meaningful assertion can be automatically generated by replacing entities in a given triple with other entities belonging to the same type. Note that, two entities and tend to have the same type if they share many relations, i.e. they are distributionally similar222This approach in inspired by the distributional hypothesis in computational linguistics Harris (1954).. Inspired by this idea, we propose the DNS block-diagram in Figure 1.

Figure 1: DNS Architecture using an off-the-shelf KBC model.

Our intuition is that the entity vectors trained by KBC deep networks represent well the distributional properties of entities in the original KB. Therefore, cosine similarity between those vectors can be used as a proxy for the semantic relatedness. This enables us to define the DNS algorithm as described by Algorithm 1 below.

Input Training set S = {(h,r,t)}, entity and relation sets and , batch size b.
Returns Given a batch B = {}, , return corrupted triples P for each triple in the batch.

1:
2:for each triple = batch B do
3:      Stores negative triples for triple
4:      Use bernoulli sampling to conclude that t should be corrupted.
5:     for (i, m) enumerate(M) do python style syntax. M: shape
6:         if  then
7:              continue Do not consider samples that are known to be true          
8:         
9:         
10:         With probability choose entity i.
11:         if entity i is chosen then
12:               [i] denotes entity i               
13:     Append (j, N) to P j: index of triple in step 2
14:return
Algorithm 1 Distributional Negative Sampling

Let us see in detail what’s happening in Algorithm [1] below. Let P denote all negative triples constructed per epoch. Steps 3-13 describes the algorithm for each training instance. For each triple , we use bernoulli sampling (as introduced by Wang et al. (2014)) to determine whether to corrupt the tail or the head entity. Without loss of generality, let us assume that the tail entity is to be corrupted. Cosine similarity is used to compute similarity values between and all other entities present in set 333Note that we have used cosine-similarity as a measure of distributional similarity in Algorithm [1]. We tried other distance metrics such euclidean and manhattan distances, but found that they performed no better.. If the generated negative sample actually exists in the training set then it is ignored (Steps 6,7) else, we make use of its associated similarity score in order to convert it into relevant probabilities in steps 8, 9. The intuition is that entities having representations orthogonal (or worse) to the entity in question are hardly similar, and therefore will not generate a meaningful assertion. Now, using the probabilities defined above, we choose whether to include the corresponding entity within our list of negative samples or not.

Once N is constructed (pertaining to the loop between steps 5-12), this set consists of negative samples corresponding to input triple . We then append i.e. all the corrupted instances for to the list P, and repeat this process for all the triples in the training data. Once completed, the list P contains all negative samples generated via DNS for the training data.

A few points to note on the above Algorithm [1]. Firstly, because of steps 8-10, there exists a non-zero albeit small probability of choosing a totally unrelated entity as a negative sample. When repeated over many training epochs, this allows the model to effectively explore the space of negatives, by focusing highly on meaningful negatives and less (non zero) on nonsensical assertions, as opposed to deterministically choosing top k nearest neighbors proposed by Kotnis and Nastase (2017).

Thirdly, unlike other KBC algorithms which require the number of negative samples as a hyper parameter (one that needs to be optimized by cross-validation), our Algorithm [1] does not need this parameter to explicitly specified. It automatically figures it out based on sampling step 10, using the accept/reject probabilities defined in steps 8,9 respectively. Thus in this aspect, the DNS variant is more generic than its RNS counterpart.

Implementation wise, in real-world situations that involve enormous number of entities, computing cosine-similarity in batches during training (step 4 in Algorithm [1]) may not be possible due to out-of-memory issues. To alleviate this, we can make use of Annoy444https://github.com/spotify/annoy, an open source library that implements Approximate Nearest Neighbor functionality using Locality-Sensitive Hashing(LSH).

4 Evaluation

In this section we provide an extensive evaluation of DNS and RNS on link prediction tasks for three data sets widely used in KBC literature. The link prediction task is defined as the task of predicting the correct object (or subject) entity given the subject (or object) entity and the relation parameter. Link prediction task is usually defined by splitting a gold standard KB into two subgraphs, used for training and test. For each triple in the test the KBC model is queried, this returns a ranked list of entities as an output. This ranked list is then used for evaluation purposes.

We evaluated our system in three different standard link prediction benchmarks, derived from Freebase Bollacker et al. (2008) and WordNet Miller (1995). For the Freebase KB, we use FB15K introduced in Bordes et al. (2013), and a more challenging dataset Fb15k-237 introduced in Toutanova and Chen (2015). The latter is obtained by removing near-duplicate and inverse relations from FB15K. For the WordNet KB, we use the dataset WN18RR introduced by Dettmers et al. (2017)

, which removes reversing relations and increases the difficulty of reasoning. Compared to an older WN18 dataset, this dataset WN18RR is much more challenging and therefore serves as a better dataset compared to its older variant. The evaluation metrics that we use are

filtered Mean Reciprocal Rank (MRR), Hits@10 and Hits@1. The dataset statistics are summarized in Table 1.

Dataset FB15K Fb15k-237 WN18RR
# Train 483,152 272,115 86,835
# Valid 50,000 17,535 3,034
# Test 59,071 20,466 3,134
# Entities 14,951 14,951 40,943
# Relations 1,357 237 11
Table 1: Knowledge Base completion datasets statistics

We compare the performances of DNS and RNS when applied to two different KBC algorithms: TransE Bordes et al. (2013) and RESCAL Nickel et al. (2011). These algorithms are easy to visualize in terms of vector operations. The DNS negatives thus generated are used directly to calculate the pairwise ranking loss (as described in equation 1).

The hyperparameters for all the experiments were fine-tuned (using grid search) based on the validation set, and

Adam optimizer Kingma and Ba (2014) was used with default hyper-parameter settings: -8. For FB and WN datasets, RESCAL used an embedding dimension of 200, whereas TransE algorithm used an embedding dimension of 100. The margin parameter for TransE model was finally set to 10.0, and for RESCAL model it equalled 5.0. Bernoulli Sampling introduced by Wang et al. (2014) was used to determine whether to corrupt the head or the tail entity. The maximum number of epochs was capped at 1000 for TransE/RESCAL models for FB and WN datasets. In addition, early-stopping criterion was considered to be the Filtered MRR on the validation set, and the threshold was set to be 20 epochs.

Table 2 provides comparable evaluation of RNS and DNS across the three different KBC benchmarks.

FB15k Fb15k-237 WN18RR
Algorithm MRR H@10 H@1 MRR H@10 H@1 MRR H@10 H@1
TransE(RNS) 46.3 74.9 29.7 25 42.8 16.9 - -
TransE(DNS) 43.0 63.9 31.1 29.2 45.7 20.9 18.4 44.4 4.3
RESCAL(RNS) 35.4 58.7 23.5 22.6 34.4 16.3 39.9 42.1 38.6
RESCAL(DNS) 41.2 62.7 29.6 27.5 44.1 19.2 42.8 44.1 42.1
Table 2: Knowledge Base completion performance comparison. H@n denotes Hits@n. We report filtered Mean Reciprocal Rank (MRR), filtered Hits@10 and filtered Hits@1 numbers in this table. All the values expressed here are in percentages. Results marked with †are produced by running Fast-TransX (Lin et al. (2015b)) with its default parameters. RNS results for TransE and RESCAL are borrowed from Nickel et al. (2016b).

DNS outperforms RNS on all the considered KBC algorithms and across all benchmarks and evaluation metrics555The only exception to this statement are the TransE filtered MRR and filtered Hits@10 results on FB15k which further warrants a detailed investigation.. Remarkably, we obtained very good improvement in accuracy (Hits@1), which is arguably the most important metric for real applications.

5 Analysis

In this section we provide an in depth analysis of the properties of DNS. First, we show that the learned vectors for the entities in the KG represent their meaning nicely from a distributional perspective, thus providing a qualitative analysis of the generated negative samples. Secondly, we show that DNS converges in a lower number of epochs compared to RNS, and we provide an argumentative analysis of why it happens.

EPOCH 1 EPOCH 5 EPOCH 10 EPOCH 90
English(0.40) Blu-ray Disc(0.56) Blu-ray Disc(0.55) Blu-ray Disc(0.41)
US Dollar(0.37) VHS(0.50) VHS(0.53) VHS(0.34)
Blu-ray Disc(0.36) video(0.33) video(0.36) video(0.27)
executive producer(0.29) Silent Hill(0.25) English(0.23) television(0.18)
French(0.28) Luther(0.23) television(0.23) Kid Rock(0.14)
Table 3: An example of negative samples generated by DNS for the entity DVD at various epochs.

We start from the observation that entity embeddings are learned during training of the KBC system, which is in turn trained using negative examples provided by DNS that rank them according to their similarity with the substituted entity. This creates a reinforcing cycle that in our experience seems to drive the algorithm toward faster convergence.

(a)
(b)
Figure 2: The left column represents Filtered MRR vs. Number of epochs, whereas the right column represents Filtered Hits@10 vs. Number of epochs. All these figures are plotted against the test data for first 50 epochs only.

Table 3 shows the nearest-neighbor results for the query entity DVD after Epochs 1, 5, 10, 90 for the TransE(DNS) algorithm on Fb15k-237 dataset. From this table, it is clear that as the training progresses, entities with similar types tend to become closer (via cosine similarity). This enables our DNS algorithm to provide meaningful false statements instead of nonsensical ones. As an example, given the triple film-distribution-medium(The Day After Tomorrow, DVD), RNS and DNS both at Epoch 1, are highly likely to generate nonsensical negatives like film-distribution-medium(The Day After Tomorrow, US Dollar).

However, after a few epochs, things change drastically. At epoch 90 for the same assertion film-distribution-medium(The Day After Tomorrow, DVD)

, the odds of choosing

VHS as a negative object by DNS is 1.05 x . Note that, this result is obtained by taking a softmax over the similarity scores computed between the query entity DVD and all other entities in Fb15k-237 dataset, and then performing a look-up for the entity VHS. Compared to the above, the odds of choosing VHS as a negative entity via RNS equals 1/14,952 . Thus, DNS is twice as likely to generate a meaningful assertion such as film-distribution-medium(The Day After Tomorrow, VHS) compared to RNS. Note that as already established before, as per LCWA this assertion is counted as less positive, as it’s unseen in the training data.

In summary, by having meaningful positive/negative training facts the system can better classify the validity of unknown assertions. From a machine learning perspective, this way of selecting negative samples provides a more efficient way to train the system.

This is because since RNS is highly likely to generate a nonsensical negative sample, the score of such negatives is highly likely to satisfy the margin in the overall hinge loss function (from equation 1). Comparing this to DNS, during latter stages of training, since the generated corrupted entity is closer to the actual entity (assuming unit normalized vectors), the overall score for such a negative sample is less likely to satisfy the margin, and hence contribute to a non-zero loss and non-zero gradients.

We support the above argument for improved efficiency, via empirical fashion as illustrated in Figure 2, where both filtered MRR and filtered Hits@10 are reported as a function of the number of epochs (restricted to first 50 epochs) for RESCAL algorithm on WN18RR dataset. In these figures, we see that both MRR and Hits@10 values for Distributional Negative Sampling(DNS) grows more rapidly as compared to Random Negative Sampling(RNS). For example in figure 1(a), both DNS performs slight better compared to RNS until epoch 10 (RNS:0.382 DNS:0.394) after that RNS saturates at 0.4 whereas filtered MRR for DNS rises till 0.43.

6 Conclusion and Future Work

In this paper, we propose an alternative to Random Negative Sampling (RNS) for Knowledge Base Completion (KBC), namely Distributional Negative Sampling (DNS), that generates meaningful negative examples which are highly likely to be false. We argue that RNS is highly likely to generate nonsensical assertions and we demonstrate how DNS solves this problem in a principled fashion. DNS consistently improves almost all the evaluation metrics over three widely known benchmarks for the two considered algorithms.

Although in this paper we focus specifically on KBC task, we believe that DNS is a general approach that can be used across many different tasks (that involve generating corrupted units as negative instances) where until now RNS has been used. A typical example of such a task is relation extraction.

We are interested in building deep learning-based reasoning systems, and we believe that DNS is just the first step in this direction as it plays a pivotal role in building better KBC models for common sense reasoning i.e. the ability to realize that some facts hold purely due to other existing relations. For KBC task, DNS provides a way for efficient training since it generates plausible false statements in a very efficient and mathematically principled manner without any further assumption.

References

  • K. Bollacker, C. Evans, P. Paritosh, T. Sturge, and J. Taylor (2008) Freebase: a collaboratively created graph database for structuring human knowledge. In Proceedings of the 2008 ACM SIGMOD international conference on Management of data, pp. 1247–1250. Cited by: §4.
  • A. Bordes, N. Usunier, A. Garcia-Duran, J. Weston, and O. Yakhnenko (2013) Translating embeddings for modeling multi-relational data. In Advances in neural information processing systems, pp. 2787–2795. Cited by: §2, §4, §4.
  • A. Bordes, J. Weston, R. Collobert, Y. Bengio, et al. (2011) Learning structured embeddings of knowledge bases.. In AAAI, Vol. 6, pp. 6. Cited by: §3.
  • L. Cai and W. Y. Wang (2017) KBGAN: adversarial learning for knowledge graph embeddings. arXiv preprint arXiv:1711.04071. Cited by: §2.
  • T. Dettmers, P. Minervini, P. Stenetorp, and S. Riedel (2017) Convolutional 2d knowledge graph embeddings. arXiv preprint arXiv:1707.01476. Cited by: §4.
  • Z. S. Harris (1954) Distributional structure. Word 10 (2-3), pp. 146–162. Cited by: footnote 2.
  • D. Kingma and J. Ba (2014) Adam: a method for stochastic optimization. arXiv preprint arXiv:1412.6980. Cited by: §4.
  • B. Kotnis and V. Nastase (2017) Analysis of the impact of negative sampling on link prediction in knowledge graphs. arXiv preprint arXiv:1708.06816. Cited by: §2, §3.
  • Y. Lin, Z. Liu, H. Luan, M. Sun, S. Rao, and S. Liu (2015a) Modeling relation paths for representation learning of knowledge bases. arXiv preprint arXiv:1506.00379. Cited by: §2.
  • Y. Lin, Z. Liu, M. Sun, Y. Liu, and X. Zhu (2015b) Learning entity and relation embeddings for knowledge graph completion.. In AAAI, Vol. 15, pp. 2181–2187. Cited by: Table 2.
  • G. A. Miller (1995) WordNet: a lexical database for english. Communications of the ACM 38 (11), pp. 39–41. Cited by: §4.
  • M. Nickel, K. Murphy, V. Tresp, and E. Gabrilovich (2016a) A review of relational machine learning for knowledge graphs. Proceedings of the IEEE 104 (1), pp. 11–33. Cited by: footnote 1.
  • M. Nickel, L. Rosasco, T. A. Poggio, et al. (2016b) Holographic embeddings of knowledge graphs.. In AAAI, pp. 1955–1961. Cited by: §2, Table 2.
  • M. Nickel, V. Tresp, and H. Kriegel (2011) A three-way model for collective learning on multi-relational data.. In ICML, Vol. 11, pp. 809–816. Cited by: §2, §4.
  • R. Socher, D. Chen, C. D. Manning, and A. Ng (2013) Reasoning with neural tensor networks for knowledge base completion. In Advances in neural information processing systems, pp. 926–934. Cited by: §2.
  • K. Toutanova and D. Chen (2015) Observed versus latent features for knowledge base and text inference. In Proceedings of the 3rd Workshop on Continuous Vector Space Models and their Compositionality, pp. 57–66. Cited by: §2, §4.
  • T. Trouillon, J. Welbl, S. Riedel, É. Gaussier, and G. Bouchard (2016) Complex embeddings for simple link prediction. In International Conference on Machine Learning, pp. 2071–2080. Cited by: §2.
  • Z. Wang, J. Zhang, J. Feng, and Z. Chen (2014)

    Knowledge graph embedding by translating on hyperplanes.

    .
    In AAAI, pp. 1112–1119. Cited by: §2, §3, §4.
  • Q. Xie, X. Ma, Z. Dai, and E. Hovy (2017) An interpretable knowledge transfer model for knowledge base completion. arXiv preprint arXiv:1704.05908. Cited by: §2.