A Re-evaluation of Knowledge Graph Completion Methods

11/10/2019 ∙ by Zhiqing Sun, et al. ∙ indian institute of science Carnegie Mellon University 1

Knowledge Graph Completion (KGC) aims at automatically predicting missing links for large-scale knowledge graphs. A vast number of state-of-the-art KGC techniques have been published in top conferences in several research fields including data mining, machine learning, and natural language processing. However, we notice that several recent papers report very high performance which largely outperforms previous state-of-the-art methods. In this paper, we find that this can be attributed to the inappropriate evaluation protocol used by them and propose a simple evaluation protocol to address this problem. The proposed protocol is robust to handle bias in the model which can substantially affect the final results. We conduct extensive experiments and report the performance of several existing methods using our protocol.



There are no comments yet.


page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Real-world knowledge bases are usually expressed as multi-relational graphs, which are collections of factual triplets, where each triplet represents a relation between a head entity and a tail entity . Examples of knowledge graphs include Freebase freebase, YAGO yago, DBpedia dbpedia, and WordNet wordnet. However, these real-word knowledge bases are usually incomplete kg_incomp1, which motivates the research of automatically predicting missing links.

A popular approach for Knowledge Graph Completion (KGC) is to embed entities and relations into continuous vector or matrix space, and use a well-designed score function

to measure the plausibility of the triplet . Most of the previous methods use translation distance based transe; transh; transg; rotate and semantic matching based rescal2013; distmult; hole; complex; analogy scoring functions which are easy to analyze.

However, recently, a vast number of neural network-based methods have been proposed. They have complex score functions which utilize black-box neural networks including Convolutional Neural Networks (CNNs)

conve; convkb

, Recurrent Neural Networks (RNNs)

ptranse; dolores, Graph Neural Networks (GNNs) r_gcn; sacn_paper, and Capsule Networks capse. While some of them report state-of-the-art performance on several benchmark datasets that are competitive to previous embedding-based approaches, a considerable portion of recent neural network-based papers report very high performance gains which are not consistent across different datasets. Moreover, most of these unusual behaviors are not at all analyzed. Such a pattern has become prominent and is misleading the whole community.

In this paper, we investigate this problem and find that this is attributed to the inappropriate evaluation protocol used by these approaches. We demonstrate that their evaluation protocol gives a perfect score to a model that always outputs a constant irrespective of the input. This has lead to artificial inflation of performance of several models. For this, we find a simple evaluation protocol that creates a fair comparison environment for all types of score functions. We conduct extensive experiments to re-examine some recent methods and fairly compare them with existing approaches. Our contributions can be summarized as follows:

  1. We highlight unusual behavior of some of the recently proposed Knowledge Graph Completion methods and demonstrate the bias in their evaluation protocol.

  2. We find a simple evaluation protocol that creates a fair comparison environment for all types of score function.

  3. We report the performance of several recent methods using our proposed protocol and fair comparison with prior approaches.

The source code and datasets used in the paper are available at http://github.com/svjan5/kgc-reevaluation.

2 Background and Related Work

Knowledge Graph Completion

Given a Knowledge Graph , where and denote the set of entities and relations and is the set of triplets (facts), the task of Knowledge Graph Completion (KGC) involves inferring missing facts based on the known facts. Most the existing methods define an embedding for each entity and relation in , i.e., and a score function which assigns a high score for valid triplets than the invalid ones.

KGC Evaluation

During KGC evaluation, for predicting in a given triplet , a KGC model scores all the triplets in the set . Based on the score, the model first sorts all the triplets and subsequently finds the rank of the valid triplet in the list. In a more relaxed setting called filtered setting, all the known correct triplets (from train, valid, and test triplets) are removed from except the one being evaluated transe. The triplets in are called negative samples.

Existing Analysis of KGC Methods

Prior to our work, baselines_strike_back

cast doubt on the claim that performance improvement of several models is due to architectural changes as opposed to hyperparameter tuning or different training objective. In our work, we raise similar concerns but through a different angle by highlighting issues with the evaluation procedure used by several recent methods.

kg_geometry analyze the geometry of KG embeddings and its correlation with task performance while effect_of_loss_function

examine the effect of different loss functions on performance. However, their analysis is restricted to non-neural approaches.

3 Observations

In this section, we first describe our observations and concerns and then investigate what leads to it.

FB15k-237 WN18RR
ConvE .325 .430
RotatE .338 (+4.0%) .476 (+10.6%)
TuckER .358 (+10.2%) .470 (+9.3%)
ConvKB .396 (+21.8%) .248 (-42.3%)
CapsE .523 (+60.9%) .415 (-3.4%)
KBAT .518 (+59.4%) .440 (+2.3%)
TransGate .404 (+24.3%) .409 (-4.9%)
Table 1: Changes in MRR for different methods on FB15k-237 and WN18RR datasets with respect to ConvE show inconsistent improvements. Overall, we find that non-NN based methods such as RotatE and TuckER give consistent improvement over both datasets whereas for some NN based methods the performance improves on one dataset while degrades on another. Refer to Section 3.1 for details.

3.1 Inconsistent Improvements over Benchmark Datasets

Several recently proposed methods report high performance gains on a particular dataset. However, their performance on another dataset is not consistently improved. In Table 1, we report change in MRR score on FB15k-237 toutanova and WN18RR conve datasets with respect to ConvE conve for different methods including RotatE rotate, TuckER tucker, ConvKB convkb, CapsE capse, KBAT kbat, and TransGate transgate. Overall, we find that for a few recent NN based methods, there are inconsistent gains on these two datasets. For instance, in ConvKB, there is a 21.8% improvement over ConvE on FB15k-237, but a degradation of 42.3% on WN18RR, which is surprising given the method is claimed to be better than ConvE. On the other hand, methods like RotatE and TuckER give consistent improvement across both benchmark datasets.

Figure 1: Sorted score distribution of ConvKB for an example valid triplet and its negative samples. The score is normalized into (lower the better). Dotted line indicate the score for the valid triplet. We find that in this example, around 58.5% negative sampled triplets obtain the exact same score as the valid triplet. Please refer to Section 3.2 for more details.
Figure 2: Plot shows the frequency of the number of negative triplets with the same assigned score as the valid triplet during evaluation on FB15k-237 dataset. The results show that for methods like ConvKB and CapsE, a large number of negative triplets get the same score as the valid triplets whereas for methods like ConvE such occurrences are rare.
Figure 3:

Distribution of ratio of neurons becoming zero after ReLU activation in different methods for the valid triplets in FB15k-237 dataset. We find that for ConvKB and CapsE an unusually large fraction of neurons become zero after ReLU activation whereas the does not hold with ConvE. More details presented in Section


3.2 Observations on Score Functions

Score distribution

When evaluating KGC methods, for a given triplet , the ranking of given and is computed by scoring all the triplets of form . On investing a few recent NN based approaches, we find that they have unusual score distribution, where some negatively sampled triplets have the same score as the valid triplet. An instance of FB15k-237 dataset is presented in Figure 1. Here, out of 14,541 negatively sampled triplets, 8,520 have the exact same score as the valid triplet.

Statistics on the whole dataset

In Figure 2, we report the total number of triplets with the exact same score over the entire dataset for ConvKB convkb and CapsE capse and compare them with ConvE conve which does not suffer from this issue. We find that both ConvKB and CapsE have multiple occurrences of such unusual score distribution. On average, ConvKB and CapsE have 125 and 278 entities with exactly same score as the valid triplet over the entire evaluation dataset of FB15k-237, whereas ConvE has around 0.002, which is almost negligible. In Section 4, we demonstrate how this leads to massive performance gain for methods like ConvKB and CapsE.

Root of the problem

Further, we investigate the cause behind such unusual score distribution. In Figure 3, we plot the ratio of neurons becoming zero after ReLU activation for the valid triplets vs. their normalized frequency on FB15k-237 dataset. The results show that in ConvKB and CapsE, a large fraction (87.3% and 92.2% respectively) of the neurons become zeros after applying ReLU activation. However, with ConvE, this count is substantially less (around 41.1%). Because of the zeroing of nearly all neurons (at least 14.2% for ConvKB and 22.0% for CapsE), the representation of several triplets become very similar during forward pass and thus leading to obtaining the exact same score.

4 Evaluation Method

In this section, we present different evaluation methodologies that can be adopted in knowledge graph completion. We further show that inappropriate evaluation protocol is the key reason for the unusual behavior of some recent NN based methods.

How to deal with the same scores?

An essential aspect of the evaluation method is to decide how to break ties for triplets with the same score. More concretely, while scoring , if there are multiple triplets with the same score from the model, one should decide which triplet to pick. Based on this choice, we design a general evaluation scheme for KGC, which consists of following three different protocols in which the correct triplet can be placed in :

  1. Top: In this setting, the correct triplet is inserted in the beginning of .

  2. Bottom: Here, the correct triplet is inserted at the end of .

  3. Random: In this, the correct triplet is placed randomly in .

We assume that the triplets are sorted in a stable manner, i.e., the relative order of triplets with equal scores is maintained while sorting. Based on the definition of the three evaluation protocols, we have the following proposition.

Proposition 4.1.

A score function that gives a constant score to all triplets irrespective of the input, i.e., , achieves the best performance when evaluated using Top evaluation scheme.

From Proposition 4.1, it is clear that Top evaluation protocol does not evaluate the model rigorously. It gives the models that have a bias to provide the same score for different triplets, an inappropriate advantage. On the other hand, Bottom evaluation protocol can be unfair to the model during inference time because it penalizes the model for giving the same score to multiple triplets, i.e., if many triplets have the same score as the correct triple, the correct triplet gets the least rank possible.

As a result, Random is the best evaluation technique which is both rigorous and fair to the model. It is in line with the situation we meet in the real world: given several same scored candidates, the only option is to select one of them randomly. Hence, we propose to use Random evaluation scheme for all model performance comparisons.

5 Experiments

In this section, we We conduct extensive experiments using our proposed evaluation protocols and make a fair comparison for several existing methods.

5.1 Datasets

We use two common benchmark datasets described below:

  • [itemsep=2pt,parsep=2pt,partopsep=0pt,leftmargin=*,topsep=4pt]

  • FB15k-237 toutanova is a subset of FB15k transe with inverse relations deleted to prevent direct inference of test triples from training.

  • WN18RR conve is a subset of WN18 transe containing lexical relations between words. Similar to FB15k-237, inverse relations are removed in WN18RR.

5.2 Methods Analyzed

In our experiments, we categorize methods into the following two categories.

  • [itemsep=2pt,parsep=2pt,partopsep=0pt,leftmargin=*,topsep=4pt]

  • Non-Affected: This includes methods which give consistent performance under different evaluation protocols. For experiments in this paper, we consider three such methods – ConvE conve, RotatE rotate, and TuckER tucker.

  • Affected: This category consists of recently proposed neural-network based methods whose performance is affected by different evaluation protocols. ConvKB convkb, CapsE capse, TransGate transgate, and KBAT kbat are methods in this category.

Reported Random Top Bottom
ConvE .325 244 .501 .324 .0 285 0 .501 .0 .324 285 .501 .324 285 .501
RotatE .338 177 .533 .336 .0 178 0 .530 .0 .336 178 .530 .336 178 .530
TuckER .358 - .544 .353 .0 162 0 .536 .0 .353 162 .536 .353 162 .536
ConvKB .396 257 .517 .243 .0 309 2 .421 .0 .407 246 .527 .130 373 .383
(+.164) (-63) (+.106) (-.113) (+64) (-.038)
CapsE .523 303 .593 .032 .0 446 2 .057 .002 .511 305 .586 .009 585 .000
(+.479) (-141) (+.528) (-.023) (+139) (-.058)
KBAT .518† 210† .626† .157 .0 270 0 .331 .0 .157 270 .331 .157 270 .331
Table 2: Effect of different evaluation protocols on recent KG embedding methods on FB15k-237 dataset. For Top and Bottom, we report changes in performance with respect to Random protocol. Please refer to Section 5.4 for details. †: KBAT has test data leakage in their original implementation.
Reported Random Top Bottom
ConvE .43 4187 .52 .444 .0 4950 0 .503 .0 .444 4950 .503 .444 4950 .503
RotatE .476 3340 .571 .473 .0 3343 0 .571 .0 .473 3343 .571 .473 3343 .571
TuckER .470 - .526 .461 .0 6324 0 .516 .0 .461 6324 .516 .461 6324 .516
ConvKB .248 2554 .525 .249 .0 3433 42 .524 .0 .251 1696 .529 .164 5168 .516
(+.002) (-1737) (+.005) (-.085) (+1735) (-.008)
CapsE‡ .415 719 .560 .088 .001 731 0 .245 .006 .415 718 .559 .030 744 .000
(+.327) (-13) (+.314) (-.058) (+13) (-.245)
KBAT .440† 1940† .581† .412 .0 1921 0 .554 .0 .412 1921 .554 .412 1921 .554
Table 3: Performance comparison under different evaluation protocols on WN18RR dataset. For Top and Bottom, we report changes in performance with respect to Random protocol. Please refer to Section 5.4 for details. ‡: CapsE uses the pre-trained 100-dimensional Glove pennington2014glove word embeddings for initialization on WN18RR dataset, which makes the comparison on WN18RR still unfair. †: KBAT has test data leakage in their original implementation.

5.3 Evaluation Metrics

For all the methods, we use the code and the hyperparameters provided by the authors in their respective papers. Model performance is evaluated using three standard metrics - Mean Reciprocal Rank (MRR), Mean Rank (MR) and Hits@10 (H@10) on filtered setting transe.

5.4 Evaluation Results

To analyze the effect of different evaluation protocols described in Section 4, we study the performance variation of the models listed in Section 5.2. We study the effect of using Top and Bottom protocols and compare them to Random protocol. We also study the random error in Random

protocol with multiple runs, where we report the average and standard deviation on 5 runs with different random seeds.

The results on FB15k-237 and WN18RR are presented in Tables 2 and 3, respectively. In their original paper, ConvE, RotatE, and TuckER use a strategy similar to the proposed Random protocol, while ConvKB, CapsE, and KBAT use Top protocol. We observe that for Non-Affected methods like ConvE, RotatE, and TuckER, the performance remains consistent across different evaluation protocols. However, with Affected methods, there is a considerable variation in performance. Specifically, we can observe that these models perform best when evaluated using Top and worst when evaluated using Bottom111KBAT incorporates ConvKB in the last layer of its model architecture, which should be affected by different evaluation protocols. But we find another bug on the leakage of test triples during negative sampling in the reported model, which results in more significant performance degradation.. Finally, we find that the proposed Random protocol is very robust to different random seeds. Although the theoretic upper and lower bounds of a Random score are Top and Bottom scores respectively, when we evaluate knowledge graph completion for real-world large-scale knowledge graphs, the randomness won’t affect the evaluation results much.

6 Conclusion

In this paper, we performed an extensive re-examination study of recent neural network based KGC techniques that claim very high performance on certain datasets. We find that many such models have issues with their score functions. Combined with inappropriate evaluation protocol, such methods reported inflated performance. Based on our observations, we propose Random evaluation protocol that can clearly distinguish between these affected methods from others. We also strongly encourage the research community to follow the Random evaluation protocol for all KGC evaluation purposes.