Real-world knowledge bases are usually expressed as multi-relational graphs, which are collections of factual triplets, where each triplet represents a relation between a head entity and a tail entity . Examples of knowledge graphs include Freebase freebase, YAGO yago, DBpedia dbpedia, and WordNet wordnet. However, these real-word knowledge bases are usually incomplete kg_incomp1, which motivates the research of automatically predicting missing links.
A popular approach for Knowledge Graph Completion (KGC) is to embed entities and relations into continuous vector or matrix space, and use a well-designed score functionto measure the plausibility of the triplet . Most of the previous methods use translation distance based transe; transh; transg; rotate and semantic matching based rescal2013; distmult; hole; complex; analogy scoring functions which are easy to analyze.
However, recently, a vast number of neural network-based methods have been proposed. They have complex score functions which utilize black-box neural networks including Convolutional Neural Networks (CNNs)conve; convkb
, Recurrent Neural Networks (RNNs)ptranse; dolores, Graph Neural Networks (GNNs) r_gcn; sacn_paper, and Capsule Networks capse. While some of them report state-of-the-art performance on several benchmark datasets that are competitive to previous embedding-based approaches, a considerable portion of recent neural network-based papers report very high performance gains which are not consistent across different datasets. Moreover, most of these unusual behaviors are not at all analyzed. Such a pattern has become prominent and is misleading the whole community.
In this paper, we investigate this problem and find that this is attributed to the inappropriate evaluation protocol used by these approaches. We demonstrate that their evaluation protocol gives a perfect score to a model that always outputs a constant irrespective of the input. This has lead to artificial inflation of performance of several models. For this, we find a simple evaluation protocol that creates a fair comparison environment for all types of score functions. We conduct extensive experiments to re-examine some recent methods and fairly compare them with existing approaches. Our contributions can be summarized as follows:
We highlight unusual behavior of some of the recently proposed Knowledge Graph Completion methods and demonstrate the bias in their evaluation protocol.
We find a simple evaluation protocol that creates a fair comparison environment for all types of score function.
We report the performance of several recent methods using our proposed protocol and fair comparison with prior approaches.
The source code and datasets used in the paper are available at http://github.com/svjan5/kgc-reevaluation.
2 Background and Related Work
Knowledge Graph Completion
Given a Knowledge Graph , where and denote the set of entities and relations and is the set of triplets (facts), the task of Knowledge Graph Completion (KGC) involves inferring missing facts based on the known facts. Most the existing methods define an embedding for each entity and relation in , i.e., and a score function which assigns a high score for valid triplets than the invalid ones.
During KGC evaluation, for predicting in a given triplet , a KGC model scores all the triplets in the set . Based on the score, the model first sorts all the triplets and subsequently finds the rank of the valid triplet in the list. In a more relaxed setting called filtered setting, all the known correct triplets (from train, valid, and test triplets) are removed from except the one being evaluated transe. The triplets in are called negative samples.
Existing Analysis of KGC Methods
Prior to our work, baselines_strike_back
cast doubt on the claim that performance improvement of several models is due to architectural changes as opposed to hyperparameter tuning or different training objective. In our work, we raise similar concerns but through a different angle by highlighting issues with the evaluation procedure used by several recent methods.kg_geometry analyze the geometry of KG embeddings and its correlation with task performance while effect_of_loss_function
examine the effect of different loss functions on performance. However, their analysis is restricted to non-neural approaches.
In this section, we first describe our observations and concerns and then investigate what leads to it.
|RotatE||.338 (+4.0%)||.476 (+10.6%)|
|TuckER||.358 (+10.2%)||.470 (+9.3%)|
|ConvKB||.396 (+21.8%)||.248 (-42.3%)|
|CapsE||.523 (+60.9%)||.415 (-3.4%)|
|KBAT||.518 (+59.4%)||.440 (+2.3%)|
|TransGate||.404 (+24.3%)||.409 (-4.9%)|
3.1 Inconsistent Improvements over Benchmark Datasets
Several recently proposed methods report high performance gains on a particular dataset. However, their performance on another dataset is not consistently improved. In Table 1, we report change in MRR score on FB15k-237 toutanova and WN18RR conve datasets with respect to ConvE conve for different methods including RotatE rotate, TuckER tucker, ConvKB convkb, CapsE capse, KBAT kbat, and TransGate transgate. Overall, we find that for a few recent NN based methods, there are inconsistent gains on these two datasets. For instance, in ConvKB, there is a 21.8% improvement over ConvE on FB15k-237, but a degradation of 42.3% on WN18RR, which is surprising given the method is claimed to be better than ConvE. On the other hand, methods like RotatE and TuckER give consistent improvement across both benchmark datasets.
3.2 Observations on Score Functions
When evaluating KGC methods, for a given triplet , the ranking of given and is computed by scoring all the triplets of form . On investing a few recent NN based approaches, we find that they have unusual score distribution, where some negatively sampled triplets have the same score as the valid triplet. An instance of FB15k-237 dataset is presented in Figure 1. Here, out of 14,541 negatively sampled triplets, 8,520 have the exact same score as the valid triplet.
Statistics on the whole dataset
In Figure 2, we report the total number of triplets with the exact same score over the entire dataset for ConvKB convkb and CapsE capse and compare them with ConvE conve which does not suffer from this issue. We find that both ConvKB and CapsE have multiple occurrences of such unusual score distribution. On average, ConvKB and CapsE have 125 and 278 entities with exactly same score as the valid triplet over the entire evaluation dataset of FB15k-237, whereas ConvE has around 0.002, which is almost negligible. In Section 4, we demonstrate how this leads to massive performance gain for methods like ConvKB and CapsE.
Root of the problem
Further, we investigate the cause behind such unusual score distribution. In Figure 3, we plot the ratio of neurons becoming zero after ReLU activation for the valid triplets vs. their normalized frequency on FB15k-237 dataset. The results show that in ConvKB and CapsE, a large fraction (87.3% and 92.2% respectively) of the neurons become zeros after applying ReLU activation. However, with ConvE, this count is substantially less (around 41.1%). Because of the zeroing of nearly all neurons (at least 14.2% for ConvKB and 22.0% for CapsE), the representation of several triplets become very similar during forward pass and thus leading to obtaining the exact same score.
4 Evaluation Method
In this section, we present different evaluation methodologies that can be adopted in knowledge graph completion. We further show that inappropriate evaluation protocol is the key reason for the unusual behavior of some recent NN based methods.
How to deal with the same scores?
An essential aspect of the evaluation method is to decide how to break ties for triplets with the same score. More concretely, while scoring , if there are multiple triplets with the same score from the model, one should decide which triplet to pick. Based on this choice, we design a general evaluation scheme for KGC, which consists of following three different protocols in which the correct triplet can be placed in :
Top: In this setting, the correct triplet is inserted in the beginning of .
Bottom: Here, the correct triplet is inserted at the end of .
Random: In this, the correct triplet is placed randomly in .
We assume that the triplets are sorted in a stable manner, i.e., the relative order of triplets with equal scores is maintained while sorting. Based on the definition of the three evaluation protocols, we have the following proposition.
A score function that gives a constant score to all triplets irrespective of the input, i.e., , achieves the best performance when evaluated using Top evaluation scheme.
From Proposition 4.1, it is clear that Top evaluation protocol does not evaluate the model rigorously. It gives the models that have a bias to provide the same score for different triplets, an inappropriate advantage. On the other hand, Bottom evaluation protocol can be unfair to the model during inference time because it penalizes the model for giving the same score to multiple triplets, i.e., if many triplets have the same score as the correct triple, the correct triplet gets the least rank possible.
As a result, Random is the best evaluation technique which is both rigorous and fair to the model. It is in line with the situation we meet in the real world: given several same scored candidates, the only option is to select one of them randomly. Hence, we propose to use Random evaluation scheme for all model performance comparisons.
In this section, we We conduct extensive experiments using our proposed evaluation protocols and make a fair comparison for several existing methods.
We use two common benchmark datasets described below:
FB15k-237 toutanova is a subset of FB15k transe with inverse relations deleted to prevent direct inference of test triples from training.
WN18RR conve is a subset of WN18 transe containing lexical relations between words. Similar to FB15k-237, inverse relations are removed in WN18RR.
5.2 Methods Analyzed
In our experiments, we categorize methods into the following two categories.
Non-Affected: This includes methods which give consistent performance under different evaluation protocols. For experiments in this paper, we consider three such methods – ConvE conve, RotatE rotate, and TuckER tucker.
Affected: This category consists of recently proposed neural-network based methods whose performance is affected by different evaluation protocols. ConvKB convkb, CapsE capse, TransGate transgate, and KBAT kbat are methods in this category.
|ConvE||.325||244||.501||.324 .0||285 0||.501 .0||.324||285||.501||.324||285||.501|
|RotatE||.338||177||.533||.336 .0||178 0||.530 .0||.336||178||.530||.336||178||.530|
|TuckER||.358||-||.544||.353 .0||162 0||.536 .0||.353||162||.536||.353||162||.536|
|ConvKB||.396||257||.517||.243 .0||309 2||.421 .0||.407||246||.527||.130||373||.383|
|CapsE||.523||303||.593||.032 .0||446 2||.057 .002||.511||305||.586||.009||585||.000|
|KBAT||.518†||210†||.626†||.157 .0||270 0||.331 .0||.157||270||.331||.157||270||.331|
|ConvE||.43||4187||.52||.444 .0||4950 0||.503 .0||.444||4950||.503||.444||4950||.503|
|RotatE||.476||3340||.571||.473 .0||3343 0||.571 .0||.473||3343||.571||.473||3343||.571|
|TuckER||.470||-||.526||.461 .0||6324 0||.516 .0||.461||6324||.516||.461||6324||.516|
|ConvKB||.248||2554||.525||.249 .0||3433 42||.524 .0||.251||1696||.529||.164||5168||.516|
|CapsE‡||.415||719||.560||.088 .001||731 0||.245 .006||.415||718||.559||.030||744||.000|
|KBAT||.440†||1940†||.581†||.412 .0||1921 0||.554 .0||.412||1921||.554||.412||1921||.554|
5.3 Evaluation Metrics
For all the methods, we use the code and the hyperparameters provided by the authors in their respective papers. Model performance is evaluated using three standard metrics - Mean Reciprocal Rank (MRR), Mean Rank (MR) and Hits@10 (H@10) on filtered setting transe.
5.4 Evaluation Results
To analyze the effect of different evaluation protocols described in Section 4, we study the performance variation of the models listed in Section 5.2. We study the effect of using Top and Bottom protocols and compare them to Random protocol. We also study the random error in Random
protocol with multiple runs, where we report the average and standard deviation on 5 runs with different random seeds.
The results on FB15k-237 and WN18RR are presented in Tables 2 and 3, respectively. In their original paper, ConvE, RotatE, and TuckER use a strategy similar to the proposed Random protocol, while ConvKB, CapsE, and KBAT use Top protocol. We observe that for Non-Affected methods like ConvE, RotatE, and TuckER, the performance remains consistent across different evaluation protocols. However, with Affected methods, there is a considerable variation in performance. Specifically, we can observe that these models perform best when evaluated using Top and worst when evaluated using Bottom111KBAT incorporates ConvKB in the last layer of its model architecture, which should be affected by different evaluation protocols. But we find another bug on the leakage of test triples during negative sampling in the reported model, which results in more significant performance degradation.. Finally, we find that the proposed Random protocol is very robust to different random seeds. Although the theoretic upper and lower bounds of a Random score are Top and Bottom scores respectively, when we evaluate knowledge graph completion for real-world large-scale knowledge graphs, the randomness won’t affect the evaluation results much.
In this paper, we performed an extensive re-examination study of recent neural network based KGC techniques that claim very high performance on certain datasets. We find that many such models have issues with their score functions. Combined with inappropriate evaluation protocol, such methods reported inflated performance. Based on our observations, we propose Random evaluation protocol that can clearly distinguish between these affected methods from others. We also strongly encourage the research community to follow the Random evaluation protocol for all KGC evaluation purposes.