1 Introduction
Knowledge graph embedding models are neural architectures that learn vector representations (i.e. embeddings) of nodes and edges of a knowledge graph. Such knowledge graph embeddings have applications in knowledge graph completion, knowledge discovery, entity resolution, and linkbased clustering, just to cite a few
(Nickel et al., 2016a).Despite burgeoning research, the problem of calibrating such models has been overlooked, and existing knowledge graph embedding models do not offer any guarantee on the probability estimates they assign to predicted facts. Probability calibration is important whenever you need the predictions to make probabilistic sense, i.e., if the model predicts a fact is true with 80% confidence, it should to be correct 80% of the times. Prior art suggests to use a sigmoid layer to turn logits returned by models into probabilities
(Nickel et al., 2016a) (also called the expit transform), but we show that this provides poor calibration. Figure 1 shows reliability diagrams for offtheshelf TransE and ComplEx. The identity function represents perfect calibration. Both models are miscalibrated: all TransE combinations in Figure 1a underforecast the probabilities (i.e. probabilities are too small), whereas ComplEx underforecasts or overforecasts according to which loss is used (Figure1b).Calibration is crucial in highstakes scenarios such as drugtarget discovery from biological networks, where endusers need trustworthy and interpretable decisions. Moreover, since probabilities are not calibrated, when classifying triples (i.e. facts) as true or false, users must define relationspecific thresholds, which can be awkward for graphs with a great number of relation types.
To the best of our knowledge, this is the first work to focus on calibration for knowledge embeddings. Our contribution is twofold: First, we use Platt Scaling and isotonic regression to calibrate knowledge graph embedding models on datasets that include ground truth negatives. One peculiar feature of knowledge graphs is that they usually rely on the open world assumption
(facts not present are not necessarily false, they are simply unknown). This makes calibration troublesome because of the lack of ground truth negatives. For this reason, our second and main contribution is a calibration heuristics that combines Plattscaling or isotonic regression with synthetically generated negatives.
Experimental results show that we obtain bettercalibrated models and that it is possible to calibrate knowledge graph embedding models even when ground truth negatives are not present. We also experiment with triple classification, and we show that calibrated models reach stateoftheart accuracy without the need to define relationspecific decision thresholds.
Reliability diagrams of uncalibrated models. Probabilities are generated by a logistic sigmoid layer. The larger the deviation from the diagonal, the more uncalibrated is the model. We present four different common loss functions used to train knowledge graph embedding models. (a) Uncalibrated TransE on WN11. (b) Uncalibrated ComplEx on FB13. Best viewed in colors.
2 Related Work
A comprehensive survey of knowledge graph embedding models is out of the scope of this paper. Recent surveys such as (Nickel et al., 2016a) and (Cai et al., 2017) present an overview or recent literature.
TransE (Bordes et al., 2013) is the forerunner of distancebased methods, and spun a number of models commonly referred to as TransX. The intuition behind the symmetric bilineardiagonal model DistMult (Yang et al., 2015) paved the way for its asymmetric evolutions in the complex space, RotatE (Sun et al., 2019) and ComplEx (Trouillon et al., 2016) (a generalization of which uses hypercomplex representations (Zhang et al., 2019)). HolE relies instead on circular correlation (Nickel et al., 2016b). The recent TorusE (Ebisu & Ichise, 2018)
operates on a lie group and not in the Euclidean space. While the above models can be interpreted as multilayer perceptrons, others such as ConvE
(Dettmers et al., 2018) or ConvKB (Nguyen et al., 2018) include convolutional layers. More recent works adopt capsule networks architectures (Nguyen et al., 2019). Adversarial learning is used by KBGAN (Cai & Wang, 2018), whereas attention mechanisms are instead used by (Nathani et al., 2019). Some models such as RESCAL (Nickel et al., 2011), TuckER (Balažević et al., 2019), and SimplE (Kazemi & Poole, 2018)rely on tensor decomposition techniques. More recently, ANALOGY adopts a differentiable version of analogical reasoning
(Liu et al., 2017). In this paper we limit our analysis to four popular models: TransE, DistMult, ComplEx and HolE. They do not address the problem of assessing the reliability of predictions, leave aside calibrating probabilities.Besides wellestablished techniques such as Platt scaling (Platt et al., 1999) and isotonic regression (Zadrozny & Elkan, 2002), recent interest in neural architectures calibration show that modern neural architectures are poorly calibrated and that calibration can be improved with novel methods. For example, (Guo et al., 2017)
successfully proposes to use temperature scaling for calibrating modern neural networks in classification problems. On the same line,
(Kuleshov et al., 2018) proposes a procedure based on Platt scaling to calibrate deep neural networks in regression problems.The Knowledge Vault pipeline in (Dong et al., 2014) extracts triples from unstructured knowledge and is equipped with Platt scaling calibration, but this is not applied to knowledge graph embedding models. KG2E (He et al., 2015)
proposes to use normallydistributed embeddings to account for the uncertainty, but their model does not provide the probability of a triple being true, so KG2E would also benefit from the output calibration we propose here. To the best of our knowledge, the only work that adopts probability calibration to knowledge graph embedding models is
Krompaß & Tresp (2015). The authors propose to use ensembles in order to improve the results of knowledge graph embedding tasks. For that, they propose to calibrate the models with Platt scaling, so they operate on the same scale. No further details on the calibration procedure are provided. Besides, there is no explanation on how to handle the lack of negatives.3 Preliminaries
Knowledge Graph. Formally, a knowledge graph is a set of triples , each including a subject , a predicate , and an object . and are the sets of all entities and relation types of .
Triple Classification. Binary classification task where (which includes only positive triples) is used as training set, and is a disjoint test set of labeled triples to classify. Note includes positives and negatives. Since the learned models are not calibrated, multiple decision thresholds must be picked, where , i.e. one for each relation type. This is done using a validation set (Bordes et al., 2013). Classification metrics apply (e.g. accuracy).
Link Prediction. Given a training set that includes only positive triples, the goal is assigning a score proportional to the likelihood that each unlabeled triple included in a heldout set is true. Note does not have ground truth positives or negatives. This task is cast as a learning to rank problem, and uses metrics such as mean rank (MR), mean reciprocal rank (MRR) or Hits@N.
Knowledge Graph Embeddings. Knowledge graph embedding models are neural architectures that encode concepts from a knowledge graph (i.e. entities and relation types ) into lowdimensional, continuous vectors (i.e, the embeddings). Embeddings are learned by training a neural architecture over . Although such architectures vary, the training phase always consists in minimizing a loss function that includes a scoring function , i.e. a modelspecific function that assigns a score to a triple (more precisely, the input of are the embeddings of the subject , the predicate , and the object ). The goal of the optimization procedure is learning optimal embeddings, such that the scoring function assigns high scores to positive triples and low scores to triples unlikely to be true . Existing models propose scoring functions that combine the embeddings using different intuitions. Table 1b lists the scoring functions of the most common models. For example, the scoring function of TransE computes a similarity between the embedding of the subject translated by the embedding of the predicate and the embedding of the object , using the or norm . Such scoring function is then used on positive and negative triples in the loss function. This is usually a pairwise marginbased loss (Bordes et al., 2013), negative loglikelihood, or multiclass loglikelihood (Lacroix et al., 2018). Since the training set usually includes positive statements, we generate synthetic negatives required for training. We do so by corrupting one side of the triple at a time (i.e. either the subject or the object), following the protocol proposed by (Bordes et al., 2013).


Calibration. Given a knowledge graph embedding model identified by its scoring function , with , where is the estimated confidence level that a triple is true, we define to be calibrated if represents a true probability. For example, if predicts 100 triples all with confidence , we expect exactly 70 to be actually true. Calibrating a model requires reliable metrics to detect miscalibration, and effective techniques to fix such distortion. Appendix A.1 includes definitions and background on the calibration metrics adopted in the paper.
4 Calibrating Knowledge Graph Embedding Models Predictions
We propose two scenariodependent calibration techniques: we first address the case with ground truth negatives . The second deals with the absence of ground truth negatives.
Calibration with Ground Truth Negatives. We propose to use offtheshelf Platt scaling and isotonic regression, techniques proved to be effective in literature. It is worth reiterating that to calibrate a model negative triples are required from a heldout dataset (which could be the validation set). Such negatives are usually available in triple classification datasets (FB13, WN11, YAGO39K)
Calibration with Synthetic Negatives. Our main contribution is for the case where no ground truth negatives are provided at all, which is in fact the usual scenario for link prediction tasks.
We propose to adopt Platt scaling or isotonic regression and to synthetically generate corrupted triples as negatives, while using sample weights to guarantee that the frequencies adhere to the base rate of the population (which is problemdependent and must be userspecified). It is worth noting that it is not possible to calibrate a model without implicit or explicit base rate. If it is not implicit on the dataset (the ratio of positives to totals), it must be explicitly provided.
We generate synthetic negatives following the standard protocol proposed by (Bordes et al., 2013)^{1}^{1}1We also experimented with perbatch entities only, without any significant changes to the results. Future work will experiments with additional techniques as proposed by Kotnis & Nastase (2017).: for every positive triple , we corrupt one side of the triple at a time (i.e. either the subject or the object ) by replacing it with other entities in . The number of corruptions generated per positive is defined by the userdefined corruption rate . Since the number of negatives can be much greater than the number of positive triples , when dealing with calibration with synthetically generated corruptions, we weigh the positive and negative triples to make the calibrated model match the population base rate , otherwise the base rate would depend on the arbitrary choice of .
Given a positive base rate , we propose the following weighting scheme:
(1)  
where is the weight associated to the positive triples and to the negatives. The weight removes the imbalance determined by having a higher number of corruptions than positive triples in each batch. The weight guarantees that the given positive base rate is respected.
The above can be verified as follows. For the unweighted problem, the positive base rate is simply the ratio of positive examples to the total number of examples:
(2) 
If we add uniform weights to each class, we have:
(3) 
By defining , i.e. adopting the ratio of negatives to positives (corruption rate), we then have:
(4) 
Thus, the negative weights is:
(5) 
5 Results
We compute the calibration quality of our heuristics, showing that we achieve calibrated predictions even when ground truth negative triples are not available. We then show the impact of calibrated predictions on the task of triple classification.
Datasets. We run experiments on triple classification datasets that include ground truth negatives (Table 1). We train on the training set, calibrate on the validation set, and evaluate on the test set.
We also use two standard link prediction benchmark datasets, WN18RR (Dettmers et al., 2018) (a subset of Wordnet) and FB15K237 (Toutanova et al., 2015) (a subset of Freebase). Their test sets do not include ground truth negatives.
Implementation Details. The knowledge graph embedding models are implemented with the AmpliGraph library (Costabello et al., 2019)
version 1.1, using TensorFlow 1.13
(Abadi et al., 2016) and Python 3.6 on the backend^{2}^{2}2We will open source our probability calibration in the cameraready, should the paper be accepted.
. All experiments were run under Ubuntu 16.04 on an Intel Xeon Gold 6142, 64 GB, equipped with a Tesla V100 16GB.Hyperparameter Tuning. For each dataset in Table 1
a, we train a TransE, DistMult, and a ComplEx knowledge graph embedding model. We rely on typical hyperparameter values: we train the embeddings with dimensionality
, Adam optimizer, initial learning rate , negatives per positive ratio , . We train all models on four different loss functions: Selfadversarial (Sun et al., 2019), pairwise (Bordes et al., 2013), NLL, and MulticlassNLL (Lacroix et al., 2018). Different losses are used in different experiments.5.1 Calibration Results
Calibration Success. Table 2 reports Brier scores and log losses for all our calibration methods, grouped by the type of negative triples they deal with (ground truth or synthetic). All calibration methods show bettercalibrated results than the uncalibrated case, by a considerable margin and for all datasets. In particular, to put the results of the synthetic strategy in perspective, if we suppose to predict the positive base rate as a baseline, for each of the cases in Table 2 (the three datasets share the same positive base rate ), we would get Brier score and log loss
, results that are always worse than our methods. There is considerable variance of results between models given a dataset, which also happens when varying losses given a particular combination of model and dataset (Table
3). TransE provides the best results for WN11 and FB13, while DistMult works best for YAGO39K. We later propose that this variance comes from the quality of the embeddings themselves, that is, better embeddings allow for better calibration.In Figure 2, we also evaluate just the frequencies themselves, ignoring sharpness (i.e. whether probabilities are close to 0 or 1), using reliability diagrams for a single modelloss combination, for all datasets (ComplEx+NLL). Calibration plots show a remarkable difference between the uncalibrated baseline (sshaped blue line on the lefthand side) and all calibrated models (curves closer to the identity function are better). A visual comparison of uncalibrated curves in Figure 1 with those in Figure 2 also gives a sense of the effectiveness of calibration.
Ground Truth vs Synthetic. As expected, the ground truth method generally performs better than the synthetic calibration, since it has more data in both quantity (twice as much) and quality (two classes instead of one). Even so, the synthetic method is much closer to the ground truth than to the uncalibrated scores, as highlighted by the calibration plots in Figure 2. For WN11, it is actually as good as the calibration with the ground truth. This shows that our proposed method works as intended and could be used in situations where we do not have access to the ground truth, as is the case for most knowledge graph datasets.
Isotonic vs Platt. Isotonic regression performs better than Platt scaling in general, but in practice Isotonic regression has the disadvantage of not being a convex or differentiable algorithm Zadrozny & Elkan (2002). This is particularly problematic for the synthetic calibration, as it requires the generation of the synthetic corruptions, which can only be made to scale via a minibatch based optimization procedure. Platt scaling, given that it is a convex and differentiable loss, can be made part of a computational graph and optimized with minibatches, thus it can rely on the modern computational infrastructure designed to train deep neural networks.
Brier Score  Log Loss  
Ground Truth  Synthetic  Ground Truth  Synthetic  
Uncalib  Platt  Iso  Platt  Iso  Uncalib  Platt  Iso  Platt  Iso  
WN11  TransE  .443  .089  .087  .092  .088  1.959  .302  .295  .311  .296 
DistMult  .488  .213  .208  .214  .208  5.625  .618  .604  .618  .601  
ComplEx  .490  .240  .227  .240  .228  6.061  .674  .651  .674  .650  
HolE  .474  .235  .235  .235  .236  2.731  .663  .661  .663  .668  
FB13  TransE  .446  .124  .124  .148  .141  1.534  .390  .391  .459  .442 
DistMult  .473  .178  .170  .185  .192  2.177  .533  .518  .549  .567  
ComplEx  .481  .177  .170  .182  .189  2.393  .534  .516  .544  .565  
HolE  .452  .229  .228  .242  .263  1.681  .650  .651  .677  .725  
YAGO 39K  TransE  .363  .095  .093  .106  .110  1.062  .319  .309  .370  .376 
DistMult  .284  .081  .079  .093  .089  1.043  .279  .266  .311  .308  
ComplEx  .264  .089  .084  .097  .095  1.199  .305  .278  .323  .313  
HolE  .345  .141  .140  .166  .162  1.065  .444  .438  .581  .537 
Influence of Loss Function. We experiment with different losses, to assess how calibration affects each of them (Table 3). We choose to work with TransE, which is reported as a strong baseline in (Hamaguchi et al., 2017). Selfadversarial loss obtains the best calibration results for all calibration methods, across all datasets. Experiments also show the choice of the loss has a big impact, greater than the choice of calibration method or embedding model. We assess whether such variability is determined by the quality of the embeddings. To verify whether better embeddings lead to sharper calibration, we report the mean reciprocal rank (MRR), which, for each true test triple, computes the (inverse) rank of the triple against synthetic corruptions, then averages the inverse rank (Table 3). In fact, we notice no correlation between calibration results and MRR. In other words, embeddings that lead to the best predictive power are not necessary the best calibrated.
Brier Score  Log Loss  MRR (filtered)  
Ground Truth  Synthetic  Ground Truth  Synthetic  
Platt  Iso  Platt  Iso  Platt  Iso  Platt  Iso  
Pairwise  .202  .198  .209  .200  .591  .585  .606  .589  .058 
NLL  .093  .088  .094  .088  .342  .299  .344  .301  .134 
MulticlassNLL  .204  .189  .204  .189  .599  .550  .599  .551  .108 
Selfadversarial  .089  .087  .092  .088  .302  .295  .311  .296  .155 
Brier Score  Log Loss  MRR (filtered)  
Ground Truth  Synthetic  Ground Truth  Synthetic  
Platt  Iso  Platt  Iso  Platt  Iso  Platt  Iso  
Pairwise  .225  .203  .225  .208  .636  .582  .637  .594  .282 
NLL  .209  .203  .240  .244  .614  .592  .676  .685  .202 
MulticlassNLL  .146  .146  .162  .159  .455  .454  .500  .490  .402 
Selfadversarial  .124  .124  .142  .141  .390  .390  .446  .442  .296 
Brier Score  Log Loss  MRR (filtered)  
Ground Truth  Synthetic  Ground Truth  Synthetic  
Platt  Iso  Platt  Iso  Platt  Iso  Platt  Iso  
Pairwise  .123  .103  .147  .113  .445  .352  .477  .393  .371 
NLL  .187  .170  .260  .200  .577  .518  .756  .622  .063 
MulticlassNLL  .111  .104  .128  .116  .392  .350  .431  .440  .325 
Selfadversarial  .095  .093  .113  .109  .319  .308  .399  .376  .169 
Positive Base Rate. We apply our synthetic calibration method to two link prediction benchmark datasets, FB15K237 and WN18RR. As they only provide positive examples, we apply our method with varying base rates , linearly spaced from to . We evaluate results relying on the closedworld assumption, i.e. triples not present in training, validation or test sets are considered negative. For each we calibrate the model using the synthetic method with both isotonic regression and Platt scaling. We sample negatives from the negative set under the implied negative rate, and calculate a baseline which is simply having all probability predictions equal to . Figure 3 shows that isotonic regression and Platt scaling perform similarly and always considerably below the baseline. As expected from the previous results, the uncalibrated scores perform poorly, only reaching acceptable levels around some particular base rates.
Triple Classification and Decision Threshold. To overcome the need to learn decision thresholds from the validation set, we propose to rely on calibrated probabilities, and use the natural threshold of . Table 4 shows how calibration affects the triple classification task, comparing with the literature standard of perrelation thresholds (last column). For simplicity, note we use the same selfadversarial loss in Table 2 and Table 4. We learn thresholds on validation sets, resulting in 11, 7, and 33 thresholds for WN11, FB13 and YAGO39K respectively.
Using a single and calibration provides competitive results compared to multiple learned thresholds (note uncalibrated results with are poor, as expected). It is worth mentioning that we are at par with stateoftheart results for WN11. Isotonic regression is again the best method, but there is more variance in the model choice. Our proposed calibration method with synthetic negatives performs well overall, even though calibration is performed only using half of the validation set (negatives examples are replaced by synthetic negatives).
Ground Truth ()  Synthetic ()  Uncalib. ()  Uncalib. (PerRelation )  
Platt  Iso  Platt  Iso  Reproduced  Literature  
WN11  TransE  88.8  88.9  88.9  88.9  50.7  88.2  88.9 
DistMult  66.5  67.2  66.4  67.1  50.8  67.2  
ComplEx  60.6  62.4  60.0  62.4  50.8  59.6  
HolE  59.3  59.0  59.3  59.0  50.9  60.8  
FB13  TransE  82.4  82.4  80.7  80.2  50.0  82.1  89.1 
DistMult  72.5  73.2  72.1  70.2  50.1  80.8  
ComplEx  73.8  74.2  74.2  72.4  50.1  83.6  
HolE  60.3  60.6  57.8  54.3  50.0  62.6  
YAGO 39K  TransE  87.2  87.8  85.3  84.9  50.2  88.8  93.8 
DistMult  88.9  89.3  88.1  88.5  56.7  90.2  
ComplEx  87.3  88.2  86.9  87.2  61.1  89.4  
HolE  80.4  80.4  78.4  78.5  50.6  81.5 
6 Conclusion
We propose a method to calibrate knowledge graph embedding models. We target datasets with and without ground truth negatives. We experiment on triple classification datasets and apply Platt scaling and isotonic regression with and without synthetic negatives controlled by our heuristics. All calibration methods perform significantly better than uncalibrated scores. We show that isotonic regression brings better calibration performance, but it is computationally more expensive. Additional experiments on triple classification shows that calibration allows to use a single decision threshold, reaching stateoftheart results without the need to learn perrelation thresholds.
Future work will evaluate additional calibration algorithms, such as beta calibration (Kull et al., 2017) or Bayesian binning (Naeini et al., 2015). We will also experiment on ensembling of knowledge graph embedding models, inspired by(Krompaß & Tresp, 2015). The rationale is that different models operate on different scales, but calibrating brings them all to the same probability scale, so their output can be easily combined.
References

Abadi et al. (2016)
Martín Abadi, Paul Barham, Jianmin Chen, Zhifeng Chen, Andy Davis, Jeffrey
Dean, Matthieu Devin, Sanjay Ghemawat, Geoffrey Irving, Michael Isard, et al.
Tensorflow: A system for largescale machine learning.
In 12th USENIX Symposium on Operating Systems Design and Implementation (OSDI 16), pp. 265–283, 2016.  Balažević et al. (2019) Ivana Balažević, Carl Allen, and Timothy M Hospedales. Tucker: Tensor factorization for knowledge graph completion. arXiv preprint arXiv:1901.09590, 2019.
 Bollacker et al. (2008) Kurt Bollacker, Colin Evans, Praveen Paritosh, Tim Sturge, and Jamie Taylor. Freebase: a collaboratively created graph database for structuring human knowledge. In Proceedings of the 2008 ACM SIGMOD international conference on Management of data, pp. 1247–1250. AcM, 2008.
 Bordes et al. (2013) Antoine Bordes, Nicolas Usunier, Alberto GarciaDuran, Jason Weston, and Oksana Yakhnenko. Translating embeddings for modeling multirelational data. In NIPS, pp. 2787–2795, 2013.
 Brier (1950) Glenn W Brier. Verification of forecasts expressed in terms of probability. Monthly weather review, 78(1):1–3, 1950.
 Cai et al. (2017) Hongyun Cai, Vincent W Zheng, and Kevin ChenChuan Chang. A comprehensive survey of graph embedding: Problems, techniques and applications. arXiv preprint arXiv:1709.07604, 2017.
 Cai & Wang (2018) Liwei Cai and William Yang Wang. Kbgan: Adversarial learning for knowledge graph embeddings. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers), pp. 1470–1480, 2018.
 Costabello et al. (2019) Luca Costabello, Sumit Pai, Chan Le Van, Rory McGrath, Nicholas McCarthy, and Pedro Tabacof. AmpliGraph: a Library for Representation Learning on Knowledge Graphs, March 2019. URL https://doi.org/10.5281/zenodo.2595043.
 DeGroot & Fienberg (1983) Morris H DeGroot and Stephen E Fienberg. The comparison and evaluation of forecasters. Journal of the Royal Statistical Society: Series D (The Statistician), 32(12):12–22, 1983.
 Dettmers et al. (2018) Tim Dettmers, Pasquale Minervini, Pontus Stenetorp, and Sebastian Riedel. Convolutional 2d knowledge graph embeddings. In Procs of AAAI, 2018. URL https://www.aaai.org/ocs/index.php/AAAI/AAAI18/paper/view/17366.
 Dong et al. (2014) Xin Dong, Evgeniy Gabrilovich, Geremy Heitz, Wilko Horn, Ni Lao, Kevin Murphy, Thomas Strohmann, Shaohua Sun, and Wei Zhang. Knowledge vault: A webscale approach to probabilistic knowledge fusion. In Proceedings of the 20th ACM SIGKDD international conference on Knowledge discovery and data mining, pp. 601–610. ACM, 2014.

Ebisu & Ichise (2018)
Takuma Ebisu and Ryutaro Ichise.
Toruse: Knowledge graph embedding on a lie group.
In
ThirtySecond AAAI Conference on Artificial Intelligence
, 2018.  Guo et al. (2017) Chuan Guo, Geoff Pleiss, Yu Sun, and Kilian Q Weinberger. On calibration of modern neural networks. In Proceedings of the 34th International Conference on Machine LearningVolume 70, pp. 1321–1330. JMLR. org, 2017.
 Hamaguchi et al. (2017) Takuo Hamaguchi, Hidekazu Oiwa, Masashi Shimbo, and Yuji Matsumoto. Knowledge transfer for outofknowledgebase entities: a graph neural network approach. In Proceedings of the 26th International Joint Conference on Artificial Intelligence, pp. 1802–1808. AAAI Press, 2017.
 He et al. (2015) Shizhu He, Kang Liu, Guoliang Ji, and Jun Zhao. Learning to represent knowledge graphs with gaussian embedding. In Proceedings of the 24th ACM International on Conference on Information and Knowledge Management, pp. 623–632. ACM, 2015.
 Ji et al. (2016) Guoliang Ji, Kang Liu, Shizhu He, and Jun Zhao. Knowledge graph completion with adaptive sparse transfer matrix. In Thirtieth AAAI Conference on Artificial Intelligence, 2016.
 Kazemi & Poole (2018) Seyed Mehran Kazemi and David Poole. Simple embedding for link prediction in knowledge graphs. In S. Bengio, H. Wallach, H. Larochelle, K. Grauman, N. CesaBianchi, and R. Garnett (eds.), Advances in Neural Information Processing Systems 31, pp. 4284–4295. Curran Associates, Inc., 2018. URL http://papers.nips.cc/paper/7682simpleembeddingforlinkpredictioninknowledgegraphs.pdf.
 Kotnis & Nastase (2017) Bhushan Kotnis and Vivi Nastase. Analysis of the impact of negative sampling on link prediction in knowledge graphs. arXiv preprint arXiv:1708.06816, 2017.
 Krompaß & Tresp (2015) Denis Krompaß and Volker Tresp. Ensemble solutions for linkprediction in knowledge graphs. In 2nd Workshop, 2015.
 Kuleshov et al. (2018) Volodymyr Kuleshov, Nathan Fenner, and Stefano Ermon. Accurate uncertainties for deep learning using calibrated regression. arXiv preprint arXiv:1807.00263, 2018.
 Kull et al. (2017) Meelis Kull, Telmo M Silva Filho, Peter Flach, et al. Beyond sigmoids: How to obtain wellcalibrated probabilities from binary classifiers with beta calibration. Electronic Journal of Statistics, 11(2):5052–5080, 2017.
 Lacroix et al. (2018) Timothee Lacroix, Nicolas Usunier, and Guillaume Obozinski. Canonical tensor decomposition for knowledge base completion. In International Conference on Machine Learning, pp. 2869–2878, 2018.
 Liu et al. (2017) Hanxiao Liu, Yuexin Wu, and Yiming Yang. Analogical inference for multirelational embeddings. In Proceedings of the 34th International Conference on Machine LearningVolume 70, pp. 2168–2178. JMLR. org, 2017.

Lv et al. (2018)
Xin Lv, Lei Hou, Juanzi Li, and Zhiyuan Liu.
Differentiating concepts and instances for knowledge graph embedding.
In
Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing
, pp. 1971–1979, 2018.  Mahdisoltani et al. (2013) Farzaneh Mahdisoltani, Joanna Biega, and Fabian M Suchanek. Yago3: A knowledge base from multilingual wikipedias. 2013.
 Miller (1995) George A Miller. Wordnet: a lexical database for english. Communications of the ACM, 38(11):39–41, 1995.
 Naeini et al. (2015) Mahdi Pakdaman Naeini, Gregory Cooper, and Milos Hauskrecht. Obtaining well calibrated probabilities using bayesian binning. In TwentyNinth AAAI Conference on Artificial Intelligence, 2015.
 Nathani et al. (2019) Deepak Nathani, Jatin Chauhan, Charu Sharma, and Manohar Kaul. Learning attentionbased embeddings for relation prediction in knowledge graphs. arXiv preprint arXiv:1906.01195, 2019.

Nguyen et al. (2018)
Dai Quoc Nguyen, Tu Dinh Nguyen, Dat Quoc Nguyen, and Dinh Phung.
A novel embedding model for knowledge base completion based on convolutional neural network.
In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 2 (Short Papers), pp. 327–333, 2018.  Nguyen et al. (2019) Dai Quoc Nguyen, Thanh Vu, Tu Dinh Nguyen, Dat Quoc Nguyen, and Dinh Phung. A capsule networkbased embedding model for knowledge graph completion and search personalization. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pp. 2180–2189, 2019.
 Nickel et al. (2011) Maximilian Nickel, Volker Tresp, and HansPeter Kriegel. A threeway model for collective learning on multirelational data. In ICML, volume 11, pp. 809–816, 2011.
 Nickel et al. (2016a) Maximilian Nickel, Kevin Murphy, Volker Tresp, and Evgeniy Gabrilovich. A review of relational machine learning for knowledge graphs. Procs of the IEEE, 104(1):11–33, 2016a.
 Nickel et al. (2016b) Maximilian Nickel, Lorenzo Rosasco, Tomaso A Poggio, et al. Holographic embeddings of knowledge graphs. In AAAI, pp. 1955–1961, 2016b.

NiculescuMizil & Caruana (2005)
Alexandru NiculescuMizil and Rich Caruana.
Predicting good probabilities with supervised learning.
In Proceedings of the 22nd international conference on Machine learning, pp. 625–632. ACM, 2005. 
Platt et al. (1999)
John Platt et al.
Probabilistic outputs for support vector machines and comparisons to regularized likelihood methods.
Advances in large margin classifiers, 10(3):61–74, 1999.  Socher et al. (2013) Richard Socher, Danqi Chen, Christopher D Manning, and Andrew Ng. Reasoning with neural tensor networks for knowledge base completion. In Advances in Neural Information Processing Systems (NIPS), 2013.
 Sun et al. (2019) Zhiqing Sun, ZhiHong Deng, JianYun Nie, and Jian Tang. Rotate: Knowledge graph embedding by relational rotation in complex space. arXiv preprint arXiv:1902.10197, 2019.
 Toutanova et al. (2015) Kristina Toutanova, Danqi Chen, Patrick Pantel, Hoifung Poon, Pallavi Choudhury, and Michael Gamon. Representing text for joint embedding of text and knowledge bases. In Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing, pp. 1499–1509, 2015.
 Trouillon et al. (2016) Théo Trouillon, Johannes Welbl, Sebastian Riedel, Éric Gaussier, and Guillaume Bouchard. Complex embeddings for simple link prediction. In procs of ICML, pp. 2071–2080, 2016.
 Yang et al. (2015) Bishan Yang, Scott Wentau Yih, Xiaodong He, Jianfeng Gao, and Li Deng. Embedding entities and relations for learning and inference in knowledge bases. In Procs of ICLR, 2015.
 Zadrozny & Elkan (2002) Bianca Zadrozny and Charles Elkan. Transforming classifier scores into accurate multiclass probability estimates. In Proceedings of the eighth ACM SIGKDD international conference on Knowledge discovery and data mining, pp. 694–699. ACM, 2002.
 Zhang et al. (2019) Shuai Zhang, Yi Tay, Lina Yao, and Qi Liu. Quaternion knowledge graph embedding. arXiv preprint arXiv:1904.10281, 2019.
 Zhang et al. (2018) Zhao Zhang, Fuzhen Zhuang, Meng Qu, Fen Lin, and Qing He. Knowledge graph embedding with hierarchical relation structure. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pp. 3198–3207, 2018.
Appendix A Appendix
a.1 Calibration Metrics
Reliability Diagram (DeGroot & Fienberg, 1983; NiculescuMizil & Caruana, 2005). Also known as calibration plot, this diagram is a visual depiction of the calibration of a model (see Figure 1 for an example). It shows the expected sample accuracy as a function of the estimated confidence. A hypothetical perfectly calibrated model is represented by the diagonal line (i.e. the identity function). Divergence from such diagonal indicates calibration issues (Guo et al., 2017).
Brier Score (Brier, 1950). It is a popular metric used to measure how well a binary classifier is calibrated. It is defined as the mean squared error between probability estimates and the corresponding actual outcomes . The smaller the Brier score, the better calibrated is the model. Note that the Brier score .
(6) 
Log Loss is another effective and popular metric to measure the reliability of the probabilities returned by a classifier. The logarithmic loss measures the relative uncertainty between the probability estimates produced by the model and the corresponding true labels.
(7) 
Platt Scaling. Proposed by (Platt et al., 1999)
for support vector machines, Platt scaling is a popular parametric calibration techniques for binary classifiers. The method consists in fitting a logistic regression model to the scores returned by a binary classifier, such that
, where is the uncalibrated score of the classifier, are trained scalar weights. andis the calibrated probability returned as output. Such model can be trained be trained by optimizing the NLL loss with nonbinary targets derived by the Bayes rule under an uninformative prior, resulting in an Maximum a Posteriori estimate.
Isotonic Regression (Zadrozny & Elkan, 2002). This popular nonparametric calibration techniques consists in fitting a nondecreasing piecewise constant function to the output of an uncalibrated classifier. As for Platt scaling, the goal is learning a function , such that is a calibrated probability. Isotonic regression learns by minimizing the square loss under the constraint that must be piecewise constant (Guo et al., 2017).
a.2 Calibration Diagrams: Instances per Bin
a.3 Impact of Model Hyperparameters: and Embedding Dimensionality
In Figure 5 we report the impact of negative/positive ratio and the embedding dimensionality . Results show that the embedding size has higher impact than the negative/positive ratio . We observe that calibrated and uncalibrated lowdimensional embeddings have worse Brier score. Results also show that any does not improve calibration anymore. The negative/positive ratio follows a similar pattern: choosing does not have any effect on the calibration score.
a.4 Positive Base Rate Experiments: Link Prediction Performance
In Table 5, we present the traditional knowledge graph embedding rank metrics: MRR (mean reciprocal rank), MR (mean rank) and Hits@10 (precision at the top10 results). We report the results for all datasets and models used in the main text, which appear in Table 2, Table 4 and Figure 3.
MR  MRR  Hits@10  
WN11  TransE  2289  .155  .309 
DistMult  10000  .045  .081  
ComplEx  13815  .054  .094  
HolE  13355  .017  .035  
FB13  TransE  3431  .296  .394 
DistMult  6667  .183  .337  
ComplEx  8937  .018  .039  
HolE  8937  .018  .039  
YAGO39K  TransE  244  .169  .319 
DistMult  635  .306  .620  
ComplEx  1074  .531  .753  
HolE  922  .101  .189  
WN18RR  ComplEx  4111  .506  .583 
FB15K237  ComplEx  183  .320  .499 
a.5 PerRelation Decision Thresholds
We report in Table 6 the perrelation decision thresholds used in Table 4, under the ‘Reproduced’ column. Note that the thresholds reported here are not probabilities, as they have been applied to the raw scores returned by the modeldependent scoring function .



Comments
There are no comments yet.