1.0.1 Link prediction
There are two major ways of measuring the quality of (neural) embeddings of entities in a Knowledge Graph for link prediction tasks, inspired by two different fields: information retrieval [1, 2, 3, 4, 5] and graph-based data mining [6, 7, 8, 9, 10, 11]. Information retrieval inspired approaches seem to favor the mean rank measurement and its variants (mean average precision, top results, mean reciprocal rank), and graph-based data mining approaches recur to the standard evaluation measurements of classifiers based on false positive rate to true positive rate curves (e.g., ROC AUC, F-measure). While both of the techniques measure the quality of the embeddings, they however may or may not be suitable for particular link prediction tasks. Consider the popular mean rank (and its variants) that measures the ability of a retrieval system to score a true result on top of all other possible candidates. Applied to the knowledge graph completion task , typically, one would consider all links, i.e., assertions of relations of type , as true and all other possible assertions that are not in the Knowledge Graph as negative (i.e., ). If we use mean rank as our metric to measure the performance of a classifier for link prediction we might have the following problems.
Consider a simple KG, consisting of three entities and one relation , and assume that we only know about the existence of the link , as depicted in Figure 1. Our only true link (example for a prediction system) is and our negative links are . Our predictor of positive links for relation can be trained in such a way that it outputs the highest score for the true link and slightly lower scores for the negative links, i.e., . Obviously, would be ranked above all , yielding a perfect mean rank score. However, if we now wanted to use as our binary predictor if , and if , we would have a predictor which predicts all links to be true. Obviously, we could have tuned (e.g., identify the suitable threshold from a few samples) our predictor to have a threshold of (i.e., ), but that would be a non-flexible predictor. In other words, the mean rank measure may evaluate certain embeddings to be of good quality, while they may actually be bad for a specific link prediction task. We believe that the mean rank measurement should be used to judge the quality of the embeddings for the reconstruction  the original knowledge graph that was used to train the embeddings, as opposed to predict new links.
1.0.2 Link prediction as binary classification task
Mean rank based link prediction pipeline does not individuate the performance of entity embeddings to predict a specific relation, as it gives a scalar evaluation of overall performance. Such a performance description is akin to describing the whole population distribution with its mean value only. In the bioinformatics field, link prediction is a very important problem, and it is formulated as a binary classification task . This is different from the reconstruction setting. Instead of asking to which other entity among all possible entities the entity
is more probable to be connected with a link, we ask what is the probability of having a link An example of could be asking:
What is the probability that the gene TRIM28 () has function () negative regulation of transcription by RNA polymerase II ()?
Indeed, a predictor that ranks first among all other possible completions of is not a strong enough link predictor. Apart from the fact that many of these potential negative examples would not even make sense biologically, i.e., , the predictor risks to accept too many false positives before it correctly classifies all the true positive examples. For the rest of this paper we focus on evaluation strategies of link prediction tasks for knowledge graphs, where link prediction is formulated as a binary classification problem.
1.0.3 How much multi-relational are knowledge graphs?
Typically, in the statistical relation learning field the link prediction algorithms embed both entities and relations in the same embedding space, and then use binary operators defined on these representations to represent a labelled link (i.e., a triple). For instance, given a function
that associates entities or relations to its vector space embeddings, a predictormight output the score by evaluating a standard Euclidean dot product like so . And if then we expect and to output very different scores, disambiguating thus links with relation or between the two entities and .
One might ask how it is possible to construct link predictors by only considering the embeddings of the entities? Can we flatten the KG, by introducing the restriction that each pair of entities must have at most one unlabelled link (Figure 2), and come up with a strategy of disambiguating links with different relations? To better demonstrate the argument, below in Table 1 we provide the statistics of pairwise count of links in the WN11 Knowledge Graph (original WordNet dataset  brought down to 11 relations as in ), we can see that only of all pairs are connected with more than one link. In other words the multi-relationness of this graph is really low.
|relation||# links||# sources||# targets|
|derivationally related form||31867||16737||16737|
|member of domain region||983||118||925|
|synset domain topic of||3335||3170||313|
|member of domain usage||675||25||635|
|#pairs (multi)||#pairs (total)||(%)|
Perhaps not directly, but similarly, some works have considered embedding the entities without the relations [10, 5] (although the potential pitfalls of flattening the links are not raised there). In  the link prediction of a hierarchical transitive closure is proposed to be treated without learning an embedding for the hierarchical relation (e.g., hypernym). In the bioinformatics domain a similar idea of learning embeddings separately for each relation is proposed, further enhanced by considering binary classifiers operating on the learned embeddings to predict links of type .
1.0.4 Contribution of this work
We investigate empirically the potential of training neural embeddings globally for the entire graph, as opposed to training embeddings locally for a specific relation . This is different to what has been previously done [5, 10], and opens new ways of assessing the quality of the neural embeddings. By comparing, combining and extending different methodologies for link prediction on graph-based data coming from different domains, we present a unified methodology for the quality evaluation of neural embeddings for link prediction tasks for knowledge graphs. The link prediction problem is treated here as a binary classification task on entity embeddings. The evaluation steps required for an effective assessment of the quality of the knowledge graph embeddings are formalized. Our evaluation pipeline is made open source , and with this we aim to draw more attention of the community towards an important issue of transparency and reproducibility of the results.
2.0.1 Preparation of the datasets for neural embedding training
The preparation of datasets for neural embedding training is inspired by the methodology presented in . In the following we present our generalized approach to this problem, as we think it is crucial for the transparent and reproducible evaluation pipeline, and has not been detailed enough in other work. In addition to local retained graphs, where only triples of a specified relation were removed (as used in ), we also consider global retained graphs for all relation .
Let denote all the existing links in the KG, i.e., triples , and let denote the non-existing links , with the restrictions that (number of elements is equal), and , and , as in . The former restriction () ensures that we do not have imbalanced classes for the binary prediction (see  for more details), we therefore sample negatives at random with the sample size equal to the number of positive examples. Potentially, the set of all possible negative examples is much bigger than the set of all positive examples (because we are enumerating many more possible connections in the graph, excluding the existing ones). The latter restriction on the domain and the range of a relation fixes attention to the most probable and semantically consistent negative links. Then, forms the set of all positive examples, analogously, – set of sampled negative examples (see Section 1.0.1). To test the quality of the embeddings and their ability to classify examples for a specific relation, we split for each considered relation the set of positive and negative links of type () in a given train/test ratio ( in ). To simplify the notation, we let represent the train split of all positive examples for links with type , and represent the test split of all positive examples. Similarly, for the negative examples (i.e., train split and test split ). Essentially, when we say we mean the set consisting of of positive examples of type (we hope that the abuse of notation will increase comprehension). Figure 3 employs a set-theoretic depiction of the sets of all positive and negative links, as well as their subsets for links restricted to a specific type , and divisions into train and test splits.
By treating the problem of evaluation of the quality of the embeddings in this set-theoretic approach, we now define the following datasets:
a global retained graph on all relations – training corpus for unsupervised learning of global entity embeddings ,
– train examples for the binary classifier on ,
– test examples for the binary classifier .
We also note the following properties, which must hold and serve as validation criteria for the generation of train and test data. In particular,
local and global retained graphs must be subgraphs of full graph,
the difference between the full and retained global graphs are the positive links, which we use in the test set, i.e., the embeddings will be used to predict these positive links,
there should be no link shared between the train and test sets,
all generated negative links do not exist in the original full graph.
The last two properties reflect what is usually done in the literature during the generation of negative links, and for this work we also conform to these two properties. However, we elaborate more on the issue, where the negative examples set generation is disjoint from the positive examples set in Section 4.
2.0.2 Training neural embeddings
In this work we employ a fast and scalable unsupervised neural embedding model , which aims at learning entity embeddings, each of which is described by a set of discrete features (bag-of-features) coming from a fixed-length dictionary. The model is trained by assigning a
-dimensional vector to each of the discrete features in the set that we want to embed directly. Ultimately, the look-up matrix (the matrix of embeddings - latent vectors) is learned by minimizing the following loss function
In this loss function, we need to indicate the generator of positive entry pairs , and the generator of negative entities , similar to the -negative sampling strategy proposed by Mikolov et al. . In our setting, for each entity in the knowledge graph, is the generator of entities from the local or global retained graphs, and is the generator of a negative entity , such that . denotes that we minimize the objective function for the small subsets of elements drawn from and (mini-batching). The similarity function is task-dependent and should operate on -dimensional vector representations of the entities, in our case we use the standard Euclidean dot product. This model (and implicitly the embeddings are) is trained with the StarSpace toolkit . The aforementioned embedding scheme is different from a multi-relational knowledge graph embedding task, since we do not require the explicit embeddings for the relations (Section 1.0.3).
Please note that since we are learning embeddings for the entities from the retained graphs (some links are excluded), the algorithm may miss to learn an embedding for an entity. That is, suppose that during the generation of the retained graph all connectivities of an entity are not assigned to the retained graph , then the algorithm will not learn an embedding , which will lead us to a situation where all the pairs will be missing during training or testing of the binary classifier (depending whether these pairs are assigned to to the train or test sets). Obviously, the amount of possible missing embeddings is inversely proportionate to the parameter, i.e., the more information we include during the embedding learning phase, the fewer embeddings will be missed.
2.0.3 Binary operators for link representation
Based on the embeddings of the nodes of the graph, we can come up with different ways of representing a link between an entity and . This is usually achieved with a binary operator that combines entitiy embeddings representations into one single representation of the link . Popular choices for this operator include operations that preserve the original dimension of the entity embeddings to represent links (e.g., element-wise sum or mean ), as well the operations that combine entitiy embeddings, such as concatenation . The definitions of these operators are given in Table 2; we use them in our experiments and evaluation.
2.0.4 Repeated random sub-sampling validation
To quantify confidence in the trained embeddings, we perform the repeated random sub-sampling validation for each classifier . That is, for each relation we generate times: retained graph corpus for unsupervised learning of entity embeddings ) and train and test
splits of positive and negative examples. Link prediction is then treated as a binary classification task with a logistic regression classifierdefined on the link representation produced by the binary operator (e.g.,
). The performance of the classifier is measured with the standard performance measurement based on false positive rate to true positive rate curves. We, in particular, report the F1 score (i.e., F-measure with equal weights for precision and recall).
In this section we report on the possibility of training the embeddings once on the retained graph on all relations , and evaluating it separately on all train and test examples for each relation . We evaluate the quality of local and global embeddings and study the influence of the choice of the binary operator (e.g., sum, concatenation) used in the logistic regression classifier , as well as the size (controlled by ) of the retained local and global graphs, applied to the WN11 knowledge graph. All of our results are presented as averages of 10 repeated random sub-sampling validations (Section 2.0.1
and number of epochs is 10. The neural embeddings are trained with the StarSpace toolkit. Classification results are obtained with the scikit Python library , grouping of data and their statistical analysis are performed with Pandas . All of our experiments were performed on a modern desktop PC with a quad core Intel i7 CPU (clocked at 4GHz) and 32 Gb of RAM.
Table 3 regroups averaged cross-validation scores of the binary classifiers per relations . These scores are split per increasing amount of information (controlled by ) which is available during the unsupervised learning phase of the neural embeddings. This models the realistic scenario where links represent an ever growing knowledge about the domain, and where we want to predict new links that might emerge in future. Overall, the unsupervised training phase seem to be quiet robust to limited amounts of information (i.e., training embeddings with only 20% of links vs. training with 80% of available links), as average F-measure score for all relations seem to be affected only slightly in both local and global settings. We can notice that the concatenation binary operator outperforms other link representations, and all its predictions lie within the range. This observation might be caused by the fact that the output of the concatenation operator has twice the amount of dimensions to represent information (i.e., ). Besides, unlike sum and mean, it naturally encodes directionality of links (i.e., assymetric relations ).
Depending on the connectivity of the knowledge graph, retaining some ratio (i.e., ) of available links may have severe consequences on the number of missing embeddings for the entities with very low incoming and outgoing links. In Table 4 we group F-measure scores for the concatenation operator, and additionally, report the percentage of missing embeddings in both train and test dataset splits for each relation . As expected, pre-training the embeddings on only 20% of all relations will miss many entities compared to the local setting , where we train on 20% only for a specific relation and on 100% of links for . 29.38 % vs. 3.7 % of missed training examples for global and local settings respectively, analogously, 59.47 % vs. 7.8 % for test examples. This makes a big difference in confidence of our link prediction binary classifier trained locally or globally, even if the final F-measure scores are quiet comparable. However, if we consider then the percentage of missing examples for train () and test () splits for global setting are tolerable, with and for the local approach respectively. Obvious advantage of the global approach is scalability in both time and space. We train and store only one neural model (embedding vectors are stored implicitly in the weight matrix of the hidden layer) for the global approach, and we need to train and store embeddings locally for as many models as there are relations in the knowledge graph. For the WN11 KG, if we consider , we need on average sec to train one global model, and we require on average seconds to train a model per relation for the local approach (averaged over 10 repeated random sub-sampling validations). Since we have 11 relations, we thus need seconds vs. seconds for and 10 epochs. The time needed to train these models will obviously grow as we increase the embedding dimension and the number of epochs. Spacewise, the global approach needs Mb and the local Mb, as in the case of time complexity, the memory needed to store bigger models () will increase.
4.0.1 Related work
The most focused study of the link prediction problem for the large scale (unlabelled) graph-based data mining has been conducted in , to the best of our knowledge. The focus of that work is on negative sample generation, and the author emphasize that the link prediction problem is a hugely imbalanced binary prediction task, where the number of negative samples is orders of magnitude higher than the number of positive examples. In the bioinformatics community Alshahrani et al.  proposed to circumvent the problem of imbalanced classes for the binary classification problem by considering negative links that have a biological meaning, truncating thus many potential negative links that are highly improbable biologically. They do this by restricting all the negative links to have the same domain and range as the positive links (i.e., they do not consider highly improbable links of type has function ). Link prediction for the entire knowledge graph is then treated as a set of binary classification tasks (one for each relation). Both of these works agree that link prediction should be treated as a binary classification task. Some works have focused their attention on the strategies for data splitting, producing biased train and test examples, such that the implicit information from the test set may leak into the train set [17, 18]. In , authors show that the random splits for the common knowledge graph evaluation benchmarks (Wordnet  and Freebase ) may bias the classification results for the symmetric relations. Solutions to unbiased evaluations include curated data splits where no such information leakage is present. Kadlec et al. 
have mentioned that fair optimization of hyperparameters for competing approaches should be considered, as some of the reported KG completion results are significantly lower than what they potentially could be. Evaluation of the machine learning tasks that use link information from the knowledge graphs and neural embeddings is explored in[20, 21].
As a general remark, in most of the works, link prediction is evaluated with the mean rank metric and its variants from the information retrieval community (e.g., mean reciprocal rank, top results, mean average precision), which we believe is not the most suitable metric for link prediction. As we pointed out in Section 1.0.1, the probability to have a new link is different from asking if the entity is part of the set of the most probable entities, among all existing entities in the knowledge graph, to be connected to with the relation .
We believe that the evaluation of the quality of the embeddings for the link prediction task has received much less attention than the methodologies for training the embeddings in the literature. While most of the works do perform extensive evaluation of their embedding approaches, the exact steps and implications of negative sample generation, random train and test data splits, amount of information involved in the unsupervised learning, are either not very well detailed for an easy and fair reproducibility of the results, or are presented as a secondary remark.
4.0.2 Negative example generation for link prediction
In general, the link prediction problem for knowledge graphs is different from other classification problems where positive and negative examples are well defined. Obtaining a representative test set with a prototypical distribution is often not trivial , and usually what is done is that we randomly remove some links which we then use as our test positives. Moreover, during the generation of negative links both for train and test sets, we impose that no negative link appears as a training positive or test positive. We therefore implicitly leak information about the test positives when we generate train negatives. In other words, during the generation of negative links we should account to the possibility that this link might actually turn out to be true, and our binary classifiers should be robust and generalize well to these realistic situations. As our future work we would like to study further the implications of the negative example generation.
In this paper we focus on link prediction for knowledge graphs, treated as a binary classification problem on entity embeddings. In this work we provide our first results of the evaluation of different strategies for training neural embeddings of entities on the WN11 knowledge graph. These early findings lead us to suggest that: i) if the number of multi-relational connectivities of nodes is low compared to the total number of connections, then the graph can be flattened and treated as an unlabelled graph (i.e., no two nodes are connected with more than one link), provided that we disambiguate the links with separate binary classifiers for each relation ; ii) training embeddings once globally and using them in binary classifiers for each relation gives comparable classification error (F-measure averaged over all relations on WN11) as embeddings trained locally for each relation separately. The global approach to training embeddings is more scalable as it requires only one neural model to represent all entities, as opposed to having as many models as there are many relations in the knowledge graph. The confidence in our results is, of course, proportionate to the amount of information (percentage of all available links) that we include in the unsupervised training. Depending on the incoming and outgoing degrees of the entities in the graph, the global approach may fail to embed many entities. Thus, the global approach is less robust to limited availability of information then the local approach. Finally, we make our code for the evaluation pipeline for link prediction tasks open source , and hope that it will trigger a standardized benchmark for the evaluation of the knowledge graph embeddings.
-  Bordes, A., Usunier, N., Garcia-Duran, A., Weston, J., Yakhnenko, O.: Translating embeddings for modeling multi-relational data. (2013)
-  Nickel, M., Murphy, K., Tresp, V., Gabrilovich, E.: A review of relational machine learning for knowledge graphs. Proc. IEEE 104(1) (jan 2016) 11–33
-  Kadlec, R., Bajgar, O., Kleindienst, J.: Knowledge base completion: Baselines strike back. arXiv:1705.10744 [cs] (may 2017)
-  Wu, L., Fisch, A., Chopra, S., Adams, K., Bordes, A., Weston, J.: Starspace: Embed all the things! arXiv (sep 2017)
-  Nickel, M., Kiela, D.: Poincaré embeddings for learning hierarchical representations. arXiv:1705.08039 [cs, stat] (may 2017)
-  Perozzi, B., Al-Rfou, R., Skiena, S.: Deepwalk: Online learning of social representations. In: Proceedings of the 20th ACM SIGKDD international conference on Knowledge discovery and data mining - KDD ’14, New York, New York, USA, ACM Press (aug 2014) 701–710
-  Grover, A., Leskovec, J.: node2vec: Scalable feature learning for networks. KDD 2016 (aug 2016) 855–864
-  Garcia-Gasulla, D., Ayguadé, E., Labarta, J., Cortés, U.: Limitations and alternatives for the evaluation of large-scale link prediction. arXiv (nov 2016)
-  Chamberlain, B.P., Clough, J., Deisenroth, M.P.: Neural embeddings of graphs in hyperbolic space. arXiv:1705.10359 [cs, stat] (may 2017)
-  Alshahrani, M., Khan, M.A., Maddouri, O., Kinjo, A.R., Queralt-Rosinach, N., Hoehndorf, R.: Neuro-symbolic representation learning on biological knowledge graphs. Bioinformatics 33(17) (sep 2017) 2723–2730
-  Agibetov, A., Samwald, M.: Fast and scalable learning of neuro-symbolic representations of biomedical knowledge. arXiv (apr 2018)
-  Miller, G.A.: WordNet: a lexical database for english. Commun ACM 38(11) (nov 1995) 39–41
-  Agibetov, A., Samwald, M.: Github repository https://github.com/plumdeq/neuro-kglink, Last accessed 2018-05-31.
-  Mikolov, T., Sutskever, I., Chen, K., Corrado, G., Dean, J.: Distributed representations of words and phrases and their compositionality. arXiv (oct 2013)
-  Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, V., Thirion, B., Grisel, O., Blondel, M., Prettenhofer, P., Weiss, R., Dubourg, V., Vanderplas, J., Passos, A., Cournapeau, D., Brucher, M., Perrot, M., Duchesnay, E.: Scikit-learn: Machine learning in python. Journal of Machine Learning Research (2011)
-  McKinney, W.: Data structures for statistical computing in python. In van der Walt, S., Millman, J., eds.: Proceedings of the 9th Python in Science Conference. (2010) 51 – 56
-  Toutanova, K., Lin, V., Yih, W.t., Poon, H., Quirk, C.: Compositional learning of embeddings for relation paths in knowledge base and text. In: Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Stroudsburg, PA, USA, Association for Computational Linguistics (2016) 1434–1444
-  Dettmers, T., Minervini, P., Stenetorp, P., Riedel, S.: Convolutional 2D knowledge graph embeddings. (jul 2017)
-  Bollacker, K., Evans, C., Paritosh, P., Sturge, T., Taylor, J.: Freebase: A collaboratively created graph database for structuring human knowledge. In: Proceedings of the 2008 ACM SIGMOD international conference on Management of data - SIGMOD ’08, New York, New York, USA, ACM Press (jun 2008) 1247
-  Ristoski, P., de Vries, G.K.D., Paulheim, H.: A collection of benchmark datasets for systematic evaluations of machine learning on the semantic web. In Groth, P., Simperl, E., Gray, A., Sabou, M., Krötzsch, M., Lecue, F., Flöck, F., Gil, Y., eds.: The semantic web – ISWC 2016. Volume 9982 of Lecture notes in computer science. Springer International Publishing, Cham (2016) 186–194
-  Cochez, M., Ristoski, P., Ponzetto, S.P., Paulheim, H.: Global rdf vector space embeddings. In d’Amato, C., Fernandez, M., Tamma, V., Lecue, F., Cudré-Mauroux, P., Sequeda, J., Lange, C., Heflin, J., eds.: The semantic web – ISWC 2017. Volume 10587 of Lecture notes in computer science. Springer International Publishing, Cham (2017) 190–207