Embedding Cardinality Constraints in Neural Link Predictors

12/16/2018 ∙ by Emir Muñoz, et al. ∙ NUI Galway Emir Muñoz UCL 0

Neural link predictors learn distributed representations of entities and relations in a knowledge graph. They are remarkably powerful in the link prediction and knowledge base completion tasks, mainly due to the learned representations that capture important statistical dependencies in the data. Recent works in the area have focused on either designing new scoring functions or incorporating extra information into the learning process to improve the representations. Yet the representations are mostly learned from the observed links between entities, ignoring commonsense or schema knowledge associated with the relations in the graph. A fundamental aspect of the topology of relational data is the cardinality information, which bounds the number of predictions given for a relation between a minimum and maximum frequency. In this paper, we propose a new regularisation approach to incorporate relation cardinality constraints to any existing neural link predictor without affecting their efficiency or scalability. Our regularisation term aims to impose boundaries on the number of predictions with high probability, thus, structuring the embeddings space to respect commonsense cardinality assumptions resulting in better representations. Experimental results on Freebase, WordNet and YAGO show that, given suitable prior knowledge, the proposed method positively impacts the predictive accuracy of downstream link prediction tasks.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1. Introduction

Cognitive development of children indicates that we learn the cardinality-related question “How many?” at ca. 3.5 years of age (Wynn, 1990). This ability helps us to recognise physical and abstract things by counting. For example, a hand has commonly five fingers, a car has four wheels, or a meeting has at least two participants. This kind of common sense knowledge is not obvious for machines to acquire, even in contexts where it can be useful, such as Question Answering, Web Search, and Information Extraction (Tandon et al., 2017).

One fundamental application area for cardinality information relates to the completion of Knowledge Graphs (KGs), graph-structured knowledge bases where factual knowledge is represented in the form of relationships between entities. For instance, consider Freebase (Bollacker et al., 2007), the core of the Google Knowledge Graph project, where 71% of the people described in it have no known place of birth as reported by Dong et al. (2014). By leveraging cardinality information about the bornIn relationship (i.e., each person must have a place of birth), we can quantitatively assess the degree of incompleteness in Freebase and focus the resources on predicting a single place of birth for each person. Yet link prediction models aimed at identifying missing facts in KGs do not consider such commonsense or schema knowledge, yielding potentially inconsistent and inaccurate predictions.

In this work, we focus on a certain class of link prediction models, namely Neural Link Predictors (Nickel et al., 2016a). Such models learn low-dimensional distributed representations—also referred to as embeddings—of all entities and relations in a knowledge graph. Neural link predictors are currently the state of the art approach to tasks such as link prediction (Bordes et al., 2013; Yang et al., 2015; Trouillon et al., 2016; Ding et al., 2018), entity disambiguation and entity resolution (Bordes et al., 2014), taxonomy extraction (Nickel et al., 2012; Nickel and Kiela, 2017), and probabilistic question answering (Krompaß et al., 2014).

Recently, research focused mainly on designing new scoring functions, and incorporating additional background knowledge during the learning process. We refer readers to (Nickel et al., 2016a; Wang et al., 2017) for a recent overview on this topic.

In this paper, we address the problem of incorporating prior knowledge in the form of relation cardinality information into state-of-the-art neural link predictors. For instance, we want to encode prior knowledge in the form of cardinality statements such as “a person should have at most two parents” or “a patient should be taking between 1 and 5 drugs at a time” in neural link prediction models. Such prior knowledge can be provided by domain experts, or automatically extracted from data (Galárraga et al., 2017; Muñoz and Nickles, 2017). It is expected that such cardinality constraints will be satisfied by both the facts in the knowledge graph and algorithms analysing the graph, such as link predictors. We believe that these constraints can impose commonsense knowledge upon the structure of the embedding space, thus helping us to learn better representations.

Triples Probability
0.989
0.979
0.974
0.890
0.889
Table 1. Top-5 predictions (among 24 results with probability ) for the hasParent relation with Edgar Allan Poe given by DistMult (Yang et al., 2015) on the FB13 dataset (Bordes et al., 2011).

Cardinality constraints are one of the most important constraints in conceptual modelling (Olivé, 2007, Chapter 4) as they explicit the topology of data. However, existing neural link prediction models are not designed to incorporate them for learning better representations and more accurate models.

Example 0 ().

One may expect that when predicting the parents (represented by relation hasParent) for the entity Edgar Allan Poe, a model will predict at most two parents, preferably Eliza Poe and David Poe Jr. To illustrate this, let us analyse the actual predictions of a state-of-the-art neural link prediction model, DistMult (Yang et al., 2015), using the Freebase FB13 dataset (Bordes et al., 2011), containing entities of the Freebase type deceased people and their relations. Table 1 shows the top-5 predicted parents for Edgar Allan Poe. As we can see, all predictions have a high probability (with 24 entities scored higher than ), albeit some predictions are incorrect.

Nevertheless, the evaluation results of our example model are positive due to the evaluation protocol of link prediction models based on a ranking metric, where correct predictions (e.g., eliza_poe) are expected to be ranked higher than incorrect ones (e.g., benjamin_franklin).

To address this problem, in this paper we propose an efficient approach for embedding the notion of cardinality in neural link prediction models, without affecting their efficiency and scalability. The proposed approach is based on a novel regularisation term, that constraints the number of predictions for a given relation. Briefly, our idea is to penalise the model when its predictions violate one cardinality constraints, expressed as lower or upper bound on the cardinality of a given relation type. By doing so, the notion of cardinality of a relation will be captured during training, yielding to more accurate link prediction models, that comply with available prior knowledge (Wang et al., 2015), and learn better representations for entities and relations in the knowledge base.

Organisation. The remainder of this paper is organised as follows. First we present the definitions of knowledge graphs and neural link prediction models in Section 2. Next we present the concept of relation cardinality constraint for knowledge graphs in Section 3. In Section 4, we introduce a cardinality regularisation term which allows neural link predictors to leverage available cardinality constraints. We evaluate the application of our regularisation term over different datasets and models in Section 5. Section 6 briefly discusses the existing works in link prediction over knowledge graphs. Finally, Section 7 concludes this paper.

2. Background

We start by introducing the fundamentals of knowledge graphs and neural link predictors.

Definition 0 (Knowledge Graphs).

A knowledge graph is a graph representation of a knowledge base. Let be the set of all entities, and the set of all relation types (predicates). We denote by a knowledge graph comprising a set of facts or triples, where and . We refer to as subject and object entities and to as relation of a triple. Let and be the number of entities and relations, respectively.

The goal of link prediction models is to learn a scoring function that given a triple returns its corresponding score, . Such a score can then be used for ranking missing triples according to the likelihood that the corresponding facts hold true.

Definition 0 (Neural Link Predictors).

Neural link prediction models (Nickel et al., 2016a; Wang et al., 2017)

can be interpreted as neural networks consisting of an

encoding layer and a scoring layer. Given a triple , the encoding layer maps entities to their -dimensional distributed representations and . Then, the scoring layer computes the likelihood of the triple based on a relation-dependent function . Henceforth, the scoring function is defined as ), where , , and .

A neural link predictor with parameters

defines a conditional probability distribution over the truth value of a triple

 (Nickel et al., 2016a):

(1)

where is the truth label of the triple, denotes the set of all entity and relation embeddings (the parameters ), is the standard logistic function, and denotes the model’s scoring function (cf. Table 2). Most models consider the -dimensional embeddings as real-valued ; however, there are exceptions like ComplEx (Trouillon et al., 2016), where .

A neural link prediction model is trained by minimising a loss function defined over a target knowledge graph

, usually using stochastic gradient descent. Since knowledge graphs only contain positive examples (i.e. facts), a way to provide negative learning examples—motivated by the Local Closed World Assumption (LCWA) 

(Dong et al., 2014)—is to generate negative examples by corrupting the triples in the graph (Rendle et al., 2009; Bordes et al., 2013; Nickel et al., 2016a). Given a (positive) triple , corrupted triples (negative examples) can be generated by replacing either the subject or object with a random entity sampled uniformly from  (Bordes et al., 2011). Formally, given a positive example , negative examples are sampled from the set of possible corruptions of , namely .

Let be the set of positive examples, and the set of negatives generated accordingly with function . The training consists of learning the parameters that best explain and according to Eq. 1. For that, models such as TransE (Bordes et al., 2013), DistMult (Yang et al., 2015) and HolE (Nickel et al., 2016b) minimise a pairwise margin loss:

(2)

where is a positive example, is a negative one, , and

is the margin hyperparameter. The entity embeddings are also constrained to unit norm, i.e.

. Whereas other models like ComplEx (Trouillon et al., 2016) minimise the logistic loss:

where is an example (triple), and is the label (negative or positive) associated with the example.


Model
Scoring Function Parameters


ER-MLP


DistMult

ComplEx

Table 2. Scoring functions of three state-of-the-art knowledge graph embedding models.

3. Relation Cardinalities

A relation type can have associated cardinality bounds, which restrict the number of object values that a subject can have.

Definition 0 (Relation Cardinality Bound).

Let be a cardinality bound for the relation , where denotes the lower bound and denotes the upper bound of the cardinality, s.t.  (Muñoz and Nickles, 2017). A knowledge graph satisfies a cardinality bound with iff

where is the number of triples with as subject and as relation (Muñoz and Nickles, 2017).

Example 0 ().

Given a cardinality bound , encoding the constraint “a person should have at most two parents”, we would like to ensure that the embeddings learned by a neural link predictor yield predictions for the hasParent relation within the boundaries. In other words, we want to have the sum of probabilities over all possible parent entities of Edgar Allan Poe precisely between zero and two.111Note that by considering a lower bound equals to zero, we can account for the possible incompleteness of the KG. We express this constraint over the triple as:

(3)

where the conditional probabilities are given by the neural link prediction model.

This term in Eq. 3 expresses a supervision signal, not based on labelled data, that can be input to the training of neural link prediction models. It is worth to mention that such cardinality boundaries can be provided by experts, gathered from literature (Mirza et al., 2017), or extracted from knowledge bases (Muñoz and Nickles, 2017; Galárraga et al., 2017).

4. Regularisation Based on Cardinality

In this section, we propose an approach to incorporate cardinality bounds in the training of neural link prediction models. Specifically, we propose to leverage the available cardinality bounds, expressed as in Eq. 3, to define a regularisation term that encourages models to respect the available cardinality constraints.

Let be the set of cardinality constraints for each relation in a given knowledge graph , where and are the lower and upper bound for relation r, respectively.

Given and , let be the set of all possible triples with relation and subject , where the object was selected from . Following our toy example, assume that denotes the relation hasParent, and denotes the entity edgar_allan_poe. Hence, we can take the set of possible triples to define the following hard constraint on the conditional probability of the triples in :

(4)

However, the inequality constraint in Eq. 4 is impractical to incorporate directly in neural link predictors.

In this work, we propose a continuous relaxation of the constraint in Eq. 4 to a soft constraint, by defining a continuous and differentiable loss function that penalises violations of such a constraint. Specifically, we define a function that is strictly positive if the cardinality constraint for a given entity and relation is violated, and zero otherwise. Given a cardinality constraint , the function (or for simplicity) is defined as follows:

(5)

Figure 1 shows the values of (Eq. 5) based on and a cardinality bound . Notice that for the general case where the upper bound corresponds to and lower bound to , the loss vanishes.

0

valid relation cardinality

violates lower bound

violates upper bound

Figure 1. Regularisation term based on the bounds of a cardinality constraint .

Therefore, we define a cardinality-regularised objective function, denoted by , for neural link prediction models:

(6)

where weights the relative contribution of the regularisation term, and can be either the pairwise ranking loss or the logistic loss. The regularised loss Eq. 6 can be minimised using stochastic gradient descent (SGD) (Robbins and Monro, 1951) in mini-batch mode, outlined in Algorithm 1.

Although our approach considers both upper and lower bounds, the latter cannot be meaningfully imposed in all cases. For instance, given a constraint , the regularisation term can yield inconsistent results if the knowledge graph is incomplete, and does not contain the spouse link of every person. In such cases, a zero lower bound can be used to address the knowledge graph incompleteness.

Our approach is intuitive and easy to implement for any neural link prediction model. However, it is limited by the cost of computing the sum in Eq. 4: the set can easily grow in some KGs and become too expensive to obtain the sum of probabilities. In the following section, we propose to use sampling techniques to overcome this problem by approximating the sum of probabilities.

1:Observed facts

, epochs

, initial learning rate
2:Optimal model parameters (see (Nickel et al., 2016a))
3:Initialise embeddings and according to (Glorot and Bengio, 2010)
4:for  do
5:      Build batch for training
6:      sample a batch from
7:     
8:     for  do
9:           Sample negative example
10:         
11:     end for
12:      Compute the gradient of the loss function
13:      using and
14:      Model parameters update via gradient descent
15:     
16:      Projection step normalising all entity embeddings
17:     
18:end for
19:return
Algorithm 1 Learning the model parameters via projected SGD

4.1. Lower Bound Estimation

We can sample a subset of all entities and obtain the following lower bound:

(7)

The tightness of the bound in Eq. 7 is determined by the selection of the entities in . In this work, we consider uniform sampling. More specifically, a random set of indices is taken uniformly, where , and form the following lower bound:

where the sum is over all elements in with no repetitions.

4.2. Sum Estimation

Instead of defining a lower bound to , we can also approximate directly by sampling. Let us consider a sum over a large collection of elements

. We consider two standard methods for approximating sums via Monte Carlo estimates, namely Importance Sampling (IS) and Bernoulli Sampling 

(Botev et al., 2017).

Importance Sampling. Based on the identity , a set of indices is selected from a distribution , where , and yielding the following approximation:

where defines the probability of sampling from .

Bernoulli Sampling. An alternative to IS is Bernoulli Sampling (BS), considering the following identity:

where each independent Bernoulli variable denotes whether will be sampled or not, and is the probability of sampling . This leads to the following approximation:

where the sum is computed over the components with non-zero elements in the vector

. Note that, when calculating an approximation to , IS relies on sampling with replacement, while BS relies on sampling without replacement.

By using our regularisation term with sampling, we add a time complexity , where is the total number of (sampled) triples when computing the regularisation term, and the number of triples per batch. Since can be smaller than the number of triples in a batch, we ensure that the time complexity of neural link predictors is not sensibly affected during training, and not affected at all at test time. The proposed method does not increase the space complexity of the models, since the proposed regulariser does not change the number of model parameters.

5. Evaluation

In this section, we investigate the benefits of cardinality regularisation for the state-of-the-art neural link prediction models. We compare the performance of original and regularised losses in the link prediction task across different benchmark datasets, which are partitioned into train, validation and test set of triples (cf. Table 3).

5.1. Evaluation Protocol

The link prediction task consists of predicting a missing entity or when given a pair or , respectively. During testing, for each test triple , we replace the subject or object entity with all entities in the knowledge graph as corruptions (Bordes et al., 2013). The evaluation then ranks the entities in descending order w.r.t. the scores calculated by a scoring function and gets the rank of the correct entity h or t. We report results based on the ranks assigned to correct entities measured using mean reciprocal rank (MRR) and Hits@ with .222For MRR and Hits@, the higher the better. During the ranking process some positive test triples could be ranked after another true triples, which should not be considered a mistake. Therefore, the above metrics have two settings: raw and filtered (Bordes et al., 2013). In the filtered setting, metrics are computed after removing all true triples appearing in train, validation, or test sets from the ranking, whereas in the raw setting they are not removed.

5.2. Datasets

Three widely used datasets for evaluating link prediction models are WordNet (Miller, 1995), Freebase (Bollacker et al., 2007), and YAGO (Mahdisoltani et al., 2015). In this work, we use four benchmark datasets generated from them: FB13, WN18, WN18RR and YAGO3-10.

The FB13 dataset (Bordes et al., 2011) is a subset of Freebase containing 13 relation types and entities of type deceased_people, where entities appear in at least 4 relations and relation types at least 5,000 times. 333We use the corrected version by (Socher et al., 2013) that contains only positive samples. We also use two datasets derived from WordNet, namely, WN18 and WN18RR. These datasets contain hyponym, hypernym, and other lexical relations of English concepts and words. It is known that WN18 contains ca. 72% of redundant and inverse relations, which were removed in the WN18RR dataset (Dettmers et al., 2018). YAGO3-10 consists of entities in YAGO3 (mostly of the people type) linked with at least 10 relations, such as citizenship, gender and profession. FB13, WN18RR, and YAGO3-10 datasets were shown to have no redundant or trivial triples (Dettmers et al., 2018). In Table 3 we summarise the characteristics of each of the datasets.

Dataset
FB13 13 81,065 350,517 5,000 5,000
WN18 18 40,943 14,1442 5,000 5,000
WN18RR 11 40,943 86,835 3,034 3,134
YAGO3-10 37 123,182 1,079,040 5,000 5,000
Table 3. Statistics for each of the datasets.

We mine the relation cardinality constraints from the training set of each dataset, following the algorithm proposed by Muñoz and Nickles (2017)

using the normalisation option but without filtering outliers.

Table 4 gives examples of the cardinality constraints mined from each dataset.

/people/person/place_of_birth (0, 2)
/people/person/parents (0, 2)
/people/person/gender (1, 1)
_hyponym (0, 380)
_has_part (0, 73)
_hypernym (0, 4)
livesIn (0, 12)
hasGender (0, 1)
hasChild (0, 19)
Table 4. Cardinality constraints extracted from FB13, WN18 (WN18RR) and YAGO3-10.

5.3. Results

For our experiments, we re-implemented three models using the TensorFlow framework 

(Abadi et al., 2016), namely, ER-MLP (Dong et al., 2014), DistMult (Yang et al., 2015) and ComplEx (Trouillon et al., 2016) (which was recently proven to be equivalent to HolE (Hayashi and Shimbo, 2017)). We compare the performance over the four benchmark datasets of each model as originally stated by their authors and with the cardinality regularisation term (cf. Eq. 6).

As recommended by (Trouillon et al., 2016), we minimise the logistic loss to train each model by using SGD, and AdaGrad (Duchi et al., 2011) to adaptively select the learning rate, initialised as . For each model and dataset, we selected hyperparameters maximising filtered Hits@10 on the validation set using an exhaustive grid search.

The evaluation of our approach is three-fold: (i) we measure the effects of the regulariser in the link prediction task; (ii) we measure the effects of the different sampling techniques; and (iii) we measure the violations to the cardinality constraints before and after regularisation. To reduce the search space, during the grid search in (i) we fix the sampling technique to uniform. In (ii), we use the best model identified in (i) to study the effect of different sampling techniques, whilst in (iii) we use the overall best model per dataset.

Link Prediction. We train each model for 1,000 epochs with a mini-batches approach over the training set of each dataset, generating two negative examples per positive triple in each batch. We set to obtain the performance results of original models (without regularisation), and use uniform sampling with sizes , of subjects and objects.444We identified via independent experiments that larger values for do not yield performance improvements.

Tables 6 and 5 show the link prediction results, confirming that in general our cardinality-based regularisation term helps to improve (or at least maintain) the performance of the original ER-MLP, DistMult and ComplEx models across all datasets. The only exception we observed is ComplEx over YAGO3-10, where the model without the regularisation term reaches better Hits@10 and MRR. We believe that a reason for this is that constraining a lower bound on the sum of probabilities may not be the best technique to use when the number of entities is very large. In our experiments we also compare two alternative approaches, namely estimating the sum of probabilities via IS and BS.

ER-MLP and DistMult models benefit the most across all datasets with improvements of up to 36% in MRR. ComplEx shows to be the overall best performing model outperforming ER-MLP (up to 20x in WN18RR) and DistMult in every dataset and evaluation metric. Still, ComplEx benefits from the regularisation term in most of the datasets. Although we did not perform a thorough search of the hyperparameters space to reach state-of-the-art performance, the results prove the advantages of our approach.

FB13 WN18 WN18RR
Hits@ MRR Hits@ MRR Hits@ MRR
Method 1 3 5 10 1 3 5 10 1 3 5 10
ER-MLP 4.40 7.55 9.14 11.82 6.94 21.64 37.30 44.94 56.52 33.02 1.84 3.29 4.10 5.31 3.10
ER-MLP 5.13 8.36 10.29 12.75 7.78 32.01 51.54 60.54 70.85 45.01 2.22 4.29 5.42 7.31 3.98
DistMult 18.07 29.29 32.94 37.01 24.92 64.46 87.47 90.66 93.49 76.62 38.93 43.49 45.93 49.63 42.46
DistMult 18.10 29.45 33.07 37.02 25.00 65.01 87.53 90.71 93.44 76.93 39.10 44.13 46.30 49.81 42.84
ComplEx YellowGreen!6025.08 31.64 34.00 36.90 YellowGreen!6029.41 88.33 93.05 94.14 95.07 90.96 40.87 YellowGreen!6046.25 YellowGreen!6048.55 YellowGreen!6051.15 44.52
ComplEx 24.89 YellowGreen!6031.78 YellowGreen!6034.10 YellowGreen!6037.16 29.36 YellowGreen!6088.66 YellowGreen!6093.27 YellowGreen!6094.21 YellowGreen!6095.21 YellowGreen!6091.20 YellowGreen!6041.10 46.06 48.13 51.09 YellowGreen!6044.57
Table 5. Link prediction results (Hits@ and Mean Reciprocal Rank, filtered setting) on FB13, WN18 and WN18RR. In bold the best results comparing both original and cardinality loss, and highlighted is the best value per evaluation metric across all models.
YAGO3-10
Hits@ MRR
Method 1 3 5 10
ER-MLP 2.22 6.09 9.59 16.01 6.83
ER-MLP 2.33 6.16 9.65 16.54 6.95
DistMult 6.75 14.33 18.86 26.51 13.33
DistMult 7.03 14.53 19.12 26.66 13.59
ComplEx 7.12 YellowGreen!6015.61 YellowGreen!6020.76 YellowGreen!6029.11 14.33
ComplEx YellowGreen!607.56 15.10 20.30 29.01 YellowGreen!6014.47
Table 6. Link prediction results (Hits@ and Mean Reciprocal Rank, filtered setting) on YAGO3-10. In bold the best results comparing both original and cardinality loss, and highlighted is the best value per evaluation metric across all models

Sampling techniques. To approximate the sum of probabilities we test both Importance Sampling and Bernoulli Sampling, and consider hyperparameters and . Starting from the best ComplEx models learned above, we tune the sampling technique for each of the datasets.

Results are shown in Table 7. In general, all sampling techniques work well and there is no one-size-fits-all solution: it depends on the dataset. (Information about properties of the data that benefit one of the samplings can be used, and custom sampling is also supported.) YAGO3-10 shows the biggest improvement of 6% in MRR using BS compared with the results in Table 6. This improvement might be correlated to the advantage of BS to handle the large number of entities in YAGO3-10. For FB13, WN18, and WN18RR we see smaller improvements in MRR and Hits@10 compared to the results in Table 5. Differences in results for uniform sampling compared to the results in Table 5 are also attributed to the expanded hyperparameters space with more sampling sizes than previously.

Hits@ MRR
Dataset Sampling 1 3 5 10
FB13 Uniform 25.84 31.85 YellowGreen!6034.19 YellowGreen!6037.26 29.89
Importance 25.17 31.36 34.36 36.18 29.18
Bernoulli YellowGreen!6025.92 YellowGreen!6031.86 34.11 37.18 YellowGreen!6029.97
WN18 Uniform 88.98 YellowGreen!6093.66 YellowGreen!6094.84 95.98 YellowGreen!6092.12
Importance 88.97 93.64 94.73 YellowGreen!6096.08 91.10
Bernoulli YellowGreen!6089.05 93.57 94.67 95.94 91.09
WN18RR Uniform 41.27 46.57 48.58 YellowGreen!6051.51 44.87
Importance 41.09 46.68 YellowGreen!6048.81 51.50 44.78
Bernoulli YellowGreen!6041.54 YellowGreen!6046.79 48.68 51.42 YellowGreen!6045.04
YAGO3-10 Uniform 8.32 15.52 YellowGreen!6020.92 29.29 15.30
Importance 8.23 15.71 20.70 29.49 15.28
Bernoulli YellowGreen!608.48 YellowGreen!6015.74 20.82 YellowGreen!6029.50 YellowGreen!6015.42
Table 7. Link prediction results (Hits@ and Mean Reciprocal Rank, filtered setting) for the best ComplEx model using different sampling techniques.

Cardinality Violations in KGs. We have shown that our regulariser is beneficial for the link prediction task, but, more importantly, the predictions that violate the cardinality constraints are significantly reduced. Figure 2 shows the changes on the distribution of in four relation cases for ER-MLP in YAGO3-10—one of the most benefited settings. Figures 2(d), 2(c) and 2(a)

illustrate positive impacts of the regularisation. We observed that the regulariser decreases the median and long-tail distribution above the third quartile for (almost) every relation, making predictions more accurate. For example, in relation

imports () the mean of is reduced by 78%, meaning less violations. Conversely, the biggest negative impact was in relation hasWebsite (, Fig. 2(b)), where violations were increased by 65%. Both constraint are equally restrictive over the number of objects but they differ on their range. For the former, the objects are entities with links to other entities, while in the latter objects are literals (URLs) with no further links. The prediction of literals is a known problem for neural link predictors as there are not many links to other entities (García-Durán and Niepert, 2017).

(a) imports (0, 6)
(b) hasWebsite (0, 2)
(c) hasAcademicAdvisor (0, 4)
(d) hasChild (0, 19)
Figure 2. Changes in the distribution of without (left, in blue) and with (right, in orange) regularisation using ER-MLP in YAGO3-10. Horizontal lines correspond to quartiles.

Following the DistMult example using the constraint , Table 8 shows the predictions for parents of Edgar Allan Poe. There are less predictions with high probability and a correct, but previously missing, entity David Poe Jr. is now scored with a high probability proving the effectiveness of regularisation.

Triple Probability
0.861
0.854
0.815
Table 8. Predictions with probability for by DistMult when imposing the cardinality regulariser.

We did not note any major difference in results between tight and loose cardinality bounds, or between constraints for relations with few and many instances. Finally, Fig. 3 shows the effects of using different regularisation weights over the values of average mean of and Hits@10 across relations in WN18RR. As grows, Hits@10 suffers small changes and the average mean of decreases. This shows that the regularisation term does not affect negatively Hits@10 (a common evaluation metric) and helps to decrease the number of violations to the cardinality constraints.

Regularisation parameter

Avg. mean of

Hits@10 score
Figure 3. Influence of the regularisation weight over the average mean of (solid blue line) and Hits@10 (dashed red line) in WN18 with ComplEx.

6. Related Work

Early works in neural link prediction (e.g., TransE (Bordes et al., 2013), RESCAL (Nickel et al., 2011), DistMult (Yang et al., 2015)) learn the representations of all entities and relations in the knowledge base by fitting simple scoring functions on the triples in the knowledge graph.

Recently, research focused on either (i) generating more elaborated scoring functions that better capture the nature of each of the relations, or (ii) improving existing models with background knowledge (Wang et al., 2017). The former includes HolE (Nickel et al., 2016b), where the scoring function is inspired by cognitive models of associative memory; ComplEx (Trouillon et al., 2016) that uses complex-valued embeddings to model asymmetric relations; and ConvE (Dettmers et al., 2018) that builds a multi-layer convolutional network. The latter is characterised by the incorporation of additional information such as entity types, relation paths, and logical rules. We refer the readers to (Nickel et al., 2016a; Wang et al., 2017) for a deeper review of neural link predictors.

Our work aligns with the second category that focuses on adding background knowledge. Almost every paper incorporating background knowledge agree that such prior knowledge improves link prediction models (Guo et al., 2015; Minervini et al., 2016, 2017b, 2017a; Guo et al., 2017; Ding et al., 2018). However, none of them has considered integrity constraints such as cardinality.

Muñoz and Nickles mine cardinality constraints from knowledge graphs, and suggest their use to improve the accuracy of link prediction models.

In a similar vein, Galárraga et al. use fine-grain cardinality information to prune ‘unnecessary’ predictions. However, this is done only after the predictions are generated. In (Zhang et al., 2017), a single cardinality bound (one-to-one, one-to-many or many-to-many) is imposed in link prediction over single-relational graphs (such as organisational charts), which differs from the multi-relational nature of knowledge graphs.

7. Conclusions

In this paper, we presented a cardinality-based regularisation term for neural link prediction models. The regulariser incorporates background knowledge in the form of relation cardinality constraints that hitherto have been ignored by neural link predictors.

The incorporation of this regularisation term in the loss function significantly reduces the number of violations produced by models at prediction time, enforcing the number of predicted triples with high probability for each relation to satisfy cardinality bounds.

Experimental results show that the regulariser consistently improves the quality of the knowledge graph embeddings, without affecting the efficiency or scalability of the learning algorithms.

Acknowledgements.
This work was partially supported by the TOMOE project funded by Fujitsu Laboratories Ltd., Japan and Insight Centre for Data Analytics at National University of Ireland Galway (supported by the Science Foundation Ireland (SFI) under Grant Number SFI/12/RC/2289).

References

  • (1)
  • Abadi et al. (2016) Martín Abadi, Paul Barham, Jianmin Chen, Zhifeng Chen, Andy Davis, Jeffrey Dean, Matthieu Devin, Sanjay Ghemawat, Geoffrey Irving, Michael Isard, Manjunath Kudlur, Josh Levenberg, Rajat Monga, Sherry Moore, Derek Gordon Murray, Benoit Steiner, Paul A. Tucker, Vijay Vasudevan, Pete Warden, Martin Wicke, Yuan Yu, and Xiaoqiang Zheng. 2016.

    TensorFlow: A System for Large-Scale Machine Learning. In

    OSDI. USENIX Association, 265–283.
  • Bollacker et al. (2007) Kurt D. Bollacker, Robert P. Cook, and Patrick Tufts. 2007. Freebase: A Shared Database of Structured General Human Knowledge. In AAAI. AAAI Press, 1962–1963.
  • Bordes et al. (2014) Antoine Bordes, Xavier Glorot, Jason Weston, and Yoshua Bengio. 2014. A semantic matching energy function for learning with multi-relational data - Application to word-sense disambiguation. Machine Learning 94, 2 (2014), 233–259.
  • Bordes et al. (2013) Antoine Bordes, Nicolas Usunier, Alberto Garcia-Durán, Jason Weston, and Oksana Yakhnenko. 2013. Translating Embeddings for Modeling Multi-relational Data. In NIPS. 2787–2795.
  • Bordes et al. (2011) Antoine Bordes, Jason Weston, Ronan Collobert, and Yoshua Bengio. 2011. Learning Structured Embeddings of Knowledge Bases. In AAAI. AAAI Press.
  • Botev et al. (2017) Aleksandar Botev, Bowen Zheng, and David Barber. 2017. Complementary Sum Sampling for Likelihood Approximation in Large Scale Classification. In AISTATS (Proceedings of Machine Learning Research), Vol. 54. PMLR, 1030–1038.
  • Dettmers et al. (2018) Tim Dettmers, Pasquale Minervini, Pontus Stenetorp, and Sebastian Riedel. 2018. Convolutional 2D Knowledge Graph Embeddings. In AAAI. AAAI Press.
  • Ding et al. (2018) Boyang Ding, Quan Wang, Bin Wang, and Li Guo. 2018. Improving Knowledge Graph Embedding Using Simple Constraints. In ACL (1). Association for Computational Linguistics, 110–121.
  • Dong et al. (2014) Xin Dong, Evgeniy Gabrilovich, Geremy Heitz, Wilko Horn, Ni Lao, Kevin Murphy, Thomas Strohmann, Shaohua Sun, and Wei Zhang. 2014. Knowledge vault: a web-scale approach to probabilistic knowledge fusion. In KDD. ACM, 601–610.
  • Duchi et al. (2011) John C. Duchi, Elad Hazan, and Yoram Singer. 2011. Adaptive Subgradient Methods for Online Learning and Stochastic Optimization. Journal of Machine Learning Research 12 (2011), 2121–2159.
  • Galárraga et al. (2017) Luis Galárraga, Simon Razniewski, Antoine Amarilli, and Fabian M. Suchanek. 2017. Predicting Completeness in Knowledge Bases. In WSDM. ACM, 375–383.
  • García-Durán and Niepert (2017) Alberto García-Durán and Mathias Niepert. 2017. KBLRN : End-to-End Learning of Knowledge Base Representations with Latent, Relational, and Numerical Features. CoRR abs/1709.04676 (2017).
  • Glorot and Bengio (2010) Xavier Glorot and Yoshua Bengio. 2010. Understanding the difficulty of training deep feedforward neural networks. In AISTATS (JMLR Proceedings), Vol. 9. JMLR.org, 249–256.
  • Guo et al. (2015) Shu Guo, Quan Wang, Bin Wang, Lihong Wang, and Li Guo. 2015. Semantically Smooth Knowledge Graph Embedding. In ACL (1). The Association for Computer Linguistics, 84–94.
  • Guo et al. (2017) Shu Guo, Quan Wang, Bin Wang, Lihong Wang, and Li Guo. 2017. SSE: Semantically Smooth Embedding for Knowledge Graphs. IEEE Trans. Knowl. Data Eng. 29, 4 (2017), 884–897.
  • Hayashi and Shimbo (2017) Katsuhiko Hayashi and Masashi Shimbo. 2017. On the Equivalence of Holographic and Complex Embeddings for Link Prediction. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics, ACL 2017, Regina Barzilay et al. (Eds.). Association for Computational Linguistics, 554–559.
  • Krompaß et al. (2014) Denis Krompaß, Maximilian Nickel, and Volker Tresp. 2014. Querying Factorized Probabilistic Triple Databases. In International Semantic Web Conference (2) (Lecture Notes in Computer Science), Vol. 8797. Springer, 114–129.
  • Mahdisoltani et al. (2015) Farzaneh Mahdisoltani, Joanna Biega, and Fabian M. Suchanek. 2015. YAGO3: A Knowledge Base from Multilingual Wikipedias. In CIDR. www.cidrdb.org.
  • Miller (1995) George A. Miller. 1995. WordNet: A Lexical Database for English. Commun. ACM 38, 11 (1995), 39–41.
  • Minervini et al. (2017a) Pasquale Minervini, Luca Costabello, Emir Muñoz, Vít Novácek, and Pierre-Yves Vandenbussche. 2017a. Regularizing Knowledge Graph Embeddings via Equivalence and Inversion Axioms. In ECML/PKDD (1) (Lecture Notes in Computer Science), Vol. 10534. Springer, 668–683.
  • Minervini et al. (2016) Pasquale Minervini, Claudia d’Amato, Nicola Fanizzi, and Floriana Esposito. 2016. Leveraging the schema in latent factor models for knowledge graph completion. In SAC. ACM, 327–332.
  • Minervini et al. (2017b) Pasquale Minervini, Thomas Demeester, Tim Rocktäschel, and Sebastian Riedel. 2017b. Adversarial Sets for Regularising Neural Link Predictors. In

    Proceedings of the Thirty-Third Conference on Uncertainty in Artificial Intelligence, UAI 2017

    , Gal Elidan et al. (Eds.). AUAI Press.
  • Mirza et al. (2017) Paramita Mirza, Simon Razniewski, Fariz Darari, and Gerhard Weikum. 2017. Cardinal Virtues: Extracting Relation Cardinalities from Text. In ACL (2). Association for Computational Linguistics, 347–351.
  • Muñoz and Nickles (2017) Emir Muñoz and Matthias Nickles. 2017. Mining Cardinalities from Knowledge Bases. In DEXA (1) (Lecture Notes in Computer Science), Vol. 10438. Springer, 447–462.
  • Nickel and Kiela (2017) Maximilian Nickel and Douwe Kiela. 2017. Poincaré Embeddings for Learning Hierarchical Representations. In NIPS. 6341–6350.
  • Nickel et al. (2016a) Maximilian Nickel, Kevin Murphy, Volker Tresp, and Evgeniy Gabrilovich. 2016a. A Review of Relational Machine Learning for Knowledge Graphs. Proc. IEEE 104, 1 (2016), 11–33.
  • Nickel et al. (2016b) Maximilian Nickel, Lorenzo Rosasco, and Tomaso A. Poggio. 2016b. Holographic Embeddings of Knowledge Graphs. In AAAI. AAAI Press, 1955–1961.
  • Nickel et al. (2011) Maximilian Nickel, Volker Tresp, and Hans-Peter Kriegel. 2011. A Three-Way Model for Collective Learning on Multi-Relational Data. In ICML. Omnipress, 809–816.
  • Nickel et al. (2012) Maximilian Nickel, Volker Tresp, and Hans-Peter Kriegel. 2012. Factorizing YAGO: scalable machine learning for linked data. In WWW. ACM, 271–280.
  • Olivé (2007) Antoni Olivé. 2007. Conceptual modeling of information systems. Springer.
  • Rendle et al. (2009) Steffen Rendle, Christoph Freudenthaler, Zeno Gantner, and Lars Schmidt-Thieme. 2009. BPR: Bayesian Personalized Ranking from Implicit Feedback. In UAI. AUAI Press, 452–461.
  • Robbins and Monro (1951) Herbert Robbins and Sutton Monro. 1951. A Stochastic Approximation Method. Ann. Math. Statist. 22, 3 (09 1951), 400–407. https://doi.org/10.1214/aoms/1177729586
  • Socher et al. (2013) Richard Socher, Danqi Chen, Christopher D. Manning, and Andrew Y. Ng. 2013.

    Reasoning With Neural Tensor Networks for Knowledge Base Completion. In

    NIPS. 926–934.
  • Tandon et al. (2017) Niket Tandon, Aparna S. Varde, and Gerard de Melo. 2017. Commonsense Knowledge in Machine Intelligence. SIGMOD Record 46, 4 (2017), 49–52.
  • Trouillon et al. (2016) Théo Trouillon, Johannes Welbl, Sebastian Riedel, Éric Gaussier, and Guillaume Bouchard. 2016. Complex Embeddings for Simple Link Prediction. In ICML (JMLR Workshop and Conference Proceedings), Vol. 48. JMLR.org, 2071–2080.
  • Wang et al. (2017) Quan Wang, Zhendong Mao, Bin Wang, and Li Guo. 2017. Knowledge Graph Embedding: A Survey of Approaches and Applications. IEEE Trans. Knowl. Data Eng. 29, 12 (2017), 2724–2743.
  • Wang et al. (2015) Quan Wang, Bin Wang, and Li Guo. 2015. Knowledge Base Completion Using Embeddings and Rules. In IJCAI. AAAI Press, 1859–1866.
  • Wynn (1990) Karen Wynn. 1990. Childrens understanding of counting. Cognition 36, 2 (Aug 1990), 155–193. https://doi.org/10.1016/0010-0277(90)90003-3
  • Yang et al. (2015) Bishan Yang, Wen-tau Yih, Xiaodong He, Jianfeng Gao, and Li Deng. 2015. Embedding Entities and Relations for Learning and Inference in Knowledge Bases. In ICLR.
  • Zhang et al. (2017) Jiawei Zhang, Jianhui Chen, Junxing Zhu, Yi Chang, and Philip S. Yu. 2017. Link Prediction with Cardinality Constraint. In WSDM. ACM, 121–130.