Semantic Triple Encoder for Fast Open-Set Link Prediction

04/30/2020 ∙ by Bo Wang, et al. ∙ University of Technology Sydney University of Washington Jilin University 0

We improve both the open-set generalization and efficiency of link prediction on knowledge graphs by leveraging the contexts of entities and relations in a novel semantic triple encoder. Most previous methods, e.g., translation-based and GCN-based embedding approaches, were built upon graph embedding models. They simply treat the entities/relations as a closed set of graph nodes regardless of their context semantics, which however cannot provide critical information for the generalization to unseen entities/relations. In this paper, we partition each graph triple and develop a novel context-based encoder that separately maps each part and its context into a latent semantic space. We train this semantic triple encoder by optimizing two objectives specifically designed for link prediction. In particular, (1) We split each triple into two parts, i.e., i) head entity plus relation and ii) tail entity, process both contexts separately by a Transformer encoder, and combine the encoding outputs to derive the prediction. This Siamese-like architecture avoids the combinatorial explosion of candidate triples and significantly improves the efficiency, especially during inference; (2) We cover the contextualized semantics of the triples in the encoder so it can handle unseen entities during inference, which promisingly improves the generalization ability; (3) We train the model by optimizing two complementary objectives defined on the triple, i.e., classification and contrastive losses, for natural and reliable ranking scores during inference. In experiments, we achieve the state-of-the-art or competitive performance on three popular link prediction benchmarks. In addition, we empirically reduce the inference costs by one or two orders of magnitude compared to a recent context-based encoding approach and meanwhile keep a superior quality of prediction.



There are no comments yet.


page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Knowledge graph (KG) is a prevalent format of knowledge base (KB). It is structured as a directed graph whose vertices and edges stand for entities and their relations respectively. It is usually represented as a set of triples in the form of (head entity, relation, tail entity). KGs as backend supporting knowledge play significant roles on a wide range of natural language processing (NLP) tasks, such as question answering

(Hao et al., 2017), dialogue system (He et al., 2017), information retrieval (Xiong et al., 2017), recommendation system (Zhang et al., 2016), etc. However, curated knowledge graphs usually suffer from incompleteness or sparsity (Socher et al., 2013; West et al., 2014), which inevitably limits their practical applications. To mitigate this issue, link prediction task (Bordes et al., 2011) is studied, which aims to complete a graph triple whose head or tail entity is missing.

To address the link prediction problem, a variety of approaches attempt to learn the representations for entities and relations as real-valued, low-dimension vectors by exploring their structural and relational information in a KG. These

graph embedding models can be coarsely grouped into three categories: (1) Translation-based embedding approaches, e.g., TransE (Bordes et al., 2013) and RotatE (Sun et al., 2019), score the plausibility of a triple based on how close the combined embedding for the head entity and relation to the embedding of the tail entity in a latent space; (2) Rather than directly calculating the distance in latent space, conv-based embedding approaches, e.g., ConvE (Dettmers et al., 2018) and ConvKB (Nguyen et al., 2017)

, employ convolution neural network to generate triples’ embeddings and then measure their correctness. (3)

GCN-based embedding approaches, e.g., R-GCN (Schlichtkrull et al., 2018) and SACN (Shang et al., 2019), enhance the embeddings by using relation-augmented graph convolution network (GCN) (Kipf and Welling, 2017) as deep embedding encoder. Despite the success of above methods in representing relational data, a key prerequisite, which is also their major weakness, is that the trained model can only be applied to triples defined on a closed set of entities and relations. This heavily limits the generalization of link prediction model to unseen entities111An unseen entity is defined as it does not appear in training phase but does in inference phase., which unfortunately is quite common in practical applications and even benchmark datasets.

To alleviate the problem above, some recently works (Xie et al., 2016; Xiao et al., 2017) encode the contexts of entities and relations (e.g., their text contents or descriptions) as their representations. They can be categorized as context-based encoding approaches. They use learning objectives in similar forms as the ones for translation-based approaches except computing them in the semantic space derived from the contexts. Although their idea is in the right direction, in practice they even underperform recent translation-based approaches because they only train shallow neural encoders (e.g., continuous bag-of-words or two-layer CNN) to separately encode each component in a triple regardless of their connections. More recently, KG-BERT (Yao et al., 2019) proposes a straightforward yet effective context-based encoding approach – it applies a Transformer encoder (Vaswani et al., 2017) to the concatenated contexts of components in a triple for the rich contextualized representation. In practice, it outperforms most previous works on popular benchmarks. However, it suffers from overwhelming computational costs and inevitable overheads during inference due to the combinatorial explosion of the candidate triples needed to be processed.

In this work, we aim to (1) improve the generalization of link prediction to unseen entities by studying a new neural architecture for context-based encoder, and (2) reduce the overheads by avoiding processing each triple candidate independently. To this end, we propose a novel semantic triple encoder for link prediction. In particular, inspired by Reimers and Gurevych (2019), we split a graph triple into two parts and separately encode them into two vector representations using a Transformer encoder, where the first part is the concatenated contexts of the head entity and the relation and the second part is the context of the tail entity. The representations are then passed to the proposed task-specific modules for training and inference. Intuitively, our model takes the contextualized information into account, which improves the generalization to unseen entities. Meanwhile, using two branches of encoder for different parts of the triple avoids combinatorial explosion and thus significantly reduces the amount of encoding operations during inference. This can be seen as a Siamese-like architecture (Reimers and Gurevych, 2019) that relates the two separated parts via optimizing objectives defined on their combined representation.

We carefully choose two optimization objectives mutually complementing each other in training the semantic triple encoder. In particular, following the common practice of contextualized encoding approach (Yao et al., 2019), the first one is a classification objective built upon an interactive combination of the two parts’ representations, to judge whether a triple is plausible or not. To further bridge the metric gap between training and inference on link prediction, a contrastive objective is presented to encourages the distance between the two parts’ representations being smaller for a correct triple than that for an improper triple. This two-objective learning is able to train the encoder for reducing false positive predictions during inference of link prediction. This is achieved by consistently producing more accurate ranking scores than the encoder trained toward each of the objectives solely.

In experiments on three link prediction benchmarks, i.e., WN18RR, FB15k-237 and ULMS, we achieve state-of-the-art or competitive performance over most evaluation metrics. Moreover, our approach requires much less training and inference computations than baseline KG-BERT with even superior performance. For example, with better Mean Ranking (MR), our approach only spends 6.5 hours to accomplish the inference on FB15k-237, while KG-BERT requires up to 1 month. In addition, we provide further analyses and case studies that help to understand the proposed approach, and give comprehensive comparisons between graph embedding approach and context-based encoding approach.

2 Background

We start this section with a formal definition of context-based encoding model for link prediction. And then we give a brief introduction to a recently-proposed approach, KG-BERT (Yao et al., 2019), which is a crucial baseline in our later comparison.

Context-Based Encoding Approach.

Given a knowledge graph organized as a set of triples, i.e., (head entity, relation, tail entity) or (h, r, t) for short, the contexts of a triple are denoted by (, , ). A context stands for a piece of natural language text that could be the text, mention, or description corresponding to . Then, an appropriate tokenization followed by a word2vec embedding transforms each context into a sequence of word embeddings that can be processed by neural models, i.e., , where denotes the length of the tokenized . Lastly, a neural model, which usually consists of an encoder and a discriminator, serves as a scoring function to measure the plausibility of each triple, i.e.,


During training, the score can be leveraged by any proper objective to train the neural model in an end-to-end fashion. During inference the score can be used as a ranking basis.

KG-BERT Baseline.

KG-BERT (Yao et al., 2019) is recently proposed as a context-based encoding approach delivering the state-of-the-art performance across several link prediction benchmarks. It straightforwardly trains a Transformer encoder (Vaswani et al., 2017) initialized by a pre-trained language model, BERT (Devlin et al., 2019), to encode the concatenated contexts of the head entity, relation and tail entity (with special token separators defined byDevlin et al. (2019)), i.e., = [CLS] + + [SEP] + + [SEP] + + [SEP]. It outputs a contextualized vector to represent the entire triple, i.e.,


Next, the vector representation

is passed into a two-way classifier to judge whether the triple is plausible or not. The model is trained by minimizing the binary cross entropy loss. During inference, the probability of the positive class is used to rank all triple candidates.

3 Proposed Approach

In this section, we will elaborate on the proposed two-objective learning based Semantic Triple Encoder for Link Prediction (STELP), whose architecture is illustrated in Figure 1. Particularly, we firstly propose a novel Transformer-based triple encoder that uses a Siamese-like architecture to extract triple embedding from the contextualized representations of its two partitions. The context-based embedding makes the model generalizable to unseen entities, while the partition significantly reduces the inference costs on all triple candidates. We then introduce two complementary objectives, which are optimized to train the above neural encoder in a similar spirit as multi-task learning. We finally provide details about training and inference, including the training/inference strategy and analysis about computation complexity.

Figure 1: An overview of the proposed two-objective learning based Semantic Triple Encoder for Link Prediction (STELP). This illustration is based on a corruption of tail entity, and in the same way for the corruption of head entity or even relation. Note that a notation whose superscript includes “” denotes it is derived from a negative triple, otherwise from a positive triple.

3.1 Semantic Triple Encoder

The two-branch architecture of Siamese network has been adopted by Reimers and Gurevych (2019) to improve the efficiency of the Transformer-based encoder without degenerating the performance. We extend this idea to triple encoder for link prediction. Unlike KG-BERT (Yao et al., 2019) that straightforwardly concatenates all the three components from a triple as the input to the encoder, we firstly split a triple into two parts, gaining a desirable trade-off between model efficiency and contextualized semantics. Intuitively, there exists a combinatorial number of options to pass the three components of a triple, i.e., head entity, relation and tail entity, into a Siamese-like encoder. After thorough empirical comparison, we select the one achieving overall the best performance in terms of evaluation metrics (see §4.5). That is, we concatenate the contexts of the head entity and relation into one piece of text as the first part and treat the context of tail entity as the second part.

Formally, given the contexts of a triple composed of (, , ), the three components are partitioned into two parts, i.e.,


We keep using the segment identifier given in Devlin et al. (2019) to mark whether it is an entity’s context or a relation’s context, i.e., 0 for the entity and 1 for the relation contexts. The two parts are then passed into two parameter-tied, Transformer-based encoder to generate the contextualized representation for each part. This procedure can be formulated as


where is comprised of stacked layers of multi-head self-attention (Vaswani et al., 2017)

with residual connection

(He et al., 2016), and as in Devlin et al. (2019) collects the contextualized representation of [CLS] from . The Transformer encoder can be initialized by a pre-trained language model to further boost its capacity for sequence modeling, which alters between BERT (Devlin et al., 2019) and RoBERTa (Liu et al., 2019) in our experiments.

In addition to the high efficiency brought by Siamese-like architecture, the above semantic triple encoder does not only preserve the rich contextual information across the entity and relation (Yao et al., 2019), but also enables flexible choices of training objectives defined upon the generated embeddings. We will discuss the details of the latter in next section.

3.2 Learning Objectives

As illustrated in the top of Figure 1, we simultaneously optimize two learning objectives to train the semantic triple encoder in §3.1, which are the classification and contrastive objectives defined on the two parts’ embeddings.

Before presenting the two objectives, it is necessary to perform negative sampling to generate improper or non-factual triples as negative examples. Specifically, given a correct triple , we corrupt the triple and generate its corresponding improper triple by replacing either the head or tail entity with another entity randomly sampled from the entities on the knowledge graph in training, which satisfies or , where denotes the ground set of all unique entities on . In the remainder, a variable with superscript “” means that it is derived from a negative example.

Triple Classification Objective.

Given the embedding of a candidate triple, the link prediction is reduced to a binary classification determining whether the triple is plausible or not. This is a common practice and has been widely used by most non-distance-based link prediction approaches (Nguyen et al., 2017; Vu et al., 2019; Yao et al., 2019). Specifically, we generate the triple embedding by an interactive concatenation (Liu et al., 2016) of the two parts’ contextualized representation and , i.e.,


A two-way classifier is then applied to and produces a categorical distribution over the negative and positive classes, i.e.,


where the output contains two numbers in , denoting the probabilities of a triple being improper and correct respectively. During the inference of link prediction, the -nd dimension of , i.e., the positive class probability , can serve as a score to perform triple ranking.

To train the encoder w.r.t the triple classification objective, we use the following binary cross entropy loss, i.e.,


where denotes the training dataset containing only correct triples, denotes a set of improper triples generated by applying the aforementioned negative sampling method to , denotes the positive class probability of and denotes the negative class probability of the improper triple . Note the loss weights are not balance between positive and negative examples, i.e., , because more penalties on negative examples can help this classifier suitable for ranking during link prediction’s inference, in which a correct triple need to be distinguished among hundreds of thousands of improper ones.

However, might not contain sufficient information for ranking during inference since it is only the confidence for a single triple’s correctness that does not take other triple candidates into account. This may cause inconsistency between the model’s training and inference. Intuitively, for a context-based encoder, subtle semantic differences between the contexts of the original entity and its corruption may not be well-captured, not to mention the polysemy and ambiguity issues in the knowledge graph. This can lead to over-confident false positive prediction for a corruption (i.e., assigning a corrupted triple with ), which badly effects the performance on link prediction.

Triple Contrastive Objective.

To achieve a reliable ranking score for triple candidates, we further train the encoder by using a contrastive objective that directly compares correct triple with improper one. It does not require additional labeling efforts but provides necessary regularization to the pairwise ranking that the classification objective cannot offer. Before introducing the exact form of the loss function of this contrastive objective, we first define the score used to measure a triple’s plausibility. In traditional link prediction approaches, the ranking score of a triple is inversely proportional to the distance between

and (Bordes et al., 2013; Sun et al., 2019). For example, TransE (Bordes et al., 2013) uses negative Euclidean distance as ranking score. We can adopt a similar scoring paradigm that computes the distance between the two parts’ contextualized embeddings generated by the semantic triple encoder. The major difference is that the distance is computed in a learned semantic space. In our method, we also use Euclidean distance as in TransE and will discuss other options of distance metric in §4.5, such that


where is the plausible score based on the two representations, and , of a triple. The contrastive objective takes into account a pairwise ranking between a correct triple and an improper triple, where the latter is corrupted from the former by negative sampling. In particular, let denote the score derived from a positive triple and denote the score derived from an improper , we define the loss by using a margin-based hinge loss function, i.e.,


where, likewise, denotes the training dataset triples and denotes a set of improper triples corrupting for .

3.3 Training and Inference Details

Training and Inference Strategies.

The final loss to train the proposed STELP is a weighted sum of the two losses defined in the two aforementioned objectives, i.e.,


where is a trade-off weight between the two losses. After optimizing the proposed model w.r.t , from the classification objective, from the contrastive objective or their integration can be used as ranking basis during inference. We will present a thorough empirical study of the possible options of ranking score based on and in the experiments.

Training Efficiency.

Since the computation overhead is dominated by the computations happening inside the Transformer encoder, we focus on analyzing the complexity of computing the contextualized embedding using the encoder. In practice, the sequence lengths of the two parts of a triple are similar because the length of an entity’s context is usually much longer than a relation’s context, especially when entity description is included (Xiao et al., 2017; Yao et al., 2019). Hence, the proposed Siamese-like model is twice efficient than the KG-BERT baseline during training since the complexity of Transformer encoder grows quadratically with the sequence length (Vaswani et al., 2017).

Inference Efficiency.

Since the time cost of inference is dominated by the computations happening inside the Transformer encoder, we focus on analyzing the complexity of computing the contextualized embedding using the encoder. For inference of one triple (e.g., using head and relation to predict tail), KG-BERT baseline requires per-token operations while our method only costs operations, where is the length of triple context and is number of all unique entities in the graph. This results in acceleration when (which holds almost always). Then, for a thorough inference on a graph, KG-BERT requires operations versus of our method, where is the number of unique relations in the graph. Because usually exceeds hundreds of thousands and is much greater than , our approach is significantly faster than the baseline. Lastly, on the test set of a benchmark dataset, our approach is empirically faster than the baseline by one or two orders of magnitude.

4 Experiment

In this section, we start from introducing the benchmark datasets and experimental settings. We then report the evaluation results on three benchmarks, which demonstrate the overwhelming advantages of our approach in terms of effectiveness, generalization and efficiency. At last, we conduct several quantitative and qualitative analyses including thorough ablation and case studies, which shed insights into the advantages of our approach.

Dataset # Ent # Rel # Train # Dev # Test
WN18RR 40943 11 86835 3034 3134
FB15k-237 14541 237 272115 17535 20466
UMLS 135 46 5216 652 661
Table 1: Summary statistics of benchmark datasets.

4.1 Experimental Settings

Benchmark Datasets.

We assessed the proposed approach on three link prediction benchmarks, whose statistics are listed in Table 1. First, WN18RR (Dettmers et al., 2018) is a popular link prediction benchmark dataset derived from WordNet (Miller, 1998). It consists of English phrases and their semantic relations (e.g., hypernym and synonym). Second, FB15k-237 (Toutanova et al., 2015) is a subset of Freebase (Bollacker et al., 2008) that consists of real-word named entities and their factoid relations. And third, UMLS (Dettmers et al., 2018) is a small knowledge graph containing medical semantic entities and their relations. Note, WN18RR and FB15k-237 are updated from WN18 and FB15k (Bordes et al., 2013) respectively because as stated by Dettmers et al. (2018), the datasets suffer from informative value causing more than 80% of test triples can be found in the training set with another relation. As a result, a simple rule-based model (Dettmers et al., 2018) can easily attack by achieving state-of-the-art results. Additionally, in line with prior context-based encoding approaches (Xiao et al., 2017; Yao et al., 2019), we employed entity description as their contexts for WN18RR and FB15k-237 from synonym definitions (Miller, 1998) and Wikipedia paragraph (Zuo et al., 2018) respectively. As for relation contexts, we straightforwardly used their text contents.

WN18RR FB15k-237 UMLS
Hits@10 MR Hits@10 MR Hits@10 MR
GCN-based embedding approach
R-GCN (Schlichtkrull et al., 2018) .207 6700 .300 500 - -
SACN (Shang et al., 2019) .540 - .540 - - -
R-GAT (Nathani et al., 2019) .581 1940 .626 210 - -
Conv-based embedding approach
ConvE (Dettmers et al., 2018) .531 4464 .497 245 .990 1.51
ConvKB (Nguyen et al., 2017) .558 1295 .471 216 - -
CapsE (Vu et al., 2019) .560 719 .593 303 - -
Translation-based embedding approach
TransE (Bordes et al., 2013) .532 2300 .441 323 .989 1.84
DistMult (Yang et al., 2014) .504 7000 .446 512 .846 5.52
ComplEx (Trouillon et al., 2016) .530 7882 .450 645 .967 2.59
KBGAN (Cai and Wang, 2017) .469 - .458 - - -
RotatE (Sun et al., 2019) .571 3340 .533 177 - -
TuckER (Balažević et al., 2019) .526 - .544 - - -
Context-based encoding approach
KG-BERT (Yao et al., 2019) .524 97 .420 153 .990 1.47
STELP (ours) .709 51 .482 117 .991 1.49
Table 2: Link prediction results on WN18RR and FB15k-237. On WN18RR and FB15k-237, resulting numbers are copied from (Nathani et al., 2019) whereas other results are taken from the original papers; and all UMLS results are copied from (Yao et al., 2019), except ConvE from our re-implementation. In addition, numbers with underline denote best results in each genre while those with bold font denote state-of-the-art performance.

Training and Inference Settings.

In training phase, the initialization of Transformer encoder was altered between BERT (Devlin et al., 2019) and RoBERTa (Liu et al., 2019)

, and the model was fine-tuned by mini-batch SGD with Adam optimizer. For the hyperparameters, the batch size was set to 16, learning rate was set to

, number of training epochs was set to 7, and 5 negative triples, i.e.,

, were sampled from each positive triple as in Yao et al. (2019). In addition, after tuned among {0.5, 1.0, 2.0}, in Eq.(12) was empirically set to 1.0. And in inference phase, given a test triple of a KG as correct candidate, all other entities in the KG act as improper candidates to corrupt its either head or tail entity. The link prediction task aims at distinguishing the correct one from all others, i.e., ranking correct triple over corrupted ones. Here we used the “filtered” setting (Bordes et al., 2013) to ensure all known correct triples are removed from the candidates except current test triple itself. For evaluation metrics, there are two aspects: (1) mean rank (MR) and mean reciprocal rank (MRR) directly reflect the absolute ranking; and (2) Hits@ stands for the ratio of test examples whose correct candidate is ranked in top-. In addition, although there are two ranking scores derived from classification and contrastive objectives respectively, only the one from classification objective, , is used as ranking basis during inference, and other options will be discussed in §4.5.

4.2 Evaluation Metrics on Link Prediction Benchmarks

The link prediction results of competitive approaches and ours on the three benchmark datasets are shown in Table 2. It is observed that our proposed STELP is able to achieve state-of-the-art or competitive performance on all these datasets. The improvement is especially significant in terms of mean rank (MR) due to the great generalization performance of context-based encoding approach, which will be further analyzed in the section below. Particularly, on WN18RR, our approach is able to outperform all graph embedding approaches (including translation-based, conv-based and GCN-based) and context-based encoding competitors. And on FB15k-237, our approach surpasses all other methods by a large margin in terms of MR. Although STELP only achieves an inferior performance on Hits@ compared to graph embedding approach, it still remarkably outperforms the baseline from the same genre.

4.3 Comparison with KG-BERT Baseline

Hits@1 @3 @10 MR MRR T/Ep Infer Time
KG-BERT (BERT-base) .041 .302 .524 97 .216 40min 32h
STELP (BERT-base) .222 .436 .647 99 .364 20min 0.9h
KG-BERT (RoBERTa-base) .130 .320 .636 84 .278 40min 32h
STELP (RoBERTa-base) .202 .410 .621 71 .343 20min 0.9h
KG-BERT (RoBERTa-large) .119 .387 .698 95 .297 79min 92h
STELP (RoBERTa-large) .243 .491 .709 51 .401 55min 1.0h
Table 3: Extensive comparisons with KG-BERT baseline on WN18RR. Note, “T/Ep” stands for time per epoch and “Infer Time” denotes inference time of link prediction on test set.

Since our approach is an update from the non-Siamese-like baseline, say KG-BERT (Yao et al., 2019), we compared our approach with KG-BERT on WN18RR in details, including different initialization methods for Transformer encoder. As shown in Table 3, in contrast to the baseline, our proposed STELP can consistently reach superior performance over the most of metrics. Note we did not show the results for BERT-large initialization since we found the both cannot reach convergence, and we used our re-implemented version of KG-BERT to empower RoBERTa initialization. As for empirical efficiency of training and inference, the time data was collected by running the experimental codes on single NVIDIA RTX6000 with mixed float computations. As shown in the right of Table 3, it is observed that our model is much faster than the non-Siamese-like baseline despite training or inference. Specifically, our Siamese-like architecture can help accelerate during training and up to accelerate during inference, which is roughly consistent with the theoretical analysis in §3.3. It is also worth mentioning that STELP with RoBERTa-base model is competitive with KG-BERT with RoBERTa-large model, with less learnable parameters and much higher efficiency.

4.4 Generalization to Open-Set Link Prediction

Context-based encoding approaches are inherently more generalizable to unseen entities than graph embedding approach since the contexts or mentions are considered to generate the embeddings rather than only the nodes themselves. This can be more prominent when the entities is not a closed set, i.e., some entities unseen during training could appear in test set. For example, 209 out of 3134 and 29 out of 20466 test triples involve unseen entities on WN18RR and FB15k-237 respectively. This inevitably hurts the performance of graph embedding based approaches, especially for the unnormalized metric, i.e., mean rank.

Further, to quantitatively compare the generalization performance, we constructing two probing settings based on WN18RR. The first probing task keeps training set unchanged but makes the test set only consist of the test triples involving unseen entities. And, in second probing task, we conducted a more reasonable comparison by supporting inductive representations (Hamilton et al., 2017) for unseen entities in a translation-based approach, and thus made following changes : (1) 1900 entities was sampled from test set, and only a test triple containing at least one of the sampled entities can be kept, resulting in 1758 test triples in this probing task; (2) those training triples that do not contain the sampled entities are used as new training set; and (3) those training triples containing exact one of the sampled entities are used as a support set to inductively generate the embedding for the unseen entities via translation formula, such as “” in TransE (Bordes et al., 2013). Using the second probing setting can assign the unseen entities with competent embeddings, thus leading to a fairer comparison than the first one. Note, if an unseen entity is involved in multiple triple on the support set, an average over the multiple inductive representations is used as its single vector representation.

Hits@1 Hits@3 Hits@10 MR MRR
Original Task
STELP (ours) .243 .491 .709 51 .401
RotatE .428 .492 .571 3340 .476
TransE .042 .441 .532 2300 .243
Probing Task
STELP (ours) .307 .486 .683 70 .431
RotatE .005 .007 .012 17955 .007
TransE .000 .007 .016 20721 .007
Probing Task
STELP (ours) .301 .497 .676 99 .427
TransE .005 .121 .210 13102 .078
Table 4: Probing tasks based on WN18RR for analyzing models’ generalization performance.

As shown in Table 4, our proposed STELP is able to achieve similar results on difference settings whereas even the state-of-the-art graph embedding approaches (e.g., RotatE) show substantial performance drop on the first probing task. Even if we used translation formula to inductively complete the embeddings of unseen entities on the second probing task, the performance decrease of TransE is far more significant than our approach. Therefore, this verifies the promising generalization performance of our model to unseen entities when compared to the popular graph embedding approaches.

4.5 Ablation Study

Perspective Detail Hits@1 Hits@3 Hits@10 MR MRR
Full model STELP (RoBERTa-large) .243 .491 .709 51 .401
Objective w/o contrastive objective .255 .474 .685 68 .399
w/o classification objective .167 .433 .653 67 .337
Concatenation, e.g., Eq.(3-4) [h, r] vs. [r, t] .049 .272 .520 106 .204
[h] vs. [r, t] .270 .461 .668 51 .402
Distance measure in Eq.(10) BiLinear .231 .403 .605 79 .354
Cosine Similarity .313 .503 .691 76 .439
Ranking Basis .252 .494 .701 62 .406
.252 .495 .706 48 .408
.252 .495 .704 51 .408
Table 5: Results of ablation study on WN18RR. Note that full model denotes using two objectives for training, “ [h, r] vs. [t]” as concatenation scheme, Euclidean distance as distance measurement, and as ranking basis during inference. And denotes scaling all scores to .

To further explore the effectiveness of each module, we conducted an extensive ablation study in Table 5 from various perspectives. (1) Ablating Objective: First, each of the components in Eq.(12) was dropped and varying degrees of performance reductions were observed. Note we used as ranking basis in the setting without classification objective. (2) Choice of contexts’ concatenation: Then, how to concatenate and encode the contexts from a triple is also non-trivial for the proposed approach. Two other options listed in the table can only achieve sub-optimal results. (3) Choice of distance measuring method: Two other methods, i.e., Bilinear and Cosine similarity, were also applied to Eq.(10) for measuring the distance between the two representations from head and tail entities respectively. The results demonstrate that Cosine similarity can achieve a very competitive performance and even outperform in terms of Hit@ and MRR. (4) Choice of Ranking Basis: Since two scores can be derived from classification and contrastive objectives respectively, it is straightforward to integrate them in either additive or multiplicative way as an ensemble score. However, as shown in the bottom of Table 5, several combinatorial ranking bases achieve a similar performance among most of metrics. We further calculated Pearson correlation between and and found the coefficient and its p-value are and respectively. This means the two scores are high-linearly related and explains why they achieve similar results.

4.6 Case Study

In this subsection, we conducted several case studies for insights into the proposed approach.

sensitive: able to feel or perceive.
sensitive: responsive to physical stimuli.
sensitive: being susceptible to the attitudes, feelings, or circumstances of others.
Table 6: An example for the polysemy problem in WordNet: three meanings for the word “sensitive” are viewed as three separate nodes in WordNet.
Figure 2: A detailed comparison between the proposed STELP and RotatE regarding different relations on WN18RR test set. The number in a parenthesis denotes its proportion of test triples with corresponding relation. Note relation “similar to” is ignored since its proportion is less than 0.1%.

Why does STELP achieve better Hits@10 but worse Hits@1 than RotatE?

As shown in the top part of Table 4, a detailed comparison between the proposed STELP and recently-proposed translation-based approach, RotatE, is demonstrated in terms of all metrics. It is observed that the STELP can outperform RotatE on Hits@10 by a large margin but achieve an inferior effectiveness on Hits@1. Similarly, the context-based encoding baseline, KG-BERT, shows the same pattern. To dig out the difference between context-based encoding approach (e.g., KG-BERT, STELP) and graph embedding approach (e.g., TransE, RotatE) behind this phenomenon, we conduct a case study according to the inference on WN18RR. In particular, considering an oracle test triple, (sensitive, derivationally related form, sense), after corrupting at the tail entity and ranking by our STELP model, the top-12 tail candidates are (sensitive, sensitivity, sensibility, sensing, sense impression, sentiency, sensitive, sense, feel, sensory, sensitive, perceptive). Note that, a word will be denoted as multiple entities in WordNet when the polysemy appears, so head entity, // candidate tail entities with the same text “sensitive” express different meanings as shown in Table 6. As shown in the top-12 tail candidates, when a context-based encoding model is applied, the gold tail entity is only ranked , and there are many semantically-similar tail entities able to fit the oracle test triple, which seem to be false negative labels for a context encoder. But this could not be a matter for graph embedding approaches because they only take into account the structure of a graph despite the contexts. In addition, the predicting tail entity ranked , sensitive, is identical to the oracle head entity, which can match relation derivationally related form to a certain extent. This however violates the label in WN18RR. It is also worth mentioning that “polysemy” or “ambiguity” problem is more severe in Freebase since many named entities with the same text express totally different meanings. For example, many people encounter the problem of duplicate names, and more specifically an entity The Avengers could be a movie, a soundtrack album or a punk rock band. These could also partially explain why both KG-BERT and the proposed STELP can only achieve a competitive performance on FB15K-237.

Why does STELP bring significant improvements?

We further discuss several inference cases here to compare the proposed STELP with RotatE. We first randomly selected an oracle test triple, (accentuation, hypernym, stress), over which STELP performs well while RotatE fails (where STELP ranks the oracle head in third place RotatE ranks it at 21068). STELP achieves this by capturing rich semantic information while RotatE just explores structured information observed in triples. Hence, for the semantically related entities STELP is able to distinguish remarkably due to its great capability of contextual generalization. Nonetheless, although the proposed STELP can acquire structural information via contrastive objective, it still could be inferior in some circumstances. For example, when predicting the head entity for triple (position, hypernym, point), RotatE reaches a better rank value than STELP. Since STELP is good at capturing semantics, it assigns high ranking scores to many relevant entities, such as “point of intersection”, “location” and “address”. On the contrary, RotatE will not consider entities with similar semantics because it just employs the structural information. In addition, a detailed comparison regarding different relations is conducted between STELP and RotatE in Figure 2. It is observed that the proposed STELP achieves consistent performance cross different relations, which is more stable than the graph embedding approach. However, STELP achieves worse effectiveness on some certain relations, e.g., verb group. We hence checked the test triples falling into verb group to figure out the performance gap, and found that there are almost half of triples having the entities with the same word but different meanings (i.e., aforementioned polysemy), such as (strike, verb group, strike), (match, verb group, match). This thus hinders the context-based STELP model from correctly ranking.

Figure 3: Four randomly-sampled comparative cases of frequency histogram for assigned to a triple’s all tail corruptions. In each comparison, with the same test triple, the one above is obtained from “full STELP” while the one below is from “STELP w/o contrastive objective”, and the text above each histogram shows the ranking and for the corresponding un-corrupted (i.e., oracle) triple. Note, the interval from 0.0 to 0.1 is removed because most of negative triples’ will fall into it.

How does the proposed two-objective learning affect the ranking scores?

To demonstrate the effectiveness of the proposed two-objective learning against one objective, in addition to the quantitative study in §4.5, we further compared the ranking score derived from classification objective for negative triples from either the “full STELP” or “STELP w/o contrastive objective”. As shown in Figure 3, each frequency histogram is used to exhibit the distributions of assigned to a triple’s all tail corruptions, where -axis denotes and -axis denotes frequency over the number of all corruptions. By the comparison of frequency histograms between the “full STELP” (above and in blue) and “STELP w/o contrastive objective” (below and in orange), it is observed that, the model trained with two-objective learning can substantially reduce the number of false positive predictions and thus improve the performance of ranking. This also verifies the contrastive objective can provide a regularization to the classification one and alleviates the over-confidence problem (§3.2), leading to more accurate and reliable ranking scores for link prediction.

5 Related Work

Traditional Context-based Link Prediction.

Opposite to graph embedding approaches learning from KG’s structure, context-based encoding approach exploits textual information for relational knowledge. Socher et al. (2013)

straightforwardly used continuous bag-of-words (CBoW) as the representation of triple’s component, and then proposed a neural tensor network for relation classification. And the word vectors here are initialized by unsupervised pre-training on large corpora.

Wang et al. (2014) embedded entities from KG and the entities’ text contents respectively then aligned the embeddings of entities and contents in the same space. Xie et al. (2016) proposed a representation learning method for KGs via embedding entity descriptions, and explored CNN encoder in addition to CBoW. They also used the objective that a vector integration of head and relation was close to vector of tail to learn the model, as in translation-based graph embedding approaches (Bordes et al., 2013). Xiao et al. (2017) incorporated contextualized embeddings of descriptions with the symbolic ones, and presented a topic-related objective besides the traditional one. Two defects of these approaches in common are that (1) the neural model is too shallow to produce expressively powerful representations and (2) the two entities and their relation in a triple are modeled separately regardless of rich contextualized knowledge cross them.

Pre-Trained Model Fine-Tuning for Link Prediction.

Recently, fine-tuning pre-trained language models (Peters et al., 2018; Radford et al., 2018; Devlin et al., 2019; Liu et al., 2019) are proven to be effective on a variety of downstream NLP tasks. Most of the language models are based on the Transformer encoder (Vaswani et al., 2017). They can be easily adapted to various kinds of task settings (e.g., sequence classification, tagging and question answer) via special token adding, concatenating and scoring (Devlin et al., 2019). However, applying a pre-trained language model to KG-related tasks has rarely been studied, and most of these relevant works specially concentrate on mining commonsense knowledge from pre-trained model for the KGs like ConceptNet (Speer et al., 2017) and ATOMIC (Sap et al., 2019). For example, Davison et al. (2019) presented a zero-shot evaluation way for commonsense triple classification to analyze whether pre-trained language model is equipped with commonsense knowledge. And Malaviya et al. (2019) used GCN and automatic graph densification to learning graph structure, coupled with pre-trained language model as encoder for enhancing contextual representation of knowledge, to solve commonsense link prediction task. More recently, KG-BERT (Yao et al., 2019) is the first work proposed to tackle traditional KG-related tasks, such as WordNet and Freebase, by using pre-trained language model as encoder. They fed a concatenation of triples’ context into the encoder, which followed by a neural classifier for predicting the correctness. Hence it benefits from the rich contextualized information however with the cost of computation overheads. In contrast to KG-BERT baseline, our approach hits the sweet spot between efficiency and contextualization, and leverages two objectives and Siamese-like architecture to boost the effectiveness and generalization performance to unseen entities on link prediction. Both theoretical and empirical analyses show that our approach is much more efficient than the baseline, without sacrifices of performance.

Joint Embedding of Text and Knowledge Graph.

Our approach is also related to the works that combine knowledge graph with text information (Toutanova et al., 2015; Yamada et al., 2016). This is achieved by a joint embedding to represent KG’s entities/relations and their textual knowledge into the same space. They usually employ unlabeled large-scale corpora and use the texts containing co-occurrence of entities to enrich the graph embeddings. For example, taking into account the sharing of sub-structure in the textual relations in a large-scale corpus, Toutanova et al. (2015) applied a CNN to the lexicalized dependency paths of the textual relation, for augmented relation representations. The representations can be fed into any traditional graph embedding approach (e.g., (Yang et al., 2014)) for improving effectiveness on link prediction tasks. In contrast, our work only operates on homogeneous textual data and employs the contexts for entities/relations themselves (i.e., only their own text contents or description), rather than acquiring textual knowledge (e.g., textual relations by Toutanova et al. (2015)) from large-scale corpora to enrich traditional graph embeddings via joint embedding.

6 Conclusion

In this work, we develop a novel semantic triple encoder for more efficient context-based link prediction, which is more generalizable to an open set of entities. The Transformer-based encoder separately processes the contexts of two parts of a given triple and then passes both contextualized representations into downstream objective-specific modules. By jointly training the encoder toward two complementary objectives, i.e., classification and contrastive losses, the proposed STELP is able to output more accurate ranking scores for inference. The empirical evaluation and thorough analyses on several mainstream benchmark datasets show that our approach is able to outperform the recent baselines, spends much less computations than other context-based encoding approaches, and easily generalizes to unseen entities in the inference phase.