KEPLER: A Unified Model for Knowledge Embedding and Pre-trained Language Representation

11/13/2019 ∙ by Xiaozhi Wang, et al. ∙ 0

Pre-trained language representation models (PLMs) learn effective language representations from large-scale unlabeled corpora. Knowledge embedding (KE) algorithms encode the entities and relations in knowledge graphs into informative embeddings to do knowledge graph completion and provide external knowledge for various NLP applications. In this paper, we propose a unified model for Knowledge Embedding and Pre-trained LanguagE Representation (KEPLER), which not only better integrates factual knowledge into PLMs but also effectively learns knowledge graph embeddings. Our KEPLER utilizes a PLM to encode textual descriptions of entities as their entity embeddings, and then jointly learn the knowledge embeddings and language representations. Experimental results on various NLP tasks such as the relation extraction and the entity typing show that our KEPLER can achieve comparable results to the state-of-the-art knowledge-enhanced PLMs without any additional inference overhead. Furthermore, we construct Wikidata5m, a new large-scale knowledge graph dataset with aligned text descriptions, to evaluate KE embedding methods in both the traditional transductive setting and the challenging inductive setting, which needs the models to predict entity embeddings for unseen entities. Experiments demonstrate our KEPLER can achieve good results in both settings.



There are no comments yet.


page 3

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Pre-trained language representation models (PLMs) such as ELMo (Peters et al., 2018a), BERT (Devlin et al., 2019a) and XLNet (Yang et al., 2019) learn effective language representations from large-scale nonstructural and unlabelled corpora and achieve great performance on various NLP tasks. However, they are typically lack of factual world knowledge (Petroni et al., 2019; Logan et al., 2019).

Recent works (Zhang et al., 2019; Peters et al., 2017; Liu et al., 2019a)

utilize entity embeddings of large-scale knowledge bases to provide external knowledge for PLMs and improve their performance on various NLP tasks. However, they have some issues: (1) They use fixed entity embeddings learned by a separate knowledge embedding (KE) algorithm, which cannot be easily aligned with the language representations because they are essentially in two different vector spaces. (2) They require an entity linker to link the words in context to corresponding entities so that they can benefit from the entity embeddings, which makes them suffer from the error propagation problem. (3) Their sophisticated mechanisms to retrieve and use entity embeddings lead to additional inference overhead compared with vanilla PLMs.

Actually, knowledge embedding methods have a strong connection with NLP models. There are not only many works integrating knowledge embeddings into NLP models to improve the performance of NLP applications such as machine translation (Zaremoodi et al., 2018), reading comprehension (Mihaylov and Frank, 2018; Zhong et al., 2019) and dialogue system (Madotto et al., 2018), but also some early works use text as additional information (Xie et al., 2016; An et al., 2018) or jointly train the knowledge and text embedding in the same space (Wang et al., 2014; Toutanova et al., 2015; Han et al., 2016; Cao et al., 2017, 2018).

In this paper, we propose to learn knowledge embedding and language representation with a unified model and encode them into the same semantic space, which can not only better integrate knowledge into PLMs but also help to learn more informative knowledge embeddings with the effective language representations. We propose KEPLER, which is short for “a unified model for Knowledge Embedding and Pre-trained LanguagE Representation”. We collect informative textual descriptions for entities in the knowledge graph and utilizes a typical PLM to encode the descriptions as text embeddings, then we treat the description embeddings as entity embeddings and optimize a KE objective function on top of them. The key idea is to encode structural knowledge in the textual representation of entities using a PLM, which can generalize to unobserved entities in the knowledge graph.

Our KEPLER enjoys the following advantages: (1) We integrate world knowledge into PLMs with the supervision of the KE objective, which is more flexible for the PLMs, and encode the entity and text into the same space, which avoids the gap between the language representations and fixed entity embeddings. (2) We do not need an entity linker or additional mechanisms to retrieve corresponding entity embeddings, which avoids the error propagation problem and extra overhead. During inference, our KEPLER is exactly the same as standard PLMs, which can be adopted in a wide range of NLP applications. (3) Different from conventional KE methods, our KEPLER encodes textual entity descriptions as entity embeddings, which enables our model to infer knowledge embedding in the inductive setting (get entity embeddings for the unseen entities). This is especially useful for deployment, where the model may deal with unseen entities.

The existing KE datasets are relatively small-scale, which is not sufficient to pre-train a large model, and typically lack of description data and a data split for the inductive setting. Therefore, we construct Wikidata5m, a new large-scale knowledge graph dataset with aligned text description for each entity. Wikidata5m is a subset of Wikidata Vrandečić and Krötzsch (2014), a free knowledge base with about sixty million entities. To ensure each entity is informative and the knowledge base is as clean as possible, we only select the entities with corresponding Wikipedia pages. Our Wikidata5m contains five million entities and twenty million triplets. We also benchmark several classical KE methods on Wikidata5m to facilitate future research. To our knowledge, this is the first million-scale general knowledge graph dataset.

To summarize, our contribution is three-fold: (1) We propose to encode entities and texts into the same space and jointly train the KE and language modeling objectives, and then get a better knowledge-enhanced PLM which avoids error propagation and additional overhead. Experimental results on various NLP tasks demonstrate the effectiveness of our KEPLER. (2) We encode textual descriptions as entity embeddings, which improves KE with textual information and enables inductive KE. (3) We introduce a new large-scale knowledge graph dataset Wikidata5m, which may promote the research on large-scale knowledge graph, inductive knowledge embedding and interactions between knowledge graph and NLP.

2 Related Work

Figure 1: A demonstration for KEPLER structure. By jointly training with knowledge embedding (KE) and pre-training language representation model (PLM) objectives, our framework can implicitly incorporate knowledge into the language representation model.
Pre-trained Language Model

There has been a long history of pre-training in NLP. Early works focus on distributed word representation (Collobert and Weston, 2008; Mikolov et al., 2013; Pennington et al., 2014), many of which are still often adopted in current models as word embeddings for their ability to capture syntactic and semantic information from large-scale corpora. Peters et al. (2018b) push this trend a step forward by using a bidirectional LSTM to capture contextualized word embeddings (ELMo) for richer semantic meanings under different circumstances.

Apart from those methods using pre-trained word embeddings as input features, there is another trend exploring pre-trained encoders. Dai and Le (2015) first propose to train an auto-encoder on unlabeled data, and then fine-tune it on downstream tasks. Howard and Ruder (2018) propose a universal language model (ULMFiT) based on AWD-LSTM (Merity et al., 2018). With the powerful Transformer (Vaswani et al., 2017) as its encoder, Radford et al. (2018) demonstrate a pre-trained generative model (GPT) and its effects, while Devlin et al. (2019b) release a pre-trained deep Bidirectional Encoder Representation from Transformers (BERT), achieving state-of-the-arts on dozens of benchmarks.

After Devlin et al. (2019b), similar pre-trained encoders spring up recently. Yang et al. (2019) propose a permutation language model (XLNet) based on TransformerXL (Dai et al., 2019). Later, Liu et al. (2019c) show that more data and more sophisticated parameter tuning would benefit pre-trained encoders a lot and release a new state-of-the-art model (Roberta). Other works explore how to add more tasks (Liu et al., 2019b) and more parameters (Lan et al., 2019; Raffel et al., 2019) to pre-training models.

Recently some works attempt to incorporate knowledge information in pre-training. Zhang et al. (2019) introduce pre-processed knowledge embeddings into the Transformer architecture of BERT (Devlin et al., 2019b). With similar ideas, Peters et al. (2019) incorporate an integrated entity linker in their models. Besides, Logan et al. (2019); Hayashi et al. (2019) utilize relations between entities inside one sentence to help train better generation models. Despite the promising results those methods bring with knowledge-enhanced techniques, they either use fixed external knowledge information or have complex structures or pipelines to handle entities within sentences.

Knowledge Graph Embeddings

In recent years knowledge embeddings have been extensively studied through predicting missing links in graphs. Conventional models define score functions for relation triples and predict head or tail entities with scores of candidate entities. For example, TransE (Bordes et al., 2013) treats tail entities as translations of heads, while DistMult (Yang et al., 2015) use matrix multiplications as score functions and ComplEx (Trouillon et al., 2016) adopt complex operations based on it. RotatE (Sun et al., 2019a) combines the advantages of both of them.

Among these works, Xie et al. (2016) propose to utilize entity descriptions as an external information source and introduce an entity description encoder to enhance the TransE score function. Though similar to our method, Xie et al. (2016) aim at utilizing entity descriptions to help knowledge representation learning, while we take entity descriptions as a tool to incorporate external knowledge in our model.

3 KEPLER Model

In this section, we introduce the structure of our KEPLER model, and how we combine two training goals of masked language modeling and knowledge representation learning.

3.1 Training Objectives

To incorporate world knowledge into our pre-trained language representation models (PLMs), we design a multi-task loss as shown in Figure 1 and Equation 1,


where represents knowledge embedding loss and represents language model loss. Since our PLMs are involved in both tasks, jointly optimizing the two objectives could implicitly integrate knowledge from external graphs with text encoders, while keeping the strong abilities of PLMs for syntactic and semantic understanding.

More specifically, we adopt a general format using negative sampling,


where is the correct triple from knowledge graphs and are negative sampling triples. is the score function, for which we have many choices. Different from conventional knowledge embedding methods, for entity embeddings and , instead of looking up in embedding tables, we use PLMs as our text encoders to extract entity representations from their descriptions.

For , many alternatives for pre-trained language representation can be used, e.g., masked language model (Devlin et al., 2019b). Note that those two tasks only share the text encoder and for each mini-batch, text sampled for and is not (necessarily) the same.

3.2 Model Details

Though we have many alternatives of model structures and training objectives to choose under KEPLER framework, here for better clarity, we introduce a specific one that we use in experiments.

Model Structure

We use the transformer architecture (Vaswani et al., 2017) as in (Devlin et al., 2019b; Liu et al., 2019c), which we will not address in details. To be more specific, we use RoBERTaBASE codes and checkpoints111

in all our experiments since it is one of the state-of-the-art pre-trained models with acceptable computing requirements. Besides the training data and hyperparameters, one of the major differences between RoBERTa and BERT is that RoBERTa uses Byte-Pair Encoding (BPE)

(Sennrich et al., 2016) to better tokenize rare words.

Given a sequence of tokens , the input format is , where [CLS] and [EOS] are two special tokens. Model output at [CLS] is often used as the sentence representation.

PLM Objective

Inspired by BERT (Devlin et al., 2019b), MLM randomly selects of input tokens, among which are masked with the special mark [MASK], are replaced by another random token, and the rest remain unchanged. Under MLM, models try to predict the correct tokens and a cross-entropy loss is calculated over the selected positions.

We adopt the pre-trained checkpoint of RoBERTaBASE for the initialization of our model. However, we still keep MLM as one of our objectives to avoid catastrophic forgetting (McCloskey and Cohen, 1989) while training towards the KRL loss. Note that experiments show that only further pre-training from RoBERTaBASE checkpoint does not bring promotion, suggesting that the combination of the two tasks contributes most to the performance.

KE Objective

We use the loss formula from (Sun et al., 2019b) as our KE objective, which takes negative sampling (Mikolov et al., 2013) for efficient optimization:


where is the correct triple, are negative sampling triples, is the margin,

is the sigmoid function, and

is the score function, for which we choose to follow TransE (Bordes et al., 2013) for its simplicity and efficiency,


where we take the norm as . Due to the limit of computing resources, we take the negative sampling size as 1. The negative sampling policy is to fix the head entity and randomly sample a tail entity, and vice versa.

Different from conventional KE methods, we do not have an entity embedding lookup table. Instead, we use our KEPLER model to encode the corresponding entity descriptions and take the [CLS] outputs as the entity embeddings.

3.3 Downstream Tasks

Like all BERT-like models, we fine-tune KEPLER on downstream tasks and use [CLS] output for sentence-level prediction and the outputs of all tokens for sequence labelling tasks (Devlin et al., 2019b). For supervised relation extraction and few-shot relation extraction, we follow the approaches from (Baldini Soares et al., 2019) and (Gao et al., 2019) respectively.

Dataset #entity #relation #training #validation #test
FB15K 14,951 1,345 483,142 50,000 59,071
WN18 40,943 18 141,442 5,000 5,000
FB15K-237 14,541 237 272,115 17,535 20,466
WN18RR 40,943 11 86,835 3,034 3,134
Wikidata5m 4,594,485 822 20,542,906 40,641 41,028
Table 1: Statistics of Wikidata5m compared with existing widely-used benchmarks.

4 Wikidata5m

We construct a new large-scale knowledge graph dataset with aligned text descriptions. Our dataset is built by integrating Wikidata (Vrandečić and Krötzsch, 2014), a large-scale open knowledge base, with Wikipedia. Each entity in the knowledge graph is aligned with its text description in Wikipedia pages. In the following sections, we will first introduce the data collection steps, and then give the benchmarks of popular KE methods on this dataset.

4.1 Data Collection

We pull the latest dump of Wikidata222 and Wikipedia333 from their websites respectively. We remove pages whose first paragraphs contain fewer than 5 words. For each entity, we align it to a Wikipedia page with the MediaWiki wbgetentities action API. The first section of Wikipedia pages is extracted as the description for entities. Entities that have no corresponding Wikipedia pages are discarded.

To construct the knowledge graph, we retrieve all the statements in entity pages, and map the entities and relations in statements to their canonical IDs in Wikidata. A statement is considered to be a valid triplet if both of its entities can be aligned with Wikipedia pages and its relation has a non-empty page in Wikidata. The final knowledge graph dataset contains 4,813,455 entities, 822 relations and 21,344,269 triplets, where each entity has a text description. Statistics of our Wikidata5m dataset and four widely-used datasets are showed in Table 

1. Top-5 entity categories are listed in Table 3. We can see that our Wikidata5m is much larger than existing knowledge graph datasets, covering all sorts of domains.

4.2 Data Split

The data split statistics for the conventional transductive setting are also shown in Table 1.

In this work, we also evaluate models on the challenging inductive setting, which requires the models to produce entity embeddings for entities which are not seen at the training time and also do link predictions for the unseen entities. So we provide a data split for the inductive setting evaluation. The statistics for the inductive setting data split are shown in Table 2. In the inductive setting, the entities and triplets in training, validation and test sets are mutually disjoint, while in the transductive setting, only the triplet sets are mutually disjoint.

max width=0.45 Subset #entity #relation #triplet Training 4,579,609 822 20,496,514 Validation 7,374 199 6,699 Test 7,475 201 6,894

Table 2: Statistics of Wikidata5m inductive setting.

max width=0.45 Entity Type Occurrence Percentage Human 1,517,591 31.5% Taxon 363,882 7.56% Wikimedia list 118,823 2.47% Film 114,266 2.37% Human Settlement 110,939 2.30% Total 2,225,501 46.2%

Table 3: Top-5 entity categories in Wikidata5m.
TransEBordes et al. (2013) 109370 0.253 0.170 0.311 0.392
DistMultYang et al. (2015) 211030 0.253 0.208 0.278 0.334
ComplExTrouillon et al. (2016) 244540 0.281 0.228 0.310 0.373
SimplEKazemi and Poole (2018) 115263 0.296 0.252 0.317 0.377
RotatESun et al. (2019b) 89459 0.290 0.234 0.322 0.390
Table 4: Performance of different knowledge graph embedding models on Wikidata5m.

4.3 Benchmarks

To assess the challenges of Wikidata5m, we benchmark several popular knowledge graph embedding models on the dataset. Since the conventional knowledge graph embedding models are inherently transductive, we split the triplets of knowledge graph into train, valid and test sets. Each model is trained on the training set and evaluated on the link prediction task.

We conduct 5 knowledge graph embedding models , including TransE Bordes et al. (2013), DistMult Yang et al. (2015), ComplEx Trouillon et al. (2016), SimplE Kazemi and Poole (2018) and RotatE Sun et al. (2019b). Because their original implementations do not scale to Wikidata5m, we benchmark these methods using the multi-GPU implementation in GraphVite Zhu et al. (2019). The performance of link prediction is evaluated in the filtered setting, where test triplets are ranked against all candidate triplets that are not observed in the knowledge graph. We report the standard metrics of Mean Rank (MR), Mean Reciprocal Rank (MRR) and Hits at N (HITS@N).

Table 4 shows the benchmarks of popular methods on Wikidata5m.

5 Experiments

In this section, we introduce the experiment settings and experimental results of KEPLER on various NLP and KE tasks.

5.1 Pre-training settings

In experiments, we choose RoBERTa (Liu et al., 2019c) as our base model and implement our methods in the fairseq framework (Ott et al., 2019) for pre-training. Due to the computing resource limit, we choose RoBERTa architecture and use the released roberta.base444 parameters to initialize our model.

In our pre-training procedure, we only use the English Wikipedia corpus to save time and also for a fair comparison with previous knowledge-enhanced PLMs (Zhang et al., 2019; Peters et al., 2019).

5.2 NLP Tasks

In this section, we introduce how our KEPLER can be used as a knowledge-enhanced PLM on various NLP tasks and its performance compared with state-of-the-art models.

Model P R F-1
BERT 67.23 64.81 66.0
BERT - - 70.1
ERNIE 69.97 66.08 67.97
MTB - - 71.50
RoBERTa 70.07 70.63 70.35
KnowBERT 71.60 71.40 71.50
KEPLER 70.43 73.02 71.70
Table 5: Results on the relation classification dataset TACRED (%). Results with , and are from Zhang et al. (2019), Baldini Soares et al. (2019) and Peters et al. (2019) respectively. BASE and LARGE identify whether the model uses a base or large version BERT-like architecture.

Relation Classification

Relation classification is an important NLP task that requires models to classify relation types between two given entities from text. We evaluate our model and baselines on two commonly-used datasets: TACRED

(Zhang et al., 2017) and FewRel (Han et al., 2018). TACRED covers 42 relation types and contains 106,264 sentences. FewRel is a few-shot relation classification dataset, which has 100 relations and 700 instances for each relation.

Here we follow the relation extraction fine-tuning procedure from Zhang et al. (2019), where four special tokens are added before and after entity mentions in the sentence to highlight where the entities are. Then we take the [CLS] output as the sentence representation for classification.

Model 5-way 1-shot 5-way 5-shot 10-way 1-shot 10-way 5-shot
Proto (KEPLER)
Table 6: Accuracies () on FewRel dataset. “Proto” indicates Prototypical Networks (Snell et al., 2017) used in Han et al. (2018). “PAIR” is proposed in Gao et al. (2019) and “MTB” is from Baldini Soares et al. (2019).
Model P R F-1
Table 7: Entity typing results on OpenEntity (). Models with come from Choi et al. (2018); Zhang et al. (2019); Peters et al. (2019) respectively. Our KEPLER performs better than RoBERTaBASE, and achieves comparable results with other state-of-the-art models.

Table 5 shows results of various models on TACRED, from which we can see that our model achieves state-of-the-art on this benchmark. Note that some baselines use the LARGE version of pre-trained language models while we still take the BASE architecture. We have gained a large promotion over our base model (RoBERTaBASE) while staying a little bit advanced over other competitive methods (even if they use a LARGE architecture).

Our model has also shown strength on FewRel dataset. We use Prototypical Networks (Snell et al., 2017) and PAIR (Gao et al., 2019) as the base frameworks and try out different kinds of pre-trained models as encoders. As shown in Table 6, for both frameworks, our models have superior performance over others. We have also compared with current state-of-the-art MTP (Baldini Soares et al., 2019), which outperforms us a little. But note that MTP uses a large version of BERT while we use the base version, and also it carries out a new pre-training task specifically targeting relation extraction, while ours is a general way to combine knowledge and natural language which would benefit all knowledge-related tasks.

Entity Typing

Entity typing requires models to classify given entity mentions into pre-defined entity types. For this task, we evaluate all the models on OpenEntity (Choi et al., 2018) following the setting from Zhang et al. (2019), which focuses on nine general entity types.

Evaluation results are demonstrated in Table 7. For now we have achieved better results than RoBERTa, and ERNIE and KnowBERT show slightly better results than ours. It is mainly due to that we use different ways of extracting entity representations. KnowBERT adds special tokens before and after the mention and uses the output of the token before the mention as the representation for typing, while ours, for now, directly uses [CLS]. We will try this better way of entity representation in the future.

max width=0.48 Method MR MRR HITS@1 HITS@3 HITS@10 KEPLER 30.8387 0.217 0.0 0.360 0.692

Table 8: Performance of KEPLER on inductive setting in Wikidata5m.

5.3 Knowledge Embedding

In this section, we show how our KEPLER works as a KE model, and evaluate it on our Wikidata5m dataset in inductive setting.

We do not use the existing KE benchmarks because they are lack of high-quality text descriptions for their entities and they do not have a reasonable data split for the inductive setting.

Inductive Setting

We evaluate the generalization ability of our KEPLER by testing it on the inductive setting in Wikidata5m (as described in Section 4.2), which requires it to produce effective entity embeddings for the unseen entities. The results are shown in Table 8.

6 Conclusion and Future Work

In this paper, we propose KEPLER, a unified model for knowledge embedding and pre-trained language representation. We jointly train the knowledge embedding and language representation objectives on top of the language representation model. Experimental results on extensive tasks demonstrate the effectiveness of our model.

In the future, we will: (1) Evaluate whether our model can recall factual knowledge with more tasks. (2) Try variations of existing models, such as highlighting entity mentions in descriptions or changing knowledge embedding form, to get better understanding of how KEPLER works and bring more promotion for downstream tasks.