KG-BERT: BERT for Knowledge Graph Completion

by   Liang Yao, et al.
Northwestern University

Knowledge graphs are important resources for many artificial intelligence tasks but often suffer from incompleteness. In this work, we propose to use pre-trained language models for knowledge graph completion. We treat triples in knowledge graphs as textual sequences and propose a novel framework named Knowledge Graph Bidirectional Encoder Representations from Transformer (KG-BERT) to model these triples. Our method takes entity and relation descriptions of a triple as input and computes scoring function of the triple with the KG-BERT language model. Experimental results on multiple benchmark knowledge graphs show that our method can achieve state-of-the-art performance in triple classification, link prediction and relation prediction tasks.



There are no comments yet.


page 3


Triple Classification for Scholarly Knowledge Graph Completion

Scholarly Knowledge Graphs (KGs) provide a rich source of structured inf...

LP-BERT: Multi-task Pre-training Knowledge Graph BERT for Link Prediction

Link prediction plays an significant role in knowledge graph, which is a...

KI-BERT: Infusing Knowledge Context for Better Language and Domain Understanding

Contextualized entity representations learned by state-of-the-art deep l...

Mixture-of-Partitions: Infusing Large Biomedical Knowledge Graphs into BERT

Infusing factual knowledge into pre-trained models is fundamental for ma...

Interpreting Language Models Through Knowledge Graph Extraction

Transformer-based language models trained on large text corpora have enj...

Be Concise and Precise: Synthesizing Open-Domain Entity Descriptions from Facts

Despite being vast repositories of factual information, cross-domain kno...

Efficient Relation-aware Scoring Function Search for Knowledge Graph Embedding

The scoring function, which measures the plausibility of triplets in kno...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.


Large-scale knowledge graphs (KG) such as FreeBase [2], YAGO [25] and WordNet [15] provide effective basis for many important AI tasks such as semantic search, recommendation [42] and question answering [6]. A KG is typically a multi-relational graph containing entities as nodes and relations as edges. Each edge is represented as a triplet (head entity, relation, tail entity) ( for short), indicating the relation between two entities, e.g., (Steve Jobs, founded, Apple Inc.). Despite their effectiveness, knowledge graphs are still far from being complete. This problem motivates the task of knowledge graph completion, which is targeted at assessing the plausibility of triples not present in a knowledge graph.

Much research work has been devoted to knowledge graph completion. A common approach is called knowledge graph embedding which represents entities and relations in triples as real-valued vectors and assess triples’ plausibility with these vectors 

[31]. However, most knowledge graph embedding models only use structure information in observed triple facts, which suffer from the sparseness of knowledge graphs. Some recent studies incorporate textual information to enrich knowledge representation [24, 37, 35], but they learn unique text embedding for the same entity/relation in different triples, which ignore contextual information. For instance, different words in the description of Steve Jobs should have distinct importance weights connected to two relations “founded” and “isCitizenOf”, the relation “wroteMusicFor” can have two different meanings “writes lyrics” and “composes musical compositions” given different entities. On the other hand, syntactic and semantic information in large-scale text data is not fully utilized, as they only employ entity descriptions, relation mentions or word co-occurrence with entities [34, 39, 1].

Recently, pre-trained language models such as ELMo [20], GPT [21], BERT [8] and XLNet [41]

have shown great success in natural language processing (NLP), these models can learn contextualized word embeddings with large amount of free text data and achieve state-of-the-art performance in many language understanding tasks. Among them, BERT is the most prominent one by pre-training the bidirectional Transformer encoder through masked language modeling and next sentence prediction. It can capture rich linguistic knowledge in pre-trained model weights.

In this study, we propose a novel method for knowledge graph completion using pre-trained language models. Specifically, we first treat entities, relations and triples as textual sequences and turn knowledge graph completion into a sequence classification problem. We then fine-tune BERT model on these sequences for predicting the plausibility of a triple or a relation. The method can achieve strong performance in several KG completion tasks. Our source code is available at Our contributions are summarized as follows:

  • We propose a new language modeling method for knowledge graph completion. To the best of our knowledge, this is the first study to model triples’ plausibility with a pre-trained contextual language model.

  • Results on several benchmark datasets show that our method can achieve state-of-the-art results in triple classification, relation prediction and link prediction tasks.

Related Work

Knowledge Graph Embedding

A literature survey of knowledge graph embedding methods has been conducted by [31]

. These methods can be classified into translational distance models and semantic matching models based on different scoring functions for a triple

. Translational distance models use distance-based scoring functions. They assess the plausibility of a triple by the distance between the two entity vectors and , typically after a translation performed by the relation vector . The representative models are TransE [3] and its extensions including TransH [33]. For TransE, the scoring function is defined as the negative translational distance . Semantic matching models employ similarity-based scoring functions. The representative models are RESCAL [18], DistMult [40] and their extensions. For DistMult, the scoring function is defined as a bilinear function

. Recently, convolutional neural networks also show promising results for knowledge graph completion 

[7, 16, 22].

The above methods conduct knowledge graph completion using only structural information observed in triples, while different kinds of external information like entity types, logical rules and textual descriptions can be introduced to improve the performance [31]. For textual descriptions, [24] firstly represented entities by averaging the word embeddings contained in their names, where the word embeddings are learned from an external corpus. [32] proposed to jointly embed entities and words into the same vector space by aligning Wikipedia anchors and entity names. [37] use convolutional neural networks (CNN) to encode word sequences in entity descriptions. [35] proposed semantic space projection (SSP) which jointly learns topics and KG embeddings by characterizing the strong correlations between fact triples and textual descriptions. Despite their success, these models learn the same textual representations of entities and relations while words in entity/relation descriptions can have different meanings or importance weights in different triples.

To address the above problems, [34] presented a text-enhanced KG embedding model TEKE which can assign different embeddings to a relation in different triples. TEKE utilizes co-occurrences of entities and words in an entity-annotated text corpus. [39] used an LSTM encoder with attention mechanism to construct contextual text representations given different relations. [1] proposed an accurate text-enhanced KG embedding method by exploiting triple specific relation mentions and a mutual attention mechanism between relation mention and entity description. Although these methods can handle the semantic variety of entities and relations in distinct triples, they could not make full use of syntactic and semantic information in large scale free text data, as only entity descriptions, relation mentions and word co-occurrence with entities are utilized. Compared with these methods, our method can learn context-aware text embeddings with rich language information via pre-trained language models.

Language Model Pre-training

Pre-trained language representation models can be divided into two categories: feature-based and fine tuning approaches. Traditional word embedding methods such as Word2Vec [14] and Glove [19] aimed at adopting feature-based approaches to learn context-independent words vectors. ELMo [20] generalized traditional word embeddings to context-aware word embeddings, where word polysemy can be properly handled. Different from feature-based approaches, fine tuning approaches like GPT [21] and BERT [8] used the pre-trained model architecture and parameters as a starting point for specific NLP tasks. The pre-trained models capture rich semantic patterns from free text. Recently, pre-trained language models have also been explored in the context of KG. [30] learned contextual embeddings on entity-relation chains (sentences) generated from random walks in KG, then used the embeddings as initialization of KG embeddings models like TransE. [44] incorporated informative entities in KG to enhance BERT language representation. [4] used GPT to generate tail phrase tokens given head phrases and relation types in a common sense knowledge base which does not cleanly fit into a schema comparing two entities with a known relation. The method focuses on generating new entities and relations. Unlike these studies, we use names or descriptions of entities and relations as input and fine-tune BERT to compute plausibility scores of triples.


Bidirectional Encoder Representations from Transformers (BERT)

BERT [8] is a state-of-the-art pre-trained contextual language representation model built on a multi-layer bidirectional Transformer encoder [28]. The Transformer encoder is based on self-attention mechanism. There are two steps in BERT framework: pre-training and fine-tuning. During pre-training, BERT is trained on large-scale unlabeled general domain corpus (3,300M words from BooksCorpus and English Wikipedia) over two self-supervised tasks: masked language modeling and next sentence prediction. In masked language modeling, BERT predicts randomly masked input tokens. In next sentence prediction, BERT predicts whether two input sentences are consecutive. For fine-tuning, BERT is initialized with the pre-trained parameter weights, and all of the parameters are fine-tuned using labeled data from downstream tasks such as sentence pair classification, question answering and sequence labeling.

Knowledge Graph BERT (KG-BERT)

Figure 1: Illustrations of fine-tuning KG-BERT for predicting the plausibility of a triple.
Figure 2: Illustrations of fine-tuning KG-BERT for predicting the relation between two entities.

To take full advantage of contextual representation with rich language patterns, We fine tune pre-trained BERT for knowledge graph completion. We represent entities and relations as their names or descriptions, then take the name/description word sequences as the input sentence of the BERT model for fine-tuning. As original BERT, a “sentence” can be an arbitrary span of contiguous text or word sequence, rather than an actual linguistic sentence. To model the plausibility of a triple, we packed the sentences of as a single sequence. A “sequence” means the input token sequence to BERT, which may be two entity name/description sentences or three sentences of packed together.

The architecture of the KG-BERT for modeling triples is shown in Figure 1. We name this KG-BERT version KG-BERT(a). The first token of every input sequence is always a special classification token [CLS]. The head entity is represented as a sentence containing tokens Tok, …, Tok, e.g., “Steven Paul Jobs was an American business magnate, entrepreneur and investor.” or “Steve Jobs”, the relation is represented as a sentence containing tokens Tok, …, Tok, e.g., “founded”, the tail entity is represented as a sentence containing tokens Tok, …, Tok, e.g., “Apple Inc. is an American multinational technology company headquartered in Cupertino, California.” or “Apple Inc.”. The sentences of entities and relations are separated by a special token [SEP]. For a given token, its input representation is constructed by summing the corresponding token, segment and position embeddings. Different elements separated by [SEP] have different segment embeddings, the tokens in sentences of head and tail entity share the same segment embedding , while the tokens in relation sentence have a different segment embedding . Different tokens in the same position have a same position embedding. Each input token has a input representation . The token representations are fed into the BERT model architecture which is a multi-layer bidirectional Transformer encoder based on the original implementation described in [28]. The final hidden vector of the special [CLS] token and -th input token are denoted as and , where is the hidden state size in pre-trained BERT. The final hidden state corresponding to [CLS] is used as the aggregate sequence representation for computing triple scores. The only new parameters introduced during triple classification fine-tuning are classification layer weights . The scoring function for a triple is , is a 2-dimensional real vector with and . Given the positive triple set and a negative triple set constructed accordingly, we compute a cross-entropy loss with and triple labels:


where is the label (negative or positive) of that triple. The negative triple set is simply generated by replacing head entity or tail entity in a positive triple with a random entity or , i.e.,


where is the set of entities. Note that a triple will not be treated as a negative example if it is already in positive set . The pre-trained parameter weights and new weights can be updated via gradient descent.

The architecture of the KG-BERT for predicting relations is shown in Figure 2. We name this KG-BERT version KG-BERT(b). We only use sentences of the two entities and to predict the relation between them. In our preliminary experiment, we found predicting relations with two entities directly is better than using KG-BERT(a) with relation corruption, i.e., generating negative triples by replacing relation with a random relation . As KG-BERT(a), the final hidden state corresponding to [CLS] is used as the representation of the two entities. The only new parameters introduced in relation prediction fine-tuning are classification layer weights , where is the number of relations in a KG. The scoring function for a triple is , is a -dimensional real vector with and . We compute the following cross-entropy loss with and relation labels:


where is an observed positive triple, is the relation indicator for the triple , when and when .


In this section we evaluate our KG-BERT on three experimental tasks. Specifically we want to determine:

  • Can our model judge whether an unseen triple fact is true or not?

  • Can our model predict an entity given another entity and a specific relation?

  • Can our model predict relations given two entities?


We ran our experiments on six widely used benchmark KG datasets: WN11 [24], FB13 [24], FB15K [3], WN18RR, FB15k-237 and UMLS [7]. WN11 and WN18RR are two subsets of WordNet, FB15K and FB15k-237 are two subsets of Freebase. WordNet is a large lexical KG of English where each entity as a synset which is consisting of several words and corresponds to a distinct word sense. Freebase is a large knowledge graph of general world facts. UMLS is a medical semantic network containing semantic types (entities) and semantic relations. The test sets of WN11 and FB13 contain positive and negative triplets which can be used for triple classification. The test set of WN18RR, FB15K, FB15k-237 and UMLS only contain correct triples, we perform link (entity) prediction and relation prediction on these datasets. Table 1 provides statistics of all datasets we used.

For WN18RR, we use synsets definitions as entity sentences. For WN11, FB15K and UMLS, we use entity names as input sentences. For FB13, we use entity descriptions in Wikipedia as input sentences. For FB15k-237, we used entity descriptions made by  [37]. For all datasets, we use relation names as relation sentences.

Dataset # Ent # Rel # Train # Dev # Test
WN11 38,696 11 112,581 2,609 10,544
FB13 75,043 13 316,232 5,908 23,733
WN18RR 40,943 11 86,835 3,034 3,134
FB15K 14,951 1,345 483,142 50,000 59,071
FB15k-237 14,541 237 272,115 17,535 20,466
UMLS 135 46 5,216 652 661
Table 1: Summary statistics of datasets.


We compare our KG-BERT with multiple state-of-the-art KG embedding methods as follows: TransE and its extensions TransH [33], TransD [10], TransR [13], TransG [36], TranSparse [11] and PTransE [12], DistMult and its extension DistMult-HRS [43]

which only used structural information in KG. The neural tensor network NTN 

[24] and its simplified version ProjE [23]. CNN models: ConvKB [16], ConvE [7] and R-GCN [22]. KG embeddings with textual information: TEKE [34], DKRL [37], SSP [35], AATE [1]. KG embeddings with entity hierarchical types: TKRL [38]. Contextualized KG embeddings: DOLORES [30]. Complex-valued KG embeddings ComplEx [27] and RotatE [26]. Adversarial learning framework: KBGAN [5].


We choose pre-trained BERT-Base model with 12 layers, 12 self-attention heads and as the initialization of KG-BERT, then fine tune KG-BERT with Adam implemented in BERT. In our preliminary experiment, we found BERT-Base model can achieve better results than BERT-Large in general, and BERT-Base is simpler and less sensitive to hyper-parameter choices. Following original BERT, we set the following hyper-parameters in KG-BERT fine-tuning: batch size: 32, learning rate: 5e-5, dropout rate: 0.1. We also tried other values of these hyper-parameters in [8]

but didn’t find much difference. We tuned number of epochs for different tasks: 3 for triple classification, 5 for link (entity) prediction and 20 for relation prediction. We found more epochs can lead to better results in relation prediction but not in other two tasks. For triple classification training, we sample 1 negative triple for a positive triple which can ensure class balance in binary classification. For link (entity) prediction training, we sample 5 negative triples for a positive triple, we tried 1, 3, 5 and 10 and found 5 is the best.

Method WN11 FB13 Avg.
NTN [24] 86.2 90.0 88.1
TransE [33] 75.9 81.5 78.7
TransH [33] 78.8 83.3 81.1
TransR [13] 85.9 82.5 84.2
TransD [10] 86.4 89.1 87.8
TEKE [34] 86.1 84.2 85.2
TransG [36] 87.4 87.3 87.4
TranSparse-S [11] 86.4 88.2 87.3
DistMult [43] 87.1 86.2 86.7
DistMult-HRS [43] 88.9 89.0 89.0
AATE [1] 88.0 87.2 87.6
ConvKB [16] 87.6 88.8 88.2
DOLORES [30] 87.5 89.3 88.4
KG-BERT(a) 93.5 90.4 91.9
Table 2: Triple classification accuracy (in percentage) for different embedding methods. The baseline results are obtained from corresponding papers.
Method WN18RR FB15k-237 UMLS
MR Hits@10 MR Hits@10 MR Hits@10
TransE (our results) 2365 50.5 223 47.4 1.84 98.9
TransH (our results) 2524 50.3 255 48.6 1.80 99.5
TransR (our results) 3166 50.7 237 51.1 1.81 99.4
TransD (our results) 2768 50.7 246 48.4 1.71 99.3
DistMult (our results) 3704 47.7 411 41.9 5.52 84.6
ComplEx (our results) 3921 48.3 508 43.4 2.59 96.7
ConvE [7] 5277 48 246 49.1
ConvKB [16] 2554 52.5 257 51.7
R-GCN [22] 41.7
KBGAN [5] 48.1 45.8
RotatE [26] 3340 57.1 177 53.3
KG-BERT(a) 97 52.4 153 42.0 1.47 99.0
Table 3: Link prediction results on WN18RR, FB15k-237 and UMLS datasets. The baseline models denoted (our results) are implemented using OpenKE toolkit [9], other baseline results are taken from the original papers.
Method Mean Rank Hits@1
TransE [12] 2.5 84.3
TransR [38] 2.1 91.6
DKRL (CNN) [37] 2.5 89.0
DKRL (CNN) + TransE [37] 2.0 90.8
DKRL (CBOW) [37] 2.5 82.7
TKRL (RHE) [38] 1.7 92.8
TKRL (RHE) [38] 1.8 92.5
PTransE (ADD, len-2 path) [12] 1.2 93.6
PTransE (RNN, len-2 path) [12] 1.4 93.2
PTransE (ADD, len-3 path) [12] 1.4 94.0
SSP [35] 1.2
ProjE (pointwise) [23] 1.3 95.6
ProjE (listwise) [23] 1.2 95.7
ProjE (wlistwise) [23] 1.2 95.6
KG-BERT (b) 1.2 96.0
Table 4: Relation prediction results on FB15K dataset. The baseline results are obtained from corresponding papers.

Triple Classification.

Triple classification aims to judge whether a given triple

is correct or not. Table 2 presents triple classification accuracy of different methods on WN11 and FB13. We can see that KG-BERT(a) clearly outperforms all baselines by a large margin, which shows the effectiveness of our method. We ran our models 10 times and found the standard deviations are less than 0.2, and the improvements are significant (

). To our knowledge, KG-BERT(a) achieves the best results so far. For more in-depth performance analysis, we note that TransE could not achieve high accuracy scores because it could not deal with 1-to-N, N-to-1, and N-to-N relations. TransH, TransR, TransD, TranSparse and TransG outperform TransE by introducing relation specific parameters. DistMult performs relatively well, and can also be improved by hierarchical relation structure information used in DistMult-HRS. ConvKB shows decent results, which suggests that CNN models can capture global interactions among the entity and relation embeddings. DOLORES further improves ConvKB by incorporating contextual information in entity-relation random walk chains. NTN also achieves competitive performances especially on FB13, which means it’s an expressive model, and representing entities with word embeddings is helpful. Other text-enhanced KG embeddings TEKE and AATE outperform their base models like TransE and TransH, which demonstrates the benefit of external text data. However, their improvements are still limited due to less utilization of rich language patterns. The improvement of KG-BERT(a) over baselines on WN11 is larger than FB13, because WordNet is a linguistic knowledge graph which is closer to linguistic patterns contained in pre-trained language models.

Figure 3 reports triple classification accuracy with 5, 10, 15, 20 and 30 of original WN11 and FB13 training triples. We note that KG-BERT(a) can achieve higher test accuracy with limited training triples. For instance, KG-BERT(a) achieves a test accuracy of 88.1 on FB13 with only training triples and a test accuracy of 87.0 on WN11 with only training triples which are higher than some baseline models (including text-enhanced models) with even the full training triples. These encouraging results suggest that KG-BERT(a) can fully utilize rich linguistic patterns in large external text data to overcome the sparseness of knowledge graphs.

The main reasons why KG-BERT(a) performs well are four fold: 1) The input sequence contains both entity and relation word sequences; 2) The triple classification task is very similar to next sentence prediction task in BERT pre-training which captures relationship between two sentences in large free text, thus the pre-trained BERT weights are well positioned for the inference of relationship among different elements in a triple; 3) The token hidden vectors are contextual embeddings. The same token can have different hidden vectors in different triples, thus contextual information is explicitly used. 4) The self-attention mechanism can discover the most important words connected to the triple fact.

(a) WN11
(b) FB13
Figure 3: Test accuracy of triple classification by varying training data proportions.

Link Prediction.

The link (entity) prediction task predicts the head entity given or predicts the tail entity given where means the missing element. The results are evaluated using a ranking produced by the scoring function ( in our method) on test triples. Each correct test triple is corrupted by replacing either its head or tail entity with every entity , then these candidates are ranked in descending order of their plausibility score. We report two common metrics, Mean Rank (MR) of correct entities and Hits@10 which means the proportion of correct entities in top 10. A lower MR is better while a higher Hits@10 is better. Following [17], we only report results under the filtered setting [3] which removes all corrupted triples appeared in training, development, and test set before getting the ranking lists.

Table 3 shows link prediction performance of various models. We test some classical baseline models with OpenKE toolkit [9]111, other results are taken from the original papers. We can observe that: 1) KG-BERT(a) can achieve lower MR than baseline models, and it achieves the lowest mean ranks on WN18RR and FB15k-237 to our knowledge. 2) The Hits@10 scores of KG-BERT(a) is lower than some state-of-the-art methods. KG-BERT(a) can avoid very high ranks with semantic relatedness of entity and relation sentences, but the KG structure information is not explicitly modeled, thus it could not rank some neighbor entities of a given entity in top 10. CNN models ConvE and ConvKB perform better compared to the graph convolutional network R-GCN. ComplEx could not perform well on WN18RR and FB15k-237, but can be improved using adversarial negative sampling in KBGAN and RotatE.

Relation Prediction.

This task predicts relations between two given entities, i.e., . The procedure is similar to link prediction while we rank the candidates with the relation scores . We evaluate the relation ranking using Mean Rank (MR) and Hits@1 with filtered setting.

Table 4 reports relation prediction results on FB15K. We note that KG-BERT(b) also shows promising results and achieves the highest Hits@1 so far. The KG-BERT(b) is analogous to sentence pair classification in BERT fine-tuning and can also benefit from BERT pre-training. Text-enhanced models DKRL and SSP can also outperform structure only methods TransE and TransH. TKRL and PTransE work well with hierarchical entity categories and extended path information. ProjE achieves very competitive results by treating KG completion as a ranking problem and optimizing ranking score vectors.

Figure 4: Illustrations of attention patterns of KG-BERT(a). A positive training triple (    twenty  dollar  bill  NN  1,    hypernym,     note  NN  6) from WN18RR is used as the example. Different colors mean different attention heads. Transparencies of colors reflect the attention scores. We show the attention weights between [CLS] and other tokens in layer 11 of the Transformer model.
Figure 5: Illustrations of attention patterns of KG-BERT(b). The example is taken from FB15K. Two entities 20th century and World War II are used as input, the relation label is /time/event/includes  event.

Attention Visualization.

We show attention patterns of KG-BERT in Figure 4 and Figure 5. We use the visualization tool released by [29]222 Figure 4 depicts the attention patterns of KG-BERT(a). A positive training triple (    twenty  dollar  bill  NN  1,    hypernym,     note  NN  6) from WN18RR is taken as the example. The entity descriptions “a United States bill worth 20 dollars” and “a piece of paper money” as well as the relation name “hypernym” are used as the input sequence. We observe that some important words such as “paper” and “money” have higher attention scores connected to the label token [CLS], while some less related words like “united” and “states” obtain less attentions. On the other hand, we can see that different attention heads focus on different tokens. [SEP] is highlighted by the same six attention heads, “a” and “piece” are highlighted by the three same attention heads, while “paper” and “money” are highlighted by other four attention heads. As mentioned in [28], multi-head attention allows KG-BERT to jointly attend to information from different representation subspaces at different positions, different attention heads are concatenated to compute the final attention values. Figure 5 illustrates attention patterns of KG-BERT(b). The triple (20th century, /time/event/includes  event, World War II) from FB15K is taken as input. We can see similar attention patterns as in KG-BERT(a), six attention heads attend to “century” in head entity, while other three attention heads focus on “war” and “ii” in tail entity. Multi-head attention can attend to different aspects of two entities in a triple.


From experimental results, we note that KG-BERT can achieve strong performance in three KG completion tasks. However, a major limitation is that BERT model is expensive, which makes the link prediction evaluation very time consuming, link prediction evaluation needs to replace head or tail entity with almost all entities, and all corrupted triple sequences are fed into the 12 layer Transformer model. Possible solutions are introducing 1-N scoring models like ConvE or using lightweight language models.

Conclusion and Future Work

In this work, we propose a novel knowledge graph completion method termed Knowledge Graph BERT (KG-BERT). We represent entities and relations as their name/description textual sequences, and turn knowledge graph completion problem into a sequence classification problem. KG-BERT can make use of rich language information in large amount free text and highlight most important words connected to a triple. The proposed method demonstrates promising results by outperforming state-of-the-art results on multiple benchmark KG datasets.

Some future directions include improving the results by jointly modeling textual information with KG structures, or utilizing pre-trained models with more text data like XLNet. And applying our KG-BERT as a knowledge-enhanced language model to language understanding tasks is an interesting future work we are going to explore.


  • [1] B. An, B. Chen, X. Han, and L. Sun (2018) Accurate text-enhanced knowledge graph representation learning. In NAACL, pp. 745–755. Cited by: Introduction, Knowledge Graph Embedding, Baselines., Table 2.
  • [2] K. Bollacker, C. Evans, P. Paritosh, T. Sturge, and J. Taylor (2008) Freebase: a collaboratively created graph database for structuring human knowledge. In SIGMOD, pp. 1247–1250. Cited by: Introduction.
  • [3] A. Bordes, N. Usunier, A. Garcia-Duran, J. Weston, and O. Yakhnenko (2013) Translating embeddings for modeling multi-relational data. In NIPS, pp. 2787–2795. Cited by: Knowledge Graph Embedding, Datasets., Link Prediction..
  • [4] A. Bosselut, H. Rashkin, M. Sap, C. Malaviya, A. Celikyilmaz, and Y. Choi (2019) COMET: commonsense transformers for automatic knowledge graph construction. In ACL, pp. 4762–4779. Cited by: Language Model Pre-training.
  • [5] L. Cai and W. Y. Wang (2018) KBGAN: adversarial learning for knowledge graph embeddings. In NAACL, pp. 1470–1480. Cited by: Baselines., Table 3.
  • [6] W. Cui, Y. Xiao, H. Wang, Y. Song, S. Hwang, and W. Wang (2017) KBQA: learning question answering over qa corpora and knowledge bases. Proceedings of the VLDB Endowment 10 (5), pp. 565–576. Cited by: Introduction.
  • [7] T. Dettmers, P. Minervini, P. Stenetorp, and S. Riedel (2018) Convolutional 2d knowledge graph embeddings. In AAAI, pp. 1811–1818. Cited by: Knowledge Graph Embedding, Datasets., Baselines., Table 3.
  • [8] J. Devlin, M. Chang, K. Lee, and K. Toutanova (2019) Bert: pre-training of deep bidirectional transformers for language understanding. In NAACL, pp. 4171–4186. Cited by: Introduction, Language Model Pre-training, Bidirectional Encoder Representations from Transformers (BERT), Settings..
  • [9] X. Han, S. Cao, X. Lv, Y. Lin, Z. Liu, M. Sun, and J. Li (2018) OpenKE: an open toolkit for knowledge embedding. In EMNLP, pp. 139–144. Cited by: Link Prediction., Table 3.
  • [10] G. Ji, S. He, L. Xu, K. Liu, and J. Zhao (2015) Knowledge graph embedding via dynamic mapping matrix. In ACL, pp. 687–696. Cited by: Baselines., Table 2.
  • [11] G. Ji, K. Liu, S. He, and J. Zhao (2016) Knowledge graph completion with adaptive sparse transfer matrix. In AAAI, Cited by: Baselines., Table 2.
  • [12] Y. Lin, Z. Liu, H. Luan, M. Sun, S. Rao, and S. Liu (2015) Modeling relation paths for representation learning of knowledge bases. In EMNLP, pp. 705–714. Cited by: Baselines., Table 4.
  • [13] Y. Lin, Z. Liu, M. Sun, Y. Liu, and X. Zhu (2015) Learning entity and relation embeddings for knowledge graph completion. In AAAI, Cited by: Baselines., Table 2.
  • [14] T. Mikolov, I. Sutskever, K. Chen, G. S. Corrado, and J. Dean (2013) Distributed representations of words and phrases and their compositionality. In NIPS, pp. 3111–3119. Cited by: Language Model Pre-training.
  • [15] G. A. Miller (1995) WordNet: a lexical database for english. Communications of the ACM 38 (11), pp. 39–41. Cited by: Introduction.
  • [16] D. Q. Nguyen, D. Q. Nguyen, T. D. Nguyen, and D. Phung (2018) A convolutional neural network-based model for knowledge base completion and its application to search personalization. Semantic Web. Cited by: Knowledge Graph Embedding, Baselines., Table 2, Table 3.
  • [17] D. Q. Nguyen, T. D. Nguyen, D. Q. Nguyen, and D. Phung (2018) A novel embedding model for knowledge base completion based on convolutional neural network. In NAACL, pp. 327–333. Cited by: Link Prediction..
  • [18] M. Nickel, V. Tresp, and H. Kriegel (2011) A three-way model for collective learning on multi-relational data. In ICML, pp. 809–816. Cited by: Knowledge Graph Embedding.
  • [19] J. Pennington, R. Socher, and C. Manning (2014) Glove: global vectors for word representation. In EMNLP, Cited by: Language Model Pre-training.
  • [20] M. E. Peters, M. Neumann, M. Iyyer, M. Gardner, C. Clark, K. Lee, and L. Zettlemoyer (2018) Deep contextualized word representations. In NAACL, pp. 2227–2237. Cited by: Introduction, Language Model Pre-training.
  • [21] A. Radford, K. Narasimhan, T. Salimans, and I. Sutskever (2018) Improving language understanding by generative pre-training. Cited by: Introduction, Language Model Pre-training.
  • [22] M. Schlichtkrull, T. N. Kipf, P. Bloem, R. Van Den Berg, I. Titov, and M. Welling (2018) Modeling relational data with graph convolutional networks. In ESWC, pp. 593–607. Cited by: Knowledge Graph Embedding, Baselines., Table 3.
  • [23] B. Shi and T. Weninger (2017) ProjE: embedding projection for knowledge graph completion. In AAAI, Cited by: Baselines., Table 4.
  • [24] R. Socher, D. Chen, C. D. Manning, and A. Ng (2013) Reasoning with neural tensor networks for knowledge base completion. In NIPS, pp. 926–934. Cited by: Introduction, Knowledge Graph Embedding, Datasets., Baselines., Table 2.
  • [25] F. M. Suchanek, G. Kasneci, and G. Weikum (2007) Yago: a core of semantic knowledge. In WWW, pp. 697–706. Cited by: Introduction.
  • [26] Z. Sun, Z. Deng, J. Nie, and J. Tang (2019) Rotate: knowledge graph embedding by relational rotation in complex space. In ICLR, Cited by: Baselines., Table 3.
  • [27] T. Trouillon, J. Welbl, S. Riedel, É. Gaussier, and G. Bouchard (2016) Complex embeddings for simple link prediction. In ICML, pp. 2071–2080. Cited by: Baselines..
  • [28] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin (2017) Attention is all you need. In NIPS, pp. 5998–6008. Cited by: Bidirectional Encoder Representations from Transformers (BERT), Knowledge Graph BERT (KG-BERT), Attention Visualization..
  • [29] J. Vig (2019) A multiscale visualization of attention in the transformer model. arXiv preprint arXiv:1906.05714. External Links: Link Cited by: Attention Visualization..
  • [30] H. Wang, V. Kulkarni, and W. Y. Wang (2018) DOLORES: deep contextualized knowledge graph embeddings. arXiv preprint arXiv:1811.00147. Cited by: Language Model Pre-training, Baselines., Table 2.
  • [31] Q. Wang, Z. Mao, B. Wang, and L. Guo (2017) Knowledge graph embedding: a survey of approaches and applications. IEEE TKDE 29 (12), pp. 2724–2743. Cited by: Introduction, Knowledge Graph Embedding, Knowledge Graph Embedding.
  • [32] Z. Wang, J. Zhang, J. Feng, and Z. Chen (2014) Knowledge graph and text jointly embedding. In EMNLP, Cited by: Knowledge Graph Embedding.
  • [33] Z. Wang, J. Zhang, J. Feng, and Z. Chen (2014)

    Knowledge graph embedding by translating on hyperplanes

    In AAAI, Cited by: Knowledge Graph Embedding, Baselines., Table 2.
  • [34] Z. Wang and J. Li (2016) Text-enhanced representation learning for knowledge graph.. In IJCAI, pp. 1293–1299. Cited by: Introduction, Knowledge Graph Embedding, Baselines., Table 2.
  • [35] H. Xiao, M. Huang, L. Meng, and X. Zhu (2017) SSP: semantic space projection for knowledge graph embedding with text descriptions. In AAAI, Cited by: Introduction, Knowledge Graph Embedding, Baselines., Table 4.
  • [36] H. Xiao, M. Huang, and X. Zhu (2016) TransG: a generative model for knowledge graph embedding. In ACL, Vol. 1, pp. 2316–2325. Cited by: Baselines., Table 2.
  • [37] R. Xie, Z. Liu, J. Jia, H. Luan, and M. Sun (2016) Representation learning of knowledge graphs with entity descriptions. In AAAI, Cited by: Introduction, Knowledge Graph Embedding, Datasets., Baselines., Table 4.
  • [38] R. Xie, Z. Liu, and M. Sun (2016) Representation learning of knowledge graphs with hierarchical types. In IJCAI, pp. 2965–2971. Cited by: Baselines., Table 4.
  • [39] J. Xu, X. Qiu, K. Chen, and X. Huang (2017) Knowledge graph representation with jointly structural and textual encoding. In IJCAI, pp. 1318–1324. Cited by: Introduction, Knowledge Graph Embedding.
  • [40] B. Yang, W. Yih, X. He, J. Gao, and L. Deng (2015) Embedding entities and relations for learning and inference in knowledge bases. In ICLR, Cited by: Knowledge Graph Embedding.
  • [41] Z. Yang, Z. Dai, Y. Yang, J. Carbonell, R. Salakhutdinov, and Q. V. Le (2019) XLNet: generalized autoregressive pretraining for language understanding. arXiv preprint arXiv:1906.08237. Cited by: Introduction.
  • [42] F. Zhang, N. J. Yuan, D. Lian, X. Xie, and W. Ma (2016) Collaborative knowledge base embedding for recommender systems. In KDD, pp. 353–362. Cited by: Introduction.
  • [43] Z. Zhang, F. Zhuang, M. Qu, F. Lin, and Q. He (2018) Knowledge graph embedding with hierarchical relation structure. In EMNLP, pp. 3198–3207. Cited by: Baselines., Table 2.
  • [44] Z. Zhang, X. Han, Z. Liu, X. Jiang, M. Sun, and Q. Liu (2019) ERNIE: enhanced language representation with informative entities. In ACL, pp. 1441–1451. Cited by: Language Model Pre-training.