Knowledge Enhanced Pretrained Language Models: A Compreshensive Survey

10/16/2021 ∙ by Xiaokai Wei, et al. ∙ Amazon 0

Pretrained Language Models (PLM) have established a new paradigm through learning informative contextualized representations on large-scale text corpus. This new paradigm has revolutionized the entire field of natural language processing, and set the new state-of-the-art performance for a wide variety of NLP tasks. However, though PLMs could store certain knowledge/facts from training corpus, their knowledge awareness is still far from satisfactory. To address this issue, integrating knowledge into PLMs have recently become a very active research area and a variety of approaches have been developed. In this paper, we provide a comprehensive survey of the literature on this emerging and fast-growing field - Knowledge Enhanced Pretrained Language Models (KE-PLMs). We introduce three taxonomies to categorize existing work. Besides, we also survey the various NLU and NLG applications on which KE-PLM has demonstrated superior performance over vanilla PLMs. Finally, we discuss challenges that face KE-PLMs and also promising directions for future research.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1. Introduction

In recent years, large-scale pretrained language models (PLM), which are pretrained on huge text corpus typically with unsupervised objectives, have revolutionized the field of NLP. Pretrained models such as BERT (Devlin et al., 2019), RoBERTa (Liu et al., 2019), GPT2/3 (Radford et al., 2019) (Brown et al., 2020) and T5 (Raffel et al., 2019) have gained huge success and greatly boosted state-of-the-art performance on various NLP applications (Qiu et al., 2020). The wide success of pretraining in NLP also inspires the adoption of self-supervised pretraining in other fields, such as graph representation learning (Hu et al., 2020a) (Hu et al., 2020b) and recommender system (Sun et al., 2019)(Xie et al., 2020).

Training on large textual data also enables these PLMs memorize certain facts and knowledge contained in the training corpus. As demonstrated in recent work, these pretrained language models could possess decent amount of lexical knowledge (Liu et al., 2019) (Vulić et al., 2020) as well as factual knowledge (Petroni et al., 2019) (Roberts et al., 2020) (Wang et al., 2020). However, further study reveals that PLMs also have the following limitations in terms of knowledge awareness:

  • For NLU, recent study have found PLMs tend to rely on superficial signals/statistical cues (Petroni et al., 2020) (McCoy et al., 2019) (Niven and Kao, 2019), and can be easily fooled with negated (e.g., “Birds can [MASK]” v.s. “Birds cannot [MASK]”) and misprimed probes (Kassner and Schütze, 2020). Besides, It has been found that PLMs often fail in reasoning tasks (Talmor et al., 2020).

  • For NLG, although PLMs are able to generate grammatically correct sentence, the generated text might not be logical or sensible (Lin et al., 2020) (Liu et al., 2020). For example, as noted in (Lin et al., 2020), given a set of concepts {dog, frisbee, catch, throw}, GPT2 generates “A dog throws a frisbee at a football player” and T5 generates “dog catches a frisbee and throws it to a dog”, neither of which aligns with human’s commonsense.

These observations motivate work on designing more knowledge-aware pre-trained models. Recently, an ever-growing body of work aims at explicitly incorporating knowledge into PLMs (Yamada et al., 2020) (Zhang et al., 2019)(Peters et al., 2019)(Verga et al., 2020)(Wang et al., 2020)(Liu et al., 2020)(Ji et al., 2020). They exploit knowledge from various sources such as encyclopedia knowledge, commonsense knowledge and linguistic knowledge with different injection strategies. Such knowledge integration mechanism have successfully enhance existing PLMs’ knowledge awareness, and lead to improved performance on a variety of tasks, including but not limited to entity typing (Yamada et al., 2020), question answering (Yang et al., 2019)(Lin et al., 2019), story generation (Guan et al., 2020)

and knowledge graph completion

(Yao et al., 2019).

In this paper, we aim to provide a comprehensive survey on this emerging field of Knowledge Enhanced Pretrained Language Models (KE-PLMs). Existing work on KE-PLMs have developed a diverse set of techniques for knowledge integration on different knowledge sources. To provide insights on these models and facilitate future research, we build three taxonomies to categorize the existing KE-PLMs. Figure 1 illustrates our proposed taxonomies on Knowledge Enhanced Pretrained Language Models (KE-PLMs).

In existing KE-PLMs, there are different types of knowledge sources (e.g., linguistic, commonsense, encyclopedia, application-specific) have been explored to enhance the capability of PLMs in different aspects. The first taxonomy helps us to understand what knowledge sources have been considered for constructing KE-PLMs. In the second taxonomy, we recognize that a knowledge source can be exploited to different extents, and categorize the existing work based on the knowledge granularity: text-chunk based, entity-based, relation triple-based and subgraph-based. Finally, we introduce a third taxonomy that groups the methods by their application areas. This taxonomy presents a range of applications that existing KE-PLMs have targeted to improve with the help of knowledge integration. By recognizing what application areas have been well addressed/under addressed by KE-PLMs, we believe this could shed light on future research opportunities on applying KE-PLMs to under-addressed areas.

Related work Two contemporaneous reviews (Safavi and Koutra, 2021) and (Colon-Hernandez et al., 2021) also investigate incorporating relational knowledge graph into pretrained language models. We cover a larger scope by discussing more types of knowledge (e.g., domain specific knowledge) and present in-depth discussion on the applications and potential future directions.

The rest of this survey is organized as follows. In section 2, we present our taxonomy based on the knowledge source of KE-PLMs, and discuss representative approaches in each category. In section 3, we categorize the different knowledge granularities that existing KE-PLMs exploit, and summarize the common techniques employed for incorporating such knowledge. In section 4, we present the applications that benefit from the development of KE-PLMs and introduce corresponding datasets. In section 5, we discuss the challenges facing the design of highly effective KE-PLMs and the opportunities for future research in this area. Lastly, we conclude our survey in 6.

Figure 1. Taxonomy of Knowledge Enhanced Pretrained Langauge Models (KE-PLMs)

2. Knowledge Source

Existing works have explored the integration of knowledge from diverse sources: linguistic knowledge, encyclopedia knowledge, and commensense knowledge and domain-specific knowledge. In this section, we categorize existing work by their knowledge source. For each category of knowledge source, we also introduce several representative methods and the corresponding knowledge they exploit.

2.1. Linguistic Knowledge

Lexcial relation Lexically Informed BERT (LIBERT) (Lauscher et al., 2020) incorporate lexical relation knowledge by predicting whether two words in a sentence exhibit semantic similarity (i.e., whether they are synonyms or hypernym-hyponym pairs).

Word sense SenseBERT (Levine et al., 2020) exploits word-sense knowledge from WordNet as weak supervision by including additional training task on predicting word-supersense (e.g., noun.food and noun.state) based on the masked word’s context.

Syntax tree Syntax-BERT(Bai et al., 2021) employs syntax parsers to extract both dependency (Chen and Manning, 2014) and constituency parsing (Zhu et al., 2013) trees. Syntax-related masks are designed to incorporate information from the syntax trees. K-Adapter(Wang et al., 2020) also includes a linguistic adapter that incorporates dependency parsing information, by predicting the head index for each token in a sentence.

Part-of-Speech tag SentiLARE (Ke et al., 2020) and LIMIT-BERT (Zhou et al., 2020) consider POS tag as additional knowledge. For example, LIMIT-BERT incorporates multiple types of linguistic knowledge simultaneously in a multi-task manner. In addition to POS tags and syntactical parsing tree, LIMIT-BERT also explores span and semantic role labeling (SRL) information.

With such additional linguistic knowledge incorporated, these approaches are able to demonstrate superior performance on general NLU benchmark dataset such as GLUE (Wang et al., 2018) and SuperGLUE(Wang et al., 2019)

, or specific applications such as sentiment analysis.

2.2. Encyclopedia Knowledge

Encyclopedia KG such as Freebase (Bollacker et al., 2008), NELL (Carlson et al., 2010) and Wikidata (Vrandečić and Krötzsch, 2014) contain facts/world knowledge in the following form:

For example, (’Joe Biden’, ’PresidentOf’, ’USA’). These encyclopedia KG are able to provide abundant knowledge for PLMs to integrate. A majority of exiting work on KE-PLMs(Zhang et al., 2019)(He et al., 2020)(Yu et al., 2020) (Wang et al., 2020)(Verga et al., 2020)(Sun et al., 2020a)(Qin et al., 2020) uses Wikidata111https://www.wikidata.org/ as knowledge source. Typically, entities in Wikidata are linked with entity mentions in the text of Wikipedia. Then entity-aware training could be performed on such linked data to learn the parameters for KE-PLMs.

2.3. Commonsense Knowledge

To empower PLMs with more commonsense reasoning capability, existing models typically resort to the following two commonsense knowledge graphs: ConceptNet (Speer et al., 2017) and ATOMIC (Sap et al., 2019).

  • ConceptNet222https://conceptnet.io/ (Speer et al., 2017) is a multilingual knowledge graph consisting of triples in relations, such as CapableOf, Causes and HasProperty. For a knowledge triple , it represents that head has the relation with tail . For example, the triple (cooking, requires, food) means that “the prerequisite of cooking is food”.

  • ATOMIC (Sap et al., 2019) contains inferential knowledge in the form if-then triples. It covers a variety of social commonsense knowledge around specific event prompts (e.g., “X goes to the store”). Specifically, ATOMIC distills its commonsense in nine dimensions, covering the event’s causes and effects on the agent.

Incorporating knowledge from ConceptNet and ATOMIC helps PLMs gain stronger capability on commonsense reasoning (Lin et al., 2019)(Shen et al., 2020)(Ye et al., 2019)(Lv et al., 2020) (Guan et al., 2020)(Ji et al., 2020)(Yu et al., 2020)(Liu et al., 2020)(Li et al., 2020). We will discuss in more detail about how the knowledge is incorporated in section 3 and how they benefit multiple commonsense-related downstream tasks such commonsenseQA and text generation in section 4.

2.4. Application/Domain Specific Knowledge

In this subsection, we introduce KE-PLMs which utilizes domain-specific knowledge for their particular vertical applications.

Sentiment knowledge In addition to the POS information mentioned in section 2.1, SentiLARE(Ke et al., 2020) also utilizes sentiment word polarity from SentiWordNet(Baccianella et al., 2010). SKEP(Tian et al., 2020) incorporates sentiment knowledge from self-supervised training, including sentiment word detection, word polarity and aspect-sentiment pair.

Medical knowledge Medical domain knowledge (He et al., 2020) integrate biomedical ontology from Unified Medical Language System (UMLS) (Bodenreider, 2004) to facilitate tasks in medical domain. K-BERT (Liu et al., 2020) also exploits knowledge from a medical concept KG for higher quality NER in medical domain.

E-commerce Product Graph E(commerce)-BERT(Zhang et al., 2020) utilizes product association graph (i.e., whether two products are substitutable and complementary) (McAuley et al., 2015) constructed from consumer shopping statistics. It introduces additional tasks for reconstructing a product given its neighbor products in association graph.

Though most existing approaches exploit only one knowledge source, it is worth noting that certain methods attempt to incorporate from more than one knowledge source. For example, K-Adapter (Wang et al., 2020) incorporate knowledge from multiple sources by learning a different adapter for each knowledge source. It exploits both dependency relation as linguistic knowledge and relation/fact knowledge from Wikidata.

3. Knowledge Granularity

A majority of approaches resort to knowledge graphs (enclopedia, commonsense or domain-specific) as source of knowledge. In this section, we group these models by the granularity of knowledge they incorporate from KG: text-based knowledge, entity knowledge, relation triples and KG subgraphs.

3.1. Text Chunk-based Knowledge

RAG (Lewis et al., 2020)

builds an index in which each entry is a text chunk from Wikipedia. It first retrieve top-k documents/chunks from the memory using kNN-based retrieval, and the BART model

(Lewis et al., 2020) is employed to generate the output, conditioned on these retrieved documents. Similarly, KIF (Fan et al., 2020) uses KNN-based retrieval on an external non-parametric memory storing wikipedia sentences and dialogue utterances to improve generative dialog modeling.

3.2. Entity Knowledge

Entity-level information can be highly useful for a variety of entity-related tasks such as NER, entity typing, relation classification and machine reading comprehension. Hence, many existing KE-PLM models target this type of simple yet powerful knowledge.

A popular approach to making PLMs more entity aware is to introduce the entity-aware objectives while pretraining. Such strategy is adopted by multiple existing approaches: ERNIE (BAIDU)(Sun et al., 2020b) ERNIE(THU)(Zhang et al., 2019) CokeBERT(Su et al., 2020) KgPLM(He et al., 2020), LUKE(Yamada et al., 2020), GLM(Shen et al., 2020), KALM(Rosset et al., 2020), CoLAKE (Sun et al., 2020a), JAKET(Yu et al., 2020) and AMS(Ye et al., 2019). A typical choice of entity-related objective is an entity linking loss which predicts the entity mention in text to entity in KG with a cross entropy loss or max-margin loss on the prediction (Zhang et al., 2019)(Yamada et al., 2020)(Yu et al., 2020)(Rosset et al., 2020). There is also method that employs replacement detection loss in which the goal is the predict whether an entity mention has been replaced with a distractor (Xiong et al., 2020). Also, certain methods adopts more than one type of losses. For example, KgPLM(He et al., 2020)

show through ablation study that employing both a cross-entropy entity prediction loss and an entity replacement detection loss (which were referred to as generative and discriminative loss respectively) lead to better performance than only using one loss. We summarize the different entity-aware loss functions in Table

1.

Loss Type Typical Form Reprensentative Methods
Cross entropy entity linking (EL) loss ERNIE(THU)(Zhang et al., 2019), JAKET(Yu et al., 2020) EaE(Févry et al., 2020), LUKE(Yamada et al., 2020), CokeBERT(Su et al., 2020)
Max margin EL loss GLM(Shen et al., 2020), KALM(Rosset et al., 2020)
Entity replacement detection + WKLM(Xiong et al., 2020), GLM(Shen et al., 2020)
Squred Loss E-BERT (Poerner et al., 2020)
Table 1. Summarization of entity-related objectives. is the similarity score between mention and entity (e.g., the inner product between them) given the contextulized embedding . is the similarity on negative(i.e. distractor) mention/entity pairs. represents whether the entity has been replaced with any negative sample.

Besides named entities, a few existing approaches find it helpful to incorporate phrase information into pretraining, such as ERNIE (BAIDU) (Sun et al., 2020b) and E(commerce)-BERT(Zhang et al., 2020).

3.3. Relation Knowledge

KE-PLMs by incorporating entity information alone have already demonstrated impressive performance gains over vanilla PLMs. Recently, an increasing amount of work have considered integrating knowledge beyond entities, such as the relation triples in KG to make the models even more powerful.

For example, FaE (Fact-as-Experts) (Verga et al., 2020) further extends EaE (Févry et al., 2020) by building fact (i.e., relation triples in KG) memory in addition to the entity memory. The set of tail entities retrieved from fact memory are aggregated with attention mechanism, and then fused with token embeddings. BERT-MK (He et al., 2020) and adopts similar technique as ERNIE-THU(Zhang et al., 2019) for knowledge fusion. ERICA(Qin et al., 2020) applies contrastive loss on entity and relation, which pulls neighbor entity/relation close and pushes non-neighbors far apart in the embedding space. K-Adapter(Wang et al., 2020) introduces a factual adapter which incorporates relation information by performing relation classification based on the entity context. K-BERT(Liu et al., 2020) injects knowledge by augmenting sentence with the triplets from KG to transform it into a knowledge-rich sentence tree. To prevent changing the original semantic of the sentence, K-BERT adopts a visible matrix to control the visibility of each injected token.

To better predict the plausibility of a triple, KG-BERT (Yao et al., 2019) fine-tune a BERT model on a training format as in (1) derived from KG triplets. Similarly, COMET (Bosselut et al., 2019) and (Guan et al., 2020) verbalize the concepts and relations from commonsense KG into text tokens for training a knowledge enhanced PLM. KEPLER (Wang et al., 2019) and BLP (BERT for Link Prediction) (Daza et al., 2020) further include a TransE based objective (Bordes et al., 2013) for the PLM to incorporate information from the triples in KG.

(1)

LRLM (Hayashi et al., 2019) and KGLM (Logan et al., 2019) are two language models that utlize relations in KG when generating the next token in an autoregressive manner. They introduce a latent variable to determine whether the next token should be a text-based token or entity/relation induced token.

3.4. Subgraph Knowledge

As knowledge graphs provide richer knowledge than simply entity/relation triplets, recently, an increasing number of approaches start to explore integration of more sophisticated knowledge, such as subgraphs in KG. The approaches in this category typically have two stages: subgraph construction and subgraph representation learning. In the subgraph construction stage, they extract the 1-hop or K-hop relations. As the subgraph can be large (particularly when

) and many entities/relations involved might not be highly relevant to the text context, usually certain pruning/filtering are conducted and/or attention mechanism are applied on the nodes/relations. In the second stage, for representation learning on extracted subgraphs, most existing approaches adopt certain variants of Graph Neural Networks with attention mechanism.

3.4.1. Encylopedia KG

CokeBERT (Su et al., 2020) designs Semantic-driven GNN (S-GNN) for representing knowledge subgraphs. S-GNN employs attention mechanism to direct the model to better focus on the most relevant entities/relations. Subsequently, conditioned on pre-trained entity embedding and subgraph embedding, the entity mention embedding is computed and used in a way similar to ERNIE (THU) (Zhang et al., 2019).

CoLake (Sun et al., 2020a) also uses GNN to aggregate information from the extracted knowledge subgraphs in both pretraining and inference. Instead of constructing GNN in an explicit way, CoLake convert the subgraph into token sequence and append it to input sequence for PLM, by recognizing the fact that self-attention mechanism in Transformer (Vaswani et al., 2017) is similar to Graph Attention Networks (Velickovic et al., 2018) in spirit.

3.4.2. Commonsense KG

Knowledge subgraphs are also very useful for commonsense based applications, such as CommonsenseQA and text/story generation.

Kag-Net (Lin et al., 2019) employs path-finding based algorithm for subgraph construction. It first applies GCN on the subgraph to generate node embeddings, then the output of GCNs is passed to LSTM-based (Hochreiter and Schmidhuber, 1997) path encoder. Finally, the path embeddings on the subgraphs are combined through attention-based pooling to obtain the knowledge representation.

GRF (Ji et al., 2020) also adopts a GNN-based module for concept subgraph encoding, and it designs a dynamic reasoning module to propogate information on this graph at each decoding step.

KG-BART (Liu et al., 2020) first constructs a knowledge subgraph from commonsense KG, and use Glove embedding (Pennington et al., 2014) based word similarity to prune irrelevant concepts. Then graph attention is employed to learn concept embeddings, which are integrated with token embedding using concept-to-subword and subword-to-concept fusion. Such scheme enables KG-BART to generate sentences with more logical and natural even with unseen concepts. (Lv et al., 2020) also uses GNN to generate embedding for concepts graphs.

In Table 2, we summarize the existing approaches with their characteristics.

Method Knowledge Aware Pretraining Knowledge Aware Auxiliary Loss Knowledge Source
ERICA (Qin et al., 2020) Yes entity/relation discrimination Wikipedia, Wikidata
ERNIE (THU) (Zhang et al., 2019) Yes entity prediction Wikipedia/Wikidata
ERNIE 2.0 (Baidu) (Sun et al., 2020b) Yes masked entity/phrase N/A
E-BERT (Poerner et al., 2020) Yes entity/wordpiece alignment Wikipedia2Vec
E(commerce)-BERT (Zhang et al., 2020) Yes neighbor Product Reconstruction product graph/AutoPhrase(Shang et al., 2018)
EaE (Févry et al., 2020) Yes mention detection/linking Wikipedia
CokeBERT (Su et al., 2020) Yes entity prediction Wikipedia/Wikidata
COMET (Bosselut et al., 2019) No autoregressive ATOMIC, ConceptNet
K-Adapter (Wang et al., 2020) No dependency relation Wikipedia, Wikidata, Stanford Parser
KnowBERT (Peters et al., 2019) Yes entity linking WordNet, Wikipedia
K-BERT (Liu et al., 2020) No finetuning

WikiZh, WebtextZh, CN-DBpedia

HowNet, MedicalKG
KEPLER (Wang et al., 2019) Yes TransE scoring Wikipedia/Wikidata
KG-BERT (Yao et al., 2019) Yes relation cross-entropy ConceptNet
KG-BART (Liu et al., 2020) Yes masked concept ConceptNet
KgPLM (He et al., 2020) Yes generative/discriminative masked entity Wikipedia/Wikidata
FaE (Verga et al., 2020) Yes masked entity et.al Wikipedia/Wikidata
JAKET (Yu et al., 2020) Yes entity category/relation type/masked entity Wikipedia/Wikidata
LUKE (Yamada et al., 2020) Yes entity prediction Wikipedia
WKLM (Xiong et al., 2020) Yes entity replacement detection Wikipedia/Wikidata
CoLAKE (Sun et al., 2020a) Yes masked entity prediction Wikipedia/Wikidata
KT-NET (Yang et al., 2019) No finetuning N/A
LIBERT (Lauscher et al., 2020) Yes lexical relation prediction WordNet
SenseBERT (Levine et al., 2020) Yes supersense prediction WordNet
Syntax-BERT (Bai et al., 2021) No masks induced by syntax tree parsing syntax tree
SentiLARE (Ke et al., 2020) Yes POS/ word level polarity/sentiment polarity SentiWordNet
(Li et al., 2020) No finetuning ConceptNet
COCOLM (Yu et al., 2020) Yes discourse relation/co-occurrence relation ASER
(Guan et al., 2020) Yes autoregressive ConceptNet/ATOMIC
AMS (Ye et al., 2019) Yes distractor-based loss ConceptNet
GLM (Shen et al., 2020) No distractor-based entity linking loss ConceptNet
GRF (Ji et al., 2020) No finetuning ConceptNet
KagNet (Lin et al., 2019) No finetuning ConceptNet
LIMIT-BERT (Zhou et al., 2020) Yes semantics/syntax pretrained model
KGLM (Logan et al., 2019) Yes autoregressive WikiText-2/wikidata
LRLM (Hayashi et al., 2019) Yes autoregressive Wikidata/Freebase
(Ostendorff et al., 2019) No finetune Wikidata
SKEP (Tian et al., 2020) Yes Word Polarity Prediction auto-mined
Table 2. Comparisons of different knowledge enhanced pretrained langauge models

4. Applications

Knowledge enhanced pretrained language models have benefited a variety of NLP applications, especially those knowledge-intensive ones. In the following, we discuss in more detail how KE-PLM improves the performance of various NLG and NLU tasks. To facilitate future research in the area, we also briefly introduce a few widely used benchmark datasets for evaluating the efficacy of these models.

4.1. Text Generation

The KE-PLMs designed for natural language generation can be roughly categorized into following two types:

  • To improve the logical soundness of generated text by incorporating commonsense knowledge graphs.

  • To improve the factualness of generated text by incorporating encyclopedia knowledge

In the following, we introduce the commonly used datasets for evaluating these two aspects.

ROCStories (Mostafazadeh et al., 2016) consists of coherent stories (each with five sentences) as the unlabeled training dataset. ROCStories is widely used for story understanding and generation tasks. By integrating subgraph information from Commonsense KG (e.g., ConceptNet and ATOMIC), knowledge enhanced PLMs (Guan et al., 2020)(Ji et al., 2020)(Yu et al., 2020) are able to generate story endings that is more logical and aligns better with the commonsense of human.

CommonGen (Lin et al., 2020) is a popular dataset for constrained text generation task. It combines crowdsourced and existing caption corpora to produce commonsense descriptions for over 35k concept sets. The goal of CommonGen is to assess the model’s capability on commonsense reasoning when generating text. In this task, given a concept set with typically three to five concepts, the model is expected to generate coherent and sensible text description containing these concepts. KG-BART (Liu et al., 2020) and (Li et al., 2020), via incorporating commonsense knowledge, show better capability on generating more natural and sensible text for the given concept set.

4.2. Entity-related Tasks

Since many existing approaches incorporate entity-level knowledge, entity related tasks (e.g., entity typing and relation classification) become natural testbeds for evaluating the efficacy of these KE-PLMs. By injecting entity information into the model, they are able to be more entity-aware and outperform these vanilla PLMs (such as BERT and RoBERTa) on different benchmark datasets.

4.2.1. Entity Typing

The goal of entity typing is to classify entity mentions to predefined categories. Two popular benchmark datasets used for entity typing are Open Entity

(Choi et al., 2018) and FIGER(Ling et al., 2015) (sentence-level entity typing dataset)). To perform entity typing, existing approaches (Su et al., 2020)(Wang et al., 2019)(Zhang et al., 2019)(Yamada et al., 2020)(Sun et al., 2020a)(Poerner et al., 2020)(Wang et al., 2020) typically insert special tokens [E] and [/E] to surround the entity mention and employ the contextual representation of the [E] token to predict the category.

4.2.2. Relation Classification

Relation classification/relation typing aims at classifying the relationship between two entities mentioned in text. TACRED

(Zhang et al., 2017) is widely used benchmark dataset for relation classification. It consists of over sentences with a total of 42 relations. FewRel(Han et al., 2018) is a dataset constructed from Wikipedia text and Wikidata facts. It contains relations can be used for evaluating few-shot relation classification.

To perform relation classification task, existing PLMs (Su et al., 2020)(Zhang et al., 2019)(Wang et al., 2019)(Zhang et al., 2019)(Yamada et al., 2020)(Sun et al., 2020a)(Poerner et al., 2020)(Wang et al., 2020) typically perform the following during fine-tuning: special tokens [HE], [/HE], [TE] and [/TE] are added to mark the beginning and end for head entity and tail entity. Then they usually concatenate the contextual embeddings for [HE] and [TE] to predict the relation category.

4.3. Question Answering

4.3.1. Cloze-style Question Answering/Knowledge Probing

LAMA (LAnguage Model Analysis) probe (Petroni et al., 2019) is a set of cloze-style questions with single-token answers. It is generated from the relation triplets in KG with templates which contain variables s and o for subject and object (e.g, “s was born in o”). This dataset aims at measuring how much factual knowledge is stored in pretrained models. (Poerner et al., 2020) further constructs LAMA-UHN, a more difficult subset of LAMA, by filtering out easy-to-answer questions with overly informative entity names.

CoLAKE(Sun et al., 2020a), E-BERT(Poerner et al., 2020), KgPLM(He et al., 2020), EaE (Févry et al., 2020), KALM(Rosset et al., 2020), KEPLER(Wang et al., 2019) and K-Adapter(Wang et al., 2020) adopt LAMA for knowledge probing. With the injected knowledge, they are able to generate more factual answers than vanilla PLMs.

4.3.2. Open Domain Question Answering

PLMs equiped with Knowledge could lead to better performance on Open Domain Question Answering (ODQA), as the context and answer often involve entities. A few KE-PLMs evaluate their approaches on ODQA dataset to showcase their improved capability. For example, KT-NET(Yang et al., 2019) and LUKE(Yamada et al., 2020) demonstrate their efficacy on SQuAD1.1(Rajpurkar et al., 2016). TriviaQA(Joshi et al., 2017) and SearchQA(Dunn et al., 2017) were used by EaE(Févry et al., 2020) and K-ADAPTER(Wang et al., 2020), respectively.

4.3.3. Commonsense QA

CommonsenseQA (Talmor et al., 2019) is a dataset question answering constructed from ConceptNet (Speer et al., 2017) to test model’s capability on understanding several commonsense types (e.g., causal, spatial, social). Each question is equipped with one correct answer and four distractors by human annotators. The distractors are typically also related to the question but less aligned with human commonsense. One example question is ”what do all humans want to experience in their own home? { feel comfortable, work hard, fall in love, lay eggs, live forever}” 333https://www.tau-nlp.org/commonsenseqa. Through integrating knowledge from commonsense KG, KagNet(Lin et al., 2019), GLM(Shen et al., 2020), AMS(Ye et al., 2019) and (Lv et al., 2020) are able to perform reasonably well on CommonsenseQA. CosmosQA(Huang et al., 2019) is a dataset with multiple-choice questions for commonsense-based reading comprehension, and K-Adapter(Wang et al., 2020) outperforms vanilla RoBERTa(Liu et al., 2019) on this dataset.

4.4. Knowledge Graph Completion

Knowledge graphs usually suffer from the problem of incompleteness and many relations between entities are missing in KG. This might negatively impact multiple downstream tasks, such as knowledge base QA (Hao et al., 2017)

. Pretrained language models enhanced by knowledge graph can in turn help infer missing links and improve the completeness of knowledge graphs. Some popular benchmark datasets for evaluating link prediction on knowledge graphs are WN18RR

(Dettmers et al., 2018)

(subset of WordNet), FB15K-237

(Dettmers et al., 2018) (subset of FreeBase (Bollacker et al., 2008)) and also Wikidata5M (recently introduced in (Wang et al., 2019)).

KG-BERT (Yao et al., 2019), BLP (Daza et al., 2020), GLM (Shen et al., 2020) and KEPLER (Wang et al., 2019) by including margin-based or cross-entropy loss derived from relation triplet, demonstrated competitive performance on the datasets mentioned above. Besides, it has been shown that COMET (Bosselut et al., 2019) is able to generate new high-quality commonsense knowledge, and hence can be employed to further complete commonsense KGs such as ConceptNet (Speer et al., 2017) and ATOMIC (Sap et al., 2019).

4.5. Sentiment Analysis Tasks

KE-PLMs also demonstrate their effectiveness on sentiment analysis tasks such as Amazon QA dataset (Miller et al., 2020), Stanford Sentiment Treebank (SST) (Socher et al., 2013), Semantic Eval 2014 Task4 (Pontiki et al., 2014). With the help of additional knowledge such as word sentiment polarity, SentiLARE (Ke et al., 2020), SKEP (Tian et al., 2020), E(commerce)-BERT (Zhang et al., 2020) have achieved than their vanilla counterpart on sentence-level and aspect-level sentiment classification. E(commerce)-BERT (Zhang et al., 2020) also demonstrates superior performance on review based QA and review aspect extraction.

5. Challenges and Future Directions

In this section, we present the common challenges that face the development of KE-PLMs, and discuss possible research directions to address these challenges.

5.1. Exploring More Applications

As discussed in previous section, knowledge enhanced PLMs have achieved success on multiple NLU and NLG tasks. Besides these tasks, there still remain many other applications that can potentially benefit from KE-PLMs such as the following:

  • Text Summarization

    KE-PLMs have shown impressive performance on generating more logical and factual text. Factual inconsistency has been an issue that affects the performance of existing document summarization approaches for a long time

    (Kryscinski et al., 2020). Adopting the techniques from knowledge enhanced PLMs for improving the factualness of document summarization would be a promising field to explore.

  • Machine Translation How KE-PLMs can improve the quality of machine translation or is not yet explored, to our best knowledge. Integrating linguistic and commonsense knowledge into the PLMs might further their performance.

  • Semantic parsing based applications might also benefit from entity-aware KE-PLMs since it often relies on the entity information in the text.

5.2. Exploring More Knowledge Sources

5.2.1. Application/Domain-specifc Knowledge

Since most existing work focuses on encyclopedia KG, commonsense KG or linguistic knowledge, exploiting domain-specific knowledge is still a relatively under-explored area. As we present in previous sections, several papers have investigated knowledge integration in medical domain and E-commerce domain. As future work, one could investigate the effectiveness of knowledge injection from other domains to PLMs, such as finance, sports and entertainment. It would be exciting to see how KE-PLMs could help improve the NLU and NLG performance in diverse fields.

5.2.2. Temporal Knowledge Graph

It would be beneficial to study how to effectively integrate temporal knowledge graphs (Trivedi et al., 2017) into PLMs. For example, the personName in relation triple (personName, presidentOf, USA) would not be the same during different time periods. Handling such temporal knowledge could be helpful for further enhancing the capability of KE-PLMs but also poses additional challenges to knowledge representation and integration.

5.3. Integrating Complex Knowledge More Effectively

There have been some recent attempts to incorporate knowledge with complex model architecture or perform pretraining in a more sophisticated manner (such as joint training of PLM and KG embedding) (Su et al., 2020)(Wang et al., 2019). However, on benchmark datasets, the reported performance of these more sophisticated schemes does not seem to always outperform approaches that incorporate only entity-level information such as LUKE (Yamada et al., 2020). Hence, we believe there is still great potential on exploring more effective knowledge integration schemes.

5.4. Optimizing Computational Burden

Despite the success of KE-PLM on a variety of applications, one should not overlook the fact that incorporating knowledge might incur more computational burden and storage overhead. Most existing work only reports accuracy gains but the incurred cost of knowledge integration has not been thoroughly studied. Designing more time and space efficient solutions would be critical to practical adoption of these knowledge enhanced models.

5.4.1. Time Overhead

Many existing methods involves pre-training with auxiliary tasks on large scale linked data (Zhang et al., 2019)(Sun et al., 2020a)(Yamada et al., 2020). How to minimize the amount of pretraining required while still achieving strong performance requires more investigation. Methods such as K-BERT(Liu et al., 2020), K-Adapter (Wang et al., 2020) and Syntax-BERT (Bai et al., 2021) make progress towards this end by being pretraining-free. We believe there is still potential on further striking a balance between pretraining workload and model performance. When it comes to inference, there is usually additional cost for KE-PLMs as well. For example, a few approaches (Ji et al., 2020)(Liu et al., 2020)(Sun et al., 2019) involve constructing knowledge subgraphs and learn graph embeddings on the fly, which might increase the inference time. It is worth developing more efficient inference strategy for KE-PLMs to facilitate their adoption in real-world applications.

5.4.2. Space Overhead

The additional space consumption of KE-PLMs might also be a concern for their practical deployment. For instance, Fact-as-Experts (FaE) (Verga et al., 2020) builds external entity memory and fact memory (with 1.54 million KB triples) to complement pretrained model. Retrieval Augmented Generation (RAG) (Lewis et al., 2020) relies on a non-parametric knowledge corpus with 21 million documents. These integrated entities/facts are not equally useful and some entities might play a more important role in enhancing the model’s capability than others. Thus, how to select a succinct set of most important knowledge entries out of a potentially huge space could be a promising future direction.

Another direction worth exploring is to employ model compression techniques (Cheng et al., 2018) on both knowledge integration parameters (e.g., entity memory) and language model parameters of KE-PLMs. For example, popular model compression methods such as quantization (Shen et al., 2020), knowledge distillation (Hinton et al., 2015) and parameter sharing (Lan et al., 2020) (Dehghani et al., 2019) can be applied to KE-PLMs for improving time and space efficiency.

5.5. Noise Resilient KE-PLMs

While existing approaches benefit from the additional information provided by knowledge sources, they might also incorporate noise from potentially noisy sources. This is especially true for methods that rely on off-the-shelf toolkit for knowledge extraction:

  • Entity recognition and entity linking

    . Existing entity-aware PLMs typically depend on third-party tools or simple heuristics for performing entity extraction and disambiguation. For example, ERNIE

    (Zhang et al., 2019), CoLake (Sun et al., 2020a), CokeBERT (Su et al., 2020) use TAGME (Ferragina and Scaiella, 2010) to perform entity extraction and linking. ERICA (Qin et al., 2020) uses SpaCy to perform NER and then link the entity mentions to Wikidata. (Rosset et al., 2020) relies on a (fuzzy) frequency-based dictionary look-up for entity linking. All of these entity linkers would inevitably introduce certain amount of noise to the training process, which might in turn degrade the PLM’s performance.

  • Syntax tree Syntax-BERT(Bai et al., 2021) and K-Adapter (Wang et al., 2020) employ the Stanford Parser444https://nlp.stanford.edu/software/lex-parser.shtml for generating dependency parsing trees, which is further used to construct auxiliary loss.

After the knowledge integration, such noise might propagate to other sub-components of the model and hence negatively impact the performance on downstream tasks. How different noise levels affect the model’s performance would be an interesting direction to explore. It would be highly desirable if new KE-PLMs developed in future could demonstrate their robustness to such noise, in both pretraining phase and inference phase.

5.6. Upper Bound Study

Recently, gigantic pretrained models with hundreds of billions of parameters, such as Switch Transformers (Fedus et al., 2021) and GPT-3 (Brown et al., 2020), have demonstrated their impressive capability for producing human-like text in a wide range of NLP tasks. Though the prohibitive training and inference cost make them less practical in large-scale real-world applications, they showcase the huge upside of PLMs’s performance once equipped with larger training corpus, more model parameters and training iterations.

However, such upper bound study for knowledge enhanced PLMs still seems to be lacking, and there has not been as much effort on pushing the limit for KE-PLMs. It would be interesting to study how well a KE-PLM can perform with massive parameters and knowledge incorporated from a large number of knowledge sources and types (e.g., encyclopedia, commonsense, linguistic). This would be a challenging problem that requires more careful design of knowledge integration schemes, to effectively integrate from multiple knowledge sources and mitigate forgetting. Besides, more advanced distributed training techniques are also critical to empowering the training of such gigantic KE-PLMs.

6. Conclusion

Integrating knowledge into pretrained language models has been an active research area. We have witnessed an ever-growing interest on this topic, since BERT (Devlin et al., 2019) popularized the application of PLMs. In this survey, we thoroughly survey and categorize the existing KE-PLMs from a methodological point of view, and establish taxonomy based on knowledge source, knowledge granularity and applications. Finally, we highlight several challenges on this topic and discuss potential research directions in this area. We hope that this survey will facilitate future research and help practitioners to further explore this promising field.

References

  • [1] S. Baccianella, A. Esuli, and F. Sebastiani (2010-05) SentiWordNet 3.0: an enhanced lexical resource for sentiment analysis and opinion mining. In Proceedings of the Seventh International Conference on Language Resources and Evaluation (LREC’10), Valletta, Malta. Cited by: §2.4.
  • [2] J. Bai, Y. Wang, Y. Chen, Y. Yang, J. Bai, J. Yu, and Y. Tong (2021-03) Syntax-BERT: Improving Pre-trained Transformers with Syntax Trees. arXiv e-prints, pp. arXiv:2103.04350. External Links: 2103.04350 Cited by: §2.1, Table 2, 2nd item, §5.4.1.
  • [3] O. Bodenreider (2004) The unified medical language system (umls): integrating biomedical terminology.. Nucleic Acids Res. 32 (Database-Issue), pp. 267–270. Cited by: §2.4.
  • [4] K. Bollacker, C. Evans, P. Paritosh, T. Sturge, and J. Taylor (2008) Freebase: a collaboratively created graph database for structuring human knowledge. In Proceedings of the 2008 ACM SIGMOD International Conference on Management of Data, SIGMOD ’08, New York, NY, USA, pp. 1247–1250. External Links: ISBN 9781605581026 Cited by: §2.2, §4.4.
  • [5] A. Bordes, N. Usunier, A. Garcia-Duran, J. Weston, and O. Yakhnenko (2013) Translating embeddings for modeling multi-relational data. In Advances in Neural Information Processing Systems, C. J. C. Burges, L. Bottou, M. Welling, Z. Ghahramani, and K. Q. Weinberger (Eds.), Vol. 26, pp. . External Links: Link Cited by: §3.3.
  • [6] A. Bosselut, H. Rashkin, M. Sap, C. Malaviya, A. Celikyilmaz, and Y. Choi (2019-07) COMET: commonsense transformers for automatic knowledge graph construction. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, Florence, Italy, pp. 4762–4779. External Links: Link, Document Cited by: §3.3, Table 2, §4.4.
  • [7] T. Brown, B. Mann, N. Ryder, M. Subbiah, J. D. Kaplan, P. Dhariwal, A. Neelakantan, P. Shyam, G. Sastry, A. Askell, S. Agarwal, A. Herbert-Voss, G. Krueger, T. Henighan, R. Child, A. Ramesh, D. Ziegler, J. Wu, C. Winter, C. Hesse, M. Chen, E. Sigler, M. Litwin, S. Gray, B. Chess, J. Clark, C. Berner, S. McCandlish, A. Radford, I. Sutskever, and D. Amodei (2020) Language models are few-shot learners. In Advances in Neural Information Processing Systems, H. Larochelle, M. Ranzato, R. Hadsell, M. F. Balcan, and H. Lin (Eds.), Vol. 33, pp. 1877–1901. Cited by: §1, §5.6.
  • [8] A. Carlson, J. Betteridge, B. Kisiel, B. Settles, E.R. H. Jr., and T.M. Mitchell (2010) Toward an architecture for never-ending language learning. In

    Proceedings of the Conference on Artificial Intelligence (AAAI)

    ,
    pp. 1306–1313. Cited by: §2.2.
  • [9] D. Chen and C. Manning (2014-10) A fast and accurate dependency parser using neural networks. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), Doha, Qatar, pp. 740–750. External Links: Link, Document Cited by: §2.1.
  • [10] Y. Cheng, D. Wang, P. Zhou, and T. Zhang (2018) Model compression and acceleration for deep neural networks: the principles, progress, and challenges. IEEE Signal Processing Magazine 35 (1), pp. 126–136. External Links: Document Cited by: §5.4.2.
  • [11] E. Choi, O. Levy, Y. Choi, and L. Zettlemoyer (2018-07) Ultra-fine entity typing. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Melbourne, Australia, pp. 87–96. External Links: Link, Document Cited by: §4.2.1.
  • [12] P. Colon-Hernandez, C. Havasi, J. Alonso, M. Huggins, and C. Breazeal (2021-01) Combining pre-trained language models and structured knowledge. arXiv e-prints, pp. arXiv:2101.12294. External Links: 2101.12294 Cited by: §1.
  • [13] D. Daza, M. Cochez, and P. Groth (2020-10) Inductive Entity Representations from Text via Link Prediction. arXiv e-prints, pp. arXiv:2010.03496. External Links: 2010.03496 Cited by: §3.3, §4.4.
  • [14] M. Dehghani, S. Gouws, O. Vinyals, J. Uszkoreit, and L. Kaiser (2019) Universal transformers. In 7th International Conference on Learning Representations, ICLR 2019, New Orleans, LA, USA, May 6-9, 2019, External Links: Link Cited by: §5.4.2.
  • [15] T. Dettmers, P. Minervini, P. Stenetorp, and S. Riedel (2018) Convolutional 2D knowledge graph embeddings. In Thirty-Second AAAI Conference on Artificial Intelligence, Cited by: §4.4.
  • [16] J. Devlin, M. Chang, K. Lee, and K. Toutanova (2019-06) BERT: pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), Minneapolis, Minnesota, pp. 4171–4186. External Links: Document Cited by: §1, §6.
  • [17] M. Dunn, L. Sagun, M. Higgins, V. Ugur Guney, V. Cirik, and K. Cho (2017-04) SearchQA: A New Q&A Dataset Augmented with Context from a Search Engine. arXiv e-prints, pp. arXiv:1704.05179. External Links: 1704.05179 Cited by: §4.3.2.
  • [18] A. Fan, C. Gardent, C. Braud, and A. Bordes (2020-04) Augmenting Transformers with KNN-Based Composite Memory for Dialogue. arXiv e-prints, pp. arXiv:2004.12744. External Links: 2004.12744 Cited by: §3.1.
  • [19] W. Fedus, B. Zoph, and N. Shazeer (2021-01) Switch Transformers: Scaling to Trillion Parameter Models with Simple and Efficient Sparsity. arXiv e-prints, pp. arXiv:2101.03961. External Links: 2101.03961 Cited by: §5.6.
  • [20] P. Ferragina and U. Scaiella (2010) TAGME: on-the-fly annotation of short text fragments (by wikipedia entities). In Proceedings of the 19th ACM International Conference on Information and Knowledge Management, CIKM ’10, New York, NY, USA, pp. 1625–1628. External Links: ISBN 9781450300995 Cited by: 1st item.
  • [21] T. Févry, L. Baldini Soares, N. FitzGerald, E. Choi, and T. Kwiatkowski (2020-11) Entities as experts: sparse memory access with entity supervision. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), Online, pp. 4937–4951. Cited by: §3.3, Table 1, Table 2, §4.3.1, §4.3.2.
  • [22] J. Guan, F. Huang, Z. Zhao, X. Zhu, and M. Huang (2020) A knowledge-enhanced pretraining model for commonsense story generation. Transactions of the Association for Computational Linguistics 8, pp. 93–108. Cited by: §1, §2.3, §3.3, Table 2, §4.1.
  • [23] X. Han, H. Zhu, P. Yu, Z. Wang, Y. Yao, Z. Liu, and M. Sun (2018-October-November) FewRel: a large-scale supervised few-shot relation classification dataset with state-of-the-art evaluation. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, Brussels, Belgium, pp. 4803–4809. External Links: Link, Document Cited by: §4.2.2.
  • [24] Y. Hao, Y. Zhang, K. Liu, S. He, Z. Liu, H. Wu, and J. Zhao (2017-07) An end-to-end model for question answering over knowledge base with cross-attention combining global knowledge. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Vancouver, Canada, pp. 221–231. External Links: Link, Document Cited by: §4.4.
  • [25] H. Hayashi, Z. Hu, C. Xiong, and G. Neubig (2019-08) Latent Relation Language Models. arXiv e-prints, pp. arXiv:1908.07690. External Links: 1908.07690 Cited by: §3.3, Table 2.
  • [26] B. He, X. Jiang, J. Xiao, and Q. Liu (2020-12) KgPLM: Knowledge-guided Language Model Pre-training via Generative and Discriminative Learning. arXiv e-prints, pp. arXiv:2012.03551. External Links: 2012.03551 Cited by: §2.2, §3.2, Table 2, §4.3.1.
  • [27] B. He, D. Zhou, J. Xiao, X. Jiang, Q. Liu, N. J. Yuan, and T. Xu (2020-11) BERT-MK: integrating graph contextualized knowledge into pre-trained language models. In Findings of the Association for Computational Linguistics: EMNLP 2020, Online, pp. 2281–2290. Cited by: §2.4, §3.3.
  • [28] G. Hinton, O. Vinyals, and J. Dean (2015) Distilling the knowledge in a neural network. In

    NIPS Deep Learning and Representation Learning Workshop

    ,
    External Links: Link Cited by: §5.4.2.
  • [29] S. Hochreiter and J. Schmidhuber (1997) Long short-term memory. Neural computation 9 (8), pp. 1735–1780. Cited by: §3.4.2.
  • [30] W. Hu, B. Liu, J. Gomes, M. Zitnik, P. Liang, V. Pande, and J. Leskovec (2020) Strategies for pre-training graph neural networks. In International Conference on Learning Representations, External Links: Link Cited by: §1.
  • [31] Z. Hu, Y. Dong, K. Wang, K. Chang, and Y. Sun (2020) GPT-gnn: generative pre-training of graph neural networks. In Proceedings of the 26th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, Cited by: §1.
  • [32] L. Huang, R. Le Bras, C. Bhagavatula, and Y. Choi (2019-11) Cosmos QA: machine reading comprehension with contextual commonsense reasoning. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), Hong Kong, China, pp. 2391–2401. External Links: Link, Document Cited by: §4.3.3.
  • [33] H. Ji, P. Ke, S. Huang, F. Wei, X. Zhu, and M. Huang (2020-11) Language generation with multi-hop reasoning on commonsense knowledge graph. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), Online, pp. 725–736. External Links: Link, Document Cited by: §1, §2.3, §3.4.2, Table 2, §4.1, §5.4.1.
  • [34] M. Joshi, E. Choi, D. Weld, and L. Zettlemoyer (2017-07) TriviaQA: a large scale distantly supervised challenge dataset for reading comprehension. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Vancouver, Canada, pp. 1601–1611. External Links: Link, Document Cited by: §4.3.2.
  • [35] N. Kassner and H. Schütze (2020-07) Negated and misprimed probes for pretrained language models: birds can talk, but cannot fly. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, Online, pp. 7811–7818. External Links: Link, Document Cited by: 1st item.
  • [36] P. Ke, H. Ji, S. Liu, X. Zhu, and M. Huang (2020-11) SentiLARE: sentiment-aware language representation learning with linguistic knowledge. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), Online, pp. 6975–6988. External Links: Link, Document Cited by: §2.1, §2.4, Table 2, §4.5.
  • [37] W. Kryscinski, B. McCann, C. Xiong, and R. Socher (2020-11) Evaluating the factual consistency of abstractive text summarization. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), Online, pp. 9332–9346. External Links: Link, Document Cited by: 1st item.
  • [38] Z. Lan, M. Chen, S. Goodman, K. Gimpel, P. Sharma, and R. Soricut (2020)

    ALBERT: a lite bert for self-supervised learning of language representations

    .
    In International Conference on Learning Representations, External Links: Link Cited by: §5.4.2.
  • [39] A. Lauscher, O. Majewska, L. F. R. Ribeiro, I. Gurevych, N. Rozanov, and G. Glavaš (2020-11) Common sense or world knowledge? investigating adapter-based knowledge injection into pretrained transformers. In Proceedings of Deep Learning Inside Out (DeeLIO): The First Workshop on Knowledge Extraction and Integration for Deep Learning Architectures, Online, pp. 43–49. External Links: Link, Document Cited by: §2.1.
  • [40] A. Lauscher, I. Vulić, E. M. Ponti, A. Korhonen, and G. Glavaš (2020-12) Specializing unsupervised pretraining models for word-level semantic similarity. In Proceedings of the 28th International Conference on Computational Linguistics, Barcelona, Spain (Online), pp. 1371–1383. External Links: Link, Document Cited by: Table 2.
  • [41] Y. Levine, B. Lenz, O. Dagan, O. Ram, D. Padnos, O. Sharir, S. Shalev-Shwartz, A. Shashua, and Y. Shoham (2020-07) SenseBERT: driving some sense into BERT. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, Cited by: §2.1, Table 2.
  • [42] M. Lewis, Y. Liu, N. Goyal, M. Ghazvininejad, A. Mohamed, O. Levy, V. Stoyanov, and L. Zettlemoyer (2020-07) BART: denoising sequence-to-sequence pre-training for natural language generation, translation, and comprehension. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, Online, pp. 7871–7880. External Links: Link, Document Cited by: §3.1.
  • [43] P. Lewis, E. Perez, A. Piktus, F. Petroni, V. Karpukhin, N. Goyal, H. Küttler, M. Lewis, W. Yih, T. Rocktäschel, S. Riedel, and D. Kiela (2020) Retrieval-augmented generation for knowledge-intensive nlp tasks. In Advances in Neural Information Processing Systems, H. Larochelle, M. Ranzato, R. Hadsell, M. F. Balcan, and H. Lin (Eds.), Vol. 33, pp. 9459–9474. Cited by: §3.1, §5.4.2.
  • [44] Y. Li, P. Goel, V. Kuppur Rajendra, H. Simrat Singh, J. Francis, K. Ma, E. Nyberg, and A. Oltramari (2020-12) Lexically-constrained Text Generation through Commonsense Knowledge Extraction and Injection. arXiv e-prints, pp. arXiv:2012.10813. External Links: 2012.10813 Cited by: §2.3, Table 2, §4.1.
  • [45] B. Y. Lin, X. Chen, J. Chen, and X. Ren (2019-11) KagNet: knowledge-aware graph networks for commonsense reasoning. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), Hong Kong, China, pp. 2829–2839. External Links: Link, Document Cited by: §1, §2.3, §3.4.2, Table 2, §4.3.3.
  • [46] B. Y. Lin, W. Zhou, M. Shen, P. Zhou, C. Bhagavatula, Y. Choi, and X. Ren (2020-11) CommonGen: a constrained text generation challenge for generative commonsense reasoning. In Findings of the Association for Computational Linguistics: EMNLP 2020, Online, pp. 1823–1840. External Links: Link, Document Cited by: 2nd item, §4.1.
  • [47] X. Ling, S. Singh, and D. S. Weld (2015) Design challenges for entity linking. Transactions of the Association for Computational Linguistics 3, pp. 315–328. External Links: Link, Document Cited by: §4.2.1.
  • [48] N. F. Liu, M. Gardner, Y. Belinkov, M. E. Peters, and N. A. Smith (2019-06) Linguistic knowledge and transferability of contextual representations. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), Minneapolis, Minnesota, pp. 1073–1094. External Links: Link, Document Cited by: §1.
  • [49] W. Liu, P. Zhou, Z. Zhao, Z. Wang, Q. Ju, H. Deng, and P. Wang (2020) K-BERT: enabling language representation with knowledge graph. In The Thirty-Fourth AAAI Conference on Artificial Intelligence, AAAI 2020, The Thirty-Second Innovative Applications of Artificial Intelligence Conference, IAAI 2020, The Tenth AAAI Symposium on Educational Advances in Artificial Intelligence, EAAI 2020, New York, NY, USA, February 7-12, 2020, pp. 2901–2908. Cited by: §1, §2.4, §3.3, Table 2, §5.4.1.
  • [50] Y. Liu, Y. Wan, L. He, H. Peng, and P. S. Yu (2020-09) KG-BART: Knowledge Graph-Augmented BART for Generative Commonsense Reasoning. arXiv e-prints, pp. arXiv:2009.12677. External Links: 2009.12677 Cited by: 2nd item, §2.3, §3.4.2, Table 2, §4.1, §5.4.1.
  • [51] Y. Liu, M. Ott, N. Goyal, J. Du, M. Joshi, D. Chen, O. Levy, M. Lewis, L. Zettlemoyer, and V. Stoyanov (2019) RoBERTa: a robustly optimized bert pretraining approach. arXiv preprint arXiv:1907.11692. Cited by: §1, §4.3.3.
  • [52] R. Logan, N. F. Liu, M. E. Peters, M. Gardner, and S. Singh (2019-07) Barack’s wife hillary: using knowledge graphs for fact-aware language modeling. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, Florence, Italy, pp. 5962–5971. External Links: Link, Document Cited by: §3.3, Table 2.
  • [53] S. Lv, D. Guo, J. Xu, D. Tang, N. Duan, M. Gong, L. Shou, D. Jiang, G. Cao, and S. Hu (2020) Graph-based reasoning over heterogeneous external knowledge for commonsense question answering. In The Thirty-Fourth AAAI Conference on Artificial Intelligence, AAAI 2020, New York, USA, pp. 8449–8456. Cited by: §2.3, §3.4.2, §4.3.3.
  • [54] J. McAuley, R. Pandey, and J. Leskovec (2015) Inferring networks of substitutable and complementary products. In Proceedings of the 21th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD ’15, New York, NY, USA, pp. 785–794. External Links: ISBN 9781450336642 Cited by: §2.4.
  • [55] T. McCoy, E. Pavlick, and T. Linzen (2019-07) Right for the wrong reasons: diagnosing syntactic heuristics in natural language inference. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, Florence, Italy, pp. 3428–3448. Cited by: 1st item.
  • [56] J. Miller, K. Krauth, B. Recht, and L. Schmidt (2020) The effect of natural distribution shift on question answering models. In

    Proceedings of the 37th International Conference on Machine Learning, ICML 2020, 13-18 July 2020, Virtual Event

    ,
    Proceedings of Machine Learning Research, Vol. 119, pp. 6905–6916. External Links: Link Cited by: §4.5.
  • [57] N. Mostafazadeh, N. Chambers, X. He, D. Parikh, D. Batra, L. Vanderwende, P. Kohli, and J. Allen (2016-06) A corpus and cloze evaluation for deeper understanding of commonsense stories. In Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, San Diego, California, pp. 839–849. External Links: Link, Document Cited by: §4.1.
  • [58] T. Niven and H. Kao (2019-07) Probing neural network comprehension of natural language arguments. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, Florence, Italy, pp. 4658–4664. External Links: Link, Document Cited by: 1st item.
  • [59] M. Ostendorff, P. Bourgonje, M. Berger, J. Moreno-Schneider, and G. Rehm (2019) Enriching BERT with Knowledge Graph Embedding for Document Classification. In Proceedings of the GermEval 2019 Workshop, Erlangen, Germany. Cited by: Table 2.
  • [60] J. Pennington, R. Socher, and C. Manning (2014-10)

    GloVe: global vectors for word representation

    .
    In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), Doha, Qatar, pp. 1532–1543. External Links: Link, Document Cited by: §3.4.2.
  • [61] M. E. Peters, M. Neumann, R. Logan, R. Schwartz, V. Joshi, S. Singh, and N. A. Smith (2019-11) Knowledge enhanced contextual word representations. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), Hong Kong, China, pp. 43–54. Cited by: §1, Table 2.
  • [62] F. Petroni, P. Lewis, A. Piktus, T. Rocktäschel, Y. Wu, A. H. Miller, and S. Riedel (2020) How context affects language models’ factual predictions. In Automated Knowledge Base Construction, Cited by: 1st item.
  • [63] F. Petroni, T. Rocktäschel, S. Riedel, P. Lewis, A. Bakhtin, Y. Wu, and A. Miller (2019-11) Language models as knowledge bases?. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), Hong Kong, China, pp. 2463–2473. Cited by: §1, §4.3.1.
  • [64] N. Poerner, U. Waltinger, and H. Schütze (2020-11) E-BERT: efficient-yet-effective entity embeddings for BERT. In Findings of the Association for Computational Linguistics: EMNLP 2020, Online, pp. 803–818. Cited by: Table 1, Table 2, §4.2.1, §4.2.2, §4.3.1, §4.3.1.
  • [65] M. Pontiki, D. Galanis, J. Pavlopoulos, H. Papageorgiou, I. Androutsopoulos, and S. Manandhar (2014-08) SemEval-2014 task 4: aspect based sentiment analysis. In Proceedings of the 8th International Workshop on Semantic Evaluation (SemEval 2014), Dublin, Ireland, pp. 27–35. External Links: Link, Document Cited by: §4.5.
  • [66] Y. Qin, Y. Lin, R. Takanobu, Z. Liu, P. Li, H. Ji, M. Huang, M. Sun, and J. Zhou (2020-12) ERICA: Improving Entity and Relation Understanding for Pre-trained Language Models via Contrastive Learning. arXiv e-prints, pp. arXiv:2012.15022. External Links: 2012.15022 Cited by: §2.2, §3.3, Table 2, 1st item.
  • [67] X. Qiu, T. Sun, Y. Xu, Y. Shao, N. Dai, and X. Huang (2020-10) Pre-trained models for natural language processing: A survey. Science in China E: Technological Sciences 63 (10), pp. 1872–1897. External Links: Document Cited by: §1.
  • [68] A. Radford, J. Wu, R. Child, D. Luan, D. Amodei, and I. Sutskever (2019) Language models are unsupervised multitask learners. Cited by: §1.
  • [69] C. Raffel, N. Shazeer, A. Roberts, K. Lee, S. Narang, M. Matena, Y. Zhou, W. Li, and P. J. Liu (2019-10)

    Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer

    .
    arXiv e-prints, pp. arXiv:1910.10683. Cited by: §1.
  • [70] P. Rajpurkar, J. Zhang, K. Lopyrev, and P. Liang (2016-11) SQuAD: 100,000+ questions for machine comprehension of text. In Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, Austin, Texas, pp. 2383–2392. External Links: Link, Document Cited by: §4.3.2.
  • [71] A. Roberts, C. Raffel, and N. Shazeer (2020-11) How much knowledge can you pack into the parameters of a language model?. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), Online. Cited by: §1.
  • [72] C. Rosset, C. Xiong, M. Phan, X. Song, P. Bennett, and S. Tiwary (2020-06) Knowledge-Aware Language Model Pretraining. arXiv e-prints, pp. arXiv:2007.00655. External Links: 2007.00655 Cited by: §3.2, Table 1, §4.3.1, 1st item.
  • [73] T. Safavi and D. Koutra (2021-11) Relational world knowledge representation in contextual language models: a review. In To Appear in EMNLP, Hong Kong, China. Cited by: §1.
  • [74] M. Sap, R. L. Bras, E. Allaway, C. Bhagavatula, N. Lourie, H. Rashkin, B. Roof, N. A. Smith, and Y. Choi (2019) ATOMIC: an atlas of machine commonsense for if-then reasoning. In AAAI, pp. 3027–3035. Cited by: 2nd item, §2.3, §4.4.
  • [75] J. Shang, J. Liu, M. Jiang, X. Ren, C. R. Voss, and J. Han (2018) Automated phrase mining from massive text corpora. IEEE Transactions on Knowledge and Data Engineering 30 (10), pp. 1825–1837. External Links: Document Cited by: Table 2.
  • [76] S. Shen, Z. Dong, J. Ye, L. Ma, Z. Yao, A. Gholami, M. W. Mahoney, and K. Keutzer (2020) Q-BERT: hessian based ultra low precision quantization of BERT. In The Thirty-Fourth AAAI Conference on Artificial Intelligence, AAAI 2020, The Thirty-Second Innovative Applications of Artificial Intelligence Conference, IAAI 2020, The Tenth AAAI Symposium on Educational Advances in Artificial Intelligence, EAAI 2020, New York, NY, USA, February 7-12, 2020, pp. 8815–8821. External Links: Link Cited by: §5.4.2.
  • [77] T. Shen, Y. Mao, P. He, G. Long, A. Trischler, and W. Chen (2020-11) Exploiting structured knowledge in text via graph-guided representation learning. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), Online, pp. 8980–8994. Cited by: §2.3, §3.2, Table 1, Table 2, §4.3.3, §4.4.
  • [78] R. Socher, A. Perelygin, J. Wu, J. Chuang, C. D. Manning, A. Ng, and C. Potts (2013-10) Recursive deep models for semantic compositionality over a sentiment treebank. In Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing, Seattle, Washington, USA, pp. 1631–1642. External Links: Link Cited by: §4.5.
  • [79] R. Speer, J. Chin, and C. Havasi (2017) ConceptNet 5.5: an open multilingual graph of general knowledge. In Proceedings of the Thirty-First AAAI Conference on Artificial Intelligence, February 4-9, 2017, San Francisco, California, USA, S. P. Singh and S. Markovitch (Eds.), pp. 4444–4451. External Links: Link Cited by: 1st item, §2.3, §4.3.3, §4.4.
  • [80] Y. Su, X. Han, Z. Zhang, P. Li, Z. Liu, Y. Lin, J. Zhou, and M. Sun (2020-09) CokeBERT: Contextual Knowledge Selection and Embedding towards Enhanced Pre-Trained Language Models. arXiv e-prints, pp. arXiv:2009.13964. External Links: 2009.13964 Cited by: §3.2, §3.4.1, Table 1, Table 2, §4.2.1, §4.2.2, 1st item, §5.3.
  • [81] F. Sun, J. Liu, J. Wu, C. Pei, X. Lin, W. Ou, and P. Jiang (2019) BERT4Rec: sequential recommendation with bidirectional encoder representations from transformer. In Proceedings of the 28th ACM International Conference on Information and Knowledge Management, CIKM ’19, New York, NY, USA, pp. 1441–1450. External Links: ISBN 978-1-4503-6976-3, Link, Document Cited by: §1, §5.4.1.
  • [82] T. Sun, Y. Shao, X. Qiu, Q. Guo, Y. Hu, X. Huang, and Z. Zhang (2020) CoLAKE: contextualized language and knowledge embedding. In Proceedings of the 28th International Conference on Computational Linguistics, COLING, Cited by: §2.2, §3.2, §3.4.1, Table 2, §4.2.1, §4.2.2, §4.3.1, 1st item, §5.4.1.
  • [83] Y. Sun, S. Wang, Y. Li, S. Feng, H. Tian, H. Wu, and H. Wang (2020) ERNIE 2.0: A continual pre-training framework for language understanding. In The Thirty-Fourth AAAI Conference on Artificial Intelligence, AAAI 2020, The Thirty-Second Innovative Applications of Artificial Intelligence Conference, IAAI 2020, The Tenth AAAI Symposium on Educational Advances in Artificial Intelligence, EAAI 2020, New York, NY, USA, February 7-12, 2020, pp. 8968–8975. External Links: Link Cited by: §3.2, §3.2, Table 2.
  • [84] A. Talmor, Y. Elazar, Y. Goldberg, and J. Berant (2020) OLMpics-on what language model pre-training captures. Transactions of the Association for Computational Linguistics 8, pp. 743–758. External Links: Link, Document Cited by: 1st item.
  • [85] A. Talmor, J. Herzig, N. Lourie, and J. Berant (2019-06) CommonsenseQA: a question answering challenge targeting commonsense knowledge. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), Minneapolis, Minnesota, pp. 4149–4158. External Links: Document Cited by: §4.3.3.
  • [86] H. Tian, C. Gao, X. Xiao, H. Liu, B. He, H. Wu, H. Wang, and F. Wu (2020-07) SKEP: sentiment knowledge enhanced pre-training for sentiment analysis. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, Online, pp. 4067–4076. External Links: Link, Document Cited by: §2.4, Table 2, §4.5.
  • [87] R. Trivedi, H. Dai, Y. Wang, and L. Song (2017-06–11 Aug) Know-evolve: deep temporal reasoning for dynamic knowledge graphs. In Proceedings of the 34th International Conference on Machine Learning, D. Precup and Y. W. Teh (Eds.), Proceedings of Machine Learning Research, Vol. 70, International Convention Centre, Sydney, Australia, pp. 3462–3471. External Links: Link Cited by: §5.2.2.
  • [88] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin (2017) Attention is all you need. In Advances in Neural Information Processing Systems, I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett (Eds.), Vol. 30, pp. . External Links: Link Cited by: §3.4.1.
  • [89] P. Velickovic, G. Cucurull, A. Casanova, A. Romero, P. Liò, and Y. Bengio (2018) Graph attention networks. In 6th International Conference on Learning Representations, ICLR 2018, Vancouver, BC, Canada, April 30 - May 3, 2018, Conference Track Proceedings, External Links: Link Cited by: §3.4.1.
  • [90] P. Verga, H. Sun, L. Baldini Soares, and W. W. Cohen (2020-07) Facts as Experts: Adaptable and Interpretable Neural Memory over Symbolic Knowledge. arXiv e-prints, pp. arXiv:2007.00849. External Links: 2007.00849 Cited by: §1, §2.2, §3.3, Table 2, §5.4.2.
  • [91] D. Vrandečić and M. Krötzsch (2014-09) Wikidata: a free collaborative knowledgebase. Commun. ACM 57 (10), pp. 78–85. External Links: ISSN 0001-0782, Link, Document Cited by: §2.2.
  • [92] I. Vulić, E. M. Ponti, R. Litschko, G. Glavaš, and A. Korhonen (2020-11) Probing pretrained language models for lexical semantics. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), Online, pp. 7222–7240. External Links: Link, Document Cited by: §1.
  • [93] A. Wang, Y. Pruksachatkun, N. Nangia, A. Singh, J. Michael, F. Hill, O. Levy, and S. Bowman (2019) SuperGLUE: a stickier benchmark for general-purpose language understanding systems. In Advances in Neural Information Processing Systems, H. Wallach, H. Larochelle, A. Beygelzimer, F. d'Alché-Buc, E. Fox, and R. Garnett (Eds.), Vol. 32, pp. . Cited by: §2.1.
  • [94] A. Wang, A. Singh, J. Michael, F. Hill, O. Levy, and S. Bowman (2018-11) GLUE: a multi-task benchmark and analysis platform for natural language understanding. In Proceedings of the 2018 EMNLP Workshop BlackboxNLP: Analyzing and Interpreting Neural Networks for NLP, Brussels, Belgium, pp. 353–355. External Links: Link, Document Cited by: §2.1.
  • [95] C. Wang, X. Liu, and D. Song (2020-10) Language Models are Open Knowledge Graphs. arXiv e-prints, pp. arXiv:2010.11967. External Links: 2010.11967 Cited by: §1.
  • [96] R. Wang, D. Tang, N. Duan, Z. Wei, X. Huang, J. ji, G. Cao, D. Jiang, and M. Zhou (2020-02) K-Adapter: Infusing Knowledge into Pre-Trained Models with Adapters. arXiv e-prints, pp. arXiv:2002.01808. External Links: 2002.01808 Cited by: §1, §2.1, §2.2, §2.4, §3.3, Table 2, §4.2.1, §4.2.2, §4.3.1, §4.3.2, §4.3.3, 2nd item, §5.4.1.
  • [97] X. Wang, T. Gao, Z. Zhu, Z. Zhang, Z. Liu, J. Li, and J. Tang (2019-11) KEPLER: A Unified Model for Knowledge Embedding and Pre-trained Language Representation. arXiv e-prints, pp. arXiv:1911.06136. External Links: 1911.06136 Cited by: §3.3, Table 2, §4.2.1, §4.2.2, §4.3.1, §4.4, §4.4, §5.3.
  • [98] X. Xie, F. Sun, Z. Liu, S. Wu, J. Gao, B. Ding, and B. Cui (2020-10) Contrastive Learning for Sequential Recommendation. arXiv e-prints, pp. arXiv:2010.14395. External Links: 2010.14395 Cited by: §1.
  • [99] W. Xiong, J. Du, W. Y. Wang, and V. Stoyanov (2020) Pretrained encyclopedia: weakly supervised knowledge-pretrained language model. In 8th International Conference on Learning Representations, ICLR 2020, Addis Ababa, Ethiopia, April 26-30, 2020, Cited by: §3.2, Table 1, Table 2.
  • [100] I. Yamada, A. Asai, H. Shindo, H. Takeda, and Y. Matsumoto (2020) LUKE: deep contextualized entity representations with entity-aware self-attention. In EMNLP, Cited by: §1, §3.2, Table 1, Table 2, §4.2.1, §4.2.2, §4.3.2, §5.3, §5.4.1.
  • [101] A. Yang, Q. Wang, J. Liu, K. Liu, Y. Lyu, H. Wu, Q. She, and S. Li (2019-07) Enhancing pre-trained language representations with rich knowledge for machine reading comprehension. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, Florence, Italy, pp. 2346–2357. External Links: Link, Document Cited by: §1, Table 2, §4.3.2.
  • [102] L. Yao, C. Mao, and Y. Luo (2019-09) KG-BERT: BERT for Knowledge Graph Completion. arXiv e-prints, pp. arXiv:1909.03193. External Links: 1909.03193 Cited by: §1, §3.3, Table 2, §4.4.
  • [103] Z. Ye, Q. Chen, W. Wang, and Z. Ling (2019-08) Align, Mask and Select: A Simple Method for Incorporating Commonsense Knowledge into Language Representation Models. arXiv e-prints, pp. arXiv:1908.06725. External Links: 1908.06725 Cited by: §2.3, §3.2, Table 2, §4.3.3.
  • [104] C. Yu, H. Zhang, Y. Song, and W. Ng (2020-12) CoCoLM: COmplex COmmonsense Enhanced Language Model. arXiv e-prints, pp. arXiv:2012.15643. External Links: 2012.15643 Cited by: §2.3, Table 2, §4.1.
  • [105] D. Yu, C. Zhu, Y. Yang, and M. Zeng (2020-10) JAKET: Joint Pre-training of Knowledge Graph and Language Understanding. arXiv e-prints, pp. arXiv:2010.00796. External Links: 2010.00796 Cited by: §2.2, §3.2, Table 1, Table 2.
  • [106] D. Zhang, Z. Yuan, Y. Liu, Z. Fu, F. Zhuang, P. Wang, H. Chen, and H. Xiong (2020-09) E-BERT: A Phrase and Product Knowledge Enhanced Language Model for E-commerce. arXiv e-prints, pp. arXiv:2009.02835. External Links: 2009.02835 Cited by: §2.4, §3.2, Table 2, §4.5.
  • [107] Y. Zhang, V. Zhong, D. Chen, G. Angeli, and C. D. Manning (2017-09) Position-aware attention and supervised data improve slot filling. In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, Copenhagen, Denmark, pp. 35–45. External Links: Link, Document Cited by: §4.2.2.
  • [108] Z. Zhang, X. Han, Z. Liu, X. Jiang, M. Sun, and Q. Liu (2019-07) ERNIE: enhanced language representation with informative entities. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, Florence, Italy, pp. 1441–1451. External Links: Document Cited by: §1, §2.2, §3.2, §3.3, §3.4.1, Table 1, Table 2, §4.2.1, §4.2.2, 1st item, §5.4.1.
  • [109] J. Zhou, Z. Zhang, H. Zhao, and S. Zhang (2020-11) LIMIT-BERT : linguistics informed multi-task BERT. In Findings of the Association for Computational Linguistics: EMNLP 2020, Online, pp. 4450–4461. External Links: Link, Document Cited by: §2.1, Table 2.
  • [110] M. Zhu, Y. Zhang, W. Chen, M. Zhang, and J. Zhu (2013-08) Fast and accurate shift-reduce constituent parsing. In Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Sofia, Bulgaria, pp. 434–443. External Links: Link Cited by: §2.1.