DeepAI
Log In Sign Up

Does Knowledge Help General NLU? An Empirical Study

09/01/2021
by   Ruochen Xu, et al.
Microsoft
4

It is often observed in knowledge-centric tasks (e.g., common sense question and answering, relation classification) that the integration of external knowledge such as entity representation into language models can help provide useful information to boost the performance. However, it is still unclear whether this benefit can extend to general natural language understanding (NLU) tasks. In this work, we empirically investigated the contribution of external knowledge by measuring the end-to-end performance of language models with various knowledge integration methods. We find that the introduction of knowledge can significantly improve the results on certain tasks while having no adverse effects on other tasks. We then employ mutual information to reflect the difference brought by knowledge and a neural interpretation model to reveal how a language model utilizes external knowledge. Our study provides valuable insights and guidance for practitioners to equip NLP models with knowledge.

READ FULL TEXT VIEW PDF

page 5

page 7

page 15

page 16

11/08/2019

Why Do Masked Neural Language Models Still Need Common Sense Knowledge?

Currently, contextualized word representations are learned by intricate ...
08/20/2021

SMedBERT: A Knowledge-Enhanced Pre-trained Language Model with Structured Semantics for Medical Text Mining

Recently, the performance of Pre-trained Language Models (PLMs) has been...
09/17/2021

Language Models as a Knowledge Source for Cognitive Agents

Language models (LMs) are sentence-completion engines trained on massive...
02/02/2022

Understanding Knowledge Integration in Language Models with Graph Convolutions

Pretrained language models (LMs) do not capture factual knowledge very w...
12/13/2021

Improving and Diagnosing Knowledge-Based Visual Question Answering via Entity Enhanced Knowledge Injection

Knowledge-Based Visual Question Answering (KBVQA) is a bi-modal task req...
06/07/2020

Language Models as Fact Checkers?

Recent work has suggested that language models (LMs) store both common-s...
11/19/2019

Unsupervised Natural Question Answering with a Small Model

The recent (2019-02) demonstration of the power of huge language models ...

1 Introduction

Language models utilize contextualized word representations to boost the performance of various NLP tasks Devlin et al. (2019); Liu et al. (2019); Lan et al. (2020); Clark et al. (2020). In recent years, there has been a rise in the trend of integrating external knowledge into language models Wang et al. (2020); Yu et al. (2020); Wang et al. (2019b); Liu et al. (2020); Peters et al. (2019); Zhang et al. (2019); Poerner et al. (2020) based on the transformer Vaswani et al. (2017). For instance, representations of entities in the input and related concepts are combined with contextual representations to provide additional information, leading to significant improvement in many tasks Han et al. (2018); Choi et al. (2018); Ling and Weld (2012); Berant et al. (2013).

However, most of these approaches Zhang et al. (2017); Talmor et al. (2019)

focus on knowledge-centric tasks, e.g. common sense Q&A, relation extraction, where the completion of a task requires information from an external source other than the input text. But, these works overlook many more general NLP tasks which do not explicitly request usage of knowledge, including but not limited to sentiment classification, natural language inference, sentence similarity, part-of-speech tagging, and named entity recognition

Wang et al. (2018, 2019a). So far, the improvement on these tasks often originates from more sophisticated architectures, larger model size, and an increasing amount of pre-training data. Very little work has investigated whether external knowledge will improve the performance of these non-knowledge-centric tasks.

In this work, we aim to find out whether external knowledge can lead to better language understanding ability for general NLU tasks. Specifically, we want to answer the following questions:

  • [leftmargin=*]

  • First of all, does knowledge help general NLU tasks overall? (Section 4.2 Q1)

  • Among various NLU tasks, what source of knowledge and which tasks could benefit the most from the integration of external knowledge? (Section 4.2 Q2)

  • Under the same experimental settings, which integration methods are the most effective in combining knowledge with language models? (Section 4.2 Q3)

  • Which large-scale pre-trained language models benefit the most from external knowledge? (Section 4.2 Q4)

  • Can we tell knowledge is helpful besides the end-to-end performance indicator? (Section 4.3 Q1)

  • If knowledge can help with certain NLU tasks, how does the language model utilize the external knowledge? (Section 4.3 Q2)

Answers to these questions not only help us understand how knowledge is leveraged in language models but also provide important insights into how to leverage knowledge in various NLU tasks.

In detail, we explore the different sources of knowledge including the textual explanation of entities and their embeddings. We explore two main categories of knowledge integration methods. Knowledge as Text places descriptions of entities into the input text, with normal or modified attention mechanism from the language model. Knowledge as Embedding integrates contextual or graphical embedding of entities into the language model via addition. Both methods are non-invasive, meaning that the language model’s inner structure does not need to be altered. We apply these knowledge integration methods to 4 pretrained language models and conduct extensive experiments on 10 NLU tasks. The results show that introducing knowledge can outperform vanilla pretrained language models by 0.46 points by averaging all language models.

To understand how and why knowledge integration methods can help with language models, we also utilize mutual information (MI) to reflect the difference brought by knowledge (see Figure 5) and visualize the contribution of inputs to the prediction of knowledge-enhanced language models (see figure 3) to better understand the interaction between language model and knowledge. We find that Knowledge integration methods retain more information about input while gradually discard take-irrelevant information and finally keep more information about output. Although the knowledge is only introduced for a subset of tokens in the input sentence, it affects the decision process of the model on all tokens and improves the generalization ability on certain tasks.

In summary, we present a systematic empirical analysis on how to effectively integrate external knowledge into existing language models for general NLU tasks. This provides valuable insights and guidance for practitioners to effectively equip language models with knowledge for different NLU tasks.

2 Related Work

In this section, we review previous works that explore how to combine external knowledge with language models, which can be grouped into the following categories.

Joint Pretraining Some recent works combine pre-trained language models with external knowledge by joint pretraining with both unstructured text and structured knowledge bases. Ernie(Baidu) Sun et al. (2019) modified pretrain objective in BERT Devlin et al. (2019) to mask the whole span of named entities. WKLM Xiong et al. (2020) trains the model to detect whether an entity is replaced by another one in the same category. LUKE Yamada et al. (2020) propose a pretrained model which uses similar entity masks from Wikipedia in pretraining but treats words and entities in a given text as independent tokens. KEPLER Wang et al. (2019b) and JAKET Yu et al. (2020) introduce descriptive text of entities and their relations into pretraining. Build upon an existing pre-trained encoderLiu et al. (2019), a fully Wang et al. (2019b) or partiallyYu et al. (2020) shared encoder is used to encode entity descriptive text with entity-related objectives such as relation type prediction.

Static Entity Representations Another way to combine knowledge with a language model is to use static entity representations learned separately from a knowledge base. Ernie (THU) Zhang et al. (2019) and KnowBert Peters et al. (2019) merge entity representations with language model using entity-to-word attentions. E-Bert Poerner et al. (2020)

aligns static entity vectors from Wikipedia2Vec with BERT’s native wordpiece vector space and uses the aligned entity vectors as if they were wordpiece vectors.

Adaptation to Knowledge-Free Model It is also possible to incorporate knowledge without joint pretraining or relying on knowledge embeddings. K-BERT Liu et al. (2020) injects triples from KGs into sentences. A special soft-position and visible matrix in attention are introduced to prevent the injected knowledge from diverting the meaning of the original sentence. K-Adapter Wang et al. (2020) initializes the model parameters from Roberta Liu et al. (2019) and equips it with adapters to continue training on entity-related objectives.

3 Approaches

Figure 1: Illustration of all approaches to incorporate knowledge with language models. Here we assume the encoder module has the flexibility to take either a sequence of tokens or a sequence of token embeddings.

3.1 Definition

Given the input text with tokens, a language model produces the contextual word representation . We use to represent the intermediate hidden states after layers. We further assume represents the token embeddings of . For a specific downstream task, a header function further takes the output of as input and generates the prediction as .

As we adopt entity information as knowledge in this paper, we assume that the input text contains entities , where each entity is represented by a contiguous span of tokens in : [, ], where and represents the start and end position of .

3.2 Combine Knowledge with Language Models

For a downstream task, the pre-trained language model and the head function

are jointly trained to minimize the loss function on training data

:

(1)

Given the external knowledge , we explore several general methods to incorporate it into any pre-trained language model such that the knowledge-enhanced language model can encode the information from both and .

In this work, we consider two formats of knowledge centered on entities:

  • Free text: An unstructured text to describe an entity , e.g. the definition of from a dictionary;

  • Embedding: A continuous embedding vector to encode an entity , e.g. graph embedding of the node of

    from a knowledge graph.

To align with the format of knowledge, our integration methods include i) Knowledge as Text, and ii) Knowledge as Embedding, as described in the following sections.

Example
Insert after
The sponge sponge: Any of various
marine invertebrates … soaked soak: To
be saturated with liquid … up the water.
Append to end
The sponge soaked up the water. sponge:
Any of various marine invertebrates …
soak: To be saturated with liquid …
Table 1: Examples of two approaches of combining external descriptions with text.

3.2.1 Knowledge as Text

The simplest way to incorporate a textual description with the input is to concatenate them in the text space. We explore two ways of combination (Table 1): inserting after in and appending to the end of . Empirically we found that the second approach always outperforms the first one in the GLUE benchmark. So we adopt the appending approach as the first knowledge combination method, which we refer to as Knowledge as Text (KT).

As pointed in Liu et al. (2020), too much knowledge incorporation may divert the sentence from its original meaning by introducing a lot of noise. This is more likely to happen if there are multiple entities in the input text. To solve this issue, we adopt the visibility matrix (Liu et al., 2020) to limit the impact of descriptions on the original text. In the Transformer architecture, an attention mask matrix is added with the self-attention weights before softmax. Therefore, an value in attention mask matrix blocks token from attending to token and a value allows token to attend to token . In our case, we modify the attention mask matrix such that

(2)

where and are tokens from the concatenation of and descriptions . In other words, can attend to if: both tokens belong to the input , or both tokens belong to the description of the same entity , or is the token at the starting position of entity in and is from its description text .

Figure 2: Illustration of attention matrix in KT-Attn. In this example, K-Text1 describes the token starting at position 1 and K-Text2 describes the token starting at position 3.

Figure 2 illustrates the attention matrix given a input text and two entity descriptions. We refer to this approach as Knowledge as Text with Attention (KT-Attn).

Knowledge as Embedding. Here, we first represent each entity by an embedding vector . When a knowledge graph of entities is available, we can obtain graphical embedding for each entity node. In our experiments, we use the pre-trained TransE Bordes et al. (2013) 111TransE embeddings are from http://openke.thunlp.org/ to get the embedding of each entity in the Wikidata knowledge graph.

We feed into a multi-layer perception layer (MLP) to align with the input embeddings of language models. We then linearly combine the transformed embedding with the input token embedding at position .

As the language model was not exposed to this additional entity embedding during pre-training, we initialize the weight of to zero and linearly increase its weight during the whole fine-tuning. Define as the vector representation that is fed into the language model, we have

where is annealed from to . We refer to this integration method as Knowledge as Graph Embedding (KG-Emb).

When a knowledge graph is not available, we can use entity descriptions to produce entity embedding . Here, we use to encode the entity description into contextual representation . As shown in Table 1, the knowledge text always starts with the token being explained, e.g. sponge: Any of various marine…. Therefore we use the contextual representation of the first token in as entity embedding . We then use in the same way as KG-Emb. Compared with existing work (Yu et al., 2020; Wang et al., 2019b), our approach does not require pre-training with external knowledge and can be easily applied to any pre-trained language model in a non-invasive way. We refer to this method as Knowledge as Textual Embedding (KT-Emb).

4 Experiments

In this section, we perform extensive experiments to examine the aforementioned knowledge integration methods in different pre-trained language models (LMs) on a variety of NLU tasks.

Dataset #Train #Val Task
CoLA 8.5K 1K regression
SST-2 67K 1.8K classification
MNLI 393K 20K classification
QQP 364K 391K classification
QNLI 105K 4K classification
STS-B 7K 1.4K regression
MRPC 3.7K 1.7K classification
RTE 2.5K 3K classification
POS 38.2K 5.5K sequence labeling
NER 14K 3.3K sequence labeling
Table 2: Statistics of the datasets. #Train and #Val are the number of samples for training and validation.
Model CoLA SST-2 MRPC STS-B QQP MNLI QNLI RTE Avg
Metrics Matt. corr. Acc. Acc. Pear. corr. Acc. Acc. Acc. Acc.
RoBERTa-Large Liu et al. (2019) 68.0 96.4 90.9 92.4 92.2 90.2 94.7 86.6 88.93
RoBERTa-Large (ours) 67.02 96.22 90.93 92.71 92.15 90.59 94.73 90.61 89.37
+ KT 68.52 96.33 89.22 92.39 92.01 90.49 94.55 90.97 89.31
+ KT-Attn 68.84 96.44 91.18 92.61 92.09 90.63 94.67 91.7 89.77
+ KT-Emb 68.22 96.56 90.69 92.8 92.08 90.56 94.73 90.97 89.58
+ KG-Emb 68.03 96.44 90.69 92.42 92.19 90.63 94.55 90.61 89.45
RoBERTa-Base (ours) 60.07 94.72 89.71 90.95 91.58 87.73 92.84 75.09 85.34
+ KT 62.89 94.72 88.24 89.87 91.57 87.78 92.75 69.68 84.69
+ KT-Attn 62.35 94.84 89.22 90.98 91.58 87.92 92.90 76.17 85.75
+ KT-Emb 62.43 94.84 89.71 90.9 91.49 88.02 92.77 73.29 85.43
+ KG-Emb 61.62 95.18 88.97 90.45 91.5 88.01 93.06 73.65 85.31
Table 3: Results for RoBERTa on classification and regression (CR) tasks. All results are medians over five runs with different seeds on the development set. To validate our results, we follow RoBERTa Liu et al. (2019) to finetune starting from the MNLI model for RoBERTa-large instead of the baseline pretrained model on RTE, STS-B and MRPC tasks. Complete results on other pretrained language models can be found in the Appendix B.
Model POS NER Avg
RoBERTa-Large 96.95 96.33 96.64
+ KT 97.06 / 96.93 96.21 / 96.72 96.89
+ KT-Attn 97.06 / 96.94 96.18 / 96.32 96.69
+ KT-Emb 96.98 / 96.97 96.67 / 96.62 96.83
+ KG-Emb 96.95 96.64 96.80
RoBERTa-Base 96.88 95.30 96.09
+ KT 97.03 / 96.87 95.07 / 95.57 96.3
+ KT-Attn 97.05 / 96.87 95.21 / 95.3 96.18
+ KT-Emb 97.03 / 96.91 95.70 / 95.69 96.37
+ KG-Emb 96.91 95.75 96.33
Table 4: Results for RoBERTa on two sequence labeling (SL) tasks. For KT, KT-Attn and KT-Emb, we also experiment with extracting knowledge description for tokens from entity linking which are denoted in right. We report F1 for both tasks. Reported results are medians over five runs on the development set. Complete results on other pre-trained language models can be found in the Appendix B.

4.1 Experimental Setup

Table 2 lists the 10 datasets in our study, including 8 classification and regression (CR) tasks from the GLUE Wang et al. (2018) benchmark and two sequence labeling (SL) tasks from Penn Treebank Marcus et al. (1993) and CoNLL-2003 shared task data Tjong Kim Sang and De Meulder (2003). We study on 4 different LMs: RoBERTa Liu et al. (2019); BERT Devlin et al. (2019), ALBERT Lan et al. (2020) and ELECTRA Clark et al. (2020). For each language model, we experiment with both base and large models. Details of datasets and LMs can be found in Appendix A.

Our implementation is based on HuggingFace’s Transformers Wolf et al. (2020)

. We conduct all experiments on 8 Nvidia A100-40GB GPU cards. We set the fixed training epochs and batch size for each task, and a limited hyperparameter sweep with learning rates

{1e-5, 2e-5, 3e-5}. For KT-Emb and KG-Emb, we search warmup weight {0.1, 0.2, 0.3}. For CR tasks, the training epochs are set to 10. Due to the sufficient training data of MNLI and QQP, we set their epochs to 5. For SL tasks, we set the training epochs to 3. The batch size is set to 128 for CR tasks except that we search the batch size in {16, 32, 128} for CoLA and STS-B on RoBERTa-base due to their small training data and then fix it for fair comparison 222CoLA and STS-B use batch size 32 and 16 respectively.. For SL tasks, the batch size is set to 16. We report the median of results on the development set over five fixed random seeds for all tasks.

To extract the knowledge description, we first use Spacy333https://spacy.io/ to annotate and select the nouns, verbs, or adjectives to use as the knowledge entities. For KG-Emb, we use REL van Hulst et al. (2020) to link entities to Wikidata. We leverage external knowledge source Wiktionary444https://en.wiktionary.org/wiki/Wiktionary:Main_Page to obtain the description for each entity.

Figure 3: DiffMask plot for CoLA task with RoBERTa-Base model. CoLA task is to predict the linguistic acceptability of a sentence. Apurplecell means that the model’s corresponding layer thinks the token on the left is not important for the end task and can be ignored. Ayellowcell means the opposite and green cells mean neutrality. The left column shows the result from the vanilla RoBERTa-Base and the right column shows KT-Attn which is one of our knowledge integrated language model. Clearly, KT-Attn has a better understanding of the end task as it correctly identifies words and phrases such as “not” and “and Sue to stay” which would not change the linguistic acceptability if being ignored.

4.2 Knowledge Integration Results

In this section, we present different knowledge integration results in 8 pretrained language models. Table 3 and Table 4 list detailed numbers on 10 NLU tasks for RoBERTa base and large models. Figure 4 summarizes our results on all LMs. From these results, we aim to answer the following questions.

Figure 4: Effectiveness of knowledge integration methods on different tasks and language models. Figure (a) shows for each task the number of language models for which our knowledge integration methods could improve accuracy (maximum is 8 which means it helps all LMs on that task). Figure (b) shows for each language model the number of tasks that knowledge integration could improve accuracy on (maximum is 10 which means it helps all tasks with that language model). Figure (c) shows for each task the number of language models that knowledge integration method performs best on (maximum is 10 which means it always performs best with different LMs among 4 integration methods for that task). Figure (d) shows for each language model the maximum average gains over CR and SL tasks. ‘B‘ and ‘L‘ stand for the base and large model respectively. The dashed lines in figure (a), (b) and (c) represent the upper bound. The detailed performance numbers on each task are in the Appendix B.

Q1: Does knowledge help general NLU tasks? Overall we find that knowledge can help general NLU tasks. Firstly, Table 3 shows that KT-Attn outperforms both RoBERTa base and large baselines about 0.4 points on average for CR tasks. For SL tasks, KT and KT-Emb outperform baselines about 0.25 and 0.28 points on average. Secondly, Figure 4(a) clearly shows that all tasks can benefit from knowledge across 8 different LMs. For example, KT-Attn improves all LMs via the introduction of knowledge for SST-2 and POS tasks. Thirdly, the average gain of all LMs on 10 NLU tasks with the introduction of knowledge is about 0.46 points. Figure 4(d) also shows the average gains with each LMs for CR and SL tasks.

Q2: Which tasks benefit the most from knowledge integration? For CR tasks, Table 3 shows that CoLA, SST2 and RTE get the most improvement. For SL tasks, Table 4 shows that both POS and NER get the considerable improvements as regards their strong baselines. In terms of the number of LMs that knowledge can help with, Figure 4(a) shows that SST-2, POS, and NER benefit the most as they get improved on all language models with the introduction of knowledge.

Q3: What is the best way to combine KGs with CWR for different NLU tasks? Firstly, Figure 4(a) shows that KT-Attn and KT-Emb can help most LMs for each task. Secondly, in terms of best knowledge integration methods on each task, Figure 4(c) shows that KT-Attn and KT-Emb accounts more than the other two methods among all LMs. Thirdly, in terms of which methods to select entities for knowledge extraction, Table 3 shows that the POS-based method performs better than entity-linking based for the POS task while it is the opposite for the NER task.

Q4: Which large scale pre-trained language models benefit the most from external knowledge? For the number of benefit tasks aspect, Figure 4(b) shows that BERT-Large model gets improvement for 9 tasks with KT-Attn method. For the performance gains aspect, Figure 4(d) shows that BERT-Base model improves most for CR tasks while ELECTRA-Base model improves most for SL tasks.

4.3 Analysis

In addition to measuring the performance of knowledge integration methods on NLU tasks, it is also of great value to understand how and why knowledge integration methods help with language models. In particular, we answer the following two questions.

Q1: Is there any indicator to tell knowledge is helpful besides the end-to-end performance? Wang et al. (2021) proposes to enforce local modules to retain as much information about the input as possible while progressively discarding task-irrelevant parts. Inspired by this, we utilize mutual information (MI) to reflect the difference brought by knowledge.

Specifically, we use the mutual information to measure the amount of retained information in -th layer about the raw input , and to measure the amount of retained task-relevant information.

We then calculate the difference and between knowledge integration methods and baseline for each layer. If , it means the knowledge integration helps to retain more information about at layer than the baseline. If , it means knowledge helps to discard more task-irrelevant information at layer .

To estimate

, we follow the common practice Vincent et al. (2008); Rifai et al. (2012) to use the expected error for reconstructing from to approximate , where is the reconstruction error and is estimated by masked language modeling to recover the masked tokens, denotes the marginal entropy of , as a constant.

To estimate , we follow Wang et al. (2021) to compute , where is the cross-entropy classification loss.

Both the estimations of and

require an auxiliary classifier layer connected to each LM Transformer layer’s output. We place more details in Appendix 

C.

Figure 5 shows the mutual information difference and between each KG integration method and the vanilla RoBERTa-base baseline on CoLA dataset. We observe the following results: KT and KT-Attn lead to higher and , indicating that they retain more information about input while discarding task-irrelevant parts. All KG integration methods gradually discard task-irrelevant information and keep more information about output after the first six layers.

Figure 5: Estimated mutual information difference and between each KG integration method and RoBERTa-Base baseline on CoLA. means the integration helps retain more information about at layer than the baseline. means that it helps to discard more task-irrelevant information.

Figure 6: The average number of Transformer layers in (a) RoBERTa-Base and (b) KT-Attn that deem words of certain part-of-speech as important for the CoLA task of linguistic acceptability. Results are obtained from the DiffMask model De Cao et al. (2020).

Q2: How does the introduction of knowledge change the way language models make decisions? We employ DiffMask De Cao et al. (2020), an interpretation tool to show how decisions emerge across transformer layers of a language model. DiffMask learns to mask out subsets of the input while maintaining the output of the network unchanged. The mask value is computed for every token position at every layer position by taking the th transformer hidden states as input to an auxiliary classifier. The value of at token and layer means the hidden states of token is predictable by the auxiliary classifier that masking in the original input will not affect the model prediction, i.e. the model ’knows’ that token would not influence the final output at layer . A towards means the opposite. The technical details of DiffMask are described in appendix D.1.

In Figure 3 we plot the mask heatmaps of RoBERTa-Base and KT-Attn for two example inputs. As shown, KT-Attn shows better generalization ability since it correctly learns that the negation word in the first example ("not") and the phrase in the second example ("Sue to stay") would not affect the prediction for linguistic acceptability.

In Figure 6, we show the average number of Transformer layers in RoBERTa-Base and KT-Attn that deem words of certain part-of-speech as important for the CoLA task. We can see that although the knowledge is only applied to verbs, nouns, and adjectives, it affects the behavior of the language model on other words as well. For example, the average number of Transformer layers increases for almost all POS tags of words. And in terms of relative ranking, PRON (pronoun), ADP (adverb), and NUM (numeral) also have significant changes after KT-Attn introduced external knowledge. We include some additional analysis based on DiffMask in appendix D.2.

5 Conclusion

In this paper, we have presented a large-scale empirical study of various knowledge integration methods on 10 general NLU tasks. We show that knowledge brings more pronounced benefits than previously thought for general NLU tasks since introducing it outperforms across a variety of vanilla pretrained language models and significantly improves the result on certain tasks while having no adverse effects on other tasks. Our analysis with MI and DiffMask further helps understand how and why knowledge integration methods can help with language models.

References

  • J. Berant, A. Chou, R. Frostig, and P. Liang (2013) Semantic parsing on freebase from question-answer pairs. In

    Proceedings of the 2013 conference on empirical methods in natural language processing

    ,
    pp. 1533–1544. Cited by: §1.
  • A. Bordes, N. Usunier, A. Garcia-Duran, J. Weston, and O. Yakhnenko (2013) Translating embeddings for modeling multi-relational data. In Advances in Neural Information Processing Systems, C. J. C. Burges, L. Bottou, M. Welling, Z. Ghahramani, and K. Q. Weinberger (Eds.), Vol. 26, pp. . External Links: Link Cited by: §3.2.1.
  • D. Cer, M. Diab, E. Agirre, I. Lopez-Gazpio, and L. Specia (2017) SemEval-2017 task 1: semantic textual similarity multilingual and crosslingual focused evaluation. In Proceedings of the 11th International Workshop on Semantic Evaluation (SemEval-2017), Vancouver, Canada, pp. 1–14. External Links: Link, Document Cited by: Appendix A.
  • E. Choi, O. Levy, Y. Choi, and L. Zettlemoyer (2018) Ultra-fine entity typing. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 87–96. Cited by: §1.
  • K. Clark, M. Luong, Q. V. Le, and C. D. Manning (2020) ELECTRA: pre-training text encoders as discriminators rather than generators. In International Conference on Learning Representations, External Links: Link Cited by: §1, §4.1.
  • N. De Cao, M. S. Schlichtkrull, W. Aziz, and I. Titov (2020) How do decisions emerge across layers in neural models? interpretation with differentiable masking. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), Online, pp. 3243–3255. External Links: Link, Document Cited by: Figure 7, Figure 8, §D.1, Figure 6, §4.3.
  • J. Devlin, M. Chang, K. Lee, and K. Toutanova (2019) BERT: pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pp. 4171–4186. Cited by: §1, §2, §4.1.
  • W. B. Dolan and C. Brockett (2005) Automatically constructing a corpus of sentential paraphrases. In Proceedings of the Third International Workshop on Paraphrasing (IWP2005), External Links: Link Cited by: Appendix A.
  • X. Han, H. Zhu, P. Yu, Z. Wang, Y. Yao, Z. Liu, and M. Sun (2018) FewRel: a large-scale supervised few-shot relation classification dataset with state-of-the-art evaluation. In EMNLP, Cited by: §1.
  • Z. Lan, M. Chen, S. Goodman, K. Gimpel, P. Sharma, and R. Soricut (2020)

    ALBERT: a lite bert for self-supervised learning of language representations

    .
    In International Conference on Learning Representations, External Links: Link Cited by: §1, §4.1.
  • X. Ling and D. Weld (2012) Fine-grained entity recognition. In

    Proceedings of the AAAI Conference on Artificial Intelligence

    ,
    Vol. 26. Cited by: §1.
  • W. Liu, P. Zhou, Z. Zhao, Z. Wang, Q. Ju, H. Deng, and P. Wang (2020) K-bert: enabling language representation with knowledge graph.. In AAAI, pp. 2901–2908. Cited by: §1, §2, §3.2.1.
  • Y. Liu, M. Ott, N. Goyal, J. Du, M. Joshi, D. Chen, O. Levy, M. Lewis, L. Zettlemoyer, and V. Stoyanov (2019) RoBERTa: a robustly optimized bert pretraining approach. arXiv preprint arXiv:1907.11692. Cited by: §1, §2, §2, §4.1, Table 3.
  • M. P. Marcus, B. Santorini, and M. A. Marcinkiewicz (1993) Building a large annotated corpus of English: the Penn Treebank. Computational Linguistics 19 (2), pp. 313–330. External Links: Link Cited by: Appendix A, §4.1.
  • B. W. Matthews (1975) Comparison of the predicted and observed secondary structure of t4 phage lysozyme. Biochimica et Biophysica Acta (BBA)-Protein Structure 405 (2), pp. 442–451. Cited by: Appendix A.
  • M. E. Peters, M. Neumann, R. Logan, R. Schwartz, V. Joshi, S. Singh, and N. A. Smith (2019) Knowledge enhanced contextual word representations. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pp. 43–54. Cited by: §1, §2.
  • N. Poerner, U. Waltinger, and H. Schütze (2020) E-BERT: efficient-yet-effective entity embeddings for BERT. In Findings of the Association for Computational Linguistics: EMNLP 2020, Online, pp. 803–818. External Links: Link, Document Cited by: §1, §2.
  • P. Rajpurkar, J. Zhang, K. Lopyrev, and P. Liang (2016) SQuAD: 100,000+ questions for machine comprehension of text. In Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, Austin, Texas, pp. 2383–2392. External Links: Link, Document Cited by: Appendix A.
  • S. Rifai, Y. Bengio, A. Courville, P. Vincent, and M. Mirza (2012) Disentangling factors of variation for facial expression recognition. In

    European Conference on Computer Vision

    ,
    pp. 808–822. Cited by: §4.3.
  • R. Socher, A. Perelygin, J. Wu, J. Chuang, C. D. Manning, A. Ng, and C. Potts (2013) Recursive deep models for semantic compositionality over a sentiment treebank. In Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing, Seattle, Washington, USA, pp. 1631–1642. External Links: Link Cited by: Appendix A.
  • Y. Sun, S. Wang, Y. Li, S. Feng, X. Chen, H. Zhang, X. Tian, D. Zhu, H. Tian, and H. Wu (2019) Ernie: enhanced representation through knowledge integration. arXiv preprint arXiv:1904.09223. Cited by: §2.
  • A. Talmor, J. Herzig, N. Lourie, and J. Berant (2019) CommonsenseQA: a question answering challenge targeting commonsense knowledge. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pp. 4149–4158. Cited by: §1.
  • E. F. Tjong Kim Sang and F. De Meulder (2003) Introduction to the CoNLL-2003 shared task: language-independent named entity recognition. In Proceedings of the Seventh Conference on Natural Language Learning at HLT-NAACL 2003, pp. 142–147. External Links: Link Cited by: Appendix A, §4.1.
  • J. M. van Hulst, F. Hasibi, K. Dercksen, K. Balog, and A. P. de Vries (2020) REL: an entity linker standing on the shoulders of giants. In Proceedings of the 43rd International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 2197–2200. Cited by: §4.1.
  • A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin (2017) Attention is all you need. In Advances in Neural Information Processing Systems, I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett (Eds.), Vol. 30, pp. . External Links: Link Cited by: §1.
  • P. Vincent, H. Larochelle, Y. Bengio, and P. Manzagol (2008)

    Extracting and composing robust features with denoising autoencoders

    .
    In

    Proceedings of the 25th international conference on Machine learning

    ,
    pp. 1096–1103. Cited by: §4.3.
  • A. Wang, Y. Pruksachatkun, N. Nangia, A. Singh, J. Michael, F. Hill, O. Levy, and S. R. Bowman (2019a) SuperGLUE: a stickier benchmark for general-purpose language understanding systems. Advances in Neural Information Processing Systems 32. Cited by: §1.
  • A. Wang, A. Singh, J. Michael, F. Hill, O. Levy, and S. Bowman (2018) GLUE: a multi-task benchmark and analysis platform for natural language understanding. In

    Proceedings of the 2018 EMNLP Workshop BlackboxNLP: Analyzing and Interpreting Neural Networks for NLP

    ,
    Brussels, Belgium, pp. 353–355. External Links: Link, Document Cited by: Appendix A, §1, §4.1.
  • R. Wang, D. Tang, N. Duan, Z. Wei, X. Huang, C. Cao, D. Jiang, M. Zhou, et al. (2020) K-adapter: infusing knowledge into pre-trained models with adapters. arXiv preprint arXiv:2002.01808. Cited by: §1, §2.
  • X. Wang, T. Gao, Z. Zhu, Z. Liu, J. Li, and J. Tang (2019b) KEPLER: a unified model for knowledge embedding and pre-trained language representation. arXiv preprint arXiv:1911.06136. Cited by: §1, §2, §3.2.1.
  • Y. Wang, Z. Ni, S. Song, L. Yang, and G. Huang (2021) Revisiting locally supervised learning: an alternative to end-to-end training. In International Conference on Learning Representations, External Links: Link Cited by: Appendix C, §4.3, §4.3.
  • A. Warstadt, A. Singh, and S. R. Bowman (2019) Neural network acceptability judgments. Transactions of the Association for Computational Linguistics 7, pp. 625–641. External Links: Link, Document Cited by: Appendix A.
  • A. Williams, N. Nangia, and S. Bowman (2018) A broad-coverage challenge corpus for sentence understanding through inference. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers), New Orleans, Louisiana, pp. 1112–1122. External Links: Link, Document Cited by: Appendix A.
  • T. Wolf, L. Debut, V. Sanh, J. Chaumond, C. Delangue, A. Moi, P. Cistac, T. Rault, R. Louf, M. Funtowicz, J. Davison, S. Shleifer, P. von Platen, C. Ma, Y. Jernite, J. Plu, C. Xu, T. L. Scao, S. Gugger, M. Drame, Q. Lhoest, and A. M. Rush (2020) Transformers: state-of-the-art natural language processing. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, Online, pp. 38–45. External Links: Link Cited by: §4.1.
  • W. Xiong, J. Du, W. Y. Wang, and V. Stoyanov (2020) Pretrained encyclopedia: weakly supervised knowledge-pretrained language model. In International Conference on Learning Representations, External Links: Link Cited by: §2.
  • I. Yamada, A. Asai, H. Shindo, H. Takeda, and Y. Matsumoto (2020) LUKE: deep contextualized entity representations with entity-aware self-attention. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 6442–6454. Cited by: §2.
  • D. Yu, C. Zhu, Y. Yang, and M. Zeng (2020) Jaket: joint pre-training of knowledge graph and language understanding. arXiv preprint arXiv:2010.00796. Cited by: §1, §2, §3.2.1.
  • Y. Zhang, V. Zhong, D. Chen, G. Angeli, and C. D. Manning (2017) Position-aware attention and supervised data improve slot filling. In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, pp. 35–45. Cited by: §1.
  • Z. Zhang, X. Han, Z. Liu, X. Jiang, M. Sun, and Q. Liu (2019) ERNIE: enhanced language representation with informative entities. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pp. 1441–1451. Cited by: §1, §2.

Appendix A Datasets and Pretrained Language Models

CoLA Warstadt et al. (2019): The Corpus of Linguistic Acceptability is a regression dataset annotated about the acceptability whether it is a grammatical English sentence. We use Matthews correlation coefficient Matthews (1975)

as the evaluation metric.

SST-2 Socher et al. (2013): The dataset of Stanford Sentiment Treebank is a sentiment classification dataset.

MRPC Dolan and Brockett (2005): The Microsoft Research Paraphrase Corpus is to predict whether two sentences are semantically equal.

STS-B Cer et al. (2017): The Semantic Textual Similarity Benchmark is the other regression dataset that measures the similarity between the pairs.

QQP555https://quoradata.quora.com/First-Quora-Dataset-Release-Question-Pairs

: The Quora Question Pairs is the other dataset to determine whether two sentences are semantically equivalent from the community question-answering website Quora.

MNLI Williams et al. (2018): The Multi-Genre Natural Language Inference Corpus is textual entailment tasks and the goal is to classify the relationship between premise and hypothesis sentences into three classes: entailment, contradiction, and neutral.

QNLI Rajpurkar et al. (2016)

: The Stanford Question Answering Dataset is sentence pair classification collection about question-answering. This task is to determine whether the context contains the answer to the question.

RTE Wang et al. (2018): The Recognizing Textual Entailment (RTE) datasets is the other textual entailment dataset.

POS: The Part-of-speech tagging is to classify the word in a text to a particular part-of-speech. We use the Penn Treebank Marcus et al. (1993) for this task.

NER: The Name-entity recognition is to seek the name entities among the given sentence. We use the CoNLL-2003 shared task data Tjong Kim Sang and De Meulder (2003).

Models #Params
RoBERTa-Base 125M
RoBERTa-Large 355M
BERT-Base-Cased 109M
BERT-Large-Cased 335M
ALBERT-Base-v2 11M
ALBERT-Large-v2 17M
ELECTRA-Base 110M
ELECTRA-Large 335M
Table 5: Number of parameters for each pretrained language models in our experiments

Table 5 lists all pretrained language models in our experiments.

Appendix B Knowledge Integration results

Model CoLA SST-2 MRPC STS-B QQP MNLI QNLI RTE Avg
Metrics Matt. corr. Acc. Acc. Pear. corr. Acc. Acc. Acc. Acc.
BERT-Base-Cased 60.83 92.2 84.31 89.02 90.81 83.83 91.09 66.79 82.36
+ KT 58.38 91.86 78.68 89.01 90.88 83.83 90.76 63.18 80.82
+ KT-Attn 59.76 92.66 85.05 89.33 90.84 83.91 90.88 66.79 82.40
+ KT-Emb 61.69 92.66 86.03 89.81 90.79 84.03 90.98 68.59 83.07
+ KG-Emb 61.25 92.43 84.56 87.88 90.83 83.9 90.92 66.43 82.28
BERT-Large-Cased 64.84 93.92 86.27 90.29 91.5 86.51 92.51 71.48 84.67
+ KT 63.61 93.92 76.47 89.4 91.46 86.48 92.46 66.06 82.48
+ KT-Attn 65.35 94.04 87.25 90.56 91.43 86.63 92.57 73.65 85.19
+ KT-Emb 64.91 93.92 87.01 90.03 91.46 86.48 92.48 71.12 84.68
+ KG-Emb 64.27 93.81 85.54 88.57 91.5 86.38 92.46 71.84 84.30
ALBERT-Base-v2 56.31 92.83 87.75 90.88 90.63 85.13 91.76 72.2 83.44
+ KT 56.08 93.0 87.99 90.56 90.55 85.18 91.76 75.09 83.78
+ KT-Attn 57.42 93.0 88.48 90.81 90.54 85.08 91.91 74.73 84.00
+ KT-Emb 55.52 93.23 88.24 90.73 90.51 85.08 91.69 74.01 83.63
+ KG-Emb 54.53 92.43 87.75 90.21 90.54 85.12 91.69 73.29 83.20
ALBERT-Large-v2 60.16 94.38 89.22 91.39 90.88 87.18 92.59 79.42 85.65
+ KT 59.1 94.61 84.31 90.71 90.71 87.19 92.75 74.73 84.26
+ KT-Attn 60.55 95.07 89.71 91.21 90.9 87.1 92.62 79.42 85.82
+ KT-Emb 61.31 94.72 89.95 91.4 90.96 87.12 92.37 80.87 86.09
+ KG-Emb 60.02 94.61 89.22 91.08 90.93 87.1 92.39 78.7 85.51
ELECTRA-Base 68.61 95.3 88.48 90.99 91.93 88.9 93.1 78.7 87.00
+ KT 69.7 94.95 87.5 90.41 91.84 88.81 93.01 76.9 86.64
+ KT-Attn 69.45 95.76 88.97 90.89 91.84 88.98 93.03 80.51 87.43
+ KT-Emb 70.69 95.64 88.73 91.15 91.86 88.81 93.12 76.53 87.07
+ KG-Emb 69.68 95.53 88.24 89.88 91.88 88.86 92.99 75.81 86.61
ELECTRA-Large 72.13 96.67 90.93 92.57 92.45 91.32 95.15 88.45 89.96
+ KT 70.27 96.9 89.46 92.29 92.55 91.09 95.19 88.81 89.57
+ KT-Attn 69.25 96.79 90.44 92.23 92.58 91.28 94.98 87.73 89.41
+ KT-Emb 68.81 97.02 90.93 91.86 92.67 91.2 95.15 88.81 89.56
+ KG-Emb 59.35 96.9 88.48 91.28 92.25 91.13 95.15 89.17 87.96
Table 6: Results on classification and regression (CR) tasks for BERT, ALBERT and ELECTRA.
Model POS NER Avg
Bert-Base-cased 96.8 94.32 95.56
KT 96.77 / 96.79 93.82 / 94.46 95.62
KT-Attn 96.93 / 96.79 94.21 / 94.33 95.63
KT-Emb 96.9 / 96.82 95.02 / 95.03 95.97
KG-Emb 96.83 94.94 95.88
Bert-Large-Cased 96.85 95.39 96.12
+ KT 96.88 / 96.86 95.42 / 95.76 96.32
+ KT-Attn 96.99 / 96.86 95.53 / 95.44 96.26
+ KT-Emb 96.92 / 96.88 95.85 / 95.9 96.41
+ KG-Emb 96.88 95.95 96.41
ALBERT-Base 96.17 93.66 94.91
+ KT 96.76 / 96.23 93.91 / 94.4 95.58
+ KT-Attn 96.79 / 96.19 94.51 / 93.62 95.65
+ KT-Emb 96.57 / 96.23 94.62 / 94.04 95.59
+ KG-Emb 96.22 93.75 94.98
ALBERT-Large 96.29 93.93 95.11
+ KT 96.81 / 96.39 94.73 / 94.89 95.85
+ KT-Attn 96.84 / 96.31 95.09 / 93.95 95.97
+ KT-Emb 96.73 / 96.34 95.17 / 94.45 95.95
+ KG-Emb 96.34 94.43 95.39
ELECTRA-Base 96.35 94.09 95.22
+ KT 96.77 / 96.37 94.91 / 94.58 95.84
+ KT-Attn 96.8 / 96.34 94.79 / 94.25 95.80
+ KT-Emb 96.86 / 96.49 95.71 / 94.89 96.28
+ KG-Emb 96.47 94.92 95.69
ELECTRA-Large 96.55 95.32 95.94
+ KT 96.9 / 96.56 95.67 / 95.8 96.35
+ KT-Attn 96.7 / 96.58 95.15 / 95.21 95.95
+ KT-Emb 96.86 / 96.64 96.14 / 95.72 96.50
+ KG-Emb 96.57 95.51 96.04
Table 7: Results on sequence labeling (SL) tasks for BERT, ALBERT and ELECTRA.

Table 6 and Table 7 list detailed numbers for CR and SL tasks on BERT, ALBERT and ELECTRA.

Appendix C Mutual Information Implementation Details

In our implementation, we stack one transformer layer followed by two fully-connected layers on top of the intermediate hidden states and optimize the newly added transformer to predict the label . Follow Wang et al. (2021), we simply use test accuracy as the estimate of of .

Appendix D DiffMask

d.1 Implementation Details

DiffMask attaches an MLP classifier to each LM layer’s output, including the token embedding layer as layer . The -th classifier takes the hidden states up to the -th layer to predict a binary mask vector: , where is the number of input tokens.

Then, the token mask for each input token is defined as the product of all binary masks up to the -th layer: . The embedding of the masked token is replaced by a learned baseline vector , i.e. . The masked embeddings is input the to the finetuned model to get . Here we assume could either take tokens or token embeddings as input. The objective of DiffMask is to estimate the parameters of the masking networks and the baseline to mask-out as many input tokens as possible while keeping , i.e. keeping the output of masked tokens close to the original output without masks.

According to De Cao et al. (2020), the learned masks reveal what the network “knows” at layer about the NLU task. We can therefore plot a heatmap over . If , it means that masking the -th input token will not affect the model prediction, i.e. the model ’knows’ that token would not influence the final output at layer and higher.

d.2 Additional Analysis

In figure 9, we plot the DiffMask heatmaps of an example input sentence in the RTE text entailment task. Given two sentences concatenated into a single sequence, the language model RoBERTa-Base is finetuned to predict whether the two sentences entail each other or not. From this example, we can see that the first sentence is verbose while the second one is concise. Therefore, as for entailment judgment, a model with good generalization power should focus on the tokens containing the key information: "Jack Kevorkian", "famed as", "real name" and "Dr. Death". KT-Attn and KT-Emb rely more on those key information than vanilla RoBERTa. In figure 7, we could also see the difference made by introducing knowledge into the finetuning of the language model is not limited to the tokens where knowledge is explicitly incorporated.

For STSB, where incorporating knowledge did not show significant improvement of end-to-end performance, we plot one of the examples in figure 10 and the average number of Transformer layers that deem words of certain part-of-speech as important for the STSB task in figure 8. In figure 10, KT-Attn and KT-Emb still show better generalization ability by identifying the keywords "a boy" and "her baby" better than the vanilla RoBERTa model. But the difference is slim since the vanilla RoBERTa model also captures "a" and "her" as the evidence for the final prediction. In figure 10, we can observe a smaller difference between vanilla RoBERTa and its two knowledge-enhanced versions, which indicates that the language models adapt to external knowledge less aggressively for some certain tasks than the others.

Figure 7: The average number of Transformer layers in (a) RoBERTa-Base and (b) KT-Attn (c) KT-Emb that deem words of certain part-of-speech as important for the RTE task. Results are obtained from the DiffMask model De Cao et al. (2020).

Figure 8: The average number of Transformer layers in (a) RoBERTa-Base and (b) KT-Attn (c) KT-Emb that deem words of certain part-of-speech as important for the STSB task. Results are obtained from the DiffMask model De Cao et al. (2020).

Figure 9: DiffMask plot for RTE task with RoBERTa-Base model. RTE task is to predict whether two sentences entail each other.

Figure 10: DiffMask plot for STSB task with RoBERTa-Base model. STSB task is to predict the semantic textual similarity of two sentences.