In recent years, pre-trained language models (PLMs) have revolutionized the field of natural language processing (NLP), yielding remarkable performance on various downstream tasks(qiu2020pre). For example, BERT (devlin2019bert) and its novel variants such as RoBERTa (liu2019roberta) and XLNet (yang2019xlnet) capture syntactical and semantic knowledge mainly from the pre-training task of masked language modeling (MLM). However, these PLMs suffer from lacking domain-specific knowledge when completing many real-world tasks. To address this issue, some existing methods have incorporated domain knowledge from external resources to enrich the language representation, ranging from linguistic (wang2021k), commonsense (guan2020knowledge), factual (wang2021kepler), to domain-specific knowledge (liu2020k; yu2020jaket).
Nevertheless, rare words (schick2020rare) and unseen words (cui2021knowledge) are still blind spots of pre-trained language models when they are fine-tuned on downstream tasks. For instance, in a dialogue system, users often talk to chatbots about recent hot topics, e.g., “Covid-19”, which may not appear in the pre-training corpus (cui2021knowledge). Since PLMs capture word semantics in different contexts from pre-training corpus, as a consequence, PLMs usually perform poorly when a user mentions such novel words (wu2021taking; ruzzetti2021lacking). As indicated by wu2021taking, the quality of word representations highly depends on the word frequency in the pre-training corpus, which typically follows a heavy-tail distribution. Thus, a large proportion of words appear very few times and the embeddings of these rare words are poorly optimized (gong2018frage; schick2020rare). Such embeddings usually carry inadequate semantic meaning, which complicate the understanding of input text, and even hurt the pre-training of the entire model.
In this work, we focus on enhancing language model pre-training by leveraging rare word definitions in English dictionaries (e.g., Wiktionary). Definitions in dictionaries are intended to describe the meaning of a word to a human reader. We append the definitions of rare words to the end of the input text and encode the whole sequence with Transformer encoder. The pre-training tasks are mainly based on the alignment between input text and the appended word definitions, some of which are randomly sampled polluted words and don’t explain the input. We propose two types of pre-training objectives: 1) a word-level contrastive objective aims to maximize the mutual information between Transformer representations of a rare word appeared in the input text and its dictionary definition. 2) a sentence-level discriminative objective aims at learning to differentiate between correct and polluted word definitions. During downstream fine-tuning, in order to avoid the appended rare word definitions diverting the sentence from its original meaning, we employ a knowledge attention mechanism that makes word definitions only visible to the corresponding words in the input text sequence. We name our method Dict-BERT. Notably, Dict-BERT is general and model-agnostic, in the sense that any pre-trained language model (e.g., BERT, RoBERTa) suffices and can be used.
Overall, our main contributions can be summarized as follows:
We are the first work to integrate word definitions in a dictionary into PLMs.
We propose two novel pre-training tasks on word and sentence-level alignment between input text sequence and rare word definitions to enhance language modeling with dictionary.
We evaluate Dict-BERT on the GLUE benchmark (wang2019glue) and our model pre-trained from scratch can improve accuracy by +1.15% on average over the vanilla BERT.
We follow the domain adaptive pre-training (DAPT) setting (gururangan2020don), where language models are continuously pre-trained with in-domain data. We evaluate Dict-BERT on eight specialized domain benchmark datasets. Our method can improve F1 score by +0.5% and +0.7% on average over the BERT-DAPT and RoBERTa-DAPT settings.
2 Related Work
Representation of rare words in language models.
Pre-trained language models capture word semantics in different contexts to address the issue of polysemous and the context-dependent nature of words. Therefore, the quality of word representations highly depends on the word frequency in the corpus, which often follows a heavy-tail distribution. Many recent works have shown rare words that are not frequently covered in the corpus can hinder the understanding of specific yet important sentences (schick2020rare; wu2021taking; ruzzetti2021lacking; dong2021injecting). Due to the poor quality of rare word representations, the pre-training model built on top of it suffers from noisy input semantic signals which lead to inefficient training. gao2019representation
provides a theoretical understanding of the rare word problem, which illustrates that the problem lies in the sparse stochastic optimization of neural networks.schick2020rare adapt attentive mimicking to explicitly learn rare word embeddings to language models. Specifally, it introduces one-token approximation, a procedure that uses attentive mimicking even when the underlying language model uses subword-based tokenization. wu2021taking proposes to take notes for rare words on the fly (TNF) during pre-training. Specifically, TNF maintains a note dictionary and saves a rare word’s contextual information as notes when the rare word occurs in a sentence. When the same rare word occurs again during training, the note information saved beforehand can be employed to enhance the semantics of the current sentence. Different from wu2021taking which keeps a fixed vocabulary of rare words during pre-training and fine-tuning, our method can dynamically adjust the vocabulary of rare words, obtain and represent their definitions in a dictionary in a plug-and-play manner.
Language model pre-training and knowledge-enhanced methods
Recent years have seen substantial pre-trained language models (PLMs) such as BERT (devlin2019bert) and T5 (raffel2020exploring) have achieved remarkable performance in various NLP downstream tasks. However, these PLMs suffer from lacking domain-specific knowledge when completing many real-world tasks (yu2020survey). For example, BERT can not give full play to its value when dealing with electronic medical record analysis tasks in the medical field (liu2020k). A lot of efforts have been made on investigating how to integrate knowledge into PLMs (yu2020jaket; liu2021kg; gunel2020mind; xiong2020pretrained; guan2020knowledge; zhou2021pre). Overall, these approaches can be grouped into two categories: The first one is to explicitly inject entity representation into PLMs, where the representations are pre-computed from external sources (zhang2019ernie; liu2021kg). For example, KG-BART encoded the graph structure of KGs with knowledge embedding algorithms like TransE (bordes2013translating), and then took the informative entity embeddings as auxiliary input (liu2021kg)
. However, the method of explicitly injecting entity representation into PLMs has been argued that the embedding vectors of words in text and entities in KG are obtained in separate ways, making their vector-space inconsistent(liu2020k). The second one is to implicitly model knowledge information into PLMs by performing knowledge-related tasks, such as concept order recovering (zhou2021pre), entity category prediction (yu2020jaket). For example, CALM proposed a novel contrastive objective for packing more commonsense knowledge into the parameters, and jointly pre-trained both generative and contrastive objectives for enhancing commonsense language understanding and generation tasks (zhou2021pre).
3 Proposed Method
In this section, we introduce the details of our model Dict-BERT. We first describe the notations and how to incorporate rare word definitions as a part of input. Then we detail the two novel self-supervised pre-training objectives. Finally, we introduce the knowledge attention during fine-tuning.
3.1 Notation and Problem Definition
Given the input text sequence with tokens, a language model produces the contextual word representation . For a specific downstream task, a header function further uses and generates the prediction as for sequence classification or for token classification.
The goal of our work is to learn better contextual word representation by leveraging definitions of the rare words in dictionaries (e.g., Wiktionary). Suppose and are the sets of rare words in the input text sequence and their definitions in the dictionary. When a rare word appears in the input text sequence, we fetch its definition from the dictionary as with tokens, and append it to the end of the input text sequence. So, an input sequence with appended definitions of rare words can be written as: , And the corresponding contextual representation generated from the language model as: . For a specific downstream task, a header function still uses to generate the prediction as for sequence classification or for token classification.
3.2 Choosing the Rare Words
There are different ways to choose the rare word set in a pre-training corpus. One way is to use a pre-defined absolute frequency value as the threshold. wu2021taking used 500 as the threshold to divide frequent words and rare words, and maintained a fixed vocabulary of rare words during pre-training and fine-tuning. However, rare words can vary greatly in different corpora. For example, rare words in the medical domain are very different from those in general domain (lee2020biobert)
. Besides, keeping a large threshold for a small downstream datasets makes the vocabulary of rare words too large. For example, only 51 words in the RTE dataset have a frequency of more than 500.
Therefore, we propose to choose specialized rare words for each pre-training corpus and downstream tasks. Specifically, we ranked all word frequency from smallest to largest, and add them to the list one by one until the word frequency of the added word reaches 10% of the total word frequency. Compared with wu2021taking which maintained a fixed vocabulary, our method can dynamically adjust the vocabulary of rare words, obtain and represent their definitions in dictionary in a plug-and-play manner. To fetch the definition of rare words, we leveraged the largest online dictionary, i.e., Wiktionary, and collected a dump of Wiktionary111https://www.wiktionary.org/ which includes definitions of 999,614 concepts.
3.3 Preliminary: BERT Pre-training
We use the BERT (devlin2019bert) model as an example to introduce the basics of the model architecture and training objective of PLMs. BERT is developed on a multi-layer bidirectional Transformer (vaswani2017attention) encoder. The Transformer encoder is a stack of multiple identical layers, where each layer has two sub-layers: a self-attention sub-layer and a position-wise feed-forward sub-layer. The self-attention sub-layer produces outputs by calculating the scaled dot products of queries and keys as the coefficients of the values, i.e.,
(Value) are the hidden representations produced by the previous self-attention layer andis the dimension of the hidden representations. Transformer also extends the aforementioned self-attention layer to a multi-head self-attention layer version in order to jointly attend to information from different representation subspaces.
BERT uses the Transformer model as its backbone neural network architecture and trains the model parameters with the masked language modeling (MLM) objective on large text corpora. In the masked language modeling task, a random sample of the words in the input text sequence is selected. The selected positions will be either replaced by special token [MASK], replaced by randomly picked tokens or remain the same. The objective of masked language modeling is to predict words at the masked positions correctly given the masked sentences.
3.4 Dict-BERT: Language Model Pre-training with Dictionary
Dict-BERT is based on the BERT architecture, which can be initialized either randomly or from a pre-trained checkpoint with the same structure. It is worth noting that we slightly modified the type embedding, in which the type embedding of the input text is set as , and the type embedding of the dictionary definitions is set as . Besides, we used the absolute positional embedding.
We represent each input text sequence and dictionary definitions pair as a tuple . The semantics of a word in the input text depends on the current context, while the semantics of a word in the dictionary is standardized by linguistic experts. In order to better align the representations between them, we propose two novel pre-training tasks on word and sentence-level alignment between input text sequence and rare word definitions to enhance pre-trained language models with dictionary.
3.4.1 Word-level Mutual Information Maximization.
Recently, there has been a revival of approaches inspired by the InfoMax principle (oord2018representation; tschannen2020mutual)
: maximizing the mutual information (MI) between the input and its representation. MI measures the amount of information obtained about a random variable by observing another random variable. As the input text sequence and rare word definitions are obtained from different sources, in order to better align the representations, we proposed to maximize the MI between a rare wordin the input sequence and its well-defined meaning in the dictionary , with joint density and marginal densities and , is defined as the Kullback–Leibler (KL) divergence between the joint and the product of the marginals,
The intuition of maximizing mutual information between a rare word appeared in the input text sequence and its definitions in the dictionary is to encode the underlying shared information and align the semantic representation between the contextual meaning and well-defined meaning of a word. Nevertheless, estimating MI in high-dimensional spaces is a notoriously difficult task, and in practice one often maximizes a tractable lower bound on this quantity(poole2019variational)
. Intuitively, if a classifier can accurately distinguish between samples drawn from the jointand those drawn from the product of marginals , then and have a high mutual information.
In order to approximate the mutual information, we adopted InfoNCE (oord2018representation), which is one of the most commonly used estimators in the representation learning literature, defined as
where the expectation is over independent samples
from the joint distribution(poole2019variational). Intuitively, the critic function measures the similarity (e.g., inner product) between two word representations. The model should assign high values to the positive pair , and low values to all negative pairs. We compute InfoNCE using Monte Carlo estimation by averaging over multiple batches of samples (chen2020simple). By maximizing the mutual information between the encoded representations, we extract the underlying latent variables that the rare words in the input text sequence and their dictionary definitions have in common.
3.4.2 Sentence-level Definition Discrimination
Instead of locally aligning the semantic representation, learning to differentiate between correct and polluted word definitions helps the language model capture global information of input text and dictionary definitions. We denote the set of definitions from rare words in the input text as . We then sample a set of “polluted” definitions from dictionary by replacing
with probability 50% with a different word randomly sampled from the entire vocabulary together with its definition. Since the last layer representation on the special token [SEP] is the fused representation of a word definition, we apply a multi-layer perception (MLP) as a binary classifierto predict whether the appended definition is for a rare word () or any polluted one () in the input text sequence. Therefore, the discriminative objective can be formally defined as follows,
3.4.3 Overall objective.
Now we present the overall training objective of Dict-BERT. To avoid catastrophic forgetting (mccloskey1989catastrophic) of general language understanding ability, we train the masked language modeling together with word-level mutual information maximization (MIM) and definition discrimination (DD) tasks. We denote
as the loss function of the MIM task which is the opposite of expectation in Equation3. Hence, the overall learning objective is formulated as:
are introduced as hyperparameters to control the importance of each task.
3.5 Dict-BERT: Fine-tuning with Knowledge-visible Attention
Most existing work uses the final hidden state of the first token (i.e., the [CLS] token) as the sequence representation (devlin2019bert; liu2019roberta; yang2019xlnet). For a sequence classification task, a multi-layer perception network function takes the output of as input and generates the prediction as . Notably, when fine-tuning a language model on downstream tasks, there could be many rare/unseen words in the dataset. Therefore, in the fine-tuning stage, when encountering a rare word in the input text, we append its definition to the end of input text, just like what we did in pre-training.
However, the appended dictionary definitions may change the meaning of the original sentence since the [CLS] token attend information from both input text and dictionary description. As pointed in liu2020k and xu2021does, too much knowledge incorporation may divert the sentence from its original meaning by introducing a lot of noise. This is more likely to happen if there are multiple rare words in the input text. To address this issue, we adopt the visibility matrix (liu2020k) to limit the impact of definitions on the original text. In BERT, an attention mask matrix is added with the self-attention weights before . If token is not supposed to be visible to token , we add an - value in attention matrix (, ).
As shown in Figure 2, we modify the attention mask matrix such that a token can attend to another token only if: (1) both tokens belong to the input text sequence, or (2) both tokens belong to the definition of the same rare word, or (3) is a rare word in the input text and is from its definition.
4.1 Overall Setting
To show the wide adaptability of our Dict-BERT, we conducted experiments on 16 NLP benchmark datasets. we use BERT (devlin2019bert) and RoBERTa (liu2019roberta) as the backbone pre-trained language methods. First, we followed liu2019roberta and wu2021takinggururangan2020don
to use 8 specialized domain tasks, including Chemprot, RCT-20k, ACL-ARC, SciERC, HyperPartisan, AGNews, Helpfulness, IMDB.
4.2 Rare Word Collection
Here, we briefly introduce the statistic of rare words in BERT pre-training corpus: English Wikipedia and BookCorpus. By concatenating these two datasets, we obtained a corpus with roughly 16GB in size. The total number of unique words in the pre-training corpus is 504,812, of which 112,750 (22.33%) words are defined as frequent words. In other words, the sum of the occurrences of these 112,750 words in the corpus accounts for 90% of the occurrences of all words in the corpus. We look up definitions of the remaining 392,062 (77.67%) words in the Wiktionary, of which 252581 (50.03%) can be found. The average length of definition is 9.57 words.
BERT (ours)). Since no open-source code is released by BERT-TNF(wu2021taking), we reported the relative improvement () of BERT-TNF and Dict-BERT compared with the original BERT.
4.3 Pre-training Corpus and Tasks
Experiments on the GLUE benchmark:
The language model is first pre-trained on the general domain corpus, and then fine-tuned on the training set of different GLUE tasks. Following BERT (devlin2019bert), we used the English Wikipedia and BookCorpus as the pre-training corpus. We removed the next sentence prediction (NSP) as suggested in RoBERTa (liu2019roberta), and kept masked language modeling (MLM) as the objective for pre-training a vanilla BERT.
Experiments on specialized domain datasets:
The language model is not only pre-trained on the general domain corpus, but also pre-trained on domain specific corpus before fine-tuned on domain specific tasks. To realize it, we initialized our model with the checkpoint from pre-trained BERT/RoBERTa and continue to pre-train on domain-specific corpus (gururangan2020don). The four domains we focus on are biomedical (BIOMED) papers, computer science (CS) papers, news text from REALNEWS, and e-commerce reviews from AMAZON.
4.4 Baseline Methods
Vanilla BERT/RoBERTa. We use the off-the-shelf BERT-base and RoBERTa-base model and perform supervised fine-tuning of their parameters for each downstream tasks.
BERT-DAPT/RoBERTa-DAPT. It continues pre-training BERT/RoBERTa on a large corpus of unlabeled domain-specific text (e.g., BioMed) using masked language modeling (MLM).
BERT-TNF. It takes notes for rare words on the fly during pre-training to help the model understand them when they occur next time. Specifically, it maintains a note dictionary and saves a rare word’s contextual information in it as notes when the rare word occurs in a sentence.
4.5 Ablation Settings
Dict-BERT-F/Dict-BERT-P indicate only using dictionary in the pre-training/fine-tuning stage. Dict-BERT-PF indicates using dictionary in the both pre-training and fine-tuning stages. Furthermore, Dict-BERT w/o MIM removes the word-level mutual information maximization task and Dict-BERT w/o DD removes the sentence-level definition discriminative task during pre-training.
4.6 Evaluation Metrics
For GLUE, we followed RoBERTa (liu2019roberta) and reported Matthews correlation for CoLA, Pearson correlation for STS-B, and accuracy for other tasks. For specialized tasks, we followed gururangan2020don and reported Micro-F1 for Chemprot and RCT-20k, and Macro-F1 for other tasks. For WNLaMPro, we followed schick2020rare and reported MRR and Precision@K.
4.7 Experimental Results
Only using Dictionary during Fine-tuning.
As shown in Table 1, comparing with the vanilla BERT and Dict-BERT-F, we can observe that only using dictionary during fine-tuning cannot improve the model performance on the GLUE benchmark. This indicates the pre-trained language model cannot quickly learn rare word definitions in the dictionary to help improve downstream task performance. Furthermore, the pre-trained language model might even be misled by noisy explanations in the dictionary. Therefore, it is important to integrate dictionary into language model during pre-training so the dictionary definitions can be better utilized.
Dict-BERT v.s. Baseline Methods.
As shown in Table 1, Dict-BERT-PF can outperform the vanilla BERT on the GLUE benchmark by improving +1.15% accuracy on average. The BERT performance from wu2021taking is higher than our implemented BERT, however, they do not have open-source code for reproducing their experimental results. Though Dict-BERT-PF and BERT-TNF achieved very close performance on GLUE benchmark, i.e., 83.80% and 83.90%, our Dict-BERT-PF has achieved greater relative improvement on the GLUE benchmark than BERT-TNF, i.e., +1.15% and +0.80%. In addition, BERT-TNF keeps a fixed note dictionary so it cannot update any unseen words into the note dictionary during fine-tuning. On the contrary, Dict-BERT can dynamically adjust the vocabulary of rare words, obtain and represent their definitions in dictionary in a plug-and-play manner. On RTE, Dict-BERT-P obtains the biggest performance improvement compared with the vanilla BERT. On another small-data sub-tasks CoLA, Dict-BERT-PF also outperforms the baseline with considerable margins. This indicates that when Dict-BERT is fine-tuned on a small downstream dataset, the improvement is particularly significant. Besides, as shown in Table 2, Dict-BERT-DAPT can outperform BERT-DAPT on the specialized domain datasets by improving +0.68% F1 on average. The same observation can be obtained from the RoBERTa setting.
Fine-tuning with Dictionary v.s. without Dictionary.
As shown in Table 1, we compared model performance between using dictionary in fine-tuning and not using dictionary in fine-tuning. First, after pre-training the language model with dictionary, even without using dictionary in fine-tuning, the performance has been greatly improved. This indicates the pre-training the language model with dictionary can generally improve the language representation and provide better initiation before fine-tuning the language model on the downstream tasks. Besides, we also observe the performance of using dictionary can perform slightly better on the GLUE benchmark. We hypothesize the reason behind can be the distribution discrepancy of the pre-training and fine-tuning data.
As shown in Table 1 and Table 2, we conducted ablation study on both GLUE benchmark and specialized domain datasets. First, both MIM and DD can help learning knowledge from dictionary and improve language model pre-training. Specifically, DD demonstrates larger average improvement than MIM on two benchmarks. The average improvements on GLUE benchmark brought by DD and MIM are +0.63% and +0.52%. Second, combining MIM and DD together can achieve the highest performance on GLUE benchmark, in which the average gain enlarges to +1.15%. For specialized domain datasets, we have the same observations as above.
Knowledge Attention v.s. Full Attention.
As we mentioned in the Section 3.5, too much knowledge incorporation may divert the sentence from its original meaning by introducing some noise. This is more likely to happen if there are multiple rare words appeared in the input text. Therefore, we compared the model performance between using knowledge attention and full attention. As shown in Figure 3(a), we observed that using knowledge attention can consistently perform better than using full attention mechanism during the fine-tuning stage on CoLA, RTE, STSB and MRPC datasets. Besides, Dict-BERT with full attention even under-performed than the vanilla BERT without using any dictionary definition, which indicates the appended description in the dictionary may change the meaning of the original sentence. For example, STSB compares similarity between two sentence. Using full attention includes semantic meanings of definitions into the sentence representation, which might reduce the sentence similarity score and hurt the model performance.
Learning with Different Rare Word Proportions.
As we mentioned in Section 3.2, we select rare words for each downstream tasks by truncating the tail distribution of the word frequency. In order to verify the impact of using different tail proportions of rare words on the downstream tasks, we selected three different ratios (i.e., 5%, 10%, and 15%) and experimented on CoLA, RTE, STSB and MRPC datasets. As shown in Figure 3(b), on the CoLA and STSB datasets, the model achieves the best performance when using 10% words at the tail as rare words. On the MRPC data, there is no significant difference of model performance in using different proportions of rare words. However, the performance on RTE data demonstrates a trend, that is, the more rare words selected, the worse the performance of the model. This is consistent with the conclusion of whether the dictionary is used in fine-tuning in Table 1, i.e., the performance of not using dictionary is better than using dictionary on the RTE dataset. Therefore, the selection of rare words with different tails has no obvious correlation with the performance of the model on downstream tasks.
|Methods||Rare (0, 10)||Frequent (100, +)||Overall (0, +)|
Unsupervised Language Model Probing.
In order to assess the ability of language models to understand words as a function of their frequency, we used WordNet Language Model Probing (WNLaMPro) dataset (schick2020rare) to test how well a language model understands a given word: we can ask it for properties of that word using natural language. For example, a language model that understands the concept of “guilt”, should be able to correctly complete the sentence “Guilt is the opposite of ___” with the word “innocence”. WNLaMPro contains four different kinds of relations: antonym, hypernym, cohyponym+, and corruption. Based on the word frequency in English Wikipedia, WNLaMPro defines three subsets based on keyword counts: WNLaMPro-rare , WNLaMPro-medium , and WNLaMPro-frequent . As shown in Table 3, Dict-BERT can greatly improve the word representation compared with the vanilla BERT without using a dictionary during pre-training. Based on the word frequency, we observe Dict-BERT can significantly help learn rare word representations. Compared to the vanilla BERT, Dict-BERT improves MRR and P@3 by relatively +23.93% and +28.30%, respectively. In addition, Dict-BERT is also able to learn better frequent word representations. Although we did not directly take frequent word definitions as part of the input, Dict-BERT spends less memory on rare words, because it is easier to predict rare words than the vanilla BERT, so the saved memory power could be used to memorize the facts involving popular words and interactions between popular words.
Enhancing the representation of rare words in language models is an important yet challenging task. To address the rare word problem, in this work, we leveraged rare word definitions in English dictionary to improve rare word representations. During the pre-training stage, when language model encounters a rare word in the input text, we fetch its definition from Wiktionary and append it to the end of the input text. In order to make better interactions between the input text and rare word definitions, we proposed two novel self-supervised training tasks to help language model learn better representations for rare words during the pre-training stage. Experimental on GLUE benchmark and eight specialized domain datasets demonstrate that our method can significantly improve the understanding of rare words and boost model performance on various downstream tasks.