Efficient Domain Adaptation of Language Models via Adaptive Tokenization

09/15/2021 ∙ by Vin Sachidananda, et al. ∙ Amazon Stanford University 0

Contextual embedding-based language models trained on large data sets, such as BERT and RoBERTa, provide strong performance across a wide range of tasks and are ubiquitous in modern NLP. It has been observed that fine-tuning these models on tasks involving data from domains different from that on which they were pretrained can lead to suboptimal performance. Recent work has explored approaches to adapt pretrained language models to new domains by incorporating additional pretraining using domain-specific corpora and task data. We propose an alternative approach for transferring pretrained language models to new domains by adapting their tokenizers. We show that domain-specific subword sequences can be efficiently determined directly from divergences in the conditional token distributions of the base and domain-specific corpora. In datasets from four disparate domains, we find adaptive tokenization on a pretrained RoBERTa model provides >97 specific pretraining. Our approach produces smaller models and less training and inference time than other approaches using tokenizer augmentation. While adaptive tokenization incurs a 6 experimentation, due to the introduction of 10k new domain-specific tokens, our approach, using 64 vCPUs, is 72x faster than further pretraining the language model on domain-specific corpora on 8 TPUs.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Pretrained language models (PLMs) trained on large “base” corpora, oftentimes 100GB of uncompressed text roberta; gpt3

, are used in many NLP tasks. These models first learn contextual representations in an unsupervised manner by minimizing a masked language modeling objective over a base corpus. This stage of unsupervised language model training is referred to as "pretraining". Subsequently, for supervised classification tasks, the output head of this pretrained model is swapped for a lightweight classifier and trained further on a classification objective over labeled data, referred to as “fine-tuning”.

Recent work has examined the transferability of PLMs dontstop and their contextual representations to domains differing from their base corpora. On text classification tasks from four different domains, it was shown that continuing to pretrain RoBERTa’s contextual embeddings on additional domain (DAPT) and/or task-specific data (TAPT) resulted in performance gains over only fine-tuning a baseline RoBERTa model. These performance gains, however, were inferior to each task’s start-of-the-art metrics which were largely based on training versions of RoBERTa, or other LMs, from scratch on a large sample of in-domain data.

These performance gains come at substantial financial, time, and environmental costs in the form of increased computation, with pretraining an LM from scratch being the most expensive, using additional pretraining in the middle, and only fine-turning an off-the-shelf model the most economical.

One observed advantage pubmedbert that pretraining from scratch on in-domain data has over continual pretraining is that the tokenizer’s vocabulary captures domain-specific terms. This allows semantics of those terms to be directly learned in their fixed embeddings, and relieves the language model from having to encode these semantics through the contextual embeddings of these domain-specific term’s subwords. Recent work zhang-etal-2020-multi-stage; poerner-etal-2020-inexpensive has shown adding whole words common to the target domain but absent from a PLM’s tokenizer improves performance on single tasks. In this work, we show that augmenting an PLM with statistically derived subword tokens selected for domain association with simple embedding initializations and no further pretraining provide an effective means of adapting a PLM across tasks and domains. In contrast, both zhang-etal-2020-multi-stage and poerner-etal-2020-inexpensive add inefficiencies by respectively requiring further masked language model (MLM) pretraining and doubling the resources needed for inference.

In this paper, we efficiently adapt a PLM by simply augmenting its vocabulary with domain-specific token sequences. We find that this adaptation, which requires no further pretraining, rivals the accuracy of domain and task-adapted pretraining approaches proposed in dontstop but requires only a small fraction of the compute cost.

2 Related work

dontstop describes two complementary methods using a task’s training data or a separate unlabeled domain-specific corpus to further pretrain an LM, denoted as Task-Adaptive Pretraining (TAPT) and Domain-Adaptive Pretraining (DAPT) respectively. This paper shows the value of employing additional in-domain data in pretraining on four domains relative to only fine-tuning a PLM. Our approach is directly comparable to DAPT, as we only use in-domain corpora for adaptation.

zhang-etal-2020-multi-stage augment RoBERTa’s vocabulary with in-domain OOV whole words. The most frequently occurring whole words are added until the OOV rate drops to 5% on the task corpus. They randomly initialize weights and pretrain a model. This improves performance on TechQA and AskUbuntu. tai-etal-2020-exbert also augmented BERT with tokens selected by frequency (12k OOV wordpieces were used) and pretrained a modified version of BERT which allowed for only new token’s embeddings to be modified while the original embeddings remained fixed. They found that using more than 12k augmented tokens didn’t improve their biomed NER and relation extraction performance, and that, once augmented, performance improved with more pretraining (4-24 hours were studied.)

poerner-etal-2020-inexpensive

augment BERT’s vocabulary with all in-domain OOV whole words, adding  31K tokens to bert-base-cased’s  29K wordpieces. They trained a word2vec model on an in-domain corpus and fit a linear transformation to project the word embeddings into the model’s input embedding space. No further pretraining is done, but during finetuning, the original tokenizer and the adapted tokenizer are both used. For inference, the finetuned model is run with both the original tokenizer and the adapted tokenizer and the outputs are averaged. Their F1 score outperforms BERT on all eight biomedical NER tasks studied. The approach has the disadvantage of increasing the parameter size of bert-base-cased by 2.2x due to the embeddings of added tokens and doubles the resources needed for inference.

superbizarre demonstrates how Wordpiece tokenization does not capture the semantics of derivationally complex words as well as an approach using a modified version of Wordpiece designed to produce subword segmentations consisting of linguistic prefixes, suffixes and affixes dagobert. This subword tokenizer outperformed WordPiece in determining words’ polarity or their source domains. Experiments were conducted on novel embedding tokens in BERT via approaches including a projection-based method and mean pooling (both similar to §3.3).

Training language models from scratch in the domain of interest has been shown to provide improved in-domain performance when compared to out-of-domain PLMs clinicalbert. In addition to dontstop, prior work has shown the effectiveness of continued pretraining for domain adaptation of PLMs alsentzer-publicly; chakrabarty-imho; lee-biobert. For the task of Aspect-Target Sentiment Classification, rietzler-etal-2020-adapt uses both DAPT and task-specific fine-tuning in order to adapt language models representations. Identifying domain-characteristic words is a well-studied problem, and many metrics have been proposed for this task through comparing the distributions of tokens in contrasting corpora keyness; mcq; kessler-2017-scattertext. muthukrishnan-etal-2008-detecting used the pointwise KL-divergence to distinguish informativeness of key phrase candidates in a domain corpus relative to a background.

3 Adaptive tokenization of contextual embeddings

We define adaptive tokenization (AT) as the process of augmenting a PLM’s tokenizer and fixed subword embeddings with new entries taken from a novel corpus. AT consists of two goals which must be achieved for domain adaptation. First, selection of domain-specific tokens, with which to augment a pretrained tokenizer, from an in-domain corpus must be determined. Second, an appropriate initialization in the input space of the contextual embedding models needs to be determined for additions to the tokenizer vocabulary. In this section, we detail approaches for each of these linked tasks.

3.1 Tokenizer vocabulary augmentation

In this section, we detail approaches for identifying domain-specific token sequences to be added during tokenizer augmentation. Common tokenization schemes such as Byte Pair Encoding bpe and WordPiece wordpiece; wordpiece2 are greedy algorithms and, as a result, merge subwords into individual tokens if such a sequence occurs with high relative frequency. When adapting a tokenizer our goal is to identify subword sequences which occur with high relative frequency in a domain specific corpus compared to the pretraining corpus. In Table 1, we provide the corpora for each domain in which experimentation is conducted. Next, we show how to operationalize this framework to find domain-specific token sequences.

3.2 Identifying domain-specific token sequences

In this section, we detail our approach for selection of token sequences which are both difficult to represent in a base tokenizer and have large disparities in occurrence between domain-specific and base corpora. Conceptually, we would like to add new tokens to the source tokenizer which are sequences of existing tokens and, in the in-domain corpus, are extensions of existing token sequences.

,
(I) Computing Empirical Token Sequence Distributions
for word, count in  do Do the same for Domain Corpus
     
     for i in [1,n] do
         
     end for
end for
Normalize Sequence Distributions
(II) Domain shift scoring of Token Seq. Dists. with Conditional KL Divergence
for Seq in  do
     
end for
(III) Selection of Token Sequences for Augmentation
for Seq in  do
     if  then
         break
     end if
     if  AND AND  then
         Augmentations.append(Seq)
     end if
end for
return Augmentations
Algorithm 1 Selection of Domain-Specific Token Sequences for Tokenizer Augmentation

(I) Computing Empirical Token Sequence Distributions We first compute counts of sequences of subword tokens () in each corpus , namely the source corpus for RoBERTa () and the in-domain corpus which is the target of our adaptation (). The source language model’s tokenizer (namely Roberta-base) is used as the source of subword tokens. The counts of each subtoken sequences are represented as , where is the corpus and s is the subword sequence. If does not appear in , . We only retain sequences occurring at least times in one corpus. The maximum subword token sequence length () is . We limit subtoken sequences to word boundaries as detected through whitespace tokenization.

Next, we predict how “phrase-like” a sequence of tokens

is, using a probability

. Define

where is first subtoken sequence of . These probabilities should be thought of as the surprise of the sequence in the corpus being counted and are indicative of the how phrase-like is.

As an example, consider a hypothetical corpus consisting of documents written about classical music. Roberta-base’s tokenizer splits “oboe” into the subtokens . In this classical music corpus, the portion of tokens following “ob” which are “oe” (composing in the word “oboe”) is surely much higher than in a general base corpus where other words staring with the “ob” subtoken like “obama” (tokenized as ) are much more frequent and “oboe” much less.

(II) Domain shift scoring of Token Sequence Distributions with Conditional KL Divergence In order to characterize these differences in probabilities, we use the pointwise KL-divergence. Letting and be probabilities, the pointwise KL-divergence is defined as:

Let the sequence relevance score be defined as

indicates how much the phrase-like probability of sequence in the in-domain corpus () diverges from the baseline phrase-like probability of in the base corpus .

(III) Selection of Token Sequences for Tokenizer Augmentation For all experiments, we add the sequences with the largest , sorted irrespective of sequence length, to the domain-augmented tokenizer.

This introduces of 7.68M parameters (embedding size K new tokens), a 6% increase over Roberta-base’s 125M.111

github.com/pytorch/fairseq/tree/master/examples/roberta

3.3 Initialization approaches for AT

In this section, we provide two approaches to impute contextual embedding input representations for tokens added in §

3.1.

Subword-based initialization In this common initialization casanueva-etal-2020-efficient; Vuli2020ProbingPL; superbizarre, additions to the tokenizer are embedded as the mean of their Roberta-base fixed subword embeddings. In cases where all a novel word’s subwords are unrelated to its specific, in-domain meaning, this initialization may cause unwanted model drift in fine-tuning for unrelated tokens with similar fixed embeddings.

Domain Learned Input Embeddings , and Embedding Size .
(I) Learn Mapping : with SGD:
(II) Get Inits. for Aug. Tokens using :
return
Algorithm 2 Projection-Based Initialization of Augmented Tokens
Domain Pretrain Corpus [# Tokens] Task Task Type Train (Lab.) Dev. Test Classes
BioMed 1.8M papers from S2ORC [5.1B] ChemProt relation classification 4169 2427 3469 13
RCT abstract sent. roles 18040 30212 30135 5
CS 580K papers from S2ORC [2.1B] ACL-ARC citation intent 1688 114 139 6
SciERC relation classification 3219 455 974 7
News 11.9M articles [6.7B] HyperPartisan partisanship 515 65 65 2
Reviews 24.75M Amazon reviews [2.1B] IMDB review sentiment 20000 5000 25000 2
Table 1: Specifications of the various target task and pretraining datasets to replicate experiments in dontstop. Due to the restrictions on accessible papers in S2ORC, we are using versions of BioMed and CS which are approximately 33% and 74% smaller than were used in dontstop. Sources: S2ORC s2orc, News fakenews, Amazon reviews amznrev, CHEMPROT chemprot, RCT rct, ACL-ARC aclarc, SCIERC scierc, HYPERPARTISAN hyperpartisan, and IMDB imdb.

Projection-based initialization To mitigate possible issues with averaging subword embeddings, we also consider projections between static token embeddings to the input space of contextual embeddings, similar to poerner-etal-2020-inexpensive.

To summarize this approach, our goal is to learn a mapping between the input token embeddings in RoBERTa, , and word2vec token embeddings learned independently on the base222See §5.4 for how the RoBERTa source corpora is approximated to form our base corpus. and domain specific corpora, . The tokens in include the original RoBERTa tokens while those in and include both the original RoBERTa tokens and the augmented tokens found using adaptive tokenization detailed in §3.2. First, a mapping , parametrized as a single layer fully connected network, from to is learned which minimizes distances, on the original set of tokens in RoBERTa. The goal of this mapping is to learn a function which can translate word2vec token embeddings to the input space of RoBERTa. Then, the learned mapping is applied to in order to obtain initializations in the input space of RoBERTa for the augmented tokens found using the approach in §3.2. The operations involved in this approach are detailed in Algorithm 2.

Domain Task RoBERTa DAPT TAPT DAPT + TAPT AT (Mean) AT (Proj) State-of-the-art (in 2020)
BioMed ChemProt 81.9 84.2 82.6 84.4 83.6 83.1 84.6
RCT 87.2 87.6 87.7 87.8 87.5 87.6 92.9
CS ACL-ARC 63.0 75.4 67.4 75.6 70.1 68.9 71.0
SciERC 77.3 80.8 79.3 81.3 81.4 81.2 81.8
News HyperPartisan 86.6 88.2 90.4 90.0 93.1 91.6 94.8
Reviews IMDB 95.0 95.4 95.5 95.6 95.4 95.5 96.2
Table 2: Results of different adaptive pretraining methods compared to the baseline RoBERTa. AT with mean subword and projective initializations are denoted as AT (Mean) and AT (Proj) respectively. Stddevs are from 5 seeds. Results for DAPT, TAPT, DAPT+TAPT, and state-of-the-arts are quoted from dontstop. The highest non-state-of-the-art result is bolded, since the state-of-the-art functions as a performance ceiling, leveraging both domain-specific pretraining and an adapted tokenizer. The best of the three approaches which utilize only source and domain domain data before fine-tuning (i.e., DAPT and AT) is underlined. *Due to restrictions on accessible papers in S2ORC, The BioMed and CS pretraining corpora used were respectively 33% and 74% smaller than the versions in dontstop. Note that state-of-the-art numbers are current at the time of dontstop, and are from the following works: ChemProt: S2ORC-BERT s2orc, RCT: Sequential Sentence Classification rct_sota, ACL-ARC: SciBert arc_sota, SciERC: S2ORC-BERT s2orc, HyperPartisan: Longformer longformer, IMDB: XLNet Large xlnet-large.
Method Hardware Specs. Runtime [h:m:s]
DAPT 8x TPU V-3 94 hours
AT (Mean) 64x vCPUs 1:17:35
AT (Projection) 64x vCPUs 4:54:58
Table 3: Runtime and hardware specifications for AT compared to DAPT. The vast majority of the time is spent reading the corpus and creating token distributions. Runtimes are based on the CS 8.1B token corpus. The DAPT runtime is mentioned in Github Issue 16 in dontstop and the AT runtimes are linearly extrapolated (an overestimate) from our observed runtime on the open version of CS, a 2.1B token corpus. We needed to perform this extrapolation since the full CS corpus which was used to benchmark dontstop is unavailable in S2ORC. “64x vCPUs” indicate the equivalent of an AWS ml.m5.16xlarge EC2 instance was used to determine which subtoken sequences to use for vocabulary augmentation and compute their embeddings. The times reported for AT (Mean) and AT (Projection) where from a single run, with precomputed base corpus token counts and embeddings.

4 Experimentation

BioMed CS News Reviews
[inc, ub, ated] incubated [The, orem] Theorem [t, uesday] tuesday [it, ’s] it’s
[trans, fect] transfect [L, em, ma] Lemma [ob, ama] obama [that, ’s] that’s
[ph, osph, ory] phosphory [vert, ices] vertices [re, uters] reuters [sh, oes] shoes
[mi, R] miR [E, q] Eq [iph, one] iphone [doesn, ’t] doesn’t
[st, aining] staining [cl, ust, ering] clustering [ny, se] nyse [didn, ’t] didn’t
[ap, opt, osis] apoptosis [H, ence] Hence [get, ty] getty [can, ’t] can’t
[G, FP] GFP [Seg, mentation] Segmentation [inst, agram] instagram [I, ’ve] I’ve
[pl, asm] plasm [class, ifier] classifier [bre, xit] brexit [b, ought] bought
[ass, ays] assays [Ga, ussian] Gaussian [nas, daq] nasdaq [you, ’ll] you’ll
[ph, osph, ory, lation] phosphorylation [p, olyn] polyn [ce, o] ceo [kind, le] kindle
Table 4: Samples of token sequences with large JSD between base and domain corpora sequence distributions; all of these sequences were added during AT to the Roberta-Base tokenizer.

In this section, we perform evaluation of our adaptation approach on six natural language processing tasks in four domains, BioMedical, Computer Science, News, and Reviews, following the evaluations in

dontstop. Due to resource constraints, we perform experimentation on all datasets in dontstop excluding the Helpfulness dataset from the reviews domain and the Hyperpartisan dataset in the news domain. Each of the excluded datasets contain greater than 100K training examples, resulting in greater than 12 hours of time required for finetuning on 8 Tesla V100 GPUs for a single seed.

Approaches Roberta-base, a commonly used PLM with high performance, is used as a baseline on which supervised finetuning is performed separately for each dataset. Additionally, we compare AT to the DAPT method from dontstop. As we do not make use of task specific data (i.e., the training data used in fine-tuning), AT is comparable to DAPT in terms of the data utilized. We focus on using large, in-domain data sets which are commonly used in further pretraining (rather than variably sized task-data) since their size both allows for reliable extraction of characteristic subtoken sequences to use in tokenizer augmentation. Adaptive tokenization for task-specific data is future work.

Classification Architecture We use the same classification architecture as in dontstop, originally proposed in bert

, in which the final layer’s [CLS] token representation is passed to a task-specific feed forward layer for prediction. All hyperaparameters used in experimentation are equivalent to either the "mini", "small", or "big" hyperparameter sets from

dontstop.

Results We find that adaptive tokenization improves performance when compared to the baseline RoBERTa model in all four of the domains on which experimentation is performed. AT provides 97% of the aggregate relative improvement attained by DAPT respectively over Roberta-base while providing an order of magnitude efficiency gain detailed in Table 3. We do not see a significant difference in the performance of AT models based on the Mean or Proj initialization schemes. Given that Mean initialization required half the time as Proj, we recommend its use over Proj.

5 Discussion

5.1 Resource Efficiency in LM Adaptation

Current approaches for training and adapting LMs have resulted in negative environmental impact and high computational resource budgets for researchers. PLMs incur significant compute time during pretraining, typically requiring numerous days of training on GPUs or TPUs roberta; bert; dontstop. In Table 3, we provide a runtime comparison between continued pretraining and AT. We find that AT provides a 72x speedup compared to DAPT and does not require a GPU or TPU to run. The most resource-intensive portion of this procedure involves indexing the corpora and conducting subtoken sequence counts.

In addition to time and resources, the environmental impact of pretraining BERT with a single set of hyperparameters incurs a carbon footprint of approximately 1.5K pounds of CO emissions, more than the average monthly emissions of an individual strubell2019. Continued pretraining, which has a similar resource budget to BERT, exacerbates this problem greenai. Lastly, we find that the cloud computing costs associated with continual pretraining for both a single domain and set of hyperparameters are  $750 compared to around $4.77 (using a ml.m5.16xlarge EC2 instance for 1:17) for AT on cloud computing platforms when using non-preemptible instances. High costs associated with the training of NLP models has led to inequity in the research community in favor of industry labs with large research budgets strubell2019.

5.2 Augmented Token Sequences selected in each domain

In Table 4, we provide examples of augmented vocabulary selected by our adaptive tokenization algorithm for each of the four domains used in experimentation. In each domain, the augmented tokens identified by AT correspond to domain-specific language. For instance, augmented tokens in the Reviews domain token sequences often contain contractions such as “I’ve” and “it’s”, which are frequently used in informal language. In the News domain, augmented tokens include financial terms such as “NYSE” and “Nasdaq” along with media outlets such as “Reuters” and “Getty”. Many of the augmented tokens in the Computer Science domain are mathematical and computing terms such as “Theorem”, “Lemma”, “Segmentation”, and “Gaussian”. Lastly, augmented tokens in the BioMedical domain are largely concerned with biological mechanisms and medical procedures such as “phosphorylation”, “assays”, and “transfect”.

5.3 Future directions

While we have evaluated this approach on Roberta-base, it can be used on any PLM which uses subword tokenization. It would be interesting future work to see if the performance gain will hold on larger PLMs with richer vocabularies or on smaller PLMs. One may speculate the benefit of AT is due to encoding non-compositional subword tokens in the input embedding space. And furthermore, this lifts some of the responsibility for encoding their semantics from the LM’s interior weights. Since these non-compositional tokens are characteristic to the domain corpus, their representations may be important to the end task and and need to be learned or improved during fine-tuning. If this is the case, then perhaps models with fewer interior weights benefit more from AT since the connection between the non-compositional tokens would be built into the input, allowing interior weights to better learn the semantics of novel non-compositional tokens and opposed to also having to learn the component tokens’ connection.

While this work tests AT on an English language PLM, it can hypothetically be applied to any PLM regardless of its source language(s). Exploring how AT can work with additional pretraining on domain data is clear future work. tai-etal-2020-exbert show that specialized further pretraining on domain data on using a model augmented with domain characteristic whole word tokens results in an improved performance/pretraining time curve. It would also be fruitful to explore how that curve changes when using more efficient pretraining techniques such as in clark2020electra.

While we compared different novel token sequence embedding techniques, we did not study different ways of identifying subtoken sequences to add. Comparing AT to approaches such adding whole word tokens tai-etal-2020-exbert would confirm our hypothesis that phrase-like token sequences are useful.

Experimenting with the number of subtoken sequences added to the tokenizer ( fixed at ) may also be worthwhile. While tai-etal-2020-exbert found tokens additions optimal, poerner-etal-2020-inexpensive added tokens. Seeing the trade-off between added tokens and performance would be useful, as each additional parameter increases the model size.

Our approach requires new tokens to appear times in both the source and domain corpora. While this was necessary in order to produce source-corpus word embeddings in Proj, it does not allow for domain-exclusive subtoken sequences to be added to the tokenizer. Abandoning this requirement for Mean may lead to a better set of token augmentations.

We can also experiment with other subtoken candidate selection techniques. For example, Schwartz2013 used pointwise mutual information (PMI) to determine how phrase-like candidates word sequences were. PMI is the log ratio of the probability of a phrase vs. the product of the probability of its component unigrams. While our approach considers the probability of a subtoken given a preceding sequence, it, unlike PMI, does not consider the probability of that following subtoken in isolation. This may lead to domain-specific subtokens sneaking into augmented token sequences, such as the contraction tokens added to the reviews Reviews tokenizer in Table 4.

5.4 Implementation details

The code is in preparation for release. The hyperparameter search used was ROBERTA_CLASSIFIER_MINI from dontstop from their codebase https://github.com/allenai/dont-stop-pretraining

. Token counts for RoBERTa-base were estimated using English Wikipedia 20200501.en and an open source book corpus from

https://storage.googleapis.com/huggingface-nlp/datasets/bookcorpus/bookcorpus.tar.bz2. Word2vec embeddings were computed with Gensim rehurek2011gensim, using the following parameters:

Word2Vec(..., size=768, window=5, min_count=100, epochs=2, sample=1e-5)

6 Conclusion

In this paper, we introduced adaptive tokenization (AT) a method for efficiently adapting pretrained language models utilizing subword tokenization to new domains. AT augments a PLM’s tokenization vocabulary to include domain-specific token sequences. We provide two approaches for initializing augmented tokens: mean subword and projections from static subword embeddings. AT requires no further language model pretraining on domain-specific corpora, resulting in a 38x speedup over pretraining on the corpora without specialized hardware. Across four domains, AT provides >97% of the performance improvement of further pretraining on domain-specific data over Roberta-base. This initial work suggests that adapting the subword tokenization scheme of PLMs is an effective means of transferring models to new domains. Future work entails hybrid approaches using both AT and small amounts of LM pretraining, alternative metrics for augmented token selection, improved initialization of augmented token representations, and the use of task data.

Acknowledgements

We thank Yi Zhang, William Headden, Max Harper, Chandni Singh, Anuj Ahluwalia, Sushant Sagar, Jay Patel, Sachin Hulyalkar, and the anonymous reviewers for their valuable feedback.

Ethics statement

As mentioned in §5, pretrained language models incur significant costs with respect to time, computational resources and environmental impact. Continued domain specific pretraining, which has a similar resource budget to BERT, exacerbates this problem greenai. In this work, we provide approaches for adapting pretrained language models to new domains with an approach, Adaptive Tokenization, which seeks to minimize costs associated with continued domain specific pretraining. It should be noted that we do not decrease the resource and environmental associated with pretraining, only the costs for domain adaptive pretraining which are nevertheless sizable (e.g. 32 TPU days for DAPT).

Additionally, we find that the cloud computing costs associated with continued domain specific pretraining on a single domain and set of hyperparameters are around $750 compared to around $5 for AT on a cloud computing platform. High costs associated with the training of NLP models has led to inequity in the research community in favor of industry labs with large research budgets strubell2019, a problem we seek to ameliorate.

This work does not address the high resource cost in fine-tuning PLMs. Risks associated with this paper are that this work may encourage the use of PLMs in more settings, such as domains with small amounts of data, and introduce potentially harmful inductive biases which have been found in many commonly used PLMs.

We include statistics about the data sets used in Table 1, these data sets were introduced in dontstop and open source.

References