Pre-train or Annotate? Domain Adaptation with a Constrained Budget

09/10/2021 ∙ by Fan Bai, et al. ∙ Georgia Institute of Technology 0

Recent work has demonstrated that pre-training in-domain language models can boost performance when adapting to a new domain. However, the costs associated with pre-training raise an important question: given a fixed budget, what steps should an NLP practitioner take to maximize performance? In this paper, we study domain adaptation under budget constraints, and approach it as a customer choice problem between data annotation and pre-training. Specifically, we measure the annotation cost of three procedural text datasets and the pre-training cost of three in-domain language models. Then we evaluate the utility of different combinations of pre-training and data annotation under varying budget constraints to assess which combination strategy works best. We find that, for small budgets, spending all funds on annotation leads to the best performance; once the budget becomes large enough, a combination of data annotation and in-domain pre-training works more optimally. We therefore suggest that task-specific data annotation should be part of an economical strategy when adapting an NLP model to a new domain.



There are no comments yet.


page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

The conventional wisdom on semi-supervised learning and unsupervised domain adaptation is that labeled data is expensive; therefore, training on a combination of labeled and unlabeled data is an economical approach to improve performance when adapting to a new domain

(Blum and Mitchell, 1998; Daume III and Marcu, 2006; Hoffman et al., 2018; Chen et al., 2020).

Figure 1: We view domain adaptation as a consumer choice problem (Becker, 1965; Lancaster, 1966). The NLP practitioner (consumer) is faced with the problem of choosing an optimal combination of annotation and pre-training under a constrained budget. This figure is purely for illustration and is not based on experimental data.

Recent work has shown that pre-training in-domain Transformers is an effective method for unsupervised adaptation Han and Eisenstein (2019); Wright and Augenstein (2020) and even boosts performance when large quantities of in-domain data are available Gururangan et al. (2020). However, modern pre-training methods incur substantial costs (Izsak et al., 2021), and generate carbon emissions (Strubell et al., 2019; Schwartz et al., 2020; Bender et al., 2021). This raises an important question: given a fixed budget to improve a model’s performance, what steps should an NLP practitioner take? On one hand, they could hire annotators to label in-domain task-specific data, while on the other, they could buy or rent GPUs or TPUs to pre-train large in-domain language models. In this paper, we empirically study the best strategy for adapting to a new domain given a fixed budget.

We view the NLP practitioner’s dilemma of how to adapt to a new domain as a problem of consumer choice, a classical problem in microeconomics (Becker, 1965; Lancaster, 1966). As illustrated in Figure 1, the NLP practitioner (consumer) can obtain annotated documents (by hiring annotators) at a cost of each, and hours of pre-training (by renting GPUs or TPUs) at a cost of per hour. Given a fixed budget , the consumer may choose any combination that fits within the budget constraint . The goal is to choose a combination that maximizes the utility function, , which can be defined using an appropriate performance metric, such as score, that is achieved after pre-training for hours and then fine-tuning on in-domain documents.

To empirically estimate the cost of annotation, we hire annotators to label domain-specific documents for supervised fine-tuning in three procedural text domains: wet-lab protocols, paragraphs describing scientific procedures in PubMed articles, and chemical synthesis procedures described in patents. We choose to target natural language understanding for scientific procedures in this study, because there is an opportunity to help automate lab protocols and support more reproducible scientific experiments, yet few annotated datasets currently exist in these domains. Furthermore, annotation of scientific procedures is not easily amenable to crowdsourcing, making this an ideal testbed for pre-training-based domain adaptation. We measure the cost of in-domain pre-training on a large collection of unlabeled procedural texts using Google’s Cloud TPUs.

222 Model performance is then evaluated under varying budget constraints in six source and target domain combinations.

Our analysis suggests that given current costs of pre-training large Transformer models, such as BERT (Devlin et al., 2019), and RoBERTa (Liu et al., 2019), in-domain data annotation should always be part of an economical strategy when adapting a single NLP system to a new domain. For small budgets (e.g. less than $800 USD), spending all funds on annotation is the best policy; however, as more funding becomes available, a combination of pre-training and annotation is the best choice.

This paper addresses a specific question that is often faced by NLP practitioners working on applications: what is the most economical approach to adapt an NLP system to a new domain when no pre-trained models or task-annotated datasets are initially available? If multiple NLP systems need to be adapted to a single target domain, model costs can be amortized, making pre-training an attractive option for smaller budgets.

2 Scope of the Study

In this study, we focus on a typical scenario faced by an NLP practitioner to adapt a single NLP system to a single new domain, maximizing performance within a constrained budget. We consider only the direct benefit on the target task in our main analysis (§5), however we do provide additional analysis of positive externalities on other related tasks that may benefit from a new pre-trained model in §6.

We estimate cost based on two major expenses: annotating task-specific data (§3) and pre-training domain-specific models using TPUs (§4). Note that fine-tuning costs are not included in our analysis, as they are nearly equal whether budget is invested into pre-training or annotation.333These are also not a significant portion of overall costs (we estimate $1.95 based on Google Cloud rates for P100 GPUs.) We assume a generic BERT is the closest zero-cost model that is initially available, which is likely the case in real-world domain adaptation scenarios (especially for non-English languages). Our experiments are designed to simulate a scenario where no domain-specific model is initially available. We also assume that the NLP engineer’s salary is a fixed cost; in other words, their salary will be the same whether they spend time pre-training models or managing a group of annotators.444Based on our experience, pre-training in-domain models (collecting training corpus is non-trivial) and managing a team of annotators are roughly comparable in terms of effort. Our primary concerns are about financial and environmental costs, rather than the overall time needed to obtain the adapted model. If the timeline is an important factor, the annotation process can be possibly sped up by hiring more annotators.

3 Estimating Annotation Cost ()

In this section, we present our estimates of the annotation cost for three procedural text datasets from specialized scientific domains, which enable a comparison of model performance under varying budget constraints (§5.4).

Dataset Domain Task #Files (train/dev/test) #Sentences #Cases #Classes Total Cost Price/File Price/Sent.
biology NER 726 (492/123/111) 17,658 185,313 20 $7,820 $10.8 $0.44
RE 124,803 16

Tabassum et al. (2020)

biomed NER 191 (100/41/50) 1,699 12,131 24 $1,730 $9.1 $1.02
(this work)
RE 8,987 17


chemistry NER 992 (793/99/100) 8,331 53,423 24 $5,000 $4.7 $0.60
(this work)
RE 46,878 17


Table 1: Statistics and examples of three procedural text datasets.

Annotated Procedural Text Datasets.

We experiment with three procedural text corpora, including Wet Lab Protocols (WLP; Tabassum et al., 2020) and two new datasets we created for this study, which include scientific articles and chemical patents. Statistics of the three datasets are shown in Table 1. The WLP corpus includes 726 wet lab experiment instructions collected from which are annotated using an inventory of 20 entity types and 16 relation types. Following the same annotation scheme, we annotate PubMedM&M and ChemSyn. The PubMedM&M corpus consists of 191 double-annotated experimental paragraphs extracted from the Materials and Methods section of PubMed articles. The ChemSyn corpus consists of 992 chemical synthesis procedures described in patents, 500 of which are double-annotated. Unlike the succinct, informal language style in WLP, PubMedM&M represents an academic writing style, as it comes from published research papers (see Table 1). More details on data pre-processing, annotation and inter-annotator agreement scores can be found in Appendix A

Annotation Cost.

We recruit undergraduate students to annotate the datasets using the BRAT annotation tool.555 Annotators are paid 13 USD / hour throughout the process, which is the standard rate for undergraduate students at our university. Estimates of the cost of annotation, , per-sentence are presented in Table 1.666The average time to annotate each sentence varies across datasets based on the complexity of the text, average length of sentences, etc.

4 Estimating Pre-training Cost ()

To evaluate varied strategies for combining pretraining and annotation given a fixed budget, we need accurate estimates on the cost of annotation, , and pretraining, . Having estimated the cost of annotating in-domain procedural text corpora in §3, we now turn to estimate the cost of in-domain pretraining. Specifically, we consider two popular approaches: 1) training an in-domain language model from scratch; 2) continued pre-training using an off-the-shelf model.

Procedure Corpus Collection.

To pre-train our models, we create a novel collection of procedural texts from the same domains as the annotated data in §3, hereinafter referred to as the Procedure corpus.

Specially trained classifiers were used to identify paragraphs describing experimental procedures. For PubMed, a classifier was used to identify paragraphs describing experimental procedures by fine-tuning SciBERT

(Beltagy et al., 2019) on the SciSeg dataset (Dasigi et al., 2017), which is annotated with scientific discourse structure, to extract procedures from the Materials and Methods section of 680k articles. For the chemical synthesis domain, the chemical reaction extractor developed by Lowe (2012) was applied to the Description section of 303k patents (174k U.S. and 129k European) we collected from USPTO777 and EPO888 More details of our data collection process can be found in Appendix B.

Cooking recipes are also an important domain for research on procedural text understanding, therefore we include the text component of the Recipe1M+ dataset (Marín et al., 2021) in the Procedure pre-training corpus. In total, our Procedure collection contains around 1.1 billion words; more statistics are shown in Table 2. In addition, we create an extended version, Procedure+, consisting of 12 billion words, where we up-sample the procedural paragraphs 6 times and combine them with the original full text of 680k PubMed articles and 303k chemical patents. This up-sampling ensures at least half of the text is procedural.

Pre-training Process and Cost.

We train two procedural domain language models on the Google Cloud Platform using 8-core v3 TPUs: 1) ProcBERT, a BERT model pre-trained from scratch using our Procedure+ corpus, and 2) Proc-RoBERTa, for which we continued pre-training RoBERTa on the Procedure corpus following Gururangan et al. (2020).

We pre-train ProcBERT using the TensorFlow codebase of BERT.

999 Following Devlin et al. (2019), we deploy a two-step regime: the model is trained with sequence length 128 and batch size 512 for 1 million steps at a rate of 4.71 steps/second. Then, it is trained for 100k more steps using sequences of length 512 and a batch size of 256 at a rate of 1.83 steps/second. The pretraining process takes about 74 hours, and the total cost is about 620 USD, which includes the price for on-demand TPU-v3s (8 USD/hour)101010 plus auxiliary costs for virtual machines and data storage.

We considered the possibility of evaluating checkpoints of partially pre-trained models, for fine-grained variation of the pre-training budget, however after some investigation we chose to only report results on fully pre-trained models, using established training protocols (learning rate, number of parameter updates, model size, sequence length, etc.) to ensure fair comparison.

In addition to pre-training from scratch, we also experiment with Domain-Adaptive Pre-training, using the codebase111111 released by AI2 to train Proc-RoBERTa. Similar to Gururangan et al. (2020), we fine-tune RoBERTa on our collected Procedure corpus for 12.5k steps with the averaged speed of 27.27 seconds per step, which leads to a TPU time of 95 hours.121212This is comparable to the number reported by the authors of Gururangan et al. (2020) on GitHub. Thus, the total cost of Proc-RoBERTa is around 800 USD after including the auxiliary expenses.

Finally, we estimate the cost of training for SciBERT Beltagy et al. (2019), which was also trained on an 8-core TPU v3 using a two-stage training process similar to ProcBERT. The overall training of SciBERT took 7 days (5 days for the first stage and 2 days for the second stage) with an estimated cost of 1,340 USD.

Carbon Footprint.

Apart from the financial cost, we also estimate the carbon footprint of each in-domain pre-trained language model for its environmental impact. We measure the energy consumption in kilowatt-hours (KWh) as in Patterson et al. (2021):

where is the number of training hours, is the number of processors used, is the average power per processor,131313Unlike Strubell et al. (2019) which measured GPU, CPU and DRAM’s power separately, Patterson et al. (2021) measured the power of a processor together with other components including fans, network interface, host CPU, etc. and (Power Usage Effectiveness) indicates the energy usage efficiency of a data center. In our case, the average power per TPU v3 processor is 283 watts, and we use a coefficient of 1.10, which is the average trailing twelve-month reported for all Google data centers in Q1 2021.141414 Once we know the energy consumption, we can estimate the CO2 emissions (CO2e) as follows:

where measures the amount of emission when consuming 1 KWh energy, which is 474g/KWh for our pre-training.151515Our models were pre-trained in the data center of Google in Netherlands: For example, ProcBERT is pre-trained on a single 8-core TPU v3 for 74 hours, resulting in CO2 emission of kg. The estimated CO2 emissions for three in-domain language models are shown in Table 3.

#Tokens Text Size Pre-trained Model
Wiki + Books 3.3B 16GB BERT
Web crawl - 160GB RoBERTa
PMC + CS 3.2B - SciBERT
BioMed 7.6B 47GB BioMed-RoBERTa
Procedure 1.05B 6.5GB Proc-RoBERTa
  - PubMed 0.32B 2.0GB
  - Chem. patent 0.61B 3.9GB
  - Cook. recipe 0.11B 0.6GB
Procedure+ 12B 77GB ProcBERT
  - Procedure ( 6) 6.3B 39GB
  - Full articles 5.7B 38GB
Table 2: Statistics of our newly created Procedure and Procedure+ corpora, which are used for pre-training Proc-RoBERTa and ProcBERT, respectively.

CO2e (kg)
Air travel, one person, SFNY 1200161616Source: Google Flights (Patterson et al., 2021).
SciBERT Beltagy et al. (2019) 198.3
Proc-RoBERTa 112.1
ProcBERT 87.4

Table 3: Carbon footprint of three in-domain pre-trained language models. CO2e is the number of metric tons of CO2 emissions with the same global warming potential as one metric ton of another greenhouse gas.

5 Measuring Utility under Varying Budget Constraints

Given the estimated unit cost of annotation 3) and pre-training 4), we now empirically evaluate the utility , of various budgets and pre-training strategies to find an optimal policy for domain adaptation that fits within the budget constraint .

5.1 NLP Tasks and Models

We experiment with two NLP tasks, Named Entity Recognition (NER) and Relation Extraction (RE). For NER, we follow

Devlin et al. (2019) to feed the contextualized embedding of each token into a linear classification layer. For RE, we follow Zhong and Chen (2020)

, inserting four special tokens specifying positions and types of each entity-pair mention, which are included as input to a pre-trained sentence encoder. Gold entity mentions are used in our relation extraction experiments, to reduce variance due to entity recognition errors.

5.2 Budget-constrained Experimental Setup

As we have three procedural text datasets (§3) annotated with entities and relations, we can experiment with six source target adaptation settings. For each domain pair, we compare five different pre-trained language models when adapted to the procedural text domain under varying budgets.

Based on the estimations of the annotation costs 3) and pre-training costs 4), we conduct various budget-constrained domain adaptation experiments. For example, if we have $1,500 and the PubMedM&M corpus, to build an NER model that works best for the ChemSyn domain (PubMedM&MChemSyn), we can spend all $1,500 to annotate 2,500 in-domain sentences to fine-tune off-the-shelf BERT. Or alternatively, we could first spend $800 to pre-train Proc-RoBERTa, then fine-tune it on 1155 sentences annotated in the ChemSyn domain using the remaining of $700. Under both budgeting strategies, an additional experiment is performed to choose one of two domain adaption methods that maximizes performance: 1) a model that simply uses the annotated data in the target domain for fine-tuning; or 2) a model which is fine-tuned using a variant of EasyAdapt (Daumé III, 2007) to leverage annotated data in both the source and target domains (see below for details). We select the approach that has better development set performance and report its test set result in Table 4 and 5 (see Appendix D for more details about hyper-parameters).

Source Target Domain Budget $0 $0 $620 $800 $1340

F1 (#sent) F1 (#sent) F1 (#sent) F1 (#sent) F1 (#sent)
PubMedM&MChemSyn $700 92.990.2 (1166) 93.320.3 (1166) 91.070.4 (133) N/A N/A
$1500 93.580.3 (2500) 94.140.2 (2500) 94.730.1 (1466) 93.830.1 (1166) 92.460.2 (266)
$2300 94.470.2 (3833) 94.810.1 (3833) 95.170.1 (2800) 94.710.2 (2500) 94.390.2 (1600)

$700 93.310.3 (1166) 93.590.3 (1166) 90.830.3 (133) N/A N/A
$1500 94.310.3 (2500) 94.440.2 (2500) 94.880.1 (1466) 94.200.2 (1166) 92.390.2 (266)
$2300 94.470.2 (3833) 95.090.3 (3833) 95.520.1 (2800) 94.630.3 (2500) 94.390.2 (1600)

$700 74.410.8 (686) 75.410.5 (686) 68.300.7 (78) N/A N/A
$1500 N/A N/A 76.760.7 (862) 76.020.5 (686) 72.660.4 (156)

$700 75.100.5 (686) 75.710.4 (686) 72.850.6 (78) N/A N/A
$1500 N/A N/A 77.620.5 (862) 76.280.5 (686) 73.930.4 (156)

$700 72.230.4 (1590) 73.210.2 (1590) 72.420.4 (181) N/A N/A
$1500 73.300.4 (3409) 73.460.4 (3409) 75.150.4 (2000) 73.900.5 (1590) 72.480.4 (363)
$2300 73.180.4 (5227) 74.120.2 (5227) 75.880.3 (3818) 74.670.4 (3409) 74.680.5 (2181)

$700 72.660.3 (1590) 73.180.7 (1590) 72.910.3 (181) N/A N/A
$1500 73.620.4 (3409) 73.460.4 (3409) 75.250.1 (2000) 73.980.2 (1590) 72.580.2 (363)
$2300 73.730.3 (5227) 74.260.2 (5227) 75.780.3 (3818) 74.680.2 (3409) 74.800.2 (2181)
Table 4: Experiment results for Named Entity Recognition (NER). With higher budgets ($1500 and $2300), our in-domain pre-training of ProcBERT achieves the best results in combination with data annotation. For a smaller budget ($700), investing all funds in annotation and fine-tuning the standard BERT (considered as cost-free) will yield the best outcome. #sent is the number of sentences from the target domain, annotated under the given budget, used for training. indicates results using EasyAdapt (§5.3), where source domain data helps.

5.3 EasyAdapt

In most of our experiments, we have access to a relatively large amount of labeled data from a source domain, and varying amounts of data from the target domain. Instead of simply concatenating the source and target datasets for fine-tuning, we propose a simple, yet novel variation of EasyAdapt (Daumé III, 2007)

for pre-trained Transformers. More specifically, we create three copies of the model’s contextualized representations: one represents the source domain, one represents the target, and the third is domain-independent. These contextualized vectors are then concatenated and fed into a linear layer that is 3 times as large as the base model’s. When encoding data from a specific domain (e.g.

ChemSyn), the other domain’s representations are zeroed out (1/3 of the new representations will always be 0.0). This enables the domain-specific block of the linear layer to encode information specific to that domain, while the domain-independent parameters can learn to represent information that transfers across domains. This is similar to prior work using EasyAdapt (Kim et al., 2016) for LSTMs.

5.4 Experimental Results and Analysis

We present the test set NER and RE results with five annotation and pre-training combination strategies under six domain adaptation settings in Table 4 and 5

, respectively. We report averages across five random seeds with standard deviations as subscripts. If pre-training costs go over the total budget, or available data goes under the annotation budget, we indicate the result as “NA”. We now discuss a set of key questions regarding pre-training-based domain adaptation under a constrained budget, which our experiments can shed some light on.

Should we prioritize pre-training or data annotation for domain adaptation?

For all six domain adaptation settings in Table 4, spending the entire budget on annotation and using the off-the-shelf language model BERT works the best for NER when the budget is 700 USD, showing the effectiveness of data annotation in low-resource scenarios. As the budget increases, performance gains from labelling additional data diminish, and pre-training in-domain language models takes the lead. ProcBERT, which is pre-trained from scratch on the Procedure+ corpus costing only 620 USD, performs best at budgets of 1500 and 2300 USD. This demonstrates that combining domain-specific pre-training with data annotation is the best strategy in high-resource settings. Similarly for RE, as shown in Table 5, using all funds for data annotation and working with off-the-shelf models achieves better performance at lower budgets, while domain-specific pre-training starts to excel as the budget increases past a certain point.

Source Target Domain Budget $0 $0 $620 $800 $1340

F1 (#sent) F1 (#sent) F1 (#sent) F1 (#sent) F1 (#sent)
PubMedM&MChemSyn $700 90.120.4 (1166) 90.710.8 (1166) 88.210.6 (133) N/A N/A
$1500 91.300.2 (2500) 91.840.5 (2500) 92.060.4 (1466) 91.250.7 (1166) 89.580.5 (266)
$2300 91.810.2 (3833) 92.900.4 (3833) 92.570.2 (2800) 92.070.3 (2500) 91.550.2 (1600)

$700 90.200.7 (1166) 90.150.9 (1166) 88.010.6 (133) N/A N/A
$1500 91.340.3 (2500) 91.610.5 (2500) 92.270.2 (1466) 91.770.2 (1166) 89.160.7 (266)
$2300 92.080.4 (3833) 92.730.4 (3833) 92.850.2 (2800) 92.440.6 (2500) 91.420.4 (1600)

$700 77.740.7 (686) 79.331.3 (686) 74.630.4 (78) N/A N/A
$1500 N/A N/A 80.100.6 (862) 79.331.3 (686) 75.701.1 (156)

$700 77.020.6 (686) 77.310.4 (686) 73.920.8 (78) N/A N/A
$1500 N/A N/A 79.500.7 (862) 77.121.2 (686) 75.430.6 (156)

$700 78.810.6 (1590) 79.800.6 (1590) 78.970.7 (181) N/A N/A
$1500 79.910.4 (3409) 80.150.8 (3409) 80.870.7 (2000) 79.690.4 (1590) 79.440.8 (363)
$2300 80.540.6 (5227) 80.970.2 (5227) 81.150.6 (3818) 81.330.5 (3409) 80.160.7 (2181)

$700 78.340.8 (1590) 78.441.2 (1590) 77.980.8 (181) N/A N/A
$1500 79.400.6 (3409) 79.780.6 (3409) 80.450.2 (2000) 78.791.1 (1590) 78.760.5 (363)
$2300 79.930.5 (5227) 80.040.8 (5227) 80.850.6 (3818) 79.490.7 (3409) 80.330.3 (2181)
Table 5: Experiment results for Relation Extraction (RE). Similar to the observations in Table 4, regardless of a small or large budget, prioritizing data annotation in the target domain is the most beneficial. indicates results using EasyAdapt (§5.3), where source domain data helps.

What is the starting budget to consider pre-training an in-domain language model?

To answer this question, we plot test set NER performance for two strategies, BERT (investing all funds on annotation) and ProcBERT (combining annotation with pre-training), against varying budgets in Figure 2. Specifically, the budget of each strategy starts with the pre-training cost of its associated language model, and is increased by 155 USD increments until the total budget reaches the total cost of available data171717We also add a few points at the start of each curve to make them smoother. For BERT, we add 50 USD, 75 USD and 100 USD. For ProcBERT, we add 670 USD, 695 USD and 720 UDS.. We observe a similar trend for both PubMedM&MChemSyn and ChemSynWLP: annotation alone works better in lower budgets while in-domain pre-training (ProcBERT) excels at higher budgets. However, the intersection of the curves for these two strategies occurs at different points, which are around 1085 USD and 775 USD.

((a)) PubMedM&MChemSyn
((b)) ChemSynWLP
Figure 2: Comparison of two domain adaptation strategies: 1) allocate all available funds to data annotation; 2) pre-train ProcBERT on in-domain data, then use the remaining budget for annotation. For small budgets, the former yields the best performance on NER, but as the budget increases, the later becomes the best choice.
((a)) ChemSyn
((b)) WLP
Figure 3: Comparison of spending the entire budget on data annotation () and pre-training followed by in-domain annotation (), where models are trained on target domain labeled data only. The crossover point for WLP moves from 775 USD (adapted from ChemSyn) to around 1395 USD (WLP only) demonstrating that a large source domain dataset can reduce the need for target domain annotation.

We hypothesize there are two reasons for this difference: 1) each target domain may require different amounts of labeled data to generalize well; 2) the quantity of labeled data from the source domain may also impact the need for data annotation in the target domain. To testify our hypotheses, we evaluate the utility of annotation vs. pre-training where no source-domain data is available in Figure 3. This is almost identical to the setting of Figure 2 except models are trained on target domain labeled data only. Here, we observe the intersection for ChemSyn is still around 1085 USD while the crossover point for WLP moves from the original 775 USD (in Figure 2) to around 1395 USD. Our hypothesis is that WLP is a broader domain compared to ChemSyn (WLP covers a more diverse range of protocols that include cell cultures, DNA sequencing, etc.), so it requires more annotated data to perform well under the setting of Figure 3. However, when adapted from a large source domain dataset like ChemSyn, the need for annotated WLP corpus is reduced so that ProcBERT can outperform BERT at a lower budget.

Note that our estimated annotation cost, , (per sentence) includes the annotation of both entity mentions and relations, so our analysis amortizes the cost of pre-training across both tasks. In a scenario where more tasks need to be adapted for the target domain, this could be accounted for simply by dividing the cost of pre-training among tasks, which would shift the black curves in Figure 2 and Figure 3 to the left, making pre-training an economical choice at lower budgets.181818For more discussion on our assumptions, see §2.

Target Entities BERT ProcBERT
$0 $620
seen by both 97.63 97.45
ChemSyn unseen by both 84.96 85.65
(budget $1085) seen by BERT 94.07 93.44
All 93.78 93.82
seen by both 85.42 84.78
WLP unseen by both 55.75 57.19
(budget $1395) seen by BERT 76.50 76.27
All 73.51 73.51
Table 6: Test set F1 on NER for entities seen and unseen in the training data for BERT and ProcBERT, when the two achieve very similar overall performance under the same budget constraints in Figure 3. ProcBERT performs better on the unseen entities.

When using the same budget and achieving similar F1, how do pre-training and annotation differ?

In the previous experiments, we show that in-domain pre-training is an effective domain adaptation method especially in high-budget settings. ProcBERT can work very well when trained with less labeled data. A plausible explanation is that in-domain pre-training improves generalization to new entities in the target domain, whereas additional annotation improves the performance on entities that are observed in the training corpus. To evaluate this hypothesis, we compare model predictions of the two strategies at the crossover points in Figure 3191919We choose Figure 3 for this analysis instead of Figure 2 because we want to isolate the impact of source domain labeled data., and consider each entity in the test set as "Seen" or "Unseen" based on whether it was observed in the training set. Then, we calculate the F1 score for each category as shown in Table 6. Although the models that are compared achieve nearly identical overall performance, the decomposition of performance on seen and unseen entities in Table 6 clearly suggests that in-domain pre-training leads to better generalization on unseen entities, whereas allocating more budget to annotation boosts performance on entities that were seen during training. This may help provide an explanation for the main finding of this paper: in-domain pre-training results in better generalization on unseen mentions, leading to better marginal utility, but only after enough in-domain annotations are observed to fully cover the head of the distribution.

6 Positive Externalities of Pretraining

Model X-WLP CheMU Recipe WLP PubMedM&M ChemSyn

74.790.6 77.570.6 95.140.1 91.930.6 80.620.7 73.870.4 80.250.4 74.800.7 78.730.9 95.090.2 92.630.2

75.531.7 76.770.5 95.100.2 92.100.9 81.530.5 74.970.3 81.390.5 77.060.3 78.441.3 95.260.1 92.870.5

75.040.8 77.240.6 95.050.1 92.541.2 83.410.1 74.970.5 80.940.5 76.210.3 78.950.7 95.300.2 93.390.3

73.771.6 74.370.2 95.160.2 92.101.1 84.540.8 76.370.5 79.760.3 78.700.6 75.660.3 95.660.2 92.870.2

75.480.7 78.510.6 95.630.0 91.800.5 81.750.4 75.890.4 81.290.5 77.810.2 79.540.9 95.820.2 93.270.2

74.890.6 76.390.8 95.320.2 92.420.6 82.920.3 75.550.2 81.56 0.4 77.140.1 78.442.0 95.380.2 93.160.3

74.760.7 76.120.9 95.490.1 91.550.6 84.190.3 75.760.4 80.790.9 76.910.7 79.160.9 95.670.1 93.310.2

76.730.9 78.570.8 96.190.1 92.320.2 84.100.3 76.040.2 81.440.4 77.310.5 80.190.6 95.970.2 93.570.2

76.5 78.1 95.70 95.36 81.96 77.99 80.46
Table 7: Test set F1 on six procedural text datasets. The best task performance is boldfaced, and the second-best performance is underlined. For the SOTA model of each dataset, we refer readers to the corresponding paper for further details: Tamari et al. (2021) for XWLP, Wang et al. (2020) for ChEMU, Gupta and Durrett (2019) for Recipe, Knafou et al. (2020) for NER on WLP, and Sohrab et al. (2020) for RE on WLP.

So far, we have discussed domain adaptation as a consumer choice problem where annotation and pre-training costs are balanced to maximize performance in a target domain. However, pre-training on large quantities of natural language instructions can improve performance on additional tasks in the procedural text domain, as demonstrated in the following subsections.

6.1 Ancillary Procedural NLP Tasks

In addition to the procedural text datasets discussed in §5, we experiment with three ancillary procedural text corpora, to explore how in-domain pretraining can benefit other tasks.

The ChEMU corpus (Nguyen et al., 2020) contains NER and event annotations for 1500 chemical reaction snippets collected from 170 English patents. Its NER task focuses on identifying chemical compounds, and its event extraction (EE) task aims at detecting chemical reaction events including trigger detection and argument role labeling.

The XWLP corpus (Tamari et al., 2021) provides the Process Event Graphs (PEG) of 279 wet-lab biochemistry protocols. The PEG is a document-level graph-based representation specifying the involved experimental objects.

The Recipe corpus (Kiddon et al., 2016) includes annotation of entity states for 866 cooking recipes. It supports Entity Tracking (ET) task which predicts whether or not a specific ingredient is involved in each step of the recipe.

6.2 Experiments on Ancillary Tasks

For ChEMU, gold arguments are provided, so we only need to identify the event trigger and predict the role of the gold arguments. An event prediction is correct if the event trigger, associated arguments, and their roles match with the gold event mention. We tackle this task using a pipeline model similar to Zhong and Chen (2020). For XWLP, we focus on the operation argument role labeling task, where gold entities are provided as input. Following Tamari et al. (2021), we decompose the results into "Core" and "Non-Core" roles. For the Recipe task, we follow the data splits and fine-tuning architecture of Gupta and Durrett (2019). The state of an ingredient in each cooking step is correct if it matches with the gold labels, as either present or absent.


Test set results of eight pre-trained language models on six procedural text datasets are presented in Table 7.202020For ChEMU, we report the development set results because its test set is not publicly available. ProcBERT, performs best in most tasks and even achieves the state-off-the-art performance on operational argument role labeling ("Core" and "Non-Core") of XWLP, showing the effectiveness of in-domain pre-training.

7 Conclusion

In this paper, we address a number of questions related to the costs of adapting an NLP model to a new domain (Blitzer et al., 2006; Han and Eisenstein, 2019), an important and well-studied problem in NLP. We frame domain adaptation under a constrained budget as a problem of consumer choice. Experiments are conducted using several pre-trained models in three procedural text domains to determine when it is economical to pre-train in-domain transformers Gururangan et al. (2020), and when it is better to spend available resources on annotation. Our results suggest that when a small number of NLP models need to be adapted to a new domain, pre-training, by itself, is not an economical solution.


We are grateful to the anonymous reviewers for helpful feedback on an earlier draft of this paper. We also thank Vardaan Pahuja for assistance with extracting experimental paragraphs from PubMed, and John Niekrasz for sharing the output of Daniel Lowe’s reaction extraction tool on European patents. This material is based upon work supported by the Defense Advanced Research Projects Agency (DARPA) under Contract No. HR001119C0108, in addition to the NSF (IIS-1845670) and IARPA via the BETTER program (2019-19051600004). The views, opinions, and/or findings expressed are those of the author(s) and should not be interpreted as representing the official views or policies of the Department of Defense, IARPA or the U.S. Government. This work is approved for Public Release, Distribution Unlimited.


  • G. S. Becker (1965) A Theory of the Allocation of Time. The economic journal (299), pp. 493–517. Cited by: Figure 1, §1.
  • I. Beltagy, K. Lo, and A. Cohan (2019) SciBERT: A Pretrained Language Model for Scientific Text. In

    Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP)

    Hong Kong, China, pp. 3615–3620. External Links: Link, Document Cited by: Appendix B, §4, §4, Table 3.
  • E. M. Bender, T. Gebru, A. McMillan-Major, and S. Shmitchell (2021) On the Dangers of Stochastic Parrots: Can Language Models Be Too Big?. In Proceedings of the 2021 ACM Conference on Fairness, Accountability, and Transparency, Cited by: §1.
  • J. Blitzer, R. McDonald, and F. Pereira (2006) Domain Adaptation with Structural Correspondence Learning. In Proceedings of the 2006 conference on empirical methods in natural language processing, pp. 120–128. Cited by: §7.
  • A. Blum and T. Mitchell (1998) Combining Labeled and Unlabeled Data with Co-training. In

    Proceedings of the eleventh annual conference on Computational learning theory

    pp. 92–100. Cited by: §1.
  • J. Chen, Z. Yang, and D. Yang (2020)

    MixText: Linguistically-Informed Interpolation of Hidden Space for Semi-Supervised Text Classification

    In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, Cited by: §1.
  • P. Dasigi, G. Burns, E. Hovy, and A. D. Waard (2017)

    Experiment Segmentation in Scientific Discourse as Clause-level Structured Prediction using Recurrent Neural Networks

    ArXiv. Cited by: Appendix B, §4.
  • H. Daume III and D. Marcu (2006) Domain Adaptation for Statistical Classifiers.

    Journal of artificial Intelligence research

    Cited by: §1.
  • H. Daumé III (2007) Frustratingly Easy Domain Adaptation. In Proceedings of the 45th Annual Meeting of the Association of Computational Linguistics, Prague, Czech Republic, pp. 256–263. External Links: Link Cited by: §5.2, §5.3.
  • J. Devlin, M. Chang, K. Lee, and K. Toutanova (2019) BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Minneapolis, Minnesota, pp. 4171–4186. External Links: Link, Document Cited by: Appendix C, §1, §4, §5.1.
  • A. Gupta and G. Durrett (2019)

    Effective Use of Transformer Networks for Entity Tracking

    In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), Hong Kong, China, pp. 759–769. External Links: Link, Document Cited by: §6.2, Table 7.
  • S. Gururangan, A. Marasović, S. Swayamdipta, K. Lo, I. Beltagy, D. Downey, and N. A. Smith (2020) Don’t Stop Pretraining: Adapt Language Models to Domains and Tasks. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, Online, pp. 8342–8360. External Links: Link, Document Cited by: Appendix C, §1, §4, §4, §7, footnote 12.
  • X. Han and J. Eisenstein (2019) Unsupervised Domain Adaptation of Contextualized Embeddings for Sequence Labeling. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), Hong Kong, China, pp. 4238–4248. External Links: Link, Document Cited by: §1, §7.
  • J. Hoffman, E. Tzeng, T. Park, J. Zhu, P. Isola, K. Saenko, A. A. Efros, and T. Darrell (2018) CyCADA: Cycle-Consistent Adversarial Domain Adaptation. In

    International Conference on Machine Learning (ICML)

    Cited by: §1.
  • P. Izsak, M. Berchansky, and O. Levy (2021) How to Train BERT with an Academic Budget. arXiv. Cited by: §1.
  • C. Kiddon, L. Zettlemoyer, and Y. Choi (2016) Globally Coherent Text Generation with Neural Checklist Models. In Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, Austin, Texas, pp. 329–339. External Links: Link, Document Cited by: §6.1.
  • Y. Kim, K. Stratos, and R. Sarikaya (2016) Frustratingly Easy Neural Domain Adaptation. In Proceedings of COLING 2016, the 26th International Conference on Computational Linguistics, Cited by: §5.3.
  • D. P. Kingma and J. Ba (2015) Adam: A Method for Stochastic Optimization. In 3rd International Conference on Learning Representations, ICLR 2015, San Diego, CA, USA, May 7-9, 2015, Conference Track Proceedings, Y. Bengio and Y. LeCun (Eds.), External Links: Link Cited by: Appendix C.
  • J. Knafou, N. Naderi, J. Copara, D. Teodoro, and P. Ruch (2020) BiTeM at WNUT 2020 Shared Task-1: Named Entity Recognition over Wet Lab Protocols using an Ensemble of Contextual Language Models. In Proceedings of the Sixth Workshop on Noisy User-generated Text (W-NUT 2020), Online, pp. 305–313. External Links: Link, Document Cited by: Table 7.
  • K. J. Lancaster (1966) A New Approach to Consumer Theory. Journal of political economy (2), pp. 132–157. Cited by: Figure 1, §1.
  • Y. Liu, M. Ott, N. Goyal, J. Du, M. Joshi, D. Chen, O. Levy, M. Lewis, L. Zettlemoyer, and V. Stoyanov (2019) RoBERTa: a Robustly Optimized BERT Pretraining Approach. ArXiv. Cited by: §1.
  • D. M. Lowe (2012) Extraction of Chemical Structures and Reactions from the Literature. Cited by: Appendix B, §4.
  • J. Marín, A. Biswas, F. Ofli, N. Hynes, A. Salvador, Y. Aytar, I. Weber, and A. Torralba (2021) Recipe1M+: A Dataset for Learning Cross-Modal Embeddings for Cooking Recipes and Food Images. IEEE Transactions on Pattern Analysis and Machine Intelligence, pp. 187–203. Cited by: §4.
  • D. Q. Nguyen, Z. Zhai, H. Yoshikawa, B. Fang, C. Druckenbrodt, C. Thorne, R. Hoessel, S. Akhondi, T. Cohn, T. Baldwin, and K. Verspoor (2020) ChEMU: Named Entity Recognition and Event Extraction of Chemical Reactions from Patents. Advances in Information Retrieval, pp. 572 – 579. Cited by: §6.1.
  • D. Patterson, J. Gonzalez, Q. Le, C. Liang, L. Munguia, D. Rothchild, D. So, M. Texier, and J. Dean (2021)

    Carbon Emissions and Large Neural Network Training

    arXiv. Cited by: §4, footnote 13, footnote 16.
  • M. E. Peters, M. Neumann, R. Logan, R. Schwartz, V. Joshi, S. Singh, and N. A. Smith (2019) Knowledge Enhanced Contextual Word Representations. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), Hong Kong, China, pp. 43–54. External Links: Link, Document Cited by: Appendix D.
  • R. Schwartz, J. Dodge, N. A. Smith, and O. Etzioni (2020) Green AI. Communications of the ACM, pp. 54 – 63. Cited by: §1.
  • M. G. Sohrab, A. Duong Nguyen, M. Miwa, and H. Takamura (2020) Mgsohrab at WNUT 2020 Shared Task-1: Neural Exhaustive Approach for Entity and Relation Recognition Over Wet Lab Protocols. In Proceedings of the Sixth Workshop on Noisy User-generated Text (W-NUT 2020), Online, pp. 290–298. External Links: Link, Document Cited by: Table 7.
  • E. Strubell, A. Ganesh, and A. McCallum (2019)

    Energy and Policy Considerations for Deep Learning in NLP

    In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, Florence, Italy, pp. 3645–3650. External Links: Link, Document Cited by: §1, footnote 13.
  • J. Tabassum, W. Xu, and A. Ritter (2020) WNUT-2020 Task 1 Overview: Extracting Entities and Relations from Wet Lab Protocols. In Proceedings of the Sixth Workshop on Noisy User-generated Text (W-NUT 2020), Online, pp. 260–267. External Links: Link, Document Cited by: Appendix A, §3, Table 1.
  • R. Tamari, F. Bai, A. Ritter, and G. Stanovsky (2021) Process-Level Representation of Scientific Protocols with Interactive Annotation. In Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics, Online, pp. 2190–2202. External Links: Link Cited by: §6.1, §6.2, Table 7.
  • J. Wang, Y. Ren, Z. Zhang, and Y. Zhang (2020) Melaxtech: A Report for CLEF 2020 - ChEMU Task of Chemical Reaction Extraction from Patent. In Working Notes of CLEF 2020 - Conference and Labs of the Evaluation Forum, Thessaloniki, Greece, September 22-25, 2020, L. Cappellato, C. Eickhoff, N. Ferro, and A. Névéol (Eds.), CEUR Workshop Proceedings, Vol. 2696. External Links: Link Cited by: Table 7.
  • D. Wright and I. Augenstein (2020) Transformer Based Multi-Source Domain Adaptation. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), Online, pp. 7963–7974. External Links: Link, Document Cited by: §1.
  • Z. Zhong and D. Chen (2020) A Frustratingly Easy Approach for Joint Entity and Relation Extraction. ArXiv. Cited by: §5.1, §6.2.

Appendix A Data Annotation

We annotate two datasets PubMedM&M and ChemSyn in the domain of scientific articles and chemical patents mainly following the annotation scheme of the Wet Lab Protocols (WLP; Tabassum et al., 2020). On top of 20 entity types and 16 relation types in WLP, we supplement four entity types (Company, Software, Data-Collection and Info-Type) and one relation type (Belong-To) due to two key features of our corpus: 1) scientific articles usually specify the provenance of reagents for better reproducibility; 2) it covers a broader range of procedures such as computer simulation and data analysis.

We recruit four undergraduate students to annotate the datasets using the BRAT annotation tool.212121 We double-annotate all files in PubMedM&M and half of the files in ChemSyn. For those double-annotated files, the coordinator will discuss the annotation with each annotator making sure their annotation follows the guideline and dissolve the disagreement. As for the inter-annotator agreement (IAA) score, we treat the annotation from one of the two annotator as the gold label, and the other annotation as the predicted label, and then use the F1 scores of Entity(Action) and Relation evaluations as the final inter-annotator agreement scores, which are shown in Table 8. We can see that ChemSyn has higher IAA scores, and there are two potential reasons: 1) we annotate PubMedM&M first, so the annotators might be more experienced when they annotate ChemSyn; 2) PubMedM&M contains more diverse content like wet lab experiments or computer simulation procedures while ChemSyn is mainly about chemical synthesis.

Dataset Entities/Actions Relations
PubMedM&M 70.54 51.94

79.87 87.20

Table 8: Inter-Annotator Agreement (F1 scores on Entity/Action Identification and Relation Extraction).

Appendix B Procedural Corpus Collection

PubMed Articles.

The first source of our procedural corpus is PubMed articles because they contain a large number of freely accessible experimental procedures. Specifically, we extract procedural paragraphs from the Materials and Methods section of articles within the Open Access Subset of PubMed. XML files containing full text of articles are downloaded from NCBI222222 and then processed to obtain all the paragraphs within the Materials and Methods section.

To improve the quality of our collected corpus, we develop a procedural paragraph extractor by fine-tuning SciBERT (Beltagy et al., 2019) on the SciSeg dataset (Dasigi et al., 2017), which includes discourse labels ({Goal, Fact, Result, Hypothesis, Method, Problem, Implication}) for PubMed articles. This extractor achieves an average F1 score of 72.65% in a five-fold cross validation, and we run it on all acquired paragraphs. We consider a paragraph as a valid procedure if at least 40% of clauses are labeled as Method. This threshold is obtained by manual inspection of the randomly sampled subset of the data.

In total, the PubMed Open Access Subset contains 2,542,736 articles, of which about 680k contain a Materials and Methods section. After running our trained procedural paragraph extractor, we retain a set of 1,785,923 procedural paragraphs. Based on a manual inspection of the extracted paragraphs, we estimate that 92% consist of instructions for carrying out experimental procedures.

Chemical Patents.

The second source of our corpus is the patent data because chemical patents usually include detailed procedures of chemical synthesis. We download U.S. patent data (1976-2016) from USPTO232323 and European data (1978-2020) from EPO242424 as XML files. Then we apply the reaction extractor developed by Lowe (2012)

, a trained Naive Bayes classifier, to the

Description section of our collected patents. Note that the U.S. patent data has two subsets, "Grant" (1976-2016) and "Application" (2001-2016). The "Application" subset covers the "Grant" subset from the same year, so for those overlapping years (2001-2016), we only use the U.S. patents from the "Application" subset. As a result, we get 2,435,999 paragraphs from 174,554 U.S. patents and 1,603,606 paragraphs from 129,035 European patents. Lastly, we use the language identification tool langid252525 to build a English-only corpus, which includes 3,671,482 paragraphs (90.9%).

Appendix C Pre-training Details

We pre-train ProcBERT using the TensorFlow codebase of BERT (Devlin et al., 2019).262626 We use the Adam optimizer (Kingma and Ba, 2015) with = 0.9, = 0.999 and weight decay of 0.01. Following Devlin et al. (2019), we deploy the two-step regime. In the first step, we pre-train the model with sequence length 128 and batch size 512 for 1 million steps. The learning rate is warmed up over the first 100k steps to a peak value of 1e-4, then linearly decayed. In the second step, we train 100k more steps of sequence length 512 and batch size 256 to learn the positional embeddings with peak learning rate 2e-5. We use the original sub-word mask as the masking strategy, and we mask 15% of tokens in the sequence for both training steps.

For Proc-RoBERTa, we use the codebase from AI2,272727

which enables language model pre-training on TPUs with PyTorch. Similar to

Gururangan et al. (2020), we train RoBERTa on our collected procedural text corpus for 12.5k steps with a learning rate of 3e-5 and an effective batch size 2048, which is achieved by accumulating the gradient of 128 steps with a basic batch size of 16. The input sequence length is 512 throughout the whole process, and 15% of words are masked for prediction.

Appendix D Hyper-parameters for Downstream Tasks

We use the same five random seeds as Peters et al. (2019) for all our experiments in §5 and §6.282828

We select the best hyperparameter values based on the average development set performances over five random seeds by grid search. For models with BERT

or RoBERTa

architecture, the search range includes learning rate (1e-5, 2e-5), batch size (16, 48, 64, 128), max sequence length (128, 256, 512) and epoch number (5, 20, 60), and the used hyperparameter values on budget-constrained domain adaptation experiments (denoted as "

Budget") (§5) and ancillary tasks (§6) are shown in Table 9. For BERT and RoBERTa, the search range is different in learning rate (5e-6, 1e-5), batch size (4, 8, 12, 24, 64) and epoch number (3, 5, 10, 20), and the used values are shown in Table 10.

Hyperparam. Budget X-WLP CheMU Recipe WLP PubMedM&M ChemSyn

Learning Rate
1e-5 1e-5 1e-5 1e-5 1e-5 2e-5 1e-5 1e-5 1e-5 1e-5 1e-5 1e-5
Batch Size 16 48 16 16 16 16 16 128 16 64 16 48
Max Seq. Length 512 256 512 512 512 512 256 128 512 256 512 256
Max Epoch 20 5 5 20 5 20 20 5 60 5 20 5

Table 9: Hyperparameters for models with BERT or RoBERTa architecture on budget-constrained domain adaptation experiments (denoted as "Budget") (§5) and ancillary tasks (§6).
Hyperparam. Budget X-WLP CheMU Recipe WLP PubMedM&M ChemSyn

Learning Rate
1e-5 5e-6 5e-6 1e-5 5e-6 5e-6 1e-5 5e-6 1e-5 5e-6 1e-5 5e-6
Batch Size 4 12 4 4 4 4 4 64 4 24 4 12
Max Seq. Length 512 256 512 512 512 512 256 128 512 256 512 256
Max Epoch 10 3 3 10 3 5 10 3 20 5 10 3

Table 10: Hyperparameters for BERT and RoBERTa on budget-constrained domain adaptation experiments (§5) and ancillary tasks (§6).