1 Introduction
Unsupervised pretraining—e.g., BERT (Devlin et al., 2019) or RoBERTa (Liu et al., 2019)—has recently pushed the state of the art on many natural language understanding tasks. One method of further improving pretrained models that has been shown to be broadly helpful is to first fine-tune a pretrained model on an intermediate task, before fine-tuning again on the target task of interest (Phang et al., 2018; Wang et al., 2019; Clark et al., 2019; Sap et al., 2019), also referred to as STILTs. However, this approach does not always improve target task performance, and it is unclear under what conditions it does.

This paper offers a large-scale empirical study aimed at addressing this open question. We perform a broad survey of intermediate and target task pairs, following an experimental pipeline similar to Phang et al. (2018) and Wang et al. (2019). This differs from previous work in that we use a larger and more diverse set of intermediate and target tasks, introduce additional analysis-oriented probing tasks, and use a better-performing base model RoBERTa (Liu et al., 2019). We aim to answer the following specific questions:
-
What kind of tasks tend to make good intermediate tasks across a wide variety of target tasks?
-
Which linguistic skills does a model learn from intermediate-task training?
-
Which skills learned from intermediate tasks help the model succeed on which target tasks?
The first question is the most straightforward: it can be answered by a sufficiently exhaustive search over possible intermediate–target task pairs. The second and third questions address the why rather than the when, and differ in a crucial detail: A model might learn skills by training on an intermediate task, but those skills might not help it to succeed on a target task.
Our search for intermediate tasks focuses on natural language understanding tasks in English. In particular, we run our experiments on 11 intermediate tasks and 10 target tasks, which results in a total of 110 intermediate–target task pairs. We use 25 probing tasks—tasks that each target a narrowly defined model behavior or linguistic phenomenon—to shed light on which skills are learned from each intermediate task.
Our findings include the following: (i) Natural language inference tasks as well as QA tasks which involve commonsense reasoning are generally useful as intermediate tasks. (ii) SocialIQA and QQP as intermediate tasks are not helpful as a means to teach the skills captured by our probing tasks, while finetuning first on MNLI and CosmosQA result in an increase in all skills. (iii) While a model’s ability to learn skills relating to input-noising correlate with target task performance, low-level skills such as knowledge of a sentence’s raw content preservation skills and ability to detect various attributes of input sentences such as tense of main verb and sentence length are less correlated with target task performance. This suggests that a model’s ability to do well on the masked language modelling (MLM) task is important for downstream performance. Furthermore, we conjecture that a portion of our analysis is affected by catastrophic forgetting of knowledge learned during pretraining.
2 Methods
2.1 Experimental Pipeline
Our experimental pipeline (Figure 1) consists of two steps, starting with a pretrained model: intermediate-task training, and fine-tuning on a target or probing task.
Intermediate Task Training
We fine-tune RoBERTa on each intermediate task. The training procedure follows the standard procedure of fine-tuning a pretrained model on a target task, as described in Devlin et al. (2019). We opt for single intermediate-task training as opposed to multi-task training (cf. Liu et al., 2019) to isolate the effect of skills learned from individual intermediate tasks.
Target and Probing Task Fine-Tuning
After intermediate-task training, we fine-tune our models on each target and probing task individually. Target tasks are tasks of interest to the general community, spanning various facets of natural language, domains, and sources. Probing tasks, while potentially similar in data source to target tasks such as with CoLA, are designed to isolate the presence of particular linguistic capabilities or skills. For instance, solving the target task BoolQ (Clark et al., 2019) may require various skills including coreference and commonsense reasoning, while probing tasks like the SentEval probing suite (Conneau et al., 2018) target specific syntactic and metadata-level phenomena such as subject-verb agreement and sentence length detection.
Name | Train | Dev | task | metrics | genre/source | |
---|---|---|---|---|---|---|
CommonsenseQA | 9,741 | 1,221 | question answering | acc. | ConceptNet | |
SciTail | 23,596 | 1,304 | natural language inference | acc. | science exams | |
Cosmos QA | 25,588 | 3,000 | question answering | acc. | blogs | |
SocialIQA | 33,410 | 1,954 | question answering | acc. | crowdsourcing | |
CCG | 38,015 | 5,484 | tagging | acc. | Wall Street Journal | |
HellaSwag | 39,905 | 10,042 | sentence completion | acc. | video captions & Wikihow | |
QA-SRL | 44,837 | 7,895 | question answering | F1/EM | Wikipedia | |
SST-2 | 67,349 | 872 | sentiment classification | acc. | movie reviews | |
QAMR | 73,561 | 27,535 | question answering | F1/EM | Wikipedia | |
Intermediate Tasks |
QQP | 363,846 | 40,430 | paraphrase detection | acc./F1 | Quora questions |
MNLI | 392,702 | 20,000 | natural language inference | acc. | fiction, letters, telephone speech | |
CB | 250 | 57 | natural language inference | acc./F1 | Wall Street Journal, fiction, dialogue | |
COPA | 400 | 100 | question answering | acc. | blogs, photography encyclopedia | |
WSC | 554 | 104 | coreference resolution | acc. | hand-crafted | |
RTE | 2,490 | 278 | natural language inference | acc. | news, Wikipedia | |
MultiRC | 5,100 | 953 | question answering | F1/EM | crowd-sourced | |
WiC | 5,428 | 638 | word sense disambiguation | acc. | WordNet, VerbNet, Wiktionary | |
BoolQ | 9,427 | 3,270 | question answering | acc. | Google queries, Wikipedia | |
Target Tasks |
CommonsenseQA | 9,741 | 1,221 | question answering | acc. | ConceptNet |
Cosmos QA | 25,588 | 3,000 | question answering | acc. | blogs | |
ReCoRD | 100,730 | 10,000 | question answering | F1/EM | news (CNN, Daily Mail) |
2.2 Tasks
Table 1 presents an overview of the intermediate and target tasks.
2.2.1 Intermediate Tasks
We curate a diverse set of tasks that either represent an especially large annotation effort or that have been shown to yield positive transfer in prior work. The resulting set of tasks cover question answering, commonsense reasoning, and natural language inference.
Qamr
The Question–Answer Meaning Representations dataset (Michael et al., 2018) is a crowdsourced QA task consisting of question–answer pairs that correspond to predicate–argument relationships. It is derived from Wikinews and Wikipedia sentences. For example, if the sentence is Ada Lovelace was a computer scientist., a potential question is What is Ada’s last name?, with the answer being Lovelace.
CommonsenseQA
SciTail
SciTail Khot et al. (2018) is a textual entailment task built from multiple-choice science questions from 4th grade and 8th grade exams, as well as crowdsourced questions (Welbl et al., 2017). The task is to determine whether a hypothesis, which is constructed from a science question and its corresponding answer, is entailed or not (neutral) by the premise.
Cosmos QA
Cosmos QA is a task for a commonsense-based reading comprehension task formulated as multiple-choice questions (Huang et al., 2019). The questions concern the causes or effects of events that require reasoning not only based on the exact text spans in the context, but also wide-range abstractive commonsense reasoning. It differs from CommonsenseQA in that it focuses on causal and deductive commensense reasoning and that it requires reading comprehension over an auxiliary passage, rather than simply answering a freestanding question.
SocialIQA
SocialIQA (Sap et al., 2019) is a task for multiple choice QA. It tests for reasoning surrounding emotional and social intelligence in everyday situations.
Ccg
CCGbank (Hockenmaier and Steedman, 2007) is a task that is a translation of the Penn Treebank into a corpus of Combinatory Categorial Grammar (CCG) derivations. We use the CCG supertagging task, which is the task of assigning tags to individual word tokens that jointly determine the parse of the sentence.
HellaSwag
Qa-Srl
The question-answer driven semantic role labeling dataset (QA-SRL; He et al., 2015) for a QA task that is derived from a semantic role labeling task. Each example, which consists of a set of questions and answers, corresponds to a predicate-argument relationship in the sentence it is derived from. Unlike QAMR, which focuses on all words in the sentence, QA-SRL is specifically focused on verbs.
Sst-2
The Stanford sentiment treebank (Socher et al., 2013) is a sentiment classification task based on movie reviews. We use the binary sentence classification version of the task.
Qqp
The Quora Question Pairs dataset111http://data.quora.com/First-Quora-DatasetRelease-Question-Pairs is constructed based on questions posted on the community question-answering website Quora. The task is to determine if two questions are semantically equivalent.
Mnli
The Multi-Genre Natural Language Inference dataset (Williams et al., 2018) is a crowdsourced collection of sentence pairs with textual entailment annotations across a variety of genres.
2.2.2 Target Tasks
We use ten target tasks, eight of which are drawn from the SuperGLUE benchmark (Wang et al., 2019a). The tasks in the SuperGLUE benchmark cover question answering, entailment, word sense disambiguation, and coreference resolution and have been shown to be easy for humans but difficult for models like BERT. Although we offer a brief description of the tasks below, we refer readers to the SuperGLUE paper for a more detailed description of the tasks.
CommitmentBank (CB; de Marneffe et al., 2019) is a three-class entailment task that consists of texts and an embedded clause that appears in each text, in which models must determine whether that embedded clause is entailed by the text. Choice of Plausible Alternatives (COPA; Roemmele et al., 2011) is a classification task that consists of premises and a question that asks for the cause or effect of each premise, in which models must correctly pick between two possible choices. Winograd Schema Challenge (WSC; Levesque et al., 2012)
is a sentence-level commonsense reasoning task that consists of texts, a pronoun from each text, and a list of possible noun phrases from each text. The dataset has been designed such that world knowledge is required to determine which of the possible noun phrases is the correct referent to the pronoun. We use the SuperGLUE binary classification cast of the task, where each example consists of a text, a pronoun, and a noun phrase from the text, which models must classify as being coreferent to the pronoun or not.
Recognizing Textual Entailment (RTE; Dagan et al., 2005, et seq) is a textual entailment task. Multi-Sentence Reading Comprehension (MultiRC; Khashabi et al., 2018) is a multi-hop QA task that consists of paragraphs, a question on each paragraph, and a list of possible answers, in which models must distinguish which of the possible answers are true and which are false. Word-in-Context (WiC; Pilehvar and Camacho-Collados, 2019) is a binary classification word sense disambiguation task. Examples consist of two text snippets, with a polysemous word that appears in both. Models must determine whether the same sense of the word is used in both contexts. BoolQ (Clark et al., 2019) is a QA task that consists of passages and a yes/no question associated with each passage. Reading Comprehension with Commonsense Reasoning (ReCoRD; Zhang et al., 2018) is a multiple-choice QA task that consists of news articles. For each article, models are given a question about each article with one entity masked out and a list of possible entities from the article, and the goal is to correctly identify the masked entity out of the list.Additionally, we use CommonsenseQA and Cosmos QA as target tasks, due to their unique combination of small dataset size and high level of difficulty for high-performing models like BERT from our set of intermediate tasks.
2.2.3 Probing Tasks
We use well-established datasets for our probing tasks, including the edge-probing suite from Tenney et al. (2019), function word oriented tasks from Kim et al. (2019), and sentence-level probing datasets (SentEval; Conneau et al., 2018).
Acceptability Judgment Tasks
This set of binary classifications tasks was designed to investigate if a model can judge the grammatical acceptability of a sentence. We use the following five datasets: AJ-CoLA is a task that tests for a model’s understanding of general grammaticality using the Corpus of Linguistic Acceptability (CoLA) (Warstadt et al., 2019), which is drawn from 22 theoretical linguistics publications. The other tasks concern the behaviors of specific classes of function words, using the dataset by Kim et al. (2019): AJ-WH is a task that tests a model’s ability to detect if a wh-word in a sentence has been swapped with another wh-word, which tests a model’s ability to identify the antecedent associated with the wh-word. AJ-Def is a task that tests a model’s ability to detect if the definite/indefinite articles in a given sentence have been swapped. AJ-Coord is a task that tests a model’s ability to detect if a coordinating conjunction has been swapped, which tests a model’s ability to understand how ideas in the various clauses relate to each other. AJ-EOS is a task that tests a model’s ability to identify grammatical sentences without indicators such as punctuation marks and capitalization, and consists of grammatical text that are removed of punctuation.
Edge-Probing Tasks
The edge probing (EP) tasks are a set of core NLP labeling tasks, collected by Tenney et al. (2019) and cast into Boolean classification. These tasks focus on the syntactic and semantic relations between spans in a sentence. The first five tasks use the OntoNotes corpus (Hovy et al., 2006): Part-of-Speech tagging (EP-POS) is a task that tests a model’s ability to predict the syntactic category (noun, verb, adjective, etc.) for each word in the sentence. Named entity recognition (EP-NER) is task that tests a model’s ability to predict the category of an entity in a given span. Semantic Role Labeling (EP-SRL) is a task that tests a model’s ability to assign a label to a given span of words that indicates its semantic role (agent, goal, etc.) in the sentence. Coreference (EP-Coref) is a task that tests a model’s ability to classify if two spans of tokens refer to the same entity/event.
The other datasets can be broken down into both syntactic and semantic probing tasks. Constituent labeling (EP-Const) is a task that tests a model’s ability to classify a non-terminal label for a span of tokens (e.g., noun phrase, verb phrase, etc.). Dependency labeling (EP-UD) is a task that tests a model on the functional relationship of one token relative to another. We use the English Web Treebank portion of Universal Dependencies 2.2 release (Silveira et al., 2014) for this task. Semantic Proto-Role labeling is a task that tests a model’s ability to predict the fine-grained non-exclusive semantic attributes of a given span. Edge probing uses two datasets for SPR: SPR1 (EP-SPR1) (Teichert et al., 2017), derived from the Penn Treebank, and SPR2 (EP-SPR2) (Rudinger et al., 2018), derived from the English Web Treebank. Relation classification (EP-Rel) is a task that tests a model’s ability to predict the relation between two entities. We use the SemEval 2010 Task 8 dataset (Hendrickx et al., 2009) for this task. For example, the relation between Yeri and Korea in Yeri is from Korea is ENTITY-ORIGIN. The Definite Pronoun Resolution dataset (Rahman and Ng, 2012) (EP-DPR) is a task that tests a model’s ability to handle coreference, and differs from OntoNotes in that it focuses on difficult cases of definite pronouns.
SentEval Tasks
The SentEval probing tasks (SE) (Conneau et al., 2018) are cast in the form of single-sentence classification. Sentence Length (SE-SentLen) is a task that tests a model’s ability to classify the length of a sentence. Word Content (SE-WC) is a task that tests a model’s ability to identify which of a set of 1,000 potential words appear in a given sentence. Tree Depth (SE-TreeDepth
) is a task that tests a model’s ability to estimate the maximum depth of the constituency parse tree of the sentence.
Top Constituents (SE-TopConst) is a task that tests a model’s ability to identify the high-level syntactic structure of the sentence by choosing among 20 constituent sequences (the 19 most common, plus an other category). Bigram Shift (SE-BShift) is a task that tests a model’s ability to classify if two consecutive tokens in the same sentence have been reordered. Coordination Inversion (SE-CoordInv) is a task that tests a model’s ability to identify if two coordinating clausal conjoints are swapped (ex: “he knew it, and he deserved no answer.”). Past-Present (SE-Tense) is a task that tests a model’s ability to classify the tense of the main verb of the sentence. Subject Number (SE-SubjNum) and Object Number (SE-ObjNum) are tasks that test a model’s ability to classify whether the subject or direct object of the main clause is singular or plural. Odd-Man-Out (SE-SOMO) is a task that tests the model’s ability to predict whether a sentence has had one of its content words randomly replaced with another word of the same part of speech.
3 Experiments
Training and Optimization
We use the large-scale pretrained model RoBERTa
in all experiments. For each intermediate, target, and probing task, we perform a hyperparameter sweep, varying the peak learning rate
and the dropout rate . After choosing the best learning rate and dropout rate, we apply the best configuration for each task for all runs. For each task, we use the batch size that maximizes GPU usage, and use a maximum sequence length of 256. Aside from these details, we follow the RoBERTa paper for all other training hyperparameters. We use NVIDIA P40 GPUs for our experiments.A complete pipeline with one intermediate task works as follows: First, we fine-tune RoBERTa on the intermediate task. We then fine-tune copies of the resulting model separately on each of the 10 target tasks and 25 probing tasks and test on their respective validation sets. We run the same pipeline three times for the 11 intermediate tasks, plus a set of baseline runs without intermediate training. This gives us 35123 = 1260 observations.
We train our models using the Adam optimizer (Kingma and Ba, 2015)
with linear decay and early stopping. We run training for a maximum of 10 epochs when more than 1,500 training examples are available, and 40 epochs otherwise to ensure models are sufficiently trained on small datasets. We use the
jiant (Wang et al., 2019b)NLP toolkit, based on PyTorch
(Paszke et al., 2019), Hugging Face Transformers (Wolf et al., 2019), and AllenNLP (Gardner et al., 2017), for all of our experiments.4 Results and Analysis
4.1 Investigating Transfer Performance
Figure 2 shows the differences in target and probing task performances (deltas) between the baselines and models trained with intermediate-task training, each averaged across three restarts. A positive delta indicates successful transfer.
Target Task Performance
We define good intermediate tasks as ones that lead to positive transfer in target task performance. We observe that tasks that require complex reasoning and inference tend to make good intermediate tasks. These include MNLI and commonsense-oriented tasks such as CommonsenseQA, HellaSWAG, and Cosmos QA (with our poor performance with the similar SocialIQA serving as a suprising exception). SocialIQA, CCG, and QQP as intermediate tasks lead to negative transfer on all target tasks and the majority of probing tasks.
We investigate the role of dataset size in the intermediate tasks with downstream task performance by additionally running a set of experiments on varying amounts of data on five intermediate tasks, which is shown in the Appendix. We do not find differences in intermediate-task dataset size to have any substantial consistent impact on downstream target task performance.
In addition, we find that smaller target tasks such as RTE, BoolQ, MultiRC, WiC, WSC benefit the most from intermediate-task training.222The deltas for experiments with the same intermediate and target tasks are not 0 as may be expected. This is because we perform both intermediate and target training phases in these cases, with reset optimizer states and stopping criteria in between intermediate and target training. There are no instances of positive transfer to CommitmentBank, since our baseline model achieves accuracy.
Probing Task Performance
Looking at the probing task performance, we find that intermediate-task training affects performance on low-level syntactic probing tasks uniformly across intermediate tasks; we observe little to no improvement for the SentEval probing tasks and higher improvement for acceptability judgment probing tasks, except for AJ-CoLA. This is also consistent with Phang et al. (2018), who find negative transfer with CoLA in their experiments.
Variation across Intermediate Tasks
There is variable performance across higher-level syntactic or semantic tasks such as the Edge-Probing and SentEval tasks. SocialIQA and QQP have negative transfer for most of the Edge-Probing tasks, while CosmosQA and QA-SRL see drops in performance only for EP-Rel. While we do see that intermediate-task trained models improve performance on EP-SRL and EP-DPR across the board, there is little to no gain in SentEval probing tasks from any intermediate tasks. Additionally, tasks that increase performance in the most number of probing tasks perform well as intermediate tasks.
Degenerate Runs
We find that the model may not exceed chance performance in some training runs. This mostly affects the baseline (no intermediate training) runs on the acceptability judgment probing tasks, excluding AJ-CoLA, which all have very small training sets. We include these degenerate runs in our analysis to reflect this phenomenon. Consistent with Phang et al. (2018), we find that intermediate-task training reduces the likelihood of degenerate runs, leading to ostensibly positive transfer results on those four acceptability judgment tasks across most intermediate tasks. On the other hand, extremely negative transfer from intermediate-task training can also result in a higher frequency of degenerate runs in downstream tasks, as we observe in the cases of using QQP and SocialIQA as intermediate tasks. We also observe a number of degenerate runs on the EP-SRL task as well as the EP-Rel task. These degenerate runs decrease positive transfer in probing tasks, such as with SocialIQA and QQP probing performance, and also decrease the average amount of positive transfer we see in target task performance.
4.2 Correlation Between Probing and Target Task Performance

Next, we investigate the relationship between target and probing tasks in an attempt to understand why certain intermediate-task models perform better on certain target tasks.
We use probing task performance as an indicator of the acquisition of particular language skills. We compute the Spearman correlation between probing-task and target-task performances across training on different intermediate tasks and multiple restarts, as shown in Figure 3. We test for statistical significance at and apply Holm-Bonferroni correction for multiple testing. We omit correlations that are not statistically significant. We opt for Spearman and not Pearson correlation because of the wide variety of metrics used for the different tasks.333Full correlation tables across all target and probing tasks with both Spearman and Pearson correlations can be found in the Appendix.
We find that acceptability judgment probing task performance is generally uncorrelated with the target task performance, except for AJ-CoLA. Similarly, many of the SentEval tasks do not correlate with the target tasks, except for Bigram Shift (SE-BShift), Odd-Man-Out (SE-SOMO) and Coordination Inversion (SE-CoordInv). These three tasks are input noising tasks—tasks where a model has to predict if a given input sentence has been randomly modified—which are, by far, the most similar tasks we study to the masked language modeling task that is used for training RoBERTa. This may explain the strong correlation with the performance of the target tasks.
We also find that some of these strong correlations, such as with SE-SOMO and SE-CoordInv, are almost entirely driven by variation in the degree of negative transfer, rather than any positive transfer. Intuitively, fine-tuning RoBERTa on an intermediate task can cause the model to forget some of its ability to perform the MLM task. Thus, a future direction for potential improvement for intermediate-task training may be integrating the MLM objective into intermediate-task training or bounding network parameter changes to reduce catastrophic forgetting (Kirkpatrick et al., 2016; Chen et al., 2019).
Interestingly, while intermediate tasks such as SocialIQA, CCG and QQP, which show negative transfer on target tasks, tend to have negative transfer on these three probing tasks, the intermediate tasks with positive transfer, such as CommonsenseQA tasks and MNLI, do not appear to adversely affect the performance on these probing tasks. This asymmetric impact may indicate that, beyond the similarity of intermediate and target tasks, avoiding catastrophic forgetting of pretraining is critical to successful intermediate-task transfer.
The remaining SentEval probing tasks have similar delta values (Figure 2), which may indicate that there is insufficient variation among transfer performance to derive significant correlations. Among the edge-probing tasks, the more semantic tasks such as coreference (EP-Coref and EP-DPR), semantic proto-role labeling (EP-SPR1 and EP-SPR2), and dependency labeling (EP-Rel) show the highest correlations with our target tasks. As our set of target tasks is also oriented towards semantics and reasoning, this is to be expected.
On the other hand, among the target tasks, we find that ReCoRD, CommonsenseQA and Cosmos QA—all commonsense-oriented tasks—exhibit both high correlations with each other as well as a similar set of correlations with the probing tasks. Similarly, BoolQ, MultiRC, and RTE correlate strongly with each other and have similar patterns of probing-task performance.
5 Related Work
Within the paradigm of training large pretrained Transformer language representations via intermediate-stage training before fine-tuning on a target task, positive transfer has been shown in both sequential task-to-task Phang et al. (2018) and multi-task-to-task Liu et al. (2019); Raffel et al. (2019) formats. Wang et al. (2019) perform an extensive study on transfer with BERT, finding language modeling and NLI tasks to be among the most beneficial tasks for improving target-task performance. Talmor and Berant (2019) perform a similar cross-task transfer study on reading comprehension datasets, finding similar positive transfer in most cases, with the biggest gains stemming from a combination of multiple QA datasets. Our work consists of a larger, more diverse, set of intermediate task–target task pairs. We also use probing tasks to shed light on the skills learned by the intermediate tasks.
Among the prior work on predicting transfer performance, Bingel and Søgaard (2017)
is the most similar to ours. They do a regression analysis that predicts target-task performance on the basis of various features of the source and target tasks and task pairs. They focus on a multi-task training setting without self-supervised pretraining, as opposed to our single-intermediate task, three-step procedure.
Similar work (Lin et al., 2019) has been done on cross-lingual transfer—the analogous challenge of transferring learned knowledge from a high-resource to a low-resource language.
Many recent works have attempted to understand the knowledge and linguistic skills BERT learns, for instance by analyzing the language model surprisal for subject–verb agreements (Goldberg, 2018), identifying specific knowledge or phenomena encapsulated in the representations learned by BERT using probing tasks (Tenney et al., 2019, 2019; Warstadt et al., 2019; Lin et al., 2019; Hewitt and Manning, 2019; Jawahar et al., 2019), analyzing the attention heads of BERT (Clark et al., 2019; Coenen et al., 2019; Lin et al., 2019; Htut et al., 2019), and testing the linguistic generalizations of BERT across runs (McCoy et al., 2019). However, relatively little work has been done to analyze fine-tuned BERT-style models (Wang et al., 2019; Warstadt et al., 2019).
6 Conclusion and Future Work
This paper presents a large-scale study on when and why intermediate-task training works with pretrained models. We perform experiments on RoBERTa with a total of 110 pairs of intermediate and target tasks, and perform an analysis using 25 probing tasks, covering different semantic and syntactic phenomena. Most directly, we observe that tasks like Cosmos QA and HellaSwag, which require complex reasoning and inference, tend to work best as intermediate tasks.
Looking to our probing analysis, intermediate tasks that help RoBERTa improve across the board show the most positive transfer in downstream tasks. However, it is difficult to draw definite conclusions about the specific skills that drive positive transfer. Intermediate-task training may help improve the handling of syntax, but there is little to no correlation between target-task and probing-task performance for these skills. Probes for higher-level semantic abilities tend to have a higher correlation with the target-task performance, but these results are too diffuse to yield more specific conclusions. Future work in this area would benefit greatly from improvements to both the breadth and depth of available probing tasks.
We also observe a worryingly high correlation between target-task performance and the two probing tasks which most closely resemble RoBERTa’s masked language modeling pretraining objective. Thus, the results of our intermediate-task training analysis may be driven in part by forgetting of knowledge acquired during pretraining. Our results therefore suggest a need for further work on efficient transfer learning mechanisms.
Acknowledgments
This project has benefited from support to SB by Eric and Wendy Schmidt (made by recommendation of the Schmidt Futures program), by Samsung Research (under the project Improving Deep Learning using Latent Structure
References
-
Identifying beneficial task relations for multi-task learning in deep neural networks
. In Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics: Volume 2, Short Papers, Valencia, Spain, pp. 164–169. External Links: Link Cited by: §5. - Catastrophic forgetting meets negative transfer: batch spectral shrinkage for safe transfer learning. In Advances in Neural Information Processing Systems 32, H. Wallach, H. Larochelle, A. Beygelzimer, F. d’ Alché-Buc, E. Fox, and R. Garnett (Eds.), pp. 1906–1916. External Links: Link Cited by: §4.2.
- BoolQ: exploring the surprising difficulty of natural yes/no questions. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), Minneapolis, Minnesota, pp. 2924–2936. External Links: Link, Document Cited by: §1, §2.1, §2.2.2.
- What does BERT look at? an analysis of BERT’s attention. In Proceedings of the 2019 ACL Workshop BlackboxNLP: Analyzing and Interpreting Neural Networks for NLP, Florence, Italy, pp. 276–286. External Links: Link, Document Cited by: §5.
- Visualizing and measuring the geometry of BERT. Note: Unpublished manuscript available on arXiv External Links: Link Cited by: §5.
-
What you can cram into a single $&!#* vector: probing sentence embeddings for linguistic properties
. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Melbourne, Australia, pp. 2126–2136. External Links: Link, Document Cited by: §2.1, §2.2.3, §2.2.3. - The pascal recognising textual entailment challenge. In Machine Learning Challenges Workshop, pp. 177–190. Cited by: §2.2.2.
- The CommitmentBank: investigating projection in naturally occurring discourse. In proceedings of Sinn und Bedeutung, Vol. 23, pp. 107–124. Cited by: §2.2.2.
- BERT: pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), Minneapolis, Minnesota, pp. 4171–4186. External Links: Link, Document Cited by: §1, §2.1.
-
AllenNLP: a deep semantic natural language processing platform
. Note: Unpublished manuscript available on arXiv External Links: Link Cited by: §3. - Assessing BERT’s syntactic abilities. Note: Unpublished manuscript available on arXiv External Links: Link Cited by: §5.
- Question-answer driven semantic role labeling: using natural language to annotate natural language. In Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing, Lisbon, Portugal, pp. 643–653. External Links: Link, Document Cited by: §2.2.1.
- SemEval-2010 task 8: multi-way classification of semantic relations between pairs of nominals. In Proceedings of the Workshop on Semantic Evaluations: Recent Achievements and Future Directions (SEW-2009), Boulder, Colorado, pp. 94–99. External Links: Link Cited by: §2.2.3.
- A structural probe for finding syntax in word representations. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), Minneapolis, Minnesota, pp. 4129–4138. External Links: Document Cited by: §5.
- CCGbank: a corpus of CCG derivations and dependency structures extracted from the Penn treebank. Computational Linguistics 33 (3), pp. 355–396. External Links: Link, Document Cited by: §2.2.1.
- OntoNotes: the 90% solution. In Proceedings of the Human Language Technology Conference of the NAACL, Companion Volume: Short Papers, New York City, USA, pp. 57–60. External Links: Link Cited by: §2.2.3.
- Do attention heads in bert track syntactic dependencies?. Note: Unpublished manuscript available on arXiv External Links: Link Cited by: §5.
- Cosmos QA: machine reading comprehension with contextual commonsense reasoning. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), Hong Kong, China, pp. 2391–2401. External Links: Link, Document Cited by: §2.2.1.
- What does BERT learn about the structure of language?. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, Florence, Italy, pp. 3651–3657. External Links: Link, Document Cited by: §5.
- Looking beyond the surface: a challenge set for reading comprehension over multiple sentences. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers), New Orleans, Louisiana, pp. 252–262. External Links: Link, Document Cited by: §2.2.2.
-
Scitail: a textual entailment dataset from science question answering.
In
Thirty-Second AAAI Conference on Artificial Intelligence
, Cited by: §2.2.1. - Probing what different NLP tasks teach machines about function word comprehension. In Proceedings of the Eighth Joint Conference on Lexical and Computational Semantics (*SEM 2019), Minneapolis, Minnesota, pp. 235–249. External Links: Link, Document Cited by: §2.2.3, §2.2.3.
- Adam: A method for stochastic optimization. In 3rd International Conference on Learning Representations, ICLR 2015, San Diego, CA, USA, May 7-9, 2015, Conference Track Proceedings, External Links: Link Cited by: §3.
- Overcoming catastrophic forgetting in neural networks. In Proceedings of the national academy of sciences (PNAS), External Links: Link Cited by: §4.2.
- The winograd schema challenge. In Proceedings of the Thirteenth International Conference on Principles of Knowledge Representation and Reasoning, KR’12, pp. 552–561. External Links: ISBN 978-1-57735-560-1, Link Cited by: §2.2.2.
- Open sesame: getting inside BERT’s linguistic knowledge. In Proceedings of the 2019 ACL Workshop BlackboxNLP: Analyzing and Interpreting Neural Networks for NLP, Florence, Italy, pp. 241–253. External Links: Link, Document Cited by: §5.
- Choosing transfer languages for cross-lingual learning. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, Florence, Italy, pp. 3125–3135. External Links: Link, Document Cited by: §5.
- Multi-task deep neural networks for natural language understanding. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, Florence, Italy, pp. 4487–4496. External Links: Link, Document Cited by: §2.1, §5.
- RoBERTa: a robustly optimized bert pretraining approach. Note: Unpublished manuscript available on arXiv External Links: Link Cited by: §1, §1.
- BERTs of a feather do not generalize together: Large variability in generalization across models with similar test set performance. Note: Unpublished manuscript available on arXiv External Links: Link, 1911.02969 Cited by: §5.
- Crowdsourcing question-answer meaning representations. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 2 (Short Papers), New Orleans, Louisiana, pp. 560–568. External Links: Link, Document Cited by: §2.2.1.
- PyTorch: an imperative style, high-performance deep learning library. In Advances in Neural Information Processing Systems 32, H. Wallach, H. Larochelle, A. Beygelzimer, F. d’ Alché-Buc, E. Fox, and R. Garnett (Eds.), pp. 8024–8035. External Links: Link Cited by: §3.
- Sentence Encoders on STILTs: Supplementary Training on Intermediate Labeled-data Tasks. Note: Unpublished manuscript available on arXiv External Links: Link Cited by: §1, §1, §4.1, §4.1, §5.
- WiC: the word-in-context dataset for evaluating context-sensitive meaning representations. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), Minneapolis, Minnesota, pp. 1267–1273. External Links: Link, Document Cited by: §2.2.2.
- Exploring the limits of transfer learning with a unified text-to-text transformer. Note: Unpublished manuscript available on arXiv External Links: Link Cited by: §5.
- Resolving complex cases of definite pronouns: the Winograd schema challenge. In Proceedings of the 2012 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning, Jeju Island, Korea, pp. 777–789. External Links: Link Cited by: §2.2.3.
- Choice of plausible alternatives: an evaluation of commonsense causal reasoning. In 2011 AAAI Spring Symposium Series, Cited by: §2.2.2.
- Neural-Davidsonian Semantic Proto-role Labeling. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, Brussels, Belgium, pp. 944–955. External Links: Link, Document Cited by: §2.2.3.
- Social IQa: commonsense reasoning about social interactions. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), Hong Kong, China, pp. 4453–4463. External Links: Link, Document Cited by: §1, §2.2.1.
- A Gold Standard Dependency Corpus for English. In Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC’14), Reykjavik, Iceland, pp. 2897–2904. External Links: Link Cited by: §2.2.3.
- Recursive Deep Models for Semantic Compositionality Over a Sentiment Treebank. In Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing, Seattle, Washington, USA, pp. 1631–1642. External Links: Link Cited by: §2.2.1.
- Conceptnet 5.5: an open multilingual graph of general knowledge. In Thirty-First AAAI Conference on Artificial Intelligence, Cited by: §2.2.1.
- MultiQA: an empirical investigation of generalization and transfer in reading comprehension. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, Florence, Italy, pp. 4911–4921. External Links: Link, Document Cited by: §5.
- CommonsenseQA: a question answering challenge targeting commonsense knowledge. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), Minneapolis, Minnesota, pp. 4149–4158. External Links: Link, Document Cited by: §2.2.1.
- Semantic proto-role labeling. In Thirty-First AAAI Conference on Artificial Intelligence, Cited by: §2.2.3.
- BERT rediscovers the classical NLP pipeline. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, Florence, Italy, pp. 4593–4601. External Links: Link, Document Cited by: §5.
- What do you learn from context? probing for sentence structure in contextualized word representations. In International Conference on Learning Representations, External Links: Link Cited by: §2.2.3, §2.2.3, §5.
- Can you tell me how to get past sesame street? sentence-level pretraining beyond language modeling. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, Florence, Italy, pp. 4465–4476. External Links: Link, Document Cited by: §1, §1, §5, §5.
- SuperGLUE: a multi-task benchmark and analysis platform for natural language understanding. In Advances in Neural Information Processing Systems 32, H. Wallach, H. Larochelle, A. Beygelzimer, F. d’ Alché-Buc, E. Fox, and R. Garnett (Eds.), pp. 3261–3275. External Links: Link Cited by: §2.2.2.
- jiant 1.2: a software toolkit for research on general-purpose text understanding models. External Links: Link Cited by: §3.
- Investigating BERT’s knowledge of language: five analysis methods with NPIs. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), Hong Kong, China, pp. 2870–2880. External Links: Link, Document Cited by: §5.
- Neural network acceptability judgments. Transactions of the Association for Computational Linguistics (TACL) 7, pp. 625–641. External Links: Link Cited by: §2.2.3.
-
Crowdsourcing multiple choice science questions.
In
Proceedings of the 3rd Workshop on Noisy User-generated Text
, Copenhagen, Denmark, pp. 94–106. External Links: Link, Document Cited by: §2.2.1. - A broad-coverage challenge corpus for sentence understanding through inference. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers), pp. 1112–1122. External Links: Link Cited by: §2.2.1.
- HuggingFace’s transformers: state-of-the-art natural language processing. Note: Unpublished manuscript available on arXiv External Links: Link Cited by: §3.
- SWAG: a large-scale adversarial dataset for grounded commonsense inference. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, Brussels, Belgium, pp. 93–104. External Links: Link, Document Cited by: §2.2.1.
- HellaSwag: can a machine really finish your sentence?. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, Florence, Italy, pp. 4791–4800. External Links: Link, Document Cited by: §2.2.1.
- ReCoRD: bridging the gap between human and machine commonsense reading comprehension. Note: Unpublished manuscript available on arXiv External Links: Link Cited by: §2.2.2.
Appendix A Correlation Between Probing and Target Task Performance
Appendix B Effect of Intermediate Task Size on Target Task Performance
Figure 6 shows the effect of dataset size on intermediate task training on downstream target task performance for five intermediate tasks, which were picked to maximize the variety of original intermediate task sizes and effectiveness in transfer learning abilities.


