Intermediate-Task Transfer Learning with Pretrained Models for Natural Language Understanding: When and Why Does It Work?

05/01/2020 ∙ by Yada Pruksachatkun, et al. ∙ NYU college 0

While pretrained models such as BERT have shown large gains across natural language understanding tasks, their performance can be improved by further training the model on a data-rich intermediate task, before fine-tuning it on a target task. However, it is still poorly understood when and why intermediate-task training is beneficial for a given target task. To investigate this, we perform a large-scale study on the pretrained RoBERTa model with 110 intermediate-target task combinations. We further evaluate all trained models with 25 probing tasks meant to reveal the specific skills that drive transfer. We observe that intermediate tasks requiring high-level inference and reasoning abilities tend to work best. We also observe that target task performance is strongly correlated with higher-level abilities such as coreference resolution. However, we fail to observe more granular correlations between probing and target task performance, highlighting the need for further work on broad-coverage probing benchmarks. We also observe evidence that the forgetting of knowledge learned during pretraining may limit our analysis, highlighting the need for further work on transfer learning methods in these settings.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 6

page 8

page 15

page 16

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Unsupervised pretraining—e.g., BERT (Devlin et al., 2019) or RoBERTa (Liu et al., 2019)—has recently pushed the state of the art on many natural language understanding tasks. One method of further improving pretrained models that has been shown to be broadly helpful is to first fine-tune a pretrained model on an intermediate task, before fine-tuning again on the target task of interest (Phang et al., 2018; Wang et al., 2019; Clark et al., 2019; Sap et al., 2019), also referred to as STILTs. However, this approach does not always improve target task performance, and it is unclear under what conditions it does.

Figure 1: Our experimental pipeline with intermediate-task transfer learning and subsequent fine-tuning on target and probing tasks.

This paper offers a large-scale empirical study aimed at addressing this open question. We perform a broad survey of intermediate and target task pairs, following an experimental pipeline similar to Phang et al. (2018) and Wang et al. (2019). This differs from previous work in that we use a larger and more diverse set of intermediate and target tasks, introduce additional analysis-oriented probing tasks, and use a better-performing base model RoBERTa (Liu et al., 2019). We aim to answer the following specific questions:

  • What kind of tasks tend to make good intermediate tasks across a wide variety of target tasks?

  • Which linguistic skills does a model learn from intermediate-task training?

  • Which skills learned from intermediate tasks help the model succeed on which target tasks?

The first question is the most straightforward: it can be answered by a sufficiently exhaustive search over possible intermediate–target task pairs. The second and third questions address the why rather than the when, and differ in a crucial detail: A model might learn skills by training on an intermediate task, but those skills might not help it to succeed on a target task.

Our search for intermediate tasks focuses on natural language understanding tasks in English. In particular, we run our experiments on 11 intermediate tasks and 10 target tasks, which results in a total of 110 intermediate–target task pairs. We use 25 probing tasks—tasks that each target a narrowly defined model behavior or linguistic phenomenon—to shed light on which skills are learned from each intermediate task.

Our findings include the following: (i) Natural language inference tasks as well as QA tasks which involve commonsense reasoning are generally useful as intermediate tasks. (ii) SocialIQA and QQP as intermediate tasks are not helpful as a means to teach the skills captured by our probing tasks, while finetuning first on MNLI and CosmosQA result in an increase in all skills. (iii) While a model’s ability to learn skills relating to input-noising correlate with target task performance, low-level skills such as knowledge of a sentence’s raw content preservation skills and ability to detect various attributes of input sentences such as tense of main verb and sentence length are less correlated with target task performance. This suggests that a model’s ability to do well on the masked language modelling (MLM) task is important for downstream performance. Furthermore, we conjecture that a portion of our analysis is affected by catastrophic forgetting of knowledge learned during pretraining.

2 Methods

2.1 Experimental Pipeline

Our experimental pipeline (Figure 1) consists of two steps, starting with a pretrained model: intermediate-task training, and fine-tuning on a target or probing task.

Intermediate Task Training

We fine-tune RoBERTa on each intermediate task. The training procedure follows the standard procedure of fine-tuning a pretrained model on a target task, as described in Devlin et al. (2019). We opt for single intermediate-task training as opposed to multi-task training (cf. Liu et al., 2019) to isolate the effect of skills learned from individual intermediate tasks.

Target and Probing Task Fine-Tuning

After intermediate-task training, we fine-tune our models on each target and probing task individually. Target tasks are tasks of interest to the general community, spanning various facets of natural language, domains, and sources. Probing tasks, while potentially similar in data source to target tasks such as with CoLA, are designed to isolate the presence of particular linguistic capabilities or skills. For instance, solving the target task BoolQ (Clark et al., 2019) may require various skills including coreference and commonsense reasoning, while probing tasks like the SentEval probing suite (Conneau et al., 2018) target specific syntactic and metadata-level phenomena such as subject-verb agreement and sentence length detection.

Name Train Dev task metrics genre/source
CommonsenseQA 9,741 1,221 question answering acc. ConceptNet
SciTail 23,596 1,304 natural language inference acc. science exams
Cosmos QA 25,588 3,000 question answering acc. blogs
SocialIQA 33,410 1,954 question answering acc. crowdsourcing
CCG 38,015 5,484 tagging acc. Wall Street Journal
HellaSwag 39,905 10,042 sentence completion acc. video captions & Wikihow
QA-SRL 44,837 7,895 question answering F1/EM Wikipedia
SST-2 67,349 872 sentiment classification acc. movie reviews
QAMR 73,561 27,535 question answering F1/EM Wikipedia

Intermediate Tasks

QQP 363,846 40,430 paraphrase detection acc./F1 Quora questions
MNLI 392,702 20,000 natural language inference acc. fiction, letters, telephone speech
CB 250 57 natural language inference acc./F1 Wall Street Journal, fiction, dialogue
COPA 400 100 question answering acc. blogs, photography encyclopedia
WSC 554 104 coreference resolution acc. hand-crafted
RTE 2,490 278 natural language inference acc. news, Wikipedia
MultiRC 5,100 953 question answering F1/EM crowd-sourced
WiC 5,428 638 word sense disambiguation acc. WordNet, VerbNet, Wiktionary
BoolQ 9,427 3,270 question answering acc. Google queries, Wikipedia

Target Tasks

CommonsenseQA 9,741 1,221 question answering acc. ConceptNet
Cosmos QA 25,588 3,000 question answering acc. blogs
ReCoRD 100,730 10,000 question answering F1/EM news (CNN, Daily Mail)
Table 1: Overview of the intermediate tasks (top) and target tasks (bottom) in our experiments. EM is short for Exact Match. The F1 metrics for MultiRC is calculated over all answer-options.

2.2 Tasks

Table 1 presents an overview of the intermediate and target tasks.

2.2.1 Intermediate Tasks

We curate a diverse set of tasks that either represent an especially large annotation effort or that have been shown to yield positive transfer in prior work. The resulting set of tasks cover question answering, commonsense reasoning, and natural language inference.

Qamr

The Question–Answer Meaning Representations dataset (Michael et al., 2018) is a crowdsourced QA task consisting of question–answer pairs that correspond to predicate–argument relationships. It is derived from Wikinews and Wikipedia sentences. For example, if the sentence is Ada Lovelace was a computer scientist., a potential question is What is Ada’s last name?, with the answer being Lovelace.

CommonsenseQA

CommonsenseQA Talmor et al. (2019) is a multiple-choice QA task derived from ConceptNet (Speer et al., 2017) with the help of crowdworkers, that is designed to test a range of commonsense knowledge.

SciTail

SciTail Khot et al. (2018) is a textual entailment task built from multiple-choice science questions from 4th grade and 8th grade exams, as well as crowdsourced questions (Welbl et al., 2017). The task is to determine whether a hypothesis, which is constructed from a science question and its corresponding answer, is entailed or not (neutral) by the premise.

Cosmos QA

Cosmos QA is a task for a commonsense-based reading comprehension task formulated as multiple-choice questions (Huang et al., 2019). The questions concern the causes or effects of events that require reasoning not only based on the exact text spans in the context, but also wide-range abstractive commonsense reasoning. It differs from CommonsenseQA in that it focuses on causal and deductive commensense reasoning and that it requires reading comprehension over an auxiliary passage, rather than simply answering a freestanding question.

SocialIQA

SocialIQA (Sap et al., 2019) is a task for multiple choice QA. It tests for reasoning surrounding emotional and social intelligence in everyday situations.

Ccg

CCGbank (Hockenmaier and Steedman, 2007) is a task that is a translation of the Penn Treebank into a corpus of Combinatory Categorial Grammar (CCG) derivations. We use the CCG supertagging task, which is the task of assigning tags to individual word tokens that jointly determine the parse of the sentence.

HellaSwag

HellaSwag (Zellers et al., 2019) is a commonsense reasoning task that tests a model’s ability to choose the most plausibe continuation of a story. It is built using adversarial filtering (Zellers et al., 2018) with BERT to create challenging negative examples.

Qa-Srl

The question-answer driven semantic role labeling dataset (QA-SRL; He et al., 2015) for a QA task that is derived from a semantic role labeling task. Each example, which consists of a set of questions and answers, corresponds to a predicate-argument relationship in the sentence it is derived from. Unlike QAMR, which focuses on all words in the sentence, QA-SRL is specifically focused on verbs.

Sst-2

The Stanford sentiment treebank (Socher et al., 2013) is a sentiment classification task based on movie reviews. We use the binary sentence classification version of the task.

Qqp

The Quora Question Pairs dataset111http://data.quora.com/First-Quora-DatasetRelease-Question-Pairs is constructed based on questions posted on the community question-answering website Quora. The task is to determine if two questions are semantically equivalent.

Mnli

The Multi-Genre Natural Language Inference dataset (Williams et al., 2018) is a crowdsourced collection of sentence pairs with textual entailment annotations across a variety of genres.

2.2.2 Target Tasks

We use ten target tasks, eight of which are drawn from the SuperGLUE benchmark (Wang et al., 2019a). The tasks in the SuperGLUE benchmark cover question answering, entailment, word sense disambiguation, and coreference resolution and have been shown to be easy for humans but difficult for models like BERT. Although we offer a brief description of the tasks below, we refer readers to the SuperGLUE paper for a more detailed description of the tasks.

CommitmentBank (CB; de Marneffe et al., 2019) is a three-class entailment task that consists of texts and an embedded clause that appears in each text, in which models must determine whether that embedded clause is entailed by the text. Choice of Plausible Alternatives (COPA; Roemmele et al., 2011) is a classification task that consists of premises and a question that asks for the cause or effect of each premise, in which models must correctly pick between two possible choices. Winograd Schema Challenge (WSC; Levesque et al., 2012)

is a sentence-level commonsense reasoning task that consists of texts, a pronoun from each text, and a list of possible noun phrases from each text. The dataset has been designed such that world knowledge is required to determine which of the possible noun phrases is the correct referent to the pronoun. We use the SuperGLUE binary classification cast of the task, where each example consists of a text, a pronoun, and a noun phrase from the text, which models must classify as being coreferent to the pronoun or not.

Recognizing Textual Entailment (RTE; Dagan et al., 2005, et seq) is a textual entailment task. Multi-Sentence Reading Comprehension (MultiRC; Khashabi et al., 2018) is a multi-hop QA task that consists of paragraphs, a question on each paragraph, and a list of possible answers, in which models must distinguish which of the possible answers are true and which are false. Word-in-Context (WiC; Pilehvar and Camacho-Collados, 2019) is a binary classification word sense disambiguation task. Examples consist of two text snippets, with a polysemous word that appears in both. Models must determine whether the same sense of the word is used in both contexts. BoolQ (Clark et al., 2019) is a QA task that consists of passages and a yes/no question associated with each passage. Reading Comprehension with Commonsense Reasoning (ReCoRD; Zhang et al., 2018) is a multiple-choice QA task that consists of news articles. For each article, models are given a question about each article with one entity masked out and a list of possible entities from the article, and the goal is to correctly identify the masked entity out of the list.

Additionally, we use CommonsenseQA and Cosmos QA as target tasks, due to their unique combination of small dataset size and high level of difficulty for high-performing models like BERT from our set of intermediate tasks.

2.2.3 Probing Tasks

We use well-established datasets for our probing tasks, including the edge-probing suite from Tenney et al. (2019), function word oriented tasks from Kim et al. (2019), and sentence-level probing datasets (SentEval; Conneau et al., 2018).

Acceptability Judgment Tasks

This set of binary classifications tasks was designed to investigate if a model can judge the grammatical acceptability of a sentence. We use the following five datasets: AJ-CoLA is a task that tests for a model’s understanding of general grammaticality using the Corpus of Linguistic Acceptability (CoLA) (Warstadt et al., 2019), which is drawn from 22 theoretical linguistics publications. The other tasks concern the behaviors of specific classes of function words, using the dataset by Kim et al. (2019): AJ-WH is a task that tests a model’s ability to detect if a wh-word in a sentence has been swapped with another wh-word, which tests a model’s ability to identify the antecedent associated with the wh-word. AJ-Def is a task that tests a model’s ability to detect if the definite/indefinite articles in a given sentence have been swapped. AJ-Coord is a task that tests a model’s ability to detect if a coordinating conjunction has been swapped, which tests a model’s ability to understand how ideas in the various clauses relate to each other. AJ-EOS is a task that tests a model’s ability to identify grammatical sentences without indicators such as punctuation marks and capitalization, and consists of grammatical text that are removed of punctuation.

Edge-Probing Tasks

The edge probing (EP) tasks are a set of core NLP labeling tasks, collected by Tenney et al. (2019) and cast into Boolean classification. These tasks focus on the syntactic and semantic relations between spans in a sentence. The first five tasks use the OntoNotes corpus (Hovy et al., 2006): Part-of-Speech tagging (EP-POS) is a task that tests a model’s ability to predict the syntactic category (noun, verb, adjective, etc.) for each word in the sentence. Named entity recognition (EP-NER) is task that tests a model’s ability to predict the category of an entity in a given span. Semantic Role Labeling (EP-SRL) is a task that tests a model’s ability to assign a label to a given span of words that indicates its semantic role (agent, goal, etc.) in the sentence. Coreference (EP-Coref) is a task that tests a model’s ability to classify if two spans of tokens refer to the same entity/event.

The other datasets can be broken down into both syntactic and semantic probing tasks. Constituent labeling (EP-Const) is a task that tests a model’s ability to classify a non-terminal label for a span of tokens (e.g., noun phrase, verb phrase, etc.). Dependency labeling (EP-UD) is a task that tests a model on the functional relationship of one token relative to another. We use the English Web Treebank portion of Universal Dependencies 2.2 release (Silveira et al., 2014) for this task. Semantic Proto-Role labeling is a task that tests a model’s ability to predict the fine-grained non-exclusive semantic attributes of a given span. Edge probing uses two datasets for SPR: SPR1 (EP-SPR1) (Teichert et al., 2017), derived from the Penn Treebank, and SPR2 (EP-SPR2) (Rudinger et al., 2018), derived from the English Web Treebank. Relation classification (EP-Rel) is a task that tests a model’s ability to predict the relation between two entities. We use the SemEval 2010 Task 8 dataset (Hendrickx et al., 2009) for this task. For example, the relation between Yeri and Korea in Yeri is from Korea is ENTITY-ORIGIN. The Definite Pronoun Resolution dataset (Rahman and Ng, 2012) (EP-DPR) is a task that tests a model’s ability to handle coreference, and differs from OntoNotes in that it focuses on difficult cases of definite pronouns.

SentEval Tasks

The SentEval probing tasks (SE) (Conneau et al., 2018) are cast in the form of single-sentence classification. Sentence Length (SE-SentLen) is a task that tests a model’s ability to classify the length of a sentence. Word Content (SE-WC) is a task that tests a model’s ability to identify which of a set of 1,000 potential words appear in a given sentence. Tree Depth (SE-TreeDepth

) is a task that tests a model’s ability to estimate the maximum depth of the constituency parse tree of the sentence.

Top Constituents (SE-TopConst) is a task that tests a model’s ability to identify the high-level syntactic structure of the sentence by choosing among 20 constituent sequences (the 19 most common, plus an other category). Bigram Shift (SE-BShift) is a task that tests a model’s ability to classify if two consecutive tokens in the same sentence have been reordered. Coordination Inversion (SE-CoordInv) is a task that tests a model’s ability to identify if two coordinating clausal conjoints are swapped (ex: “he knew it, and he deserved no answer.”). Past-Present (SE-Tense) is a task that tests a model’s ability to classify the tense of the main verb of the sentence. Subject Number (SE-SubjNum) and Object Number (SE-ObjNum) are tasks that test a model’s ability to classify whether the subject or direct object of the main clause is singular or plural. Odd-Man-Out (SE-SOMO) is a task that tests the model’s ability to predict whether a sentence has had one of its content words randomly replaced with another word of the same part of speech.

Figure 2: Transfer learning results between intermediate and target/probing tasks. Baselines (rightmost column) are models fine-tuned without intermediate-task training. Each cell shows the difference in performance (delta) between the baseline and model with intermediate-task training. We use the macro-average of each task’s metrics as the reported performance. Refer to Table 1 for target task metrics.

3 Experiments

Training and Optimization

We use the large-scale pretrained model RoBERTa

in all experiments. For each intermediate, target, and probing task, we perform a hyperparameter sweep, varying the peak learning rate

and the dropout rate . After choosing the best learning rate and dropout rate, we apply the best configuration for each task for all runs. For each task, we use the batch size that maximizes GPU usage, and use a maximum sequence length of 256. Aside from these details, we follow the RoBERTa paper for all other training hyperparameters. We use NVIDIA P40 GPUs for our experiments.

A complete pipeline with one intermediate task works as follows: First, we fine-tune RoBERTa on the intermediate task. We then fine-tune copies of the resulting model separately on each of the 10 target tasks and 25 probing tasks and test on their respective validation sets. We run the same pipeline three times for the 11 intermediate tasks, plus a set of baseline runs without intermediate training. This gives us 35123 = 1260 observations.

We train our models using the Adam optimizer (Kingma and Ba, 2015)

with linear decay and early stopping. We run training for a maximum of 10 epochs when more than 1,500 training examples are available, and 40 epochs otherwise to ensure models are sufficiently trained on small datasets. We use the

jiant (Wang et al., 2019b)

NLP toolkit, based on PyTorch

(Paszke et al., 2019), Hugging Face Transformers (Wolf et al., 2019), and AllenNLP (Gardner et al., 2017), for all of our experiments.

4 Results and Analysis

4.1 Investigating Transfer Performance

Figure 2 shows the differences in target and probing task performances (deltas) between the baselines and models trained with intermediate-task training, each averaged across three restarts. A positive delta indicates successful transfer.

Target Task Performance

We define good intermediate tasks as ones that lead to positive transfer in target task performance. We observe that tasks that require complex reasoning and inference tend to make good intermediate tasks. These include MNLI and commonsense-oriented tasks such as CommonsenseQA, HellaSWAG, and Cosmos QA (with our poor performance with the similar SocialIQA serving as a suprising exception). SocialIQA, CCG, and QQP as intermediate tasks lead to negative transfer on all target tasks and the majority of probing tasks.

We investigate the role of dataset size in the intermediate tasks with downstream task performance by additionally running a set of experiments on varying amounts of data on five intermediate tasks, which is shown in the Appendix. We do not find differences in intermediate-task dataset size to have any substantial consistent impact on downstream target task performance.

In addition, we find that smaller target tasks such as RTE, BoolQ, MultiRC, WiC, WSC benefit the most from intermediate-task training.222The deltas for experiments with the same intermediate and target tasks are not 0 as may be expected. This is because we perform both intermediate and target training phases in these cases, with reset optimizer states and stopping criteria in between intermediate and target training. There are no instances of positive transfer to CommitmentBank, since our baseline model achieves accuracy.

Probing Task Performance

Looking at the probing task performance, we find that intermediate-task training affects performance on low-level syntactic probing tasks uniformly across intermediate tasks; we observe little to no improvement for the SentEval probing tasks and higher improvement for acceptability judgment probing tasks, except for AJ-CoLA. This is also consistent with Phang et al. (2018), who find negative transfer with CoLA in their experiments.

Variation across Intermediate Tasks

There is variable performance across higher-level syntactic or semantic tasks such as the Edge-Probing and SentEval tasks. SocialIQA and QQP have negative transfer for most of the Edge-Probing tasks, while CosmosQA and QA-SRL see drops in performance only for EP-Rel. While we do see that intermediate-task trained models improve performance on EP-SRL and EP-DPR across the board, there is little to no gain in SentEval probing tasks from any intermediate tasks. Additionally, tasks that increase performance in the most number of probing tasks perform well as intermediate tasks.

Degenerate Runs

We find that the model may not exceed chance performance in some training runs. This mostly affects the baseline (no intermediate training) runs on the acceptability judgment probing tasks, excluding AJ-CoLA, which all have very small training sets. We include these degenerate runs in our analysis to reflect this phenomenon. Consistent with Phang et al. (2018), we find that intermediate-task training reduces the likelihood of degenerate runs, leading to ostensibly positive transfer results on those four acceptability judgment tasks across most intermediate tasks. On the other hand, extremely negative transfer from intermediate-task training can also result in a higher frequency of degenerate runs in downstream tasks, as we observe in the cases of using QQP and SocialIQA as intermediate tasks. We also observe a number of degenerate runs on the EP-SRL task as well as the EP-Rel task. These degenerate runs decrease positive transfer in probing tasks, such as with SocialIQA and QQP probing performance, and also decrease the average amount of positive transfer we see in target task performance.

4.2 Correlation Between Probing and Target Task Performance

Figure 3: Correlations between probing and target task performances. Each cell contains the Spearman correlation between probing-task and target-task performances across training on different intermediate tasks and random restarts. We test for statistical significance at with Holm-Bonferroni correction, and omit the correlations that are not statistically significant.

Next, we investigate the relationship between target and probing tasks in an attempt to understand why certain intermediate-task models perform better on certain target tasks.

We use probing task performance as an indicator of the acquisition of particular language skills. We compute the Spearman correlation between probing-task and target-task performances across training on different intermediate tasks and multiple restarts, as shown in Figure 3. We test for statistical significance at and apply Holm-Bonferroni correction for multiple testing. We omit correlations that are not statistically significant. We opt for Spearman and not Pearson correlation because of the wide variety of metrics used for the different tasks.333Full correlation tables across all target and probing tasks with both Spearman and Pearson correlations can be found in the Appendix.

We find that acceptability judgment probing task performance is generally uncorrelated with the target task performance, except for AJ-CoLA. Similarly, many of the SentEval tasks do not correlate with the target tasks, except for Bigram Shift (SE-BShift), Odd-Man-Out (SE-SOMO) and Coordination Inversion (SE-CoordInv). These three tasks are input noising tasks—tasks where a model has to predict if a given input sentence has been randomly modified—which are, by far, the most similar tasks we study to the masked language modeling task that is used for training RoBERTa. This may explain the strong correlation with the performance of the target tasks.

We also find that some of these strong correlations, such as with SE-SOMO and SE-CoordInv, are almost entirely driven by variation in the degree of negative transfer, rather than any positive transfer. Intuitively, fine-tuning RoBERTa on an intermediate task can cause the model to forget some of its ability to perform the MLM task. Thus, a future direction for potential improvement for intermediate-task training may be integrating the MLM objective into intermediate-task training or bounding network parameter changes to reduce catastrophic forgetting (Kirkpatrick et al., 2016; Chen et al., 2019).

Interestingly, while intermediate tasks such as SocialIQA, CCG and QQP, which show negative transfer on target tasks, tend to have negative transfer on these three probing tasks, the intermediate tasks with positive transfer, such as CommonsenseQA tasks and MNLI, do not appear to adversely affect the performance on these probing tasks. This asymmetric impact may indicate that, beyond the similarity of intermediate and target tasks, avoiding catastrophic forgetting of pretraining is critical to successful intermediate-task transfer.

The remaining SentEval probing tasks have similar delta values (Figure 2), which may indicate that there is insufficient variation among transfer performance to derive significant correlations. Among the edge-probing tasks, the more semantic tasks such as coreference (EP-Coref and EP-DPR), semantic proto-role labeling (EP-SPR1 and EP-SPR2), and dependency labeling (EP-Rel) show the highest correlations with our target tasks. As our set of target tasks is also oriented towards semantics and reasoning, this is to be expected.

On the other hand, among the target tasks, we find that ReCoRD, CommonsenseQA and Cosmos QA—all commonsense-oriented tasks—exhibit both high correlations with each other as well as a similar set of correlations with the probing tasks. Similarly, BoolQ, MultiRC, and RTE correlate strongly with each other and have similar patterns of probing-task performance.

5 Related Work

Within the paradigm of training large pretrained Transformer language representations via intermediate-stage training before fine-tuning on a target task, positive transfer has been shown in both sequential task-to-task Phang et al. (2018) and multi-task-to-task Liu et al. (2019); Raffel et al. (2019) formats. Wang et al. (2019) perform an extensive study on transfer with BERT, finding language modeling and NLI tasks to be among the most beneficial tasks for improving target-task performance. Talmor and Berant (2019) perform a similar cross-task transfer study on reading comprehension datasets, finding similar positive transfer in most cases, with the biggest gains stemming from a combination of multiple QA datasets. Our work consists of a larger, more diverse, set of intermediate task–target task pairs. We also use probing tasks to shed light on the skills learned by the intermediate tasks.

Among the prior work on predicting transfer performance, Bingel and Søgaard (2017)

is the most similar to ours. They do a regression analysis that predicts target-task performance on the basis of various features of the source and target tasks and task pairs. They focus on a multi-task training setting without self-supervised pretraining, as opposed to our single-intermediate task, three-step procedure.

Similar work (Lin et al., 2019) has been done on cross-lingual transfer—the analogous challenge of transferring learned knowledge from a high-resource to a low-resource language.

Many recent works have attempted to understand the knowledge and linguistic skills BERT learns, for instance by analyzing the language model surprisal for subject–verb agreements (Goldberg, 2018), identifying specific knowledge or phenomena encapsulated in the representations learned by BERT using probing tasks (Tenney et al., 2019, 2019; Warstadt et al., 2019; Lin et al., 2019; Hewitt and Manning, 2019; Jawahar et al., 2019), analyzing the attention heads of BERT (Clark et al., 2019; Coenen et al., 2019; Lin et al., 2019; Htut et al., 2019), and testing the linguistic generalizations of BERT across runs (McCoy et al., 2019). However, relatively little work has been done to analyze fine-tuned BERT-style models (Wang et al., 2019; Warstadt et al., 2019).

6 Conclusion and Future Work

This paper presents a large-scale study on when and why intermediate-task training works with pretrained models. We perform experiments on RoBERTa with a total of 110 pairs of intermediate and target tasks, and perform an analysis using 25 probing tasks, covering different semantic and syntactic phenomena. Most directly, we observe that tasks like Cosmos QA and HellaSwag, which require complex reasoning and inference, tend to work best as intermediate tasks.

Looking to our probing analysis, intermediate tasks that help RoBERTa improve across the board show the most positive transfer in downstream tasks. However, it is difficult to draw definite conclusions about the specific skills that drive positive transfer. Intermediate-task training may help improve the handling of syntax, but there is little to no correlation between target-task and probing-task performance for these skills. Probes for higher-level semantic abilities tend to have a higher correlation with the target-task performance, but these results are too diffuse to yield more specific conclusions. Future work in this area would benefit greatly from improvements to both the breadth and depth of available probing tasks.

We also observe a worryingly high correlation between target-task performance and the two probing tasks which most closely resemble RoBERTa’s masked language modeling pretraining objective. Thus, the results of our intermediate-task training analysis may be driven in part by forgetting of knowledge acquired during pretraining. Our results therefore suggest a need for further work on efficient transfer learning mechanisms.

Acknowledgments

This project has benefited from support to SB by Eric and Wendy Schmidt (made by recommendation of the Schmidt Futures program), by Samsung Research (under the project

Improving Deep Learning using Latent Structure

), by Intuit, Inc., and by NVIDIA Corporation (with the donation of a Titan V GPU).

References

  • J. Bingel and A. Søgaard (2017)

    Identifying beneficial task relations for multi-task learning in deep neural networks

    .
    In Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics: Volume 2, Short Papers, Valencia, Spain, pp. 164–169. External Links: Link Cited by: §5.
  • X. Chen, S. Wang, B. Fu, M. Long, and J. Wang (2019) Catastrophic forgetting meets negative transfer: batch spectral shrinkage for safe transfer learning. In Advances in Neural Information Processing Systems 32, H. Wallach, H. Larochelle, A. Beygelzimer, F. d’ Alché-Buc, E. Fox, and R. Garnett (Eds.), pp. 1906–1916. External Links: Link Cited by: §4.2.
  • C. Clark, K. Lee, M. Chang, T. Kwiatkowski, M. Collins, and K. Toutanova (2019) BoolQ: exploring the surprising difficulty of natural yes/no questions. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), Minneapolis, Minnesota, pp. 2924–2936. External Links: Link, Document Cited by: §1, §2.1, §2.2.2.
  • K. Clark, U. Khandelwal, O. Levy, and C. D. Manning (2019) What does BERT look at? an analysis of BERT’s attention. In Proceedings of the 2019 ACL Workshop BlackboxNLP: Analyzing and Interpreting Neural Networks for NLP, Florence, Italy, pp. 276–286. External Links: Link, Document Cited by: §5.
  • A. Coenen, E. Reif, A. Yuan, B. Kim, A. Pearce, F. B. Viégas, and M. Wattenberg (2019) Visualizing and measuring the geometry of BERT. Note: Unpublished manuscript available on arXiv External Links: Link Cited by: §5.
  • A. Conneau, G. Kruszewski, G. Lample, L. Barrault, and M. Baroni (2018)

    What you can cram into a single $&!#* vector: probing sentence embeddings for linguistic properties

    .
    In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Melbourne, Australia, pp. 2126–2136. External Links: Link, Document Cited by: §2.1, §2.2.3, §2.2.3.
  • I. Dagan, O. Glickman, and B. Magnini (2005) The pascal recognising textual entailment challenge. In Machine Learning Challenges Workshop, pp. 177–190. Cited by: §2.2.2.
  • M. de Marneffe, M. Simons, and J. Tonhauser (2019) The CommitmentBank: investigating projection in naturally occurring discourse. In proceedings of Sinn und Bedeutung, Vol. 23, pp. 107–124. Cited by: §2.2.2.
  • J. Devlin, M. Chang, K. Lee, and K. Toutanova (2019) BERT: pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), Minneapolis, Minnesota, pp. 4171–4186. External Links: Link, Document Cited by: §1, §2.1.
  • M. Gardner, J. Grus, M. Neumann, O. Tafjord, P. Dasigi, N. F. Liu, M. Peters, M. Schmitz, and L. S. Zettlemoyer (2017)

    AllenNLP: a deep semantic natural language processing platform

    .
    Note: Unpublished manuscript available on arXiv External Links: Link Cited by: §3.
  • Y. Goldberg (2018) Assessing BERT’s syntactic abilities. Note: Unpublished manuscript available on arXiv External Links: Link Cited by: §5.
  • L. He, M. Lewis, and L. Zettlemoyer (2015) Question-answer driven semantic role labeling: using natural language to annotate natural language. In Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing, Lisbon, Portugal, pp. 643–653. External Links: Link, Document Cited by: §2.2.1.
  • I. Hendrickx, S. N. Kim, Z. Kozareva, P. Nakov, D. Ó Séaghdha, S. Padó, M. Pennacchiotti, L. Romano, and S. Szpakowicz (2009)

    SemEval-2010 task 8: multi-way classification of semantic relations between pairs of nominals. In Proceedings of the Workshop on Semantic Evaluations: Recent Achievements and Future Directions (SEW-2009), Boulder, Colorado, pp. 94–99. External Links: Link Cited by: §2.2.3.
  • J. Hewitt and C. D. Manning (2019) A structural probe for finding syntax in word representations. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), Minneapolis, Minnesota, pp. 4129–4138. External Links: Document Cited by: §5.
  • J. Hockenmaier and M. Steedman (2007) CCGbank: a corpus of CCG derivations and dependency structures extracted from the Penn treebank. Computational Linguistics 33 (3), pp. 355–396. External Links: Link, Document Cited by: §2.2.1.
  • E. Hovy, M. Marcus, M. Palmer, L. Ramshaw, and R. Weischedel (2006) OntoNotes: the 90% solution. In Proceedings of the Human Language Technology Conference of the NAACL, Companion Volume: Short Papers, New York City, USA, pp. 57–60. External Links: Link Cited by: §2.2.3.
  • P. M. Htut, J. Phang, S. Bordia, and S. R. Bowman (2019) Do attention heads in bert track syntactic dependencies?. Note: Unpublished manuscript available on arXiv External Links: Link Cited by: §5.
  • L. Huang, R. Le Bras, C. Bhagavatula, and Y. Choi (2019) Cosmos QA: machine reading comprehension with contextual commonsense reasoning. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), Hong Kong, China, pp. 2391–2401. External Links: Link, Document Cited by: §2.2.1.
  • G. Jawahar, B. Sagot, and D. Seddah (2019) What does BERT learn about the structure of language?. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, Florence, Italy, pp. 3651–3657. External Links: Link, Document Cited by: §5.
  • D. Khashabi, S. Chaturvedi, M. Roth, S. Upadhyay, and D. Roth (2018) Looking beyond the surface: a challenge set for reading comprehension over multiple sentences. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers), New Orleans, Louisiana, pp. 252–262. External Links: Link, Document Cited by: §2.2.2.
  • T. Khot, A. Sabharwal, and P. Clark (2018) Scitail: a textual entailment dataset from science question answering. In

    Thirty-Second AAAI Conference on Artificial Intelligence

    ,
    Cited by: §2.2.1.
  • N. Kim, R. Patel, A. Poliak, P. Xia, A. Wang, T. McCoy, I. Tenney, A. Ross, T. Linzen, B. Van Durme, S. R. Bowman, and E. Pavlick (2019) Probing what different NLP tasks teach machines about function word comprehension. In Proceedings of the Eighth Joint Conference on Lexical and Computational Semantics (*SEM 2019), Minneapolis, Minnesota, pp. 235–249. External Links: Link, Document Cited by: §2.2.3, §2.2.3.
  • D. P. Kingma and J. Ba (2015) Adam: A method for stochastic optimization. In 3rd International Conference on Learning Representations, ICLR 2015, San Diego, CA, USA, May 7-9, 2015, Conference Track Proceedings, External Links: Link Cited by: §3.
  • J. Kirkpatrick, R. Pascanu, N. C. Rabinowitz, J. Veness, G. Desjardins, A. A. Rusu, K. Milan, J. Quan, T. Ramalho, A. Grabska-Barwinska, D. Hassabis, C. Clopath, D. Kumaran, and R. Hadsell (2016) Overcoming catastrophic forgetting in neural networks. In Proceedings of the national academy of sciences (PNAS), External Links: Link Cited by: §4.2.
  • H. J. Levesque, E. Davis, and L. Morgenstern (2012) The winograd schema challenge. In Proceedings of the Thirteenth International Conference on Principles of Knowledge Representation and Reasoning, KR’12, pp. 552–561. External Links: ISBN 978-1-57735-560-1, Link Cited by: §2.2.2.
  • Y. Lin, Y. C. Tan, and R. Frank (2019) Open sesame: getting inside BERT’s linguistic knowledge. In Proceedings of the 2019 ACL Workshop BlackboxNLP: Analyzing and Interpreting Neural Networks for NLP, Florence, Italy, pp. 241–253. External Links: Link, Document Cited by: §5.
  • Y. Lin, C. Chen, J. Lee, Z. Li, Y. Zhang, M. Xia, S. Rijhwani, J. He, Z. Zhang, X. Ma, A. Anastasopoulos, P. Littell, and G. Neubig (2019) Choosing transfer languages for cross-lingual learning. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, Florence, Italy, pp. 3125–3135. External Links: Link, Document Cited by: §5.
  • X. Liu, P. He, W. Chen, and J. Gao (2019) Multi-task deep neural networks for natural language understanding. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, Florence, Italy, pp. 4487–4496. External Links: Link, Document Cited by: §2.1, §5.
  • Y. Liu, M. Ott, N. Goyal, J. Du, M. Joshi, D. Chen, O. Levy, M. Lewis, L. Zettlemoyer, and V. Stoyanov (2019) RoBERTa: a robustly optimized bert pretraining approach. Note: Unpublished manuscript available on arXiv External Links: Link Cited by: §1, §1.
  • R. T. McCoy, J. Min, and T. Linzen (2019) BERTs of a feather do not generalize together: Large variability in generalization across models with similar test set performance. Note: Unpublished manuscript available on arXiv External Links: Link, 1911.02969 Cited by: §5.
  • J. Michael, G. Stanovsky, L. He, I. Dagan, and L. Zettlemoyer (2018) Crowdsourcing question-answer meaning representations. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 2 (Short Papers), New Orleans, Louisiana, pp. 560–568. External Links: Link, Document Cited by: §2.2.1.
  • A. Paszke, S. Gross, F. Massa, A. Lerer, J. Bradbury, G. Chanan, T. Killeen, Z. Lin, N. Gimelshein, L. Antiga, A. Desmaison, A. Kopf, E. Yang, Z. DeVito, M. Raison, A. Tejani, S. Chilamkurthy, B. Steiner, L. Fang, J. Bai, and S. Chintala (2019) PyTorch: an imperative style, high-performance deep learning library. In Advances in Neural Information Processing Systems 32, H. Wallach, H. Larochelle, A. Beygelzimer, F. d’ Alché-Buc, E. Fox, and R. Garnett (Eds.), pp. 8024–8035. External Links: Link Cited by: §3.
  • J. Phang, T. Févry, and S. R. Bowman (2018) Sentence Encoders on STILTs: Supplementary Training on Intermediate Labeled-data Tasks. Note: Unpublished manuscript available on arXiv External Links: Link Cited by: §1, §1, §4.1, §4.1, §5.
  • M. T. Pilehvar and J. Camacho-Collados (2019) WiC: the word-in-context dataset for evaluating context-sensitive meaning representations. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), Minneapolis, Minnesota, pp. 1267–1273. External Links: Link, Document Cited by: §2.2.2.
  • C. Raffel, N. Shazeer, A. Roberts, K. Lee, S. Narang, M. Matena, Y. Zhou, W. Li, and P. J. Liu (2019) Exploring the limits of transfer learning with a unified text-to-text transformer. Note: Unpublished manuscript available on arXiv External Links: Link Cited by: §5.
  • A. Rahman and V. Ng (2012) Resolving complex cases of definite pronouns: the Winograd schema challenge. In Proceedings of the 2012 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning, Jeju Island, Korea, pp. 777–789. External Links: Link Cited by: §2.2.3.
  • M. Roemmele, C. A. Bejan, and A. S. Gordon (2011) Choice of plausible alternatives: an evaluation of commonsense causal reasoning. In 2011 AAAI Spring Symposium Series, Cited by: §2.2.2.
  • R. Rudinger, A. Teichert, R. Culkin, S. Zhang, and B. Van Durme (2018) Neural-Davidsonian Semantic Proto-role Labeling. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, Brussels, Belgium, pp. 944–955. External Links: Link, Document Cited by: §2.2.3.
  • M. Sap, H. Rashkin, D. Chen, R. Le Bras, and Y. Choi (2019) Social IQa: commonsense reasoning about social interactions. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), Hong Kong, China, pp. 4453–4463. External Links: Link, Document Cited by: §1, §2.2.1.
  • N. Silveira, T. Dozat, M. de Marneffe, S. Bowman, M. Connor, J. Bauer, and C. Manning (2014) A Gold Standard Dependency Corpus for English. In Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC’14), Reykjavik, Iceland, pp. 2897–2904. External Links: Link Cited by: §2.2.3.
  • R. Socher, A. Perelygin, J. Wu, J. Chuang, C. D. Manning, A. Ng, and C. Potts (2013) Recursive Deep Models for Semantic Compositionality Over a Sentiment Treebank. In Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing, Seattle, Washington, USA, pp. 1631–1642. External Links: Link Cited by: §2.2.1.
  • R. Speer, J. Chin, and C. Havasi (2017) Conceptnet 5.5: an open multilingual graph of general knowledge. In Thirty-First AAAI Conference on Artificial Intelligence, Cited by: §2.2.1.
  • A. Talmor and J. Berant (2019) MultiQA: an empirical investigation of generalization and transfer in reading comprehension. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, Florence, Italy, pp. 4911–4921. External Links: Link, Document Cited by: §5.
  • A. Talmor, J. Herzig, N. Lourie, and J. Berant (2019) CommonsenseQA: a question answering challenge targeting commonsense knowledge. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), Minneapolis, Minnesota, pp. 4149–4158. External Links: Link, Document Cited by: §2.2.1.
  • A. Teichert, A. Poliak, B. Van Durme, and M. R. Gormley (2017) Semantic proto-role labeling. In Thirty-First AAAI Conference on Artificial Intelligence, Cited by: §2.2.3.
  • I. Tenney, D. Das, and E. Pavlick (2019) BERT rediscovers the classical NLP pipeline. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, Florence, Italy, pp. 4593–4601. External Links: Link, Document Cited by: §5.
  • I. Tenney, P. Xia, B. Chen, A. Wang, A. Poliak, R. T. McCoy, N. Kim, B. V. Durme, S. Bowman, D. Das, and E. Pavlick (2019) What do you learn from context? probing for sentence structure in contextualized word representations. In International Conference on Learning Representations, External Links: Link Cited by: §2.2.3, §2.2.3, §5.
  • A. Wang, J. Hula, P. Xia, R. Pappagari, R. T. McCoy, R. Patel, N. Kim, I. Tenney, Y. Huang, K. Yu, S. Jin, B. Chen, B. Van Durme, E. Grave, E. Pavlick, and S. R. Bowman (2019) Can you tell me how to get past sesame street? sentence-level pretraining beyond language modeling. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, Florence, Italy, pp. 4465–4476. External Links: Link, Document Cited by: §1, §1, §5, §5.
  • A. Wang, Y. Pruksachatkun, N. Nangia, A. Singh, J. Michael, F. Hill, O. Levy, and S. R. Bowman (2019a) SuperGLUE: a multi-task benchmark and analysis platform for natural language understanding. In Advances in Neural Information Processing Systems 32, H. Wallach, H. Larochelle, A. Beygelzimer, F. d’ Alché-Buc, E. Fox, and R. Garnett (Eds.), pp. 3261–3275. External Links: Link Cited by: §2.2.2.
  • A. Wang, I. F. Tenney, Y. Pruksachatkun, K. Yu, J. Hula, P. Xia, R. Pappagari, S. Jin, R. T. McCoy, R. Patel, Y. Huang, J. Phang, E. Grave, H. Liu, N. Kim, P. M. Htut, T. Févry, B. Chen, N. Nangia, A. Mohananey, K. Kann, S. Bordia, N. Patry, D. Benton, E. Pavlick, and S. R. Bowman (2019b) jiant 1.2: a software toolkit for research on general-purpose text understanding models. External Links: Link Cited by: §3.
  • A. Warstadt, Y. Cao, I. Grosu, W. Peng, H. Blix, Y. Nie, A. Alsop, S. Bordia, H. Liu, A. Parrish, S. Wang, J. Phang, A. Mohananey, P. M. Htut, P. Jeretic, and S. R. Bowman (2019) Investigating BERT’s knowledge of language: five analysis methods with NPIs. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), Hong Kong, China, pp. 2870–2880. External Links: Link, Document Cited by: §5.
  • A. Warstadt, A. Singh, and S. R. Bowman (2019) Neural network acceptability judgments. Transactions of the Association for Computational Linguistics (TACL) 7, pp. 625–641. External Links: Link Cited by: §2.2.3.
  • J. Welbl, N. F. Liu, and M. Gardner (2017) Crowdsourcing multiple choice science questions. In

    Proceedings of the 3rd Workshop on Noisy User-generated Text

    ,
    Copenhagen, Denmark, pp. 94–106. External Links: Link, Document Cited by: §2.2.1.
  • A. Williams, N. Nangia, and S. Bowman (2018) A broad-coverage challenge corpus for sentence understanding through inference. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers), pp. 1112–1122. External Links: Link Cited by: §2.2.1.
  • T. Wolf, L. Debut, V. Sanh, J. Chaumond, C. Delangue, A. Moi, P. Cistac, T. Rault, R. Louf, M. Funtowicz, and J. Brew (2019) HuggingFace’s transformers: state-of-the-art natural language processing. Note: Unpublished manuscript available on arXiv External Links: Link Cited by: §3.
  • R. Zellers, Y. Bisk, R. Schwartz, and Y. Choi (2018) SWAG: a large-scale adversarial dataset for grounded commonsense inference. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, Brussels, Belgium, pp. 93–104. External Links: Link, Document Cited by: §2.2.1.
  • R. Zellers, A. Holtzman, Y. Bisk, A. Farhadi, and Y. Choi (2019) HellaSwag: can a machine really finish your sentence?. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, Florence, Italy, pp. 4791–4800. External Links: Link, Document Cited by: §2.2.1.
  • S. Zhang, X. Liu, J. Liu, J. Gao, K. Duh, and B. V. Durme (2018) ReCoRD: bridging the gap between human and machine commonsense reading comprehension. Note: Unpublished manuscript available on arXiv External Links: Link Cited by: §2.2.2.

Appendix A Correlation Between Probing and Target Task Performance

Figure 4 shows the correlation matrix using Spearman correlation and Figure 5 shows the matrix using Pearson correlation.

Appendix B Effect of Intermediate Task Size on Target Task Performance

Figure 6 shows the effect of dataset size on intermediate task training on downstream target task performance for five intermediate tasks, which were picked to maximize the variety of original intermediate task sizes and effectiveness in transfer learning abilities.

Figure 4: Correlations between probing and target task performances. Each cell contains the Spearman correlation between probing and target tasks performances across training on different intermediate tasks and random restarts.
Figure 5: Correlations between probing and target task performances. Each cell contains the Pearson correlation between probing and target tasks performances across training on different intermediate tasks and random restarts.
Figure 6: Results of experiments on impact of intermediate task data size on downstream target task performance. For each subfigure, we finetune RoBERTa over a variety of dataset size (sampled randomly from the dataset). We report the macro-average of each target task’s performance metrics after finetuning on each dataset size split.