Sentence Encoders on STILTs: Supplementary Training on Intermediate Labeled-data Tasks

11/02/2018 ∙ by Jason Phang, et al. ∙ NYU college 0

Pretraining with language modeling and related unsupervised tasks has recently been shown to be a very effective enabling technology for the development of neural network models for language understanding tasks. In this work, we show that although language model-style pretraining is extremely effective at teaching models about language, it does not yield an ideal starting point for efficient transfer learning. By supplementing language model-style pretraining with further training on data-rich supervised tasks, we are able to achieve substantial additional performance improvements across the nine target tasks in the GLUE benchmark. We obtain an overall score of 76.9 on GLUE--a 2.3 point improvement over our baseline system adapted from Radford et al. (2018) and a 4.1 point improvement over Radford et al.'s reported score. We further use training data downsampling to show that the benefits of this supplementary training are even more pronounced in data-constrained regimes.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

It has become clear over the last year that pretraining sentence encoder neural networks on unsupervised tasks, such as language modeling, then fine-tuning them on individual target tasks, can yield significantly better target task performance than could be achieved using target task training data alone (Howard and Ruder, 2018; Radford et al., 2018; Devlin et al., 2018). Large-scale unsupervised pretraining in experiments like these seems to produce pretrained sentence encoders with substantial knowledge of the target language (which, so far, is generally English). These works have shown that a mostly task-agnostic, one-size-fits-all approach to fine-tuning a large pretrained model with a thin output layer for a given task can achieve results superior to individually optimized models.

However, it is not obvious that the model parameters obtained during unsupervised pretraining should be ideally suited to supporting this kind of transfer learning. Especially when only a small amount of training data is available for the target task, experiments of this kind are potentially brittle, and rely on the pretrained encoder parameters to be reasonably close to an optimal setting for the target task. During target task training, the encoder must learn and adapt enough to be able to solve the target task—potentially involving a very different input distribution and output label space than was seen in pretraining—but it must avoid adapting so much that it overfits and ceases to take advantage of what was learned during pretraining.

This work explores the possibility that the use of a second stage of pretraining with data-rich intermediate supervised tasks might mitigate this brittleness and improve both the robustness and effectiveness of the resulting target task model. We name this approach Supplementary Training on Intermediate Labeled-data Tasks (STILTs).

Experiments with sentence encoders on STILTs take the following form: (i) A model is first trained on an unlabeled-data task like language modeling that can teach it to handle data in the target language; (ii) The model is then further trained on an intermediate, labeled-data task for which ample labeled data is available; (iii) The model is finally fine-tuned further on the target task and evaluated. Our experiments evaluate STILTs as a means of improving target task performance on the GLUE benchmark suite (Wang et al., 2018)—a collection of target tasks drawn from the NLP literature—using the publicly-distributed OpenAI generatively-pretrained (GPT) Transformer language model (Radford et al., 2018) as our pretrained encoder. We follow Radford et al. in our basic mechanism for fine-tuning both for the intermediate and final tasks, and use the following intermediate tasks: the Multi-Genre NLI Corpus (MNLI; Williams et al., 2018), the Stanford NLI Corpus (SNLI; Bowman et al., 2015), the Quora Question Pairs111 Quora Question Pairs (QQP) dataset, and a custom fake-sentence-detection task based on the BooksCorpus dataset (Zhu et al., 2015) using a method adapted from Warstadt et al. (2018). We show that using STILTs yields significant gains across most of the GLUE tasks.

As we expect that any kind of pretraining will be most valuable in a limited training data regime, we also conduct a set of fine-tuning experiments where the model is fine tuned on only 1k- or 5k-example sample of the target task training set. The results show that STILTs substantially improve model performance across most tasks in this downsampled data setting. For target tasks such as MRPC, using STILTs is critical to obtaining good performance.

2 Related Work

Zhang and Bowman (2018) compare several pretraining tasks for syntactic target tasks, and find that language model pretraining reliably performs well. Peters et al. investigate the architectural choices behind ELMo-style pretraining with a fixed encoder, and find that the precise choice of encoder architecture strongly influences training speed, but has a relatively small impact on performance. In an publicly-available ICLR 2019 submission, Anonymous (2019) compare a variety of tasks for pretraining in an ELMo-style setting with no encoder fine-tuning. They conclude that language modeling generally works best among candidate single tasks for pretraining, but show some cases in which a cascade of a model pretrained on language modeling followed by another model pretrained on tasks like MNLI can work well. The paper introducing BERT (Devlin et al., 2018) briefly mentions encouraging results in a direction similar to ours: One footnote notes that unpublished experiments show “substantial improvements on RTE from multi-task training with MNLI”.

In the area of sentence-to-vector sentence encoding,

Conneau et al. (2018) offer one of the most comprehensive suites of diagnostic tasks, and higlight the importance of ensuring that these models preserve lexical content information.

In earlier work less closely tied to the unsupervised pretraining setup used here, Bingel and Søgaard (2017) investigate the conditions under which task combinations can be productively combined in multi-task learning, and show that the success of a task combination can be determined by the shape of the learning curve during training for each task. In their words: “Multi-task gains are more likely for target tasks that quickly plateau with non-plateauing auxiliary tasks”.

In word representations, this work shares motivations with work on embedding space retrofitting (Faruqui et al., 2015), in which a labeled dataset like WordNet is used to refine representations learned by an unsupervised embedding learning algorithm before those representations can then be used in a target task.

Avg AvgEx CoLA SST MRPC QQP STS MNLI QNLI RTE WNLI
Training Set Size 8.5k 67k 3.7k 364k 7k 393k 108k 2.5k 634
Development Set Scores
LM 75.4 72.4 50.2 93.2 80.1/ 85.9 89.4/ 85.9 86.4/ 86.5 81.2 82.4 58.1 56.3
LMQQP 76.0 73.1 48.3 93.1 83.1/ 88.0 *89.4/ 85.9 87.0/ 86.9 80.7 82.6 62.8 56.3
LMMNLI 76.7 74.2 45.7 92.2 87.3/ 90.8 89.2/ 85.3 88.1/ 88.0 *81.2 82.6 67.9 56.3
LMSNLI 76.0 73.1 41.5 91.9 86.0/ 89.9 89.9/ 86.6 88.7/ 88.6 81.1 82.2 65.7 56.3
LMReal/Fake 76.6 73.9 49.5 91.4 83.6/ 88.6 90.1/ 86.9 87.9/ 87.8 81.0 82.5 66.1 56.3
Overall Best 77.6 75.5 50.2 93.2 87.3/ 90.8 90.1/ 86.9 88.7/ 88.6 81.1 83.1 67.9 56.3
Test Set Scores
LM 74.6 72.9 47.2 93.1 84.7/ 79.1 87.7/ 69.3 81.5/ 79.6 80.7 86.6 57.8 65.1
LMQQP 75.4 74.1 45.5 92.6 79.8/ 85.2 *87.7/ 69.3 84.0/ 82.7 80.1 86.9 64.0 65.1
LMMNLI 76.5 75.3 43.8 93.0 87.7/ 83.7 88.4/ 70.0 84.7/ 83.9 *80.7 87.2 69.1 65.1
LMSNLI 75.7 74.3 39.4 91.0 84.1/ 88.2 88.2/ 70.3 85.3/ 84.8 80.5 86.6 67.8 65.1
LMReal/Fake 76.2 75.0 47.9 92.8 82.8/ 87.6 88.1/ 70.1 84.0/ 82.7 80.8 86.4 65.5 65.1
Best based on Dev 76.9 75.9 47.2 93.1 87.7/ 83.7 88.1/ 70.1 85.3/ 84.8 80.7 87.2 69.1 65.1
Table 1: GLUE results with and without STILTs, fine-tuning on full training data of each target task with an auxiliary language modeling objective. Bold the best within each section. * indicates cases where the intermediate task is the same as the target task–we substitute the baseline result for that cell. AvgEx is the average excluding MNLI and QQP because of the overlap with intermediate tasks. See text for discussion of WNLI results.

3 Methods

Pretrained Language Model

We primarily use the pretrained Transformer language model, or GPT model, distributed by Radford et al. (2018), which was the best publicly available pretrained sentence encoder available as measured by GLUE benchmark performance at the time we started work.222We plan to perform parallel experiments with the newer BERT on STILTs in the near future. They apply the decoder-only variant of the Transformer architecture (Vaswani et al., 2017) to language modeling on the BooksCorpus dataset. We follow this line of work in using an inductive approach to transfer learning in which the model parameters learned during pretraining are used to initialize a target task model, but are not fixed and do not constrain the solution that is learned for the target task. This stands in contrast to the approach used for methods like ELMo (Peters et al., ) and CoVe (McCann et al., 2017) and for earlier sentence-to-vector methods like (Subramanian et al., 2018), in which a sentence encoder component is pretrained and then attached to a target task model as an fixed input layer that cannot be further trained.

To implement intermediate-task and target-task training, we use the jiant transfer learning toolkit,333https://github.com/jsalt18-sentence-repl/jiant which is built on AllenNLP (Gardner et al., 2017)

and PyTorch

(Paszke et al., 2017).

Target Tasks and Evaluation

We evaluate on the nine target tasks in the GLUE benchmark (Wang et al., 2018). These include MNLI, QQP, and seven others: acceptability classification with CoLA (Warstadt et al., 2018); binary sentiment classification with SST (Socher et al., 2013); semantic similarity with the MSR Paraphrase Corpus (MRPC; Dolan and Brockett, 2005) and STS-Benchmark (STS; Cer et al., 2017); and textual entailment with a subset of the RTE challenge corpora (Dagan et al., 2006, et seq.), and data from SQuAD (QNLI, Rajpurkar et al., 2016) and the Winograd Schema Challenge (WNLI, Levesque et al., 2011) converted to entailment format as in White et al. (2017). Because of the adversarial nature of WNLI, our models do not generally perform better than chance, and we follow the recipe of Devlin et al. (2018) by predicting the most frequent label for all examples.

Avg AvgEx CoLA SST MRPC QQP STS MNLI QNLI RTE WNLI
Training Set Size 8.5k 67k 3.7k 364k 7k 393k 108k 2.5k 634
At Most 5k Training Examples for Target Tasks
LM 71.5 70.9 50.2 91.1 80.1/ 85.9 80.6/ 75.2 86.1/ 85.9 68.5 72.8 58.1 56.3
LMQQP 70.9 70.4 40.4 89.7 83.1/ 88.0 *80.6/ 75.2 86.9/ 86.7 67.4 72.6 62.8 56.3
LMMNLI 73.0 72.4 44.2 88.9 87.3/ 90.8 82.2/ 77.1 87.3/ 87.2 *68.8 74.8 67.9 56.3
LMSNLI 72.7 70.5 33.1 89.7 86.0/ 89.9 81.9/ 76.4 88.4/ 88.3 79.1 75.3 65.7 56.3
LMReal/Fake 73.8 72.2 48.3 89.7 83.6/ 88.6 81.5/ 76.3 87.3/ 87.1 78.0 73.4 66.1 56.3
Overall Best 76.3 74.0 50.2 91.1 87.3/ 90.8 82.2/ 77.1 88.4/ 88.3 79.1 75.3 67.9 56.3
At Most 1k Training Examples for Target Tasks
LM 58.3 60.2 0.0 88.4 73.3/ 82.8 73.0/ 66.7 81.0/ 80.6 35.0 63.0 53.1 56.3
LMQQP 63.8 63.0 12.1 85.8 73.8/ 82.0 *73.0/ 66.7 79.5/ 79.9 60.4 69.1 61.7 56.3
LMMNLI 62.8 64.6 9.4 86.5 78.9/ 85.2 78.0/ 71.6 83.5/ 83.5 *35.0 71.7 66.1 56.3
LMSNLI 67.7 65.0 11.7 86.1 83.6/ 88.5 77.8/ 69.9 85.1/ 85.1 77.5 69.5 63.2 56.3
LMReal/Fake 70.4 68.7 40.9 86.8 77.5/ 84.8 77.2/ 69.7 81.6/ 81.6 77.8 70.0 65.7 56.3
Overall Best 73.7 70.7 40.9 88.4 83.6/ 88.5 78.0/ 71.6 85.1/ 85.1 77.8 71.7 66.1 56.3
Table 2: GLUE results on development set based on fine-tuning on only a subset of target-task data, simulating data-constrained scenarios, with an auxiliary language modeling objective. Bold indicates the best within each section. * indicates cases where the intermediate task is the same as the target task: We substitute the baseline result for that cell. AvgEx is the average excluding MNLI and QQP because of the overlap with intermediate tasks. See text for discussion of WNLI results.

Most of our experiments—including all of our experiments using downsampled training sets for our target tasks—are evaluated on the development set of GLUE. Based on the results on the development set, we choose the best intermediate-task training scheme for each task and submit the best-per-task model for evaluation on the test set on the public leaderboard.

Intermediate Task Training

Most of our experiments follow the GPT approach from Radford et al. (2018), except that we add a supplementary training phase on an intermediate task before target-task fine-tuning. We call this approach GPT on STILTs. We evaluate four intermediate tasks, which are collectively meant to represent a small sample of readily available data-rich sentence-level tasks similar to those in GLUE: (i) textual entailment with MNLI; (ii) textual entailment with SNLI; (iii) paraphrase detection with QQP; and (iv) a custom fake-sentence-detection task.

Our use of MNLI is motivated by prior successes with MNLI pretraining by Conneau et al. (2018) and Subramanian et al. (2018). We include the single-genre captions-based SNLI in addition to the multi-genre MNLI to disambiguate between the benefits of domain-shift and task-shift from supplementary training on MNLI. QQP is included as we believed it could improve performance on sentence similarity tasks. Lastly, we include a fake-sentence-detection task based on the BooksCorpus dataset–this is a simple single-sentence task that enables us to isolate the impact of task shift from corpus shift altogether, as the Radford et al.’s GPT model was similarly trained on BooksCorpus. The fake-sentence-detection task is constructed by sampling sentences from BooksCorpus, and fake sentences are generating by randomly swapping 2–4 pairs of words in the sentence. We generate a dataset of 600,000 sentences with a 50/50 real/fake split for this intermediate task.

Training Details

Unless otherwise stated, we follow the model formulation and training regime of the GPT specified in Radford et al., including using the same optimizer, learning rate schedule, and weight decay. We halve the batch size for pair sentence tasks to run the model on a single-GPU setting. We use a three-epoch training limit for both supplementary training and target-task fine-tuning, and use a fresh optimizer each time. For each task, we add only a small task-specific output layer to the pretrained Transformer model, and we follow Radford et al. in the choice of output layer and the method for handling multiple-sentence input. Radford et al. also formulate two regimes for training their model–with and without an auxiliary language modeling objective during fine-tuning. We show results from models trained using the auxiliary language modeling objective in both supplementary training and fine-tuning. Results with supplementary training and fine-tuning both without the auxiliary language modeling objective can be found in the Appendix.

For our baseline, we do not fine-tune on any intermediate task: This is equivalent to the formulation presented in Radford et al. (2018) and serves as our attempt to replicate their results using jiant’s fine-tuning code. Our replication gets a test score of 74.6, attaining an average increase of 0.5 points over theirs.444Adjusting their 72.8 public leaderboard score to account for WNLI gives them a more comparable score of 74.1.

4 Results

Table 1 shows our results on GLUE with and without STILTs. Our addition of supplementary training boosts performance across many of the two sentence tasks. On each of our models trained with STILTs, we show improved overall average GLUE scores on the development set. For MNLI and QNLI target tasks, we observe marginal or no gains, likely owing to the two tasks already having large training sets. For the two single sentence tasks—the syntax-oriented CoLA task and the SST sentiment task—we find somewhat deteriorated performance. For CoLA, this mirrors results reported in Anonymous (2019), who show that few pretraining tasks other than language modeling offer any advantage for CoLA. The Overall Best score is computed based on taking the best score for each task.

On the test set, we show similar performance gains across most tasks. Here, we compute Best based on Dev

, which shows scores based on choosing the best supplementary training scheme for each task based on corresponding development set score. This is a more realistic estimate of test set performance, attaining a GLUE score of 76.9, a 2.3 point gain over the score of our baseline system adapted from Radford et al. This significantly closes the gap between Radford et al.’s model and the BERT model

Devlin et al. (2018) variant with a similar number of parameters and layers, which attains a GLUE score of 78.3.

We perform the same experiment on the development set without the auxiliary language modeling objective. The results are shown in Table 3 in the Appendix. We similarly find improvements across many tasks by applying STILTs, showing that the benefits of supplementary training do not require language modeling at either the supplementary training or the fine-tuning stage.

Limited Target-Task Data Experiments

Table 2 shows the same models fine-tuned on 5k training examples and 1k examples for each task. For tasks with training sets that are already smaller than these limits, we use the training sets as-is. The benefits of supplementary training are generally more pronounced in these settings, with performance in many tasks showing improvements of more than 10 points. CoLA and SST are again the exceptions: Both tasks deteriorated moderately with supplementary training, and CoLA trained with the auxiliary language modeling objective in particular showed highly unstable results when trained on small amounts of data.

We see one obvious area for potential improvement: In our experiments, we follow the recipe for fine-tuning from Radford et al.’s GPT as closely as possible. Particularly in the case of the artificially data-constrained tasks, we believe that performance can be improved with more careful tuning of the training duration (the three epoch cut-off) and learning rates to account for the small number of samples.

Taking into account Radford et al. (2018)’s comment that using the auxiliary language model objective can be detrimental in data-constrained settings, we also report scores without the auxiliary objective in Table 4 in the Appendix. Although we find that under data-constrained settings, the baseline scores without the auxiliary objective are higher than its auxiliarily trained counterparts, intermediate task training still improves performance across many tasks. Furthermore, although the average scores on each supplementary training regime are higher with auxiliary language modeling, the Overall Best without auxiliary language modeling is higher based on better performance on CoLA and RTE.

ELMo on STILTs

To assess whether our improvements can help language models with different architectures, we investigate the impact of supplementary training on a language model with an ELMo-based architecture Peters et al. (2018). On the development set, the baseline model reaches 63.8 and supplementary training on MNLI (resp. QQP) brings a 2.6 (resp. 1.0) point improvement. Appendix A details the training setup for ELMo and Table 5

shows the detail of the scores. Despite large differences in language model architecture, task setup and hyperparameters, supplementary training also helps in this setting. We leave the investigation of benefits on other language model-like architectures such as BERT

Devlin et al. (2018) for future work.

5 Discussion

We find that sentence pair tasks seem to benefit more from supplementary training than single-sentence ones. This is true even for the case of supplementary training on the single-sentence fake-sentence-detection task, so the benefits cannot be wholly attributed to task similarity. We also find that data-constrained tasks benefit much more from supplementary training. Indeed, when applied to RTE, supplementary training on MNLI leads to a eleven-point increase in test set score, pushing the performance of Radford et al.’s GPT model with supplementary training above the BERT model of similar size, which achieves a test set score of 66.4. Based on the improvements seen from applying supplementary training on the fake-sentence-detection task, which is built on the same BooksCorpus dataset that the GPT model was trained on, it is also clear that the benefits from supplementary training do not entirely stem from the trained model being exposed to different textual domains.

Applying STILTs also comes with little complexity or computational overhead. The same infrastructure used to fine-tune the GPT model can be used to implement the supplementary training. The computational cost of the supplementary training phase is another phase of fine-tuning, which is small compared to the cost of training the original model.

However, using STILTs is not always beneficial. In particular, we show that most of our intermediate tasks were actually detrimental to the single-sentence tasks in GLUE. The interaction between the intermediate task, the target task, and the use of the auxiliary language modeling objective is a subject due for further investigation. Therefore, for best target task performance, we recommend experimenting with supplementary training with several closely-related data-rich tasks and use the development set to select the most promising approach for each task, as in the Best based on Dev formulation shown in Table 1.

6 Conclusion

This work represents only an initial investigation into the benefits of supplementary supervised pretraining. More work remains to be done to firmly establish when methods like STILTs can be productively applied and what criteria can be used to predict which combinations of intermediate and target tasks should work well. Nevertheless, in our initial work with four example intermediate training tasks, GPT on STILTs achieves a test set GLUE score of 76.9, which markedly improves on our strong pretrained Transformer baseline. We also show that in data-constrained regimes, the benefits of using STILTs are even more pronounced.

Acknowledgments

We would like to thank Nikita Nangia for her helpful feedback.

References

Appendix A ELMo on STILTs

Experiment setup

We use the same architecture as Peters et al. (2018)

for the non-task-specific parameters. For task-specific parameters, we use the layer weights and the task weights described in the paper, as well as a classifier composed of pooling with projection and a logistic regression classifier. In contrast to the GLUE baselines and to

Anonymous (2019), we refrain from adding many non-LM pretrained parameters by not using pair attention nor an additional encoding layer. The whole model, including ELMo parameters, is trained during both suplementary training on the intermediate task and target-task tuning. For two sentence tasks, we follow the model design of Wang et al. (2018) rather than that of Radford et al. (2018), since early experiments showed better performance with the former. Consequently, we run the shared encoder on the two sentences and independently and then use for our task-specific classifier. We use the default optimizer and learning rate schedule from jiant.

Avg AvgEx CoLA SST MRPC QQP STS MNLI QNLI RTE WNLI
Training Set Size 8.5k 67k 3.7k 364k 7k 393k 108k 2.5k 634
LM 63.8 59.4 15.6 84.9 69.9/ 80.6 86.4/ 82.2 64.5/ 64.4 69.4 73.0 50.9 56.3
LMQQP 64.8 61.7 16.6 87.0 73.5/ 82.4 *86.4/ 82.2 71.6/ 72.0 63.9 73.4 52.0 56.3
LMMNLI 66.4 62.8 16.4 87.6 73.5/ 83.0 87.2/ 83.1 75.2/ 75.8 *69.4 72.4 56.3 56.3
Table 5: Results on the GLUE development set for ELMo training. Bold results are the best overall; * indicates cases where the intermediate task is the same as the target task: We substitute the baseline result for that cell. AvgEx is the average excluding MNLI and QQP. See text for discussion of WNLI results.