It has become clear over the last year that pretraining sentence encoder neural networks on unsupervised tasks, such as language modeling, then fine-tuning them on individual target tasks, can yield significantly better target task performance than could be achieved using target task training data alone (Howard and Ruder, 2018; Radford et al., 2018; Devlin et al., 2018). Large-scale unsupervised pretraining in experiments like these seems to produce pretrained sentence encoders with substantial knowledge of the target language (which, so far, is generally English). These works have shown that a mostly task-agnostic, one-size-fits-all approach to fine-tuning a large pretrained model with a thin output layer for a given task can achieve results superior to individually optimized models.
However, it is not obvious that the model parameters obtained during unsupervised pretraining should be ideally suited to supporting this kind of transfer learning. Especially when only a small amount of training data is available for the target task, experiments of this kind are potentially brittle, and rely on the pretrained encoder parameters to be reasonably close to an optimal setting for the target task. During target task training, the encoder must learn and adapt enough to be able to solve the target task—potentially involving a very different input distribution and output label space than was seen in pretraining—but it must avoid adapting so much that it overfits and ceases to take advantage of what was learned during pretraining.
This work explores the possibility that the use of a second stage of pretraining with data-rich intermediate supervised tasks might mitigate this brittleness and improve both the robustness and effectiveness of the resulting target task model. We name this approach Supplementary Training on Intermediate Labeled-data Tasks (STILTs).
Experiments with sentence encoders on STILTs take the following form: (i) A model is first trained on an unlabeled-data task like language modeling that can teach it to handle data in the target language; (ii) The model is then further trained on an intermediate, labeled-data task for which ample labeled data is available; (iii) The model is finally fine-tuned further on the target task and evaluated. Our experiments evaluate STILTs as a means of improving target task performance on the GLUE benchmark suite (Wang et al., 2018)—a collection of target tasks drawn from the NLP literature—using the publicly-distributed OpenAI generatively-pretrained (GPT) Transformer language model (Radford et al., 2018) as our pretrained encoder. We follow Radford et al. in our basic mechanism for fine-tuning both for the intermediate and final tasks, and use the following intermediate tasks: the Multi-Genre NLI Corpus (MNLI; Williams et al., 2018), the Stanford NLI Corpus (SNLI; Bowman et al., 2015), the Quora Question Pairs111 Quora Question Pairs (QQP) dataset, and a custom fake-sentence-detection task based on the BooksCorpus dataset (Zhu et al., 2015) using a method adapted from Warstadt et al. (2018). We show that using STILTs yields significant gains across most of the GLUE tasks.
As we expect that any kind of pretraining will be most valuable in a limited training data regime, we also conduct a set of fine-tuning experiments where the model is fine tuned on only 1k- or 5k-example sample of the target task training set. The results show that STILTs substantially improve model performance across most tasks in this downsampled data setting. For target tasks such as MRPC, using STILTs is critical to obtaining good performance.
2 Related Work
Zhang and Bowman (2018) compare several pretraining tasks for syntactic target tasks, and find that language model pretraining reliably performs well. Peters et al. investigate the architectural choices behind ELMo-style pretraining with a fixed encoder, and find that the precise choice of encoder architecture strongly influences training speed, but has a relatively small impact on performance. In an publicly-available ICLR 2019 submission, Anonymous (2019) compare a variety of tasks for pretraining in an ELMo-style setting with no encoder fine-tuning. They conclude that language modeling generally works best among candidate single tasks for pretraining, but show some cases in which a cascade of a model pretrained on language modeling followed by another model pretrained on tasks like MNLI can work well. The paper introducing BERT (Devlin et al., 2018) briefly mentions encouraging results in a direction similar to ours: One footnote notes that unpublished experiments show “substantial improvements on RTE from multi-task training with MNLI”.
In the area of sentence-to-vector sentence encoding,Conneau et al. (2018) offer one of the most comprehensive suites of diagnostic tasks, and higlight the importance of ensuring that these models preserve lexical content information.
In earlier work less closely tied to the unsupervised pretraining setup used here, Bingel and Søgaard (2017) investigate the conditions under which task combinations can be productively combined in multi-task learning, and show that the success of a task combination can be determined by the shape of the learning curve during training for each task. In their words: “Multi-task gains are more likely for target tasks that quickly plateau with non-plateauing auxiliary tasks”.
In word representations, this work shares motivations with work on embedding space retrofitting (Faruqui et al., 2015), in which a labeled dataset like WordNet is used to refine representations learned by an unsupervised embedding learning algorithm before those representations can then be used in a target task.
|Training Set Size||8.5k||67k||3.7k||364k||7k||393k||108k||2.5k||634|
|Development Set Scores|
|Test Set Scores|
|Best based on Dev||76.9||75.9||47.2||93.1||87.7/||83.7||88.1/||70.1||85.3/||84.8||80.7||87.2||69.1||65.1|
Pretrained Language Model
We primarily use the pretrained Transformer language model, or GPT model, distributed by Radford et al. (2018), which was the best publicly available pretrained sentence encoder available as measured by GLUE benchmark performance at the time we started work.222We plan to perform parallel experiments with the newer BERT on STILTs in the near future. They apply the decoder-only variant of the Transformer architecture (Vaswani et al., 2017) to language modeling on the BooksCorpus dataset. We follow this line of work in using an inductive approach to transfer learning in which the model parameters learned during pretraining are used to initialize a target task model, but are not fixed and do not constrain the solution that is learned for the target task. This stands in contrast to the approach used for methods like ELMo (Peters et al., ) and CoVe (McCann et al., 2017) and for earlier sentence-to-vector methods like (Subramanian et al., 2018), in which a sentence encoder component is pretrained and then attached to a target task model as an fixed input layer that cannot be further trained.
Target Tasks and Evaluation
We evaluate on the nine target tasks in the GLUE benchmark (Wang et al., 2018). These include MNLI, QQP, and seven others: acceptability classification with CoLA (Warstadt et al., 2018); binary sentiment classification with SST (Socher et al., 2013); semantic similarity with the MSR Paraphrase Corpus (MRPC; Dolan and Brockett, 2005) and STS-Benchmark (STS; Cer et al., 2017); and textual entailment with a subset of the RTE challenge corpora (Dagan et al., 2006, et seq.), and data from SQuAD (QNLI, Rajpurkar et al., 2016) and the Winograd Schema Challenge (WNLI, Levesque et al., 2011) converted to entailment format as in White et al. (2017). Because of the adversarial nature of WNLI, our models do not generally perform better than chance, and we follow the recipe of Devlin et al. (2018) by predicting the most frequent label for all examples.
|Training Set Size||8.5k||67k||3.7k||364k||7k||393k||108k||2.5k||634|
|At Most 5k Training Examples for Target Tasks|
|At Most 1k Training Examples for Target Tasks|
Most of our experiments—including all of our experiments using downsampled training sets for our target tasks—are evaluated on the development set of GLUE. Based on the results on the development set, we choose the best intermediate-task training scheme for each task and submit the best-per-task model for evaluation on the test set on the public leaderboard.
Intermediate Task Training
Most of our experiments follow the GPT approach from Radford et al. (2018), except that we add a supplementary training phase on an intermediate task before target-task fine-tuning. We call this approach GPT on STILTs. We evaluate four intermediate tasks, which are collectively meant to represent a small sample of readily available data-rich sentence-level tasks similar to those in GLUE: (i) textual entailment with MNLI; (ii) textual entailment with SNLI; (iii) paraphrase detection with QQP; and (iv) a custom fake-sentence-detection task.
Our use of MNLI is motivated by prior successes with MNLI pretraining by Conneau et al. (2018) and Subramanian et al. (2018). We include the single-genre captions-based SNLI in addition to the multi-genre MNLI to disambiguate between the benefits of domain-shift and task-shift from supplementary training on MNLI. QQP is included as we believed it could improve performance on sentence similarity tasks. Lastly, we include a fake-sentence-detection task based on the BooksCorpus dataset–this is a simple single-sentence task that enables us to isolate the impact of task shift from corpus shift altogether, as the Radford et al.’s GPT model was similarly trained on BooksCorpus. The fake-sentence-detection task is constructed by sampling sentences from BooksCorpus, and fake sentences are generating by randomly swapping 2–4 pairs of words in the sentence. We generate a dataset of 600,000 sentences with a 50/50 real/fake split for this intermediate task.
Unless otherwise stated, we follow the model formulation and training regime of the GPT specified in Radford et al., including using the same optimizer, learning rate schedule, and weight decay. We halve the batch size for pair sentence tasks to run the model on a single-GPU setting. We use a three-epoch training limit for both supplementary training and target-task fine-tuning, and use a fresh optimizer each time. For each task, we add only a small task-specific output layer to the pretrained Transformer model, and we follow Radford et al. in the choice of output layer and the method for handling multiple-sentence input. Radford et al. also formulate two regimes for training their model–with and without an auxiliary language modeling objective during fine-tuning. We show results from models trained using the auxiliary language modeling objective in both supplementary training and fine-tuning. Results with supplementary training and fine-tuning both without the auxiliary language modeling objective can be found in the Appendix.
For our baseline, we do not fine-tune on any intermediate task: This is equivalent to the formulation presented in Radford et al. (2018) and serves as our attempt to replicate their results using jiant’s fine-tuning code. Our replication gets a test score of 74.6, attaining an average increase of 0.5 points over theirs.444Adjusting their 72.8 public leaderboard score to account for WNLI gives them a more comparable score of 74.1.
Table 1 shows our results on GLUE with and without STILTs. Our addition of supplementary training boosts performance across many of the two sentence tasks. On each of our models trained with STILTs, we show improved overall average GLUE scores on the development set. For MNLI and QNLI target tasks, we observe marginal or no gains, likely owing to the two tasks already having large training sets. For the two single sentence tasks—the syntax-oriented CoLA task and the SST sentiment task—we find somewhat deteriorated performance. For CoLA, this mirrors results reported in Anonymous (2019), who show that few pretraining tasks other than language modeling offer any advantage for CoLA. The Overall Best score is computed based on taking the best score for each task.
On the test set, we show similar performance gains across most tasks. Here, we compute Best based on Dev
, which shows scores based on choosing the best supplementary training scheme for each task based on corresponding development set score. This is a more realistic estimate of test set performance, attaining a GLUE score of 76.9, a 2.3 point gain over the score of our baseline system adapted from Radford et al. This significantly closes the gap between Radford et al.’s model and the BERT modelDevlin et al. (2018) variant with a similar number of parameters and layers, which attains a GLUE score of 78.3.
We perform the same experiment on the development set without the auxiliary language modeling objective. The results are shown in Table 3 in the Appendix. We similarly find improvements across many tasks by applying STILTs, showing that the benefits of supplementary training do not require language modeling at either the supplementary training or the fine-tuning stage.
Limited Target-Task Data Experiments
Table 2 shows the same models fine-tuned on 5k training examples and 1k examples for each task. For tasks with training sets that are already smaller than these limits, we use the training sets as-is. The benefits of supplementary training are generally more pronounced in these settings, with performance in many tasks showing improvements of more than 10 points. CoLA and SST are again the exceptions: Both tasks deteriorated moderately with supplementary training, and CoLA trained with the auxiliary language modeling objective in particular showed highly unstable results when trained on small amounts of data.
We see one obvious area for potential improvement: In our experiments, we follow the recipe for fine-tuning from Radford et al.’s GPT as closely as possible. Particularly in the case of the artificially data-constrained tasks, we believe that performance can be improved with more careful tuning of the training duration (the three epoch cut-off) and learning rates to account for the small number of samples.
Taking into account Radford et al. (2018)’s comment that using the auxiliary language model objective can be detrimental in data-constrained settings, we also report scores without the auxiliary objective in Table 4 in the Appendix. Although we find that under data-constrained settings, the baseline scores without the auxiliary objective are higher than its auxiliarily trained counterparts, intermediate task training still improves performance across many tasks. Furthermore, although the average scores on each supplementary training regime are higher with auxiliary language modeling, the Overall Best without auxiliary language modeling is higher based on better performance on CoLA and RTE.
ELMo on STILTs
To assess whether our improvements can help language models with different architectures, we investigate the impact of supplementary training on a language model with an ELMo-based architecture Peters et al. (2018). On the development set, the baseline model reaches 63.8 and supplementary training on MNLI (resp. QQP) brings a 2.6 (resp. 1.0) point improvement. Appendix A details the training setup for ELMo and Table 5
shows the detail of the scores. Despite large differences in language model architecture, task setup and hyperparameters, supplementary training also helps in this setting. We leave the investigation of benefits on other language model-like architectures such as BERTDevlin et al. (2018) for future work.
We find that sentence pair tasks seem to benefit more from supplementary training than single-sentence ones. This is true even for the case of supplementary training on the single-sentence fake-sentence-detection task, so the benefits cannot be wholly attributed to task similarity. We also find that data-constrained tasks benefit much more from supplementary training. Indeed, when applied to RTE, supplementary training on MNLI leads to a eleven-point increase in test set score, pushing the performance of Radford et al.’s GPT model with supplementary training above the BERT model of similar size, which achieves a test set score of 66.4. Based on the improvements seen from applying supplementary training on the fake-sentence-detection task, which is built on the same BooksCorpus dataset that the GPT model was trained on, it is also clear that the benefits from supplementary training do not entirely stem from the trained model being exposed to different textual domains.
Applying STILTs also comes with little complexity or computational overhead. The same infrastructure used to fine-tune the GPT model can be used to implement the supplementary training. The computational cost of the supplementary training phase is another phase of fine-tuning, which is small compared to the cost of training the original model.
However, using STILTs is not always beneficial. In particular, we show that most of our intermediate tasks were actually detrimental to the single-sentence tasks in GLUE. The interaction between the intermediate task, the target task, and the use of the auxiliary language modeling objective is a subject due for further investigation. Therefore, for best target task performance, we recommend experimenting with supplementary training with several closely-related data-rich tasks and use the development set to select the most promising approach for each task, as in the Best based on Dev formulation shown in Table 1.
This work represents only an initial investigation into the benefits of supplementary supervised pretraining. More work remains to be done to firmly establish when methods like STILTs can be productively applied and what criteria can be used to predict which combinations of intermediate and target tasks should work well. Nevertheless, in our initial work with four example intermediate training tasks, GPT on STILTs achieves a test set GLUE score of 76.9, which markedly improves on our strong pretrained Transformer baseline. We also show that in data-constrained regimes, the benefits of using STILTs are even more pronounced.
We would like to thank Nikita Nangia for her helpful feedback.
- Anonymous (2019) Anonymous. 2019. Looking for ELMo’s friends: Sentence-level pretraining beyond language modeling. Under review.
- Bingel and Søgaard (2017) Joachim Bingel and Anders Søgaard. 2017. Identifying beneficial task relations for multi-task learning in deep neural networks. In Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics: Volume 2, Short Papers, pages 164–169, Valencia, Spain. Association for Computational Linguistics.
Bowman et al. (2015)
Samuel R. Bowman, Gabor Angeli, Christopher Potts, and Christopher D. Manning.
A large annotated
corpus for learning natural language inference.
Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing, pages 632–642. Association for Computational Linguistics.
- Cer et al. (2017) Daniel Cer, Mona Diab, Eneko Agirre, Inigo Lopez-Gazpio, and Lucia Specia. 2017. Semeval-2017 task 1: Semantic textual similarity multilingual and crosslingual focused evaluation. In Proceedings of the 11th International Workshop on Semantic Evaluation (SemEval-2017), pages 1–14. Association for Computational Linguistics.
- Conneau et al. (2018) Alexis Conneau, Germán Kruszewski, Guillaume Lample, Loïc Barrault, and Marco Baroni. 2018. What you can cram into a single $&!#* vector: Probing sentence embeddings for linguistic properties. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 2126–2136. Association for Computational Linguistics.
- Dagan et al. (2006) Ido Dagan, Oren Glickman, and Bernardo Magnini. 2006. The PASCAL recognising textual entailment challenge. In Machine learning challenges. evaluating predictive uncertainty, visual object classification, and recognising tectual entailment, pages 177–190. Springer.
- Devlin et al. (2018) Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2018. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint 1810.04805.
- Dolan and Brockett (2005) William B. Dolan and Chris Brockett. 2005. Automatically constructing a corpus of sentential paraphrases. In Proceedings of the Third International Workshop on Paraphrasing (IWP2005).
- Faruqui et al. (2015) Manaal Faruqui, Jesse Dodge, Sujay Kumar Jauhar, Chris Dyer, Eduard Hovy, and Noah A. Smith. 2015. Retrofitting word vectors to semantic lexicons. In Proceedings of the 2015 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 1606–1615. Association for Computational Linguistics.
- Gardner et al. (2017) Matt Gardner, Joel Grus, Mark Neumann, Oyvind Tafjord, Pradeep Dasigi, Nelson F. Liu, Matthew Peters, Michael Schmitz, and Luke S. Zettlemoyer. 2017. AllenNLP: A deep semantic natural language processing platform.
- Howard and Ruder (2018) Jeremy Howard and Sebastian Ruder. 2018. Universal language model fine-tuning for text classification. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 328–339. Association for Computational Linguistics.
- Levesque et al. (2011) Hector J Levesque, Ernest Davis, and Leora Morgenstern. 2011. The Winograd schema challenge. In Aaai spring symposium: Logical formalizations of commonsense reasoning, volume 46, page 47.
- McCann et al. (2017) Bryan McCann, James Bradbury, Caiming Xiong, and Richard Socher. 2017. Learned in translation: Contextualized word vectors. In Advances in Neural Information Processing Systems, pages 6297–6308.
- Paszke et al. (2017) Adam Paszke, Sam Gross, Soumith Chintala, Gregory Chanan, Edward Yang, Zachary DeVito, Zeming Lin, Alban Desmaison, Luca Antiga, and Adam Lerer. 2017. Automatic differentiation in pytorch.
- Peters et al. (2018) Matthew Peters, Mark Neumann, Mohit Iyyer, Matt Gardner, Christopher Clark, Kenton Lee, and Luke Zettlemoyer. 2018. Deep contextualized word representations. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers), pages 2227–2237. Association for Computational Linguistics.
- (16) Matthew E Peters, Mark Neumann, Luke Zettlemoyer, and Wen-tau Yih. Dissecting contextual word embeddings: Architecture and representation. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing (EMNLP), year=2018.
- Radford et al. (2018) Alec Radford, Karthik Narasimhan, Tim Salimans, and Ilya Sutskever. 2018. Improving language understanding by generative pre-training. Unpublished manuscript accessible via the OpenAI Blog.
- Rajpurkar et al. (2016) Pranav Rajpurkar, Jian Zhang, Konstantin Lopyrev, and Percy Liang. 2016. SQuAD: 100,000+ questions for machine comprehension of text. In Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, pages 2383–2392. Association for Computational Linguistics.
- Socher et al. (2013) Richard Socher, Alex Perelygin, Jean Wu, Jason Chuang, Christopher D Manning, Andrew Ng, and Christopher Potts. 2013. Recursive deep models for semantic compositionality over a sentiment treebank. In Proceedings of the 2013 conference on empirical methods in natural language processing, pages 1631–1642.
- Subramanian et al. (2018) Sandeep Subramanian, Adam Trischler, Yoshua Bengio, and Christopher J. Pal. 2018. Learning general purpose distributed sentence representations via large scale multi-task learning. In ICLR.
- Vaswani et al. (2017) Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. In Advances in Neural Information Processing Systems, pages 6000–6010.
- Wang et al. (2018) Alex Wang, Amanpreet Singh, Julian Michael, Felix Hill, Omer Levy, and Samuel R. Bowman. 2018. GLUE: A multi-task benchmark and analysis platform for natural language understanding. arXiv preprint 1804.07461.
- Warstadt et al. (2018) Alex Warstadt, Amanpreet Singh, and Samuel R Bowman. 2018. Neural network acceptability judgments. arXiv preprint 1805.12471.
- White et al. (2017) Aaron Steven White, Pushpendre Rastogi, Kevin Duh, and Benjamin Van Durme. 2017. Inference is everything: Recasting semantic resources into a unified evaluation framework. In Proceedings of the Eighth International Joint Conference on Natural Language Processing (Volume 1: Long Papers), volume 1, pages 996–1005.
- Williams et al. (2018) Adina Williams, Nikita Nangia, and Samuel Bowman. 2018. A broad-coverage challenge corpus for sentence understanding through inference. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers), pages 1112–1122. Association for Computational Linguistics.
- Zhang and Bowman (2018) Kelly Zhang and Samuel R. Bowman. 2018. Language modeling teaches you more syntax than translation does: Lessons learned through auxiliary task analysis. arXiv preprint 1809.10040.
Zhu et al. (2015)
Yukun Zhu, Ryan Kiros, Rich Zemel, Ruslan Salakhutdinov, Raquel Urtasun,
Antonio Torralba, and Sanja Fidler. 2015.
Aligning books and movies: Towards story-like visual explanations by
watching movies and reading books.
Proceedings of the IEEE international conference on computer vision, pages 19–27.
Appendix A ELMo on STILTs
We use the same architecture as Peters et al. (2018)
for the non-task-specific parameters. For task-specific parameters, we use the layer weights and the task weights described in the paper, as well as a classifier composed of pooling with projection and a logistic regression classifier. In contrast to the GLUE baselines and toAnonymous (2019), we refrain from adding many non-LM pretrained parameters by not using pair attention nor an additional encoding layer. The whole model, including ELMo parameters, is trained during both suplementary training on the intermediate task and target-task tuning. For two sentence tasks, we follow the model design of Wang et al. (2018) rather than that of Radford et al. (2018), since early experiments showed better performance with the former. Consequently, we run the shared encoder on the two sentences and independently and then use for our task-specific classifier. We use the default optimizer and learning rate schedule from jiant.
|Training Set Size||8.5k||67k||3.7k||364k||7k||393k||108k||2.5k||634|