An Embarrassingly Simple Approach for Transfer Learning from Pretrained Language Models

02/27/2019 ∙ by Alexandra Chronopoulou, et al. ∙ National Technical University of Athens 0

A growing number of state-of-the-art transfer learning methods employ language models pretrained on large generic corpora. In this paper we present a conceptually simple and effective transfer learning approach that addresses the problem of catastrophic forgetting. Specifically, we combine the task-specific optimization function with an auxiliary language model objective, which is adjusted during the training process. This preserves language regularities captured by language models, while enabling sufficient adaptation for solving the target task. Our method does not require pretraining or finetuning separate components of the network and we train our models end-to-end in a single step. We present results on a variety of challenging affective and text classification tasks, surpassing well established transfer learning methods with greater level of complexity.



There are no comments yet.


page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Pretrained word representations captured by Language Models (LMs) have recently become popular in Natural Language Processing (NLP). Pretrained LMs encode contextual information and high-level features of language, modeling syntax and semantics, producing state-of-the-art results across a wide range of tasks, such as named entity recognition

Peters et al. (2017), machine translation Ramachandran et al. (2017) and text classification Howard and Ruder (2018).

However, in cases where contextual embeddings from language models are used as additional features (e.g. ELMo Peters et al. (2018)), results come at a high computational cost and require task-specific architectures. At the same time, approaches that rely on fine-tuning a LM to the task at hand (e.g. ULMFiT Howard and Ruder (2018)) depend on pretraining the model on an extensive vocabulary and on employing a sophisticated slanted triangular learning rate scheme to adapt the parameters of the LM to the target dataset.

We propose a simple and effective transfer learning approach, that leverages LM contextual representations and does not require any elaborate scheduling schemes during training. We initially train a LM on a Twitter corpus and then transfer its weights. We add a task-specific recurrent layer and a classification layer. The transferred model is trained end-to-end using an auxiliary LM loss, which allows us to explicitly control the weighting of the pre-trained part of the model and ensure that the distilled knowledge it encodes is preserved.

Our contributions are summarized as follows: 1) We show that transfer learning from language models can achieve competitive results, while also being intuitively simple and computationally effective. 2) We address the problem of catastrophic forgetting, by adding an auxiliary LM objective and using an unfreezing method. 3) Our results show that our approach is competitive with more sophisticated transfer learning methods. Our code is publicly available (will be released in future version).

2 Related Work

Unsupervised pretraining has played a key role in deep neural networks, building on the premise that representations learned for one task can be useful for another task. In NLP, pretrained word vectors

Mikolov et al. (2013); Pennington et al. (2014) are widely used, improving performance in various downstream tasks, such as part-of-speech tagging Collobert et al. (2011) and question answering Xiong et al. (2016).

Aiming to learn from unlabeled data, Dai and Le (2015)

use unsupervised objectives such as sequence autoencoding and language modeling as pretraining representations.

Ramachandran et al. (2017) also pretrain encoder-decoder pairs using language models and fine-tune them to a specific task. ELMo embeddings Peters et al. (2018) are obtained from character-based bidirectional language models improving the results in a variety of tasks as additional contextual representations.

Towards the same direction, ULMFiT Howard and Ruder (2018) shows impressive results on a variety of tasks by employing pretrained LMs. The proposed pipeline requires three distinct steps, that include pretraining the LM, fine-tuning it on a target dataset with an elaborate scheduling procedure and transferring it to a classification model.

Multi-Task Learning (MTL) via hard parameter sharing Caruana (1993) in neural networks has proven to be effective in many NLP problems Collobert and Weston (2008). More recently, alternative approaches have been suggested that only share parameters across lower layers Søgaard and Goldberg (2016). By introducing part-of-speech tags at the lower levels of the network, the proposed model achieves competitive results on chunking and CCG super tagging. Our auxiliary language model objective follows this line of thought and intends to boost the performance of the higher classification layer.

3 Our Model

We introduce SiATL, which stands for Single-step Auxiliary loss Transfer L

earning. In our proposed approach, we first train a LM. We then transfer its weights and add a task-specific recurrent layer to the final classifier. We also employ an auxiliary LM loss to avoid catastrophic forgetting.

LM Pretraining. We train a word-level language model, which consists of an embedding LSTM layer Hochreiter and Schmidhuber (1997), 2 hidden LSTM layers and a linear layer. We want to minimize the negative log-likelihood of the LM:


where is the distibution of the word in the sentence given the words preceding it and is total number of sentences.

Transfer & auxiliary loss. We transfer the weights of the pretrained model and add one LSTM with a self-attention mechanism Lin et al. (2017); Bahdanau et al. (2015).

Figure 1: High-level overview of our proposed TL architecture. We transfer the pretrained LM add an extra recurrent layer and an auxiliary LM loss.

In order to adapt the contribution of the pretrained model to the task at hand, we introduce an auxiliary LM loss during training. The joint loss is the weighted sum of the task-specific loss and the auxiliary LM loss , where is a weighting parameter to enable adaptation to the target task but at the same time keep the useful knowledge from the source task. Specifically:


Exponential decay of .

An advantage of the proposed TL method is that the contribution of the LM can be explicitly controlled in each training epoch. In the first few epochs, the LM should contribute more to the joint loss of SiATL so that the task-specific layers adapt to the new linguistic distribution. After the knowledge of the pretrained LM is transferred to the new domain, the task-specific component of the loss function is more important and

should become smaller. In this paper, we use an exponential decay for over the training epochs.

Sequential Unfreezing. Instead of fine-tuning all the layers simultaneously, we propose unfreezing them sequentially, according to Chronopoulou et al. (2018). We first fine-tune only the extra, randomly initialized LSTM and the output layer for epochs. At the epoch, we unfreeze the pretrained hidden layers. We let the model fine-tune, until epoch . Finally, at epoch , we also unfreeze the embedding layer and let the network train until convergence. The values of and

are obtained through hyperparameter tuning. We find the sequential unfreezing scheme important, as it minimizes the risk of overfitting to small datasets.


Irony18 Sent17 SCv2 SCv1 PsychExp
BoW 43.7 61.0 65.1 60.9 25.8
NBoW 45.2 63.0 61.1 51.9 20.3
P-LM 42.7 0.6 61.2 0.7 69.4 0.4 48.5 1.5 38.3 0.3
P-LM + su 41.8 1.2 62.1 0.8 69.9 1.0 48.4 1.7 38.7 1.0
P-LM + aux 45.5 0.9 65.1 0.6 72.6 0.7 55.8 1.0 40.9 0.5
SiATL (P-LM + aux + su) 47.0 1.1 66.5 0.2 75.0 0.7 56.8 2.0 45.8 1.6
ULMFiT (Wiki-103) 68.7 0.6 56.6 0.5 21.8 0.3
ULMFiT (Twitter) 41.6 0.7 65.6 0.4 67.2 0.9 44.0 0.7 40.2 1.1
State of the art 53.6 68.5 76.0 69.0 57.0
Baziotis et al. (2018) Cliche (2017) Ilic et al. (2018) Felbo et al. (2017)


Table 1:

Ablation study on various downstream datasets. Average over five runs with standard deviation.

BoW stands for Bag of Words, NBoW for Neural Bag of Words. P-LM stands for a classifier initialized with our pretrained LM, su for sequential unfreezing and aux for the auxiliary LM loss. In all cases, is employed.


We use Stochastic Gradient Descent (SGD) for the pretrained LM with a small learning rate, in order to preserve its contextual information and avoid catastrophic forgetting. However, we want the extra LSTM and softmax layer to train fast and adapt to the target task, so in that case Adam

Kingma and Ba (2015) is employed.


Dataset Domain # classes # examples
Irony18 Tweets 4 4618
Sent17 Tweets 3 61854
SCv2 Debate Forums 2 3260
SCv1 Debate Forums 2 1995
PsychExp Experiences 7 7480


Table 2: Datasets used for the downstream tasks.

4 Experiments and Results

4.1 Datasets

To pretrain the language model, we collect a dataset of 20 million English Twitter messages, including approximately 2M unique tokens. We use the 70K most frequent tokens as vocabulary. We evaluate our model on five datasets: Sent17

for sentiment analysis

Rosenthal et al. (2017), PsychExp for emotion recognition Wallbott and Scherer (1986), Irony18 for irony detection Van Hee et al. (2018), SCv1 and SCv2 for sarcasm detection Oraby et al. (2016); Lukin and Walker (2013). More details about the datasets can be found in Table 2.

4.2 Experimental Setup

To preprocess the tweets, we use Ekphrasis (Baziotis et al., 2017). For the generic datasets, we use NLTK Loper and Bird (2002). For the NBoW baseline, we use word2vec Mikolov et al. (2013)

300-dimensional embeddings as features. For the neural models, we use an LM with an embedding size of 400, 2 hidden layers, 1000 neurons per layer, embedding dropout 0.1, hidden dropout 0.3 and batch size 32. We add Gaussian noise of size 0.01 to the embedding layer. A clip norm of 5 is applied, as an extra safety measure against exploding gradients. For each text classification neural network, we add on top of the transferred LM an LSTM layer of size 100 with self-attention and a softmax classification layer. For developing our models, we use PyTorch

Paszke et al. (2017) and Scikit-learn Pedregosa et al. (2011).

5 Results & Discussion

Baselines and Comparison. Table 1 summarizes our results. The top two rows detail the baseline performance of the BoW and NBoW models. We observe that when enough data is available (e.g. Sent17), baselines provide decent results. Next, the results for the generic classifier initialized from a pretrained LM (P-LM) are shown with and without sequential unfreezing, followed by the results of the proposed model SiATL. SiATL is also directly compared with its close relative ULMFiT (trained on Wiki-103 or Twitter) and the state-of-the-art for each task; ULMFiT also fine-tunes a LM for classification tasks. The proposed SiATL method consistently outperforms the baselines, the P-LM method and ULMFiT in all datasets. Even though we do not perform any elaborate learning rate scheduling and we limit ourselves to pretraining in Twitter, we obtain higher results in two Twitter datasets and three generic.

Figure 2: Results of our proposed approach (SiATL) (o) and ULMFiT (+) for different datasets as a function of the number of training examples.

Auxiliary LM objective. The effect of the auxiliary objective is highlighted in very small datasets, such as SCv1, where it results in an impressive boost in performance (7%). We hypothesize that when the classifier is simply initialized with the pretrained LM, it overfits quickly, as the target vocabulary is very limited. The auxiliary LM loss, however, permits refined adjustments to the model and fine-grained adaptation to the target task.

Exponential decay of . For the optimal interval, we empirically find that exponentially decaying to half of its initial value over the number of training epochs provides best results for our classification tasks. A heatmap of is depicted in Figure 3. We observe that small values of should be employed, in order to scale the LM loss in the same order of magnitude as the classification loss over the training period.

Sequential Unfreezing. Results show that sequential unfreezing is crucial to the proposed method, as it allows the pretrained LM to adapt to the target word distribution. The performance improvement is more pronounced when there is a mismatch between the LM and task domains, i.e., the non-Twitter domain tasks. Specifically for the PsychExp and SCv2 datasets, sequentially unfreezing yields significant improvement in building upon our intuition.

Number of training examples. Transfer learning is particularly useful when limited training data are available. We notice that for our largest dataset Sent17, SiATL outperforms ULMFiT only by a small margin when trained on all the training examples available (see Table 1), while for the small SCv2 dataset, SiATL outperforms ULMFiT by a large margin and ranks very close to the state-of-the-art model Ilic et al. (2018). Moreover, the performance of SiATL vs ULMFiT as a function of the training dataset size is shown in Figure 2. Note that the proposed model achieves competitive results on less than 1000 training examples for the Irony18, SCv2, SCv1 and PsychExp datasets, demonstrating the robustness of SiATL even when trained on a handful of training examples.

Figure 3: Heatmap of the effect of to -score, evaluated on SCv2. The horizontal axis depicts the initial value of and the vertical axis the final value of .

6 Conclusions and Future Work

We introduce SiATL, a simple and efficient transfer learning method for text classification tasks. Our approach is based on pretraining a LM and transferring its weights to a classifier with a task-specific layer. The model is trained using a task-specific functional with an auxiliary LM loss. SiATL avoids catastrophic forgetting of the language distribution learned by the pretrained LM. Experiments on various text classification tasks yield results competitive to the state-of-the-art, demonstrating the efficacy of our approach. Furthermore, our method outperforms more sophisticated transfer learning approaches, such as ULMFiT in all tasks.

In future work, we plan to incorporate subword information in our LMs and experiment with bidirectional architectures. Furthermore, we plan to pretrain LMs on generic corpora. Finally, we will investigate adaptive schemes for sequential unfreezing and decay over training epochs.

7 Acknowledgements

We would like to thank Katerina Margatina and Georgios Paraskevopoulos for their helpful suggestions and comments. This work has been partially supported by computational time granted from the Greek Research & Technology Network (GR-NET) in the National HPC facility - ARIS. Also, the authors would like to thank NVIDIA for supporting this work by donating a TitanX GPU.