One ultimate goal of language modelling is to construct a model like human, to grasp general, flexible and robust meaning in language. One reflection of obtaining such model is be able to master new tasks or domains on same task quickly. However, NLU models have been building from specific task on given data domain but fail when dealing with out-of-domain data or performing on a new task. To combat this issue, several research areas in transfer learning including domain adaptation, cross lingual learning, multi-task learning and sequential transfer learning have been developed to extend model handling on multiple tasks. However, transfer learning tends to favor high-resources tasks if not trained carefully, and it is also computationally expensivedou:19. Meta learning algorithm tries to solve this problem by training model in a variety of tasks which equip the model the ability to adapt to new tasks with only a few samples.
In our case, we adopt the idea of model-agnostic meta learning (MAML) which is an optimization method of meta learning that directly optimized the model by constructing an useful initial representation that could be efficiently trained to perform well on various tasks finn:18. However, in an continual learning where data comes into the model sequentially, there is still a potential problem of catastrophic forgetting where a model trained with new tasks would start to perform worse on previous tasks. The two objectives of designing a continual learning architecture are to accelerate future learning where it exploits existing knowledge of a task quickly together with general knowledge from previous tasks to learn prediction on new samples and to avoid interference in previous tasks by updates from new tasks. MLRCL:19.
In this paper, we utilize algorithm derived from Jave and White MLRCL:19 which applies Meta-Learning under continual learning. Our objective is to apply this framework in NLP field, specifically on NLU tasks. By taking advantage of this model-agnostic approach, Meta-Learning under continual learning should be applicable on any language model that is optimized by gradient-based methods. We compare our results with Duo et al dou:19 which applies meta-learning on Glue tasks, our MAML-Rep shows comparable results. We hope to bring new research direction in NLP fields focusing on such method. The implementation of our code can be found at https://github.com/lexili24/NLUProject.
The section is dedicated to examine the implementation of methods solely or combined, in natural language and other fields, which leads us to develop our framework tackling NLU tasks. Plenty of research have been focused in these two areas and some efforts have succeeded in combining these two goals in other field.
There has been success in implementing MAML in NLU tasks dou:19. In their work, they explored the model-agnostic meta-learning algorithm (MAML) and its variants for low-resource NLU tasks and obtained impressive results on the GLUE benchmarkglue:19. This proves that MAML can be applied to NLU tasks, and achieve comparable results on complex architectures like BERT and MT-DNN. However, this method does not address the potential problem of catastrophic forgetting, as they test Meta-Trained model on one task at a time.
In addition, meta learning is proved to excel in other natural language domains. Mi et almi:19
has shown promising results of incorporating MAML in natural language generation (NLG). NLG models, like many NLU tasks, are heavily affected by the domain they are trained on and are data-intensive but data resource is low due to high annotation cost. Therefore, authorsmi:19 approach to generalize a NLG model with MAML to train on the optimization procedure and derive a meaningful initialization serves to adapt new low-resource NLG scenarios efficiently. In comparison, Meta-Learning approach outperformed multi-task approach with higher BLEU score and lower error percentage.
2.2 Continual Learning
Continual learning is proved to boost model performance in Liu et al liu:19’s writing on computing sentence similarities. Liu et al liu:19 leveraged continual learning to construct a simple linear sentence encoder to learn representations and compute similarities between sentences, such application can be fed into a chat bot. A general concern is that in practice, the encoder is fed into a series of input from inconsistent corpora, and might degrade performance if fails to generalize common knowledge across domains. Continual learning enables zero-shot learning and allows a sentence encoder to perform well on new text domains while avoiding catastrophic forgettingliu:19. Authors evaluate result on semantic textual similarity (STS) datasets with Pearson correlation coefficient (PCC). With a structure utilizing continual learning approach, Liu et al liu:19 showed consistent results cross various corpora.
Continual learning implemented in NLU tasks on top of transfer learning presented by Yogatama Yogatama:2019
did not show generalization of the model. Yogatama et al followed the continual learning setup to train a new task on best SQuAD-trained BERT and ELMo model, and both architectures show catastrophic forgetting after TriviaQA or MNLI is trained, which degrades model performance on SQuAD dataset. Their work shows an attempt to derive a generative language model and provides a solid ground of continual learning in language modelling.
An implementation of meta-learning under continual framework is proposed in reinforcement learning (RL) by Alshedivat et alAl-Shedivat:2017. In their paper, MAML is proved to be a complementary solution adding onto continual adaption in reinforcement learning (RL) fields. Al-Shedivat et alAl-Shedivat:2017
considered nonstationary environments as sequences of stationary tasks for RL agents, which transferred nonstationary environment to learning-to-learning tasks. They developed a gradient-based Meta-Learning algorithm for quick adaption to continuously changing environment. They found that Meta-Learning is capable of adapting far more efficiently than baseline models in the few-shot regime. Although the implementation is outside the domain of Natural Language Processing, it is worth-noting that experts from different domains have implemented this method and sheds lights on authors to implement in NLU tasks.
To sum up, MAML and continual learning have been applied on NLP tasks separately but not both. In reinforcement learning, Meta-Continual learning can solve non-stationary environments Al-Shedivat:2017. In next section, we extend on the work done by Javed and White MLRCL:19 and propose implementations on combining both methods for NLP tasks.
3 Problem and Method
3.1 Problem Formation
Consider the input data consists of an stream of data
for inputs and targets , for continual learning task this stream can be extended to an unending stream. In our case, we concatenate batches of data in order, each batch consisting data from a task in glue or superglue benchmark glue:19 superglue:19. We followed Dou et aldou:19
’s way of defining Meta-training and Meta-testing tasks. For high resources tasks in Meta-Training, we use SST-2, QQP,1 MNLI and QNLI. For low-resource auxiliary tasks in Meta-testing, we add RTE, BoolQ, CB, Copa, WiC and WsC from SuperGlue to the original set of meta-testing tasks, CoLA , MRPC , STS-B and RTE.
Javed and White atMLRCL:19 proposed a methodology that achieves Meta-Learning under continual learning setting. The representation learnt from existing knowledge by Meta-Learning, enables the model to learn new tasks quickly. Traditional MAML, proposed by Finn et al finn:18,takes a task which is sampled from during meta-training phrase and the model is trained with samples and feedback from the corresponding loss , and then tested on new samples selected from . Model is improved by looking at how test error on new data changes with respect to parameters. Finn’s approach of MAML is learning an effective initialization which Javed et atMLRCL:19 reframed to MAML-Rep which enables it to work in the online setting. OML is another approach that attempts to alleviates catastrophic forgetting by online updating at meta-training phase, and utilize meta-testing tasks to improve the accuracy of the model in general. Given most neural networks are highly sparse, OML takes such advantage to update its parameters by constructing representations of the incoming online data point of different tasks either as parallel, where some updates can be beneficial for many tasks, or orthogonal, where updates by later task do not interfere with previous tasks.
Our model architecture strictly follows the architecture proposed in MLRCL:19, where both MAML-Rep and OML objectives are test in NLU tasks by training a pre-trained BERT model, we call models produced by these objectives MAML-Bert and OML-Bert. We used BERT because BERT is a state-of-art language model that utilizes Transformer architectures bert:18. Pre-trained BERT is chosen instead of an empty BERT because Yogatama et al Yogatama:2019 have showed that training a BERT with supervised tasks instead of unsupervised tasks critically degrades model performance, and this paper focuses on supervised tasks only. To understand our training and evaluation methods, a brief overview of both objectives are introduced below.
For Meta-Training, we consider two Meta-Objective to minimize. (1) a MAML like objective and (2) OML objective. The OML objective is defined as
sampled from distribution . Specifically, we designed our mBERT (modified BERT model ) inspired from MLRCL:19 into two parts: BERT-Base Network (Representation Learning Network, RLN) and Task-specific Network (Prediction Learning Network, PLN) illustrated in the Figure 1.
Both approach updates PLN only in Meta-Training inner loop using support dataset with around 100 data points for some inner steps, and updates RLN in Meta-Trianing using query dataset with 80 data points in the outer loop. For Meta-Testing, 100 data points are fed into the model and both RLN and PLN is updated for some inner steps, at the end of each Meta-Tesing, we compare the performance of model trained on a specific task with after the model has been updated on all Meta-Testing Tasks to test if model forgets.
To minimize the objective above, our MAML-BERT model for continual learning are elaborated in Algorithm 1. For OML-BERT continual learning model, we proposed the same strategy except in Algorithm 1
, which was changed to train the inner update step with only one sample of the training data at a time. Using this method, we tried to mimic the Stochastic Gradient Descent approach in order to avoid catastrophic forgetting.
We present our results along with Dou’s and our baseline, using one model to simply train auxiliary tasks in sequences. With forgetting that degrades model performance, our MAML-Rep still outperforms other approaches. We attempted with different inner learning rates, inner update steps and Meta-Testing batch sizes, OML did not seem to generalize well in Meta-Testing tasks. Cola is evaluated with Matthew correlation, sts-b is evaluated with Pearson correlation, and rest two present accuracy score.
In addition, we swapped Meta-Testing tasks to low resource SuperGlue tasks and expanded number of tasks to 5 showing in table 2. All tasks are evaluated with accuracy score. MAML-Rep outperforms on three out of four tasks, note that OML still struggles to get better than random guessing for WsC tasks, and behave like random guessing for WiC and Copa tasks.
In this work, we are able to extend Meta-Learning under continual learning framework to learn a general presentation that is robust on a set of continual tasks with efficiency. We replicate MLRCL:19 method and and implement on NLU tasks. Results show that with less datapoints, we could derive a MAML like model that is robust on testing tasks, however extending it to continual setting during training phrase, the performance drastically worsen. Future direction would be extending this approach to other language models, as wells as experiment with a combination of high and low resources other than Glue and SuperGlue benchmark to evaluate model performance.
Appendix A Appendices: Implementation Details
Our implementation is based on PyTorch implementation, backboned in Huggingfacemodel. We use Adam optimizer, with a batch size of 16 for both Meta-Training and Meta-Testing. Maximum sentence length is set to be 64. In Meta-Learning and Meta-Testiing stage, we use learning rate of for outer loop learning rate where we update RLN, and for inner learning to update PLN. We use a cosine annealing in Meta-Training as a scheduler to update the optimizer. Dropout of 0.1 is applied to PLN when it is applicable. We set the inner update step to 5 for Meta-Training and 7 for Meta-Testing. We use a total sample of 128 and 112 for support and query dataset for Meta-Testing, and 100 and entire dev set during Meta-Testing.