In the continual learning framework of machine learning, a neural network performs poorly on tasks first trained on in a sequence of successive tasks. This problem is termed as catastrophic forgetting[french1999catastrophic]. Recently, various approaches have been proposed to address it, such as Hard Attention (HT) [serra2018overcoming]
and Incremental Moment Matching (IMM)[lee2017overcoming]. Elastic Weight Consolidation (EWC) [kirkpatrick2017overcoming] is one such regularization based approach which has been shown to preserve performance over tasks effectively even after training on multiple tasks. Section II
discusses other works in natural language processing (NLP) which have utilized EWC for overcoming catastrophic forgetting and more. However, in the proposed framework, we use EWC for a distinctive reason from catastrophic forgetting. EWC relies on the idea of over-parameterization[kirkpatrick2017overcoming]. Over-parameterization suggests that a neural network has more than one optimal solution for the same task. Hence, given two tasks A and B, there can be a solution to task B, which exists in the low error space of task A. EWC tries to approach this solution for maximizing performance on both the tasks as seen in Figure 1. The proposed framework Sequential Domain Adaptation (SDA), appropriates this idea for Domain Adaptation. We hypothesize that, for sentiment analysis on a specific domain, multiple solutions exist, at least one of which lies closest to the general domain optimal. We do not have any general domain data. Instead, we rely on continuous training across multiple domains to reach as close as possible to the general domain optimal, which gives the best performance on an unseen domain. Section III explains EWC and its utilization within the proposed framework. We experiment SDA with varying continual learning strategies, ordering of domains for training, and across architectures to compare them against the state-of-the-art (SotA) models. Section IV details these experiments. Our observations and experiments conclude that sequential training of domains in a harder-first approach or Anti-Curriculum domain ordering outperforms SotA. Curriculum Learning[bengio2009curriculum] proposes that training a model such that it is provided easier samples first leads to better generalization. Anti-Curriculum, as the name suggests, advises that training with harder to easier examples leads to a better generalization. We discuss the effect Curriculum and Anti-Curriculum ordering along with other results in Section VI.
The overall contributions of our paper are as follows:
We propose a framework SDA that outperforms SotA systems while employing a stricter resource-constrained definition of Domain Adaptation.
Proposed framework is architecture invariant, enabling even simple and fast architectures to beat complex SotA architectures.
Since the proposed framework draws from catastrophic forgetting, we compare catastrophic forgetting and domain adaptation by contrasting advanced continual learning methods with EWC, employed in SDA for Domain Adaptation.
Observed results show the effectiveness of an Anti-Curriculum domain ordering for a better generalization across unseen domains.
Ii Related Work
Domain Adaptation: Most existing domain adaptation works focus on learning general domain knowledge just from one domain rather than utilizing all. Recent SotA performance is recorded by pivot based methods [ziser2018pivot, ziser2019task], adversarial [qu2019adversarial], semi-supervised [he2018adaptive], and domain classification [liu2018learning] based techniques. We explain recent methods as baselines in Section IV-D
Elastic Weight Consolidation in NLP: In recent times EWC [kirkpatrick2017overcoming] has been in used in various Natural Language Processing tasks, like in machine translation [varis2019unsupervised, saunders2019domain, thompson2019overcoming], Language Modeling [wolf2018continuous] and Sentiment Analysis [lv2019sentiment]. Most of these works utilize Elastic Weight Consolidation to tackle catastrophic forgetting. [saunders2019domain, thompson2019overcoming]
tackle Domain adaptation in Neural Machine Translation (NMT). They train their translation model on successive parallel corpora, with different domains and use EWC to retain performance on older domains. However, these works assume the existence of general domain data.[wolf2018continuous] use EWC to reduce forgetting between distinct high level and low-level language modeling tasks. [lv2019sentiment] leverage shared knowledge between successive domains with EWC to perform on trained domains the highest. They aim to improve sentiment analysis with the help of data from more domains. This idea has also been reflected in earlier works, such as [yang2016leveraging]. However, the proposed work strictly keeps test domain unseen, as the aim is to use EWC to reach as close as possible to a general domain optimal solution with only seeing multiple training domain data.
Furthermore, there have also been works on domain adaption, such as incremental domain adaptation (IDA) [asghar2018progressive]. While our work resembles IDA in the idea of taking advantage of multiple domains for domain adaption, but it differs in the definition of domain adaptation. Within the IDA framework, the goal is to build a unified model that performs best on the domains observed so far by the model. Hence for [asghar2018progressive]’s work overcoming catastrophic forgetting becomes an important task. However, in the proposed problem setting, the application of EWC is not to overcome catastrophic forgetting itself since we are not bothered with performance on source domains but rather to facilitate a framework to maximize return on an utterly unseen target domain. This difference between Domain Adaptation and Catastrophic Forgetting is discussed with results in Section VI-C
Iii Sequential Domain Adaptation Framework and Elastic Weight Consolidation
Iii-a Elastic Weight Consolidation
The EWC [kirkpatrick2017overcoming]loss function is devised to train on a task while not forgetting the parameters of a previous task. This is done by adding the difference of magnitude between current parameters and previous parameters weighted by the parameter’s importance to the previous task. Suppose a neural network is trained on task A, being the optimal value of the parameters on this task. Then, while training for task B, the following loss function will be used:
is the standard cross-entropy loss on task B in isolation. Within the EWC term itself.
is a hyperparameter. The summationis over all parameters with being a single parameter, and is the diagonal element of corresponding Fisher information matrix of the parameter when training on task . correlates with parameter gradients on task A squared . After training of task A, a higher gradient of a parameter implies that even small changes in the parameter’s value will result to large changes in the predictions. Hence the model is sensitive to such parameters, reflecting their importance to task A. Conversely, lower gradient implies lower significance for task A. Hence, its regularization is weighted less and can be updated more flexibly when training task .
|Kitchen (K)||Electronics (E)||DVD (D)||Books (B)|
Using Figures 1 and 2, we explain our framework which enables us to exploit Elastic Weight Consolidation for Domain Adaptation completely invariant to the model itself. Figure 1 explains the original EWC reasoning. [kirkpatrick2017overcoming] suggest that due to over parameterization, there exists a low error region in the parameter space for any task. All parameters within this space are equally optimal. When training on task B, using EWC we move to the common intersection with low error region of task B. We attune this idea in domain adaptation. Here each task could be interpreted as a domain. If there exists a general domain low error space, then successive EWC training across various domains will push the solution closest to this general domain low error space. Figure 2 demonstrates this idea. where A, B, and C are source domains model has been trained on, will be pushed close to where G is the ideal general domain for which we lack any actual data. As long as the training is done on enough number of domains we hypothesize that attenuates. Such a general domain solution will maximize performance on unseen domains solving the problem of domain adaptation. Since the framework only uses custom loss function from EWC, it is independent of the model architecture itself.
Given a set of training or “source” domains where and n is the number of source domains, the proposed framework is explained as follows. For a model architecture and continual learning method , f is trained iteratively on an ordered set of the source domains . If the domain ordering strategy . where then training is done sequentially in the order of . We test with all possible combinations of domains and report our results in Tables IV and V. We get the best performance with being Anti-Curriculum or hardest first in nature. We explain the ordering strategy in Section VI-B. For every successive training between two domains and , the parameter training of is constrained by the continual learning method C such that performance of on and is maximized.The aim of domain adaptation is to develop to perform best on a completely unseen “target” domain where . In SDA specifically, C is chosen to be EWC, but we experiment with other, more recent continual learning methods as well explained in Section IV-C. Algorithm 1 outlines SDA.
In the following sections, we explain our experimental details and choices of the model architectures , the continual learning method , and the domain ordering .
Iv Experimental Setup
We perform experiments on the standard Multi-Domain Sentiment Dataset [blitzer2007biographies]. It contains reviews from 4 domains, namely Books (B), DVD (D), Electronics (E), and Kitchen (K). Each domain has 2000 reviews, of which 1000 reviews belong to the positive polarity, and 1000 reviews belong to negative polarity. In each domain, 1280 reviews are used for training, 320 for validation, and 400 for testing. All the reported results are averaged over five runs. While reporting results on a target domain, the training sequence of domains , and is represented as .
We experimented the following neural network architectures for performing SDA. Hyperparameters of the Architectures are specified in Section V:
This architecture is based on the popular CNN-non-static model used for sentence classification [kim2014convolutional].
We used a single layer LSTM model [hochreiter1997long], and the output at the last time-step is fed into a fully connected output layer.
Iv-B3 Attention LSTM (ALSTM)
This architecture is same as the above LSTM, except the attention [bahdanau2014neural] mechanism is applied on the outputs across all time-steps. The weighted sum is fed into a fully connected output layer.
Iv-B4 Transformer Encoder (TE)
This architecture uses a Bidirectional Transformer also known as Transformer Encoder [vaswani2017attention]. Transformer encoder has been very popularly used for text classification [devlin2018bert]. The output from the Encoder is averaged across time-steps and fed into a fully connected output layer. We do not initialize our architecture with the weights of popular transformer encoders such as BERT[devlin2018bert]. These models first pretrain their architecture on large general domain data. However, proposed models assume the absence any such general data. Hence, for a fairer comparison the transformer encoder is randomly initialized.
Iv-C Continual Learning Baselines
We compare our approach with the following continual learning methods . Note that these baselines adapt the same architectures mentioned in section IV-B. The training procedure of SDA differs among these baselines.
Iv-C1 Weight Initialization (Init)
Let , be the domains trained sequentially using the neural network architecture . The parameters of for training on are initialized with the parameters of trained on .
Iv-C2 Combined Training (Comb)
In this baseline, the data from the three domains are combined to train the neural network just once.
Iv-C3 Incremental Moment Matching (IMM)
Incremental Moment Matching (IMM) [lee2017overcoming] is a method proposed for overcoming catastrophic forgetting between tasks. It incrementally learns the distribution of neural network trained on subsequent tasks. We compare our approach with two IMM techniques IMM-Mean (Mean) and IMM-Mode (Mode).
Iv-C4 Hard Attention to Task (HAT)
Hard Attention to Task (HAT) [serra2018overcoming] is also proposed for overcoming catastrophic forgetting between tasks. A hard attention mask is learned concurrently to every task, which preserves the previous task’s information without affecting the learning of current task.
Iv-D State-of-the-art Baselines
The URLs of the code used for performing experiments on SotA architectures are given in the Appendix
Pivot Based Language Model (PBLM) [ziser2018pivot]
is a representation learning model that combines pivot-based learning with Neural Networks in a structure-aware manner. The output from PBLM model consists of a context-dependent representation vector for every input word.
[liu2018learning] learns Domain-Specific Representations (DSR) for each domain, which are then used to map adversarial trained general Bi-LSTM representations into domain-specific representations. This domain knowledge is further extended by training a memory network on a series of source domains. This memory network holds domain-specific representations for each of the source domains.
Bilingual Sentiment Embeddings (BLSE) [barnes2018projecting] casts Domain Adaptation problem as an embedding projection task. The model takes input as embeddings from two domains and projects them into a space representing both of them. This projection is jointly learned to predict the sentiment.
Domain Adaptive Semi-supervised learning (DAS)[he2018adaptive] minimizes the distance between the source and target domains in an embedded feature space. For exploiting additional information about target domain, unlabelled target domain data is trained using semi-supervised learning. This is done by employing the regularization methods of entropy minimization and self-ensemble bootstrapping.
Adversarial Category Alignment Network (ACAN) [qu2019adversarial] trains an adversarial network using labeled source data and unlabelled target data. It first produces ambiguous features near the decision boundary reducing the domain discrepancy. A feature encoder is further trained to generate features appearing at the decision boundary.
V Implementation Details
Let denote the input sentence represented by a sequence of words where is the maximum sentence length. Let be the vocabulary of words and denote the pretrained word embeddings where denotes the dimensions of word embeddings. Words present in the vocabulary are initialized to the corresponding word embeddings and words not present are initialized to 0’s and are updated during training. Therefore, input is converted to . We used a maximum sentence length of 40 and glove 840B pretrained embeddings111https://nlp.stanford.edu/projects/glove/.
V-a CNN Architecture
Convolutional layers with kernel sizes 3, 4 and 5 are applied on the input
simultaneously. The number of filters are 100 and activation function in the convolutional layer is ReLU. The outputs from each of the convolutional layer are concatenated and fully connected to the sigmoid layer. A dropout of 0.5 is applied on the fully connected layer. CNN is trained for 30 epochs with a batch size of 16. Early-stopping mechanism is applied if the validation loss doesn’t decrease for 10 epochs. AdamW is used as the optimizer with a learning rate of 0.001.
V-B LSTM Architecture
The input is passed through a single layer LSTM. Hidden dimensions of LSTM is 100. LSTM is trained for 25 epochs with a batch size of 15. AdamW is used as the optimizer with a learning rate of 0.001.
V-C ALSTM Architecture
The input is passed through the single layer LSTM. Attention mechanism is applied on the outputs of all the timesteps. Hidden dimensions of LSTM are 100 and hidden layer dimension of attention mechanism is 64. The output from attention is fully connected to the sigmoid output layer. ALSTM is trained for 30 epochs with a batch size of 35. AdamW is used as the optimizer with a learning rate of 0.0081.
V-D Bi-directional Transformer Encoder Architecture (TE)
is passed through a Bi-directional Transformer Encoder. The outputs from the transformer encoder are averaged across timesteps and are fully connected to sigmoid output layer. Number of Encoder layers are 2, number of heads in the multihead attention models are 5, the dimension of the feedforward network model is 256. TE is trained for 30 epochs with a batch size of 35. AdamW is used as the optimizer with a learning rate of 0.001.
Vi Results and Discussion
visualizes the attention scores from the ALSTM model of samples from Kitchen (K) and DVD (D) target domains. The attention scores are shown at each step in SDA when the model encounters a new source domain. In these examples, the model correctly classifies the sentiment on only observing the third source domain. As explained in SectionIII, the target domain is always unseen by the model. These examples help us understand how SDA works internally. We note the following crucial points.
In the first example where Kitchen is the target domain, on just observing DVD, the model attends to “thin”, “washing” and “drying”. These words describe pillowcases, even though the sentence is a washing machine review, attending on them produces an incorrect positive outcome. On encountering the second domain, the model learns that the actual sentiment of the sentence is not conveyed in these terms. However, it is only on supervising the Electronics domain, the attention is focused on the crucial terms “used” and “unsatisfactory”, giving the real sentiment.222As part of preprocessing, stop-words have been removed.
In the second example where DVD is the target domain, on encountering just Books, the focus is arbitrary and hence resulting in the wrong prediction. The focus is just right on encountering Electronics, and precisely on the “highly recommended movie” when encountering the final source domain.
In both examples, we see how the model with sequential EWC training, moves from arbitrary domain-specific attention to learning more general domain knowledge, helping it predict a sample of an unseen domain. Without the framework, the model fails to generalize from a single source domain and ends up making incorrect predictions.
In our experiments described in Section IV, proposed approach comparison with the SotA architectures is discussed, the results of these experiments have been charted in Table II. The choice of continual learning method EWC has been justified in Tables I ,III ,VIII and IX. Furthermore, Tables IV, V and VI, VII compare the anti-curriculum order of training with other orders across models in SDA.
Vi-a SDA and State-of-the-Art Domain Adaptation
As shown in Tables II and , the models trained with SDA framework outperform the recent state-of-the-art in Domain Adaptation of sentiment Analysis. PBLM, DAS, ACAN [ziser2019task, ziser2018pivot, he2018adaptive, qu2019adversarial] use Domain adaptation in a semi-supervised setting. In other words, they use large quantities of target domain unlabelled data for training their respective architecture. Our framework strictly keeps the target domain unseen and still can outperform them except for ACAN in one target domain. This shows the effectiveness of SDA framework in resource scarce setting. DSR, similar to the proposed framework, uses multiple source domains for learning domain representations. Their model relies heavily on domain classification. However, training a robust domain classifier requires much more data, hence in our resource-constrained setting, they give a poor performance. While BLSE outperforms SDA in one target domain, they utilize labelled target domain data for training their architecture, whereas as SDA keeps target domain strictly unseen. This makes the domain adaptation definition of proposed framework much more stringent. Apart from DSR, all other architectures observe target domain in some way. The EWC regularized training of SDA ameliorates the requirement to observe the target domain. EWC-CNN and EWC-LSTM closely follow the highest results when not the highest themselves. Other architectures, EWC-ALSTM and EWC-TE might be poor because of small dataset size, failing to generalize on unseen domains.
Vi-B Effectiveness of Anti-Curriculum order of Training
Since SDA facilitates training in a sequential manner, there is a multitude of orders in which the model can observe the domains. Specifically, domains entail orders. Tables IV, V and VI, VII show the performance on all possible domain orderings kitchen and electronics as test domains respectively333Due to constraint in page length, we have kept the domain ordering tables for DVD and Books test domain in appendix here: Here.. For EWC, we observe that a certain ordering has largely performed best. On comparing the single source domain setting in Table III, we see that for Kitchen as a target domain, Electronics is the easiest followed by Books and DVD. Hence, DBE can be characterized as “Anti-Curriculum” strategy where the hardest examples are provided first followed by easier ones. Conversely, EBD becomes a curriculum strategy. Followed closely by DBE are the results of BDE, this shows that while anti-curriculum works best, there is more bias towards the last domain and not all positions in the ordering are equally important. While the effect of curriculum strategies in text classification is well documented [cirik2016visualizing, han2017tree], the result of anti-curriculum strategies in literature has been mixed. [hacohen2019power, weinshall2018curriculum] show that anti-curriculum strategies work worst among no curriculum and curriculum. However, anti-curriculum effectiveness has been demonstrated by some, such as [mccann2018natural]. Our results demonstrate that an anti-curriculum ordering of domains works best and furthermore, curriculum ordering gives one of the most unsatisfactory outcomes, even less than results when the model observes a single domain as shown in Table II. This indicates that the choice of curriculum or anti-curriculum is heavily task-dependent. Previous[hacohen2019power, weinshall2018curriculum, mccann2018natural] and current work show that if either of curriculum or anti-curriculum work, the converse strategy leads to a reduced performance than no curriculum.
Vi-C Catastrophic Forgetting and SDA
The problem of Catastrophic Forgetting and Domain Adaptation within the SDA framework are fundamentally different. This is shown in Tables I, III, VIII and IX where various recent continual learning methods are compared against EWC. Previous works [serra2018overcoming, lee2017overcoming] have established that these methods outperform EWC at overcoming catastrophic forgetting and are better at remembering multiple tasks learnt in a continual learning setting. However, across various architectures, we find that these methods are not good at Domain Adaptation. Despite being able to remember previous domain information, they fail to push model parameters closer to a general domain low error region, solving the problem of domain adaptation. These methods perform even weaker than settings where the model only encounters a single source domain. A Domain Adaptation setting much more dependent to catastrophic forgetting is IDA[asghar2018progressive] where the model can only train on a single domain at a time, similar to proposed SDA, however, the model needs to perform best on all trained domains. Hence remembering disparate domain knowledge observed by the model becomes key to building solutions here. One question that arises when we study the SDA framework is that ”If the model is observing multiple source domain data anyway, then why don’t we combine all = and call as General Domain data and train on it.“. However, as we see in Table III proposed framework outperforms combined baseline across all architectures. This is because, as long as target domain , is never really the General Domain. A model , trained on may perform sound on all the domains contained within but not outside it. As explained in Figures 1 and 2, in SDA, using EWC we move to a solution space as close as possible to the actual general domain solution for which we can never have the data since target domain is unseen.
Vi-D Training time of SotA models
Since the proposed framework enables even simple and low parameter models such as CNN to outperform state-of-the-art models, we also get an advantage in training time. This is shown in Figure 4. Apart from DSR, all architectures use only a single target domain and yet take much more time to train. Among proposed architectures for SDA, CNN not only performs best as we saw in Table II, but also is the quickest to train by a large margin. Comparatively on Electronics domain, we see LSTM takes five times more time than CNN. However, it is still five to thirty times less than PBLM and ACAN. As we saw in Table II, BLSE and ACAN outperformed SDA in DVD and Books. However, they are also ten times slower than EWC-CNN, which take 15 and 16 seconds respectively. This shows the efficiency of SDA at empowering low parameter models such as CNN to match large architectures.
In this paper, we present a new framework SDA to enable Domain Adaptation for Sentiment Analysis models. The proposed framework utilizes data from multiple domains but strictly keeps target domain data unseen. The approach aims models to be trained sequentially on the source domains with Elastic Weight Consolidation between successive training steps. In doing so, the framework empowers even simple model architectures to outperform complex state-of-the-art systems. SDA is tested on various models showing its independence on architectures. An anti-curriculum ordering of domains leads to the best performance. However, the shortcoming of such an ordering is that it necessitates the requirement of all source domains beforehand. Including that they were individually tested against the target domain to compute their respective difficulty or easiness. Future work could include how to make SDA invariant to the domain ordering as well.
Viii Appendix: Results and Discussions
The effect of domain ordering in these algorithms are presented for DVD target domain in Tables X and XI and for Books target domain in Tables XII and XIII across all architectures. As we can observe from the aforementioned Tables, our hypothesis from the results on Kitchen and Electronics Target domains stand true across other testing domains as well.
Viii-a Baseline Architectures
We provide URLs of the code used for running Baseline Architectures and Continual Learning methods.
IMM Mean and Mode
Viii-B SotA Implementations
We used a maximum sentence length of 40 for DAS and ACAN instead of 3000. We fine-tuned PBLM by training for 30 epochs and early stopping is imposed with a patience of 10. IMM Mean, IMM Mode and HAT are trained with a batchsize of 32 instead of 64