Most previous NLP studies focus on open domain tasks. But due to the variety and ambiguity of natural language (Glorot et al., 2011; Song et al., 2011), models for one domain usually incur more errors when adapting to another domain. This is even worse for neural networks since embedding-based neural network models usually suffer from overfitting (Peng et al., 2015). While existing NLP models are usually trained by the open domain, they suffer from severe performance degeneration when adapting to specific domains. This motivates us to train specific models for specific domains.
The key issue of training a specific domain is the insufficiency of labeled data. Transfer learning is one promising way to solve the insufficiency (Jiang & Zhai, 2007). Existing studies (Daumé III, 2007; Jiang & Zhai, 2007) have shown that (1) NLP models in different domains still share many common features (e.g. common vocabularies, similar word semantics, similar sentence syntaxes), and (2) the corpus of the open domain is usually much richer than that of a specific domain.
Our transfer learning model is under the pre-training framework. We first pre-train the model for the source domain. Then we fine-tune the model for the target domain. Recently, some pre-trained models (e.g. BERT (Devlin et al., 2018), ELMo (Peters et al., 2018)
, GPT-2(Radford et al., 2019)) successfully learns general knowledge for text. The difference is that these models use a large scale and domain-independent corpus for pre-training. In this paper, we use a small scale but domain-dependent corpus as the source domain for pre-training. We argue that, for the pre-training corpus, the domain relevance will overcome the disadvantage of limited scale.
Most previous transfer learning approaches (Li et al., 2018; Ganin et al., 2016) only transfer information across the whole layers. This causes the information loss from cells in the source domain. “Layer-wise transfer learning” indicates that the approach represents the whole sentence by a single vector. So the transfer mechanism is only applied to the vector. We highlight the effectiveness of precisely capturing and transferring information of each cell from the source domain in two cases. First, in seq2seq (e.g. machine translation) or sequence labeling (e.g. POS tagging) tasks, all cells directly affect the results. So layer-wise information transfer is unfeasible for these tasks. Second, even for the sentence classification, cells in the source domain provide more fine-grained information to understand the target domain. For example, in figure 1, parameters for “hate” are insufficiently trained. The model transfers the state of “hate” from the source domain to understand it better.
Besides transferring the corresponding position’s information, the transfer learning algorithm captures the cross-domain long-term dependency. Two words that have a strong dependency on each other can have a long gap between them. Being in the insufficiently trained target domain, a word needs to represent its precise meaning by incorporating the information from its collocated words. Here “collocate” indicates that a word’s semantics can have long-term dependency on other words. To understand a word in the target domain, we need to precisely represent its collocated words from the source domain. We learn the collocated words via the attention mechanism (Bahdanau et al., 2015). For example, in figure 1, “hate” is modified by the adverb “sometimes”, which implies the act of hating is not serious. But the “sometimes” in the target domain is trained insufficiently. We need to transfer the semantics of “sometimes” in the source domain to understand the implication. Therefore, we need to carefully align word collocations between the source domain and the target domain to represent the long-term dependency.
In this paper, we proposed ART (aligned recurrent transfer), a novel transfer learning mechanism, to transfer cell-level information by learning to collocate cross-domain words. ART allows the cell-level information transfer by directly extending each RNN cell. ART incorporates the hidden state representation corresponding to the same position and a function of the hidden states for all words weighted by their attention scores.
Cell-Level Recurrent Transfer ART extends each recurrent cell by taking the states from the source domain as an extra input. While traditional layer-wise transfer learning approaches discard states of the intermediate cells, ART uses cell-level information transfer, which means each cell is affected by the transferred information. For example, in figure 1, the state of “hate” in the target domain is affected by “sometimes” and “hate” in the source domain. Thus ART transfers more fine-grained information.
Learn to Collocate and Transfer For each word in the target domain, ART learns to incorporate two types of information from the source domain: (a) the hidden state corresponding to the same word, and (b) the hidden states for all words in the sequence. Information (b) enables ART to capture the cross-domain long-term dependency. ART learns to incorporate information (b) based on the attention scores (Bahdanau et al., 2015) of all words from the source domain. Before learning to transfer, we first pre-train the neural network of the source domain. Therefore ART is able to leverage the pre-trained information from the source domain.
2 Aligned Recurrent Transfer
In this section, we elaborate the general architecture of ART. We will show that, ART precisely learns to collocate words from the source domain and to transfer their cell-level information for the target domain.
Architecture The source domain and the target domain share an RNN layer, from which the common information is transferred. We pre-train the neural network of the source domain. Therefore the shared RNN layer represents the semantics of the source domain. The target domain has an additional RNN layer. Each cell in it accepts transferred information through the shared RNN layer. Such information consists of (1) the information of the same word in the source domain (the red edge in figure 2); and (2) the information of all its collocated words (the blue edges in figure 2). ART uses attention (Bahdanau et al., 2015) to decide the weights of all candidate collocations. The RNN cell controls the weights between (1) and (2) by an update gate.
Figure 2 shows the architecture of ART. The yellow box contains the neural network for the source domain, which is a classical RNN. The green box contains the neural network for the target domain. and are cells for the source domain and target domain, respectively. takes as the input, which is usually a word embedding. The two neural networks overlap each other. The source domain’s neural network transfers information through the overlapping modules. We will describe the details of the architecture below. Note that although ART is only deployed over RNNs in this paper, its attentive transfer mechanism is easy to be deploy over other structures (e.g. Transformer (Vaswani et al., 2017))
RNN for the Source Domain The neural network of the source domain consists of a standard RNN. Each cell captures information from the previous time step , computes and passes the information to the next time step. More formally, we define the performance of as:
where is the parameter (recurrent weight matrix).
Information Transfer for the Target Domain Each RNN cell in the target domain leverages the transferred information from the source domain. Different from the source domain, the -th hidden state in the target domain is computed by:
where contains the information passed from the previous time step in the target domain (), and the transferred information from the source domain (). We compute it by:
where is the parameter for .
Note that both domains use the same RNN function with different parameters ( and ). Intuitively, we always want to transfer the common information across domains. And we think it’s easier to represent common and shareable information with an identical network structure.
Learn to Collocate and Transfer We compute by aligning its collocations in the source domain. We consider two kinds of alignments: (1) The alignment from the corresponding position. This makes sense since the corresponding position has the corresponding information of the source domain. (2) The alignments from all collocated words of the source domain. This alignment is used to represent the long-term dependency across domains. We use a “concentrate gate” to control the ratio between the corresponding position and collocated positions. We compute by:
denotes the transferred information from collocated words. denotes the element-wise multiplication. and are parameter matrices.
In order to compute , we use attention (Bahdanau et al., 2015) to incorporate information of all candidate positions in the sequence from the source domain. We denote the strength of the collocation intensity to position in the source domain as . We merge all information of the source domain by a weighted sum according to the collocation intensity. More specifically, we define as:
denotes the collocation intensity between the -th cell in the target domain and the -th cell in the source domain. The model needs to be evaluated times for each sentence, due to the enumeration of indexes for the source domain and indexes for the target domain. Here denotes the sentence length. By following Bahdanau et al. (2015), we use a single-layer perception:
where and are the parameter matrices. Since does not depend on , we can pre-compute it to reduce the computation cost.
Update New State To compute , we use an update gate to determine how much of the source domain’s information should be transferred. is computed by merging the original input , the previous cell’s hidden state and the transferred information . We use a reset gate to determine how much of should be reset to zero for . More specifically, we define as:
Here these are parameter matrices.
Model Training: We first pre-train the parameters of the source domain by its training samples. Then we fine-tune the pre-trained model with additional layers of the target domain. The fine-tuning uses the training samples of the target domain. All parameters are jointly fine-tuned.
3 ART over LSTM
In this section, we illustrate how we deploy ART over LSTM. LSTM specifically addresses the issue of learning long-term dependency in RNN. Instead of using one hidden state for the sequential memory, each LSTM cell has two hidden states for the long-term and short-term memory. So the ART adaptation needs to separately represent information for the long-term memory and short-term memory.
The source domain of ART over LSTM uses a standard LSTM layer. The computation of the -th cell is precisely specified as follows:
Here and denote the short-term memory and long-term memory, respectively.
In the target domain, we separately incorporate the short-term and long-term memory from the source domain. More formally, we compute the -the cell in the target domain by:
where and are computed by Eq (6) with parameters and , respectively. and are the transferred information from the short-term memory ( in Eq. (12)) and the long-term memory ( in Eq. (13)), respectively.
Bidirectional Network We use the bidirectional architecture to reach all words’ information for each cell. The backward neural network accepts the in reverse order. We compute the final output of the ART over LSTM by concatenating the states from the forward neural network and the backward neural network for each cell.
We evaluate our proposed approach over sentence classification (sentiment analysis) and sequence labeling task (POS tagging and NER).
All the experiments run over a computer with Intel Core i7 4.0GHz CPU, 32GB RAM, and a GeForce GTX 1080 Ti GPU.
Network Structure: We use a very simple network structure. The neural network consists of an embedding layer, an ART layer as described in section 3
, and a task-specific output layer for the prediction. We will elaborate the output layer and the loss function in each of the tasks below. We use 100d GloVe vectors(Pennington et al., 2014) as the initialization for ART and all its ablations.
Competitor Models We compare ART with the following ablations.
LSTM (no transfer learning): It uses a standard LSTM without transfer learning. It is only trained by the data of the target domain.
LSTM-u It uses a standard LSTM and is trained by the union data of the source domain and the target domain.
LSTM-s It uses a standard LSTM and is trained only by the data of the source domain. Then parameters are used to predicting outputs of samples in the target domain.
Layer-Wise Transfer (LWT) (no cell-level information transfer): It consists of a layer-wise transfer learning neural network. More specifically, only the last cell of the RNN layer transfers information. This cell works as in ART. LWT only works for sentence classification. We use LWT to verify the effectiveness of the cell-level information transfer.
Corresponding Cell Transfer (CCT) (no collocation information transfer): It only transfers information from the corresponding position of each cell. We use CCT to verify the effectiveness of collocating and transferring from the source domain.
We also compare ART with state-of-the-art transfer learning algorithms. For sequence labeling, we compare with hierarchical recurrent networks (HRN) (Yang et al., 2017) and FLORS (Schnabel & Schütze, 2014; Yin, 2015). For sentence classification, we compare with DANN (Ganin et al., 2016), DAmSDA (Ganin et al., 2016), AMN (Li et al., 2017), and HATN (Li et al., 2018). Note that FLORS, DANN, DAmSDA, AMN and HATN use labeled samples of the source domain and unlabeled samples of both the source domain and the target domain for training. Instead, ART and HRN use labeled samples of both domains.
4.2 Sentence Classification: Sentiment Analysis
Datasets: We use the Amazon review dataset (Blitzer et al., 2007), which has been widely used for cross-domain sentence classification. It contains reviews for four domains: books, dvd, electronics, kitchen. Each review is either positive or negative. We list the detailed statistics of the dataset in Table 1. We use the training data and development data from both domains for training and validating. And we use the testing data of the target domain for testing.
To adapt ART to sentence classification, we use a max pooling layer to merge the states of different words. Then we use a perception and a sigmoid function to score the probability of the given sentence being positive. We use binary cross entropy as the loss function. The dimension of each LSTM is set to 100. We use the Adam(Kingma & Ba, 2015) optimizer. We use a dropout probability of 0.5 on the max pooling layer.
Results: We report the classification accuracy of different models in Table 2. The no-transfer LSTM only performs accuracy of on average. ART outperforms it by . ART also outperforms its ablations and other competitors. This overall verifies the effectiveness of ART.
Effectiveness of Cell-Level Transfer LWT only transfers layer-wise information and performs accuracy of on average. But ART and CCT transfer more fine-grained cell-level information. CCT outperforms LWT by . ART outperforms LWT by . This verifies the effectiveness of the cell-level transfer.
Effectiveness of Collocation and Transfer CCT only transfers the corresponding position’s information from the source domain. It achieves accuracy of on average. ART outperforms CCT by on average. ART provides a more flexible way to transfer a set of positions in the source domain and represent the long-term dependency. This verifies the effectiveness of ART in representing long-term dependency by learning to collocate and transfer.
Minimally Supervised Domain Adaptation We also evaluate ART when the number of training samples for the target domain is much fewer than that of the source domain. For each target domain in the Amazon review dataset, we combine the training/development data of rest three domains as the source domain. We show the results in Table 3. ART outperforms all the competitors by a large margin. This verifies its effectiveness in the setting of minimally supervised domain adaptation.
4.3 Sequence Labeling
We evaluate the effectiveness of ART w.r.t. sequence labeling. We use two typical tasks: POS tagging and NER (named entity recognition). The goal of POS tagging is to assign part-of-speech tags to each word of the given sentence. The goal of NER is to extract and classify the named entity in the given sentence.
Model Settings: POS tagging and NER are multi-class labeling tasks. To adapt ART to them, we follow HRN (Yang et al., 2017) and use a CRF layer to compute the tag distribution of each word. We predict the tag with maximized probability for each word. We use categorical cross entropy as the loss function. The dimension of each LSTM cell is set to 300. By following HRN, we use the concatenation of 50d word embeddings and 50d character embeddings as the input of the ART layer. We use 50 1d filters for CNN char embedding, each with a width of 3. We use the Adagrad (Duchi et al., 2011) optimizer. We use a dropout probability of 0.5 on the max pooling layer.
Dataset: We use the dataset settings as in Yang et al. (2017). For POS Tagging, we use Penn Treebank (PTB) POS tagging, and a Twitter corpus (Ritter et al., 2011) as different domains. For NER, we use CoNLL 2003 (Tjong Kim Sang & De Meulder, 2003) and Twitter (Ritter et al., 2011) as different domains. Their statistics are shown in Table 4. By following Ritter et al. (2011), we use 10% training samples of Twitter (Twitter/0.1), 1% training samples of Twitter (Twitter/0.01), and 1% training samples of CoNLL (CoNLL/0.01) as the training data for the target domains to simulate a low-resource setting. Note that the label space of these tasks are different. So some baselines (e.g. LSTM-u, LSTM-s) cannot be applied.
|Benchmark||Task||# Training Tokens||# Dev Tokens||# Test Tokens|
Table 5 shows the per-word accuracy (for POS tagging) and f1 (for NER) of different models. From the table, we see that the performances of all tasks are improved by ART. For example, when transferring from PTB to Twitter/0.1, ART outperforms HRN by 2.2%. ART performs the best among all competitors in almost all cases. This verifies the effectiveness of ART w.r.t. sequence labeling. Note that FLORS is independent of the target domain. If the training corpus of the target domain is quite rare (Twitter/0.01), FLORS performs better. But with richer training data of the target domain (Twitter/0.1), ART outperforms FLORS.
4.4 Visualization of the Aligned Transfer
ART aligns and transfers information from different positions in the source domain. Intuitively, we use the alignment and attention matrices to represent cross-domain word dependencies. So positions with stronger dependencies will be highlighted during the transfer. We visualize the attention matrix for sentiment analysis to verify this. We show the attention matrices for the short-term memory and for the long-term memory in figure 3.
Effectiveness of the Cell-Level Transfer From figure 3, words with stronger emotions have more attentions. For example, the word “pleased” for the long-term memory and “easy to move” for the short-term memory have strong attention for almost all words, which sits well with the intuition of cell-level transfer. The target domain surely wants to accept more information from the meaningful words in the source domain, not from the whole sentence. Notice that and have different attentions. Thus the two attention matrices represent discriminative features.
Effectiveness in Representing the Long-Term Dependency We found that the attention reflects the long-term dependency. For example, in figure 3 (b), although all words in the target domain are affected by the word “easy”, the word “very” has highest attention. This makes sense because “very” is actually the adverb for “easy”, although they are not adjacent. ART highlights cross-domain word dependencies and therefore gives a more precise understanding of each word.
5 Related work
Neural network-based transfer learning The layer-wise transfer learning approaches (Glorot et al., 2011; Ajakan et al., 2014; Zhou et al., 2016) represent the input sequence by a non-sequential vector. These approaches cannot be applied to seq2seq or sequence labeling tasks. To tackle this problem, algorithms must transfer cell-level information in the neural network (Yang et al., 2017). Some approaches use RNN to represent cell-level information. Ying et al. (2017) trains the RNN layer by domain-independent auxiliary labels. Ziser & Reichart (2018) trains the RNN layer with pivots. However, the semantics of a word can depend on its collocated words. These approaches cannot represent the collocated words. In contrast, ART successfully represents the collocations by attention.
Pre-trained models ART uses a pre-trained model from the source domain, and fine-tunes the model with additional layers for the target domain. Recently, pre-trained models with additional layers are shown to be effectiveness for many downstream models (e.g. BERT (Devlin et al., 2018), ELMo (Peters et al., 2018)). As a pre-trained model, ELMo uses bidirectional LSTMs to generate contextual features. Instead, ART uses attention mechanism in RNN that each cell in the target domain directly access information of all cells in the source domain. ART and these pre-trained models have different goals. ART aims at transfer learning for one task in different domains, while BERT and ELMo focus on learning general word representations or sentence representations.
In this paper, we study the problem of transfer learning for sequences. We proposed the ART model to collocate and transfer cell-level information. ART has three advantages: (1) it transfers more fine-grained cell-level information, and thus can be adapted to seq2seq or sequence labeling tasks; (2) it aligns and transfers a set of collocated words in the source sentence to represent the cross domain long-term dependency; (3) it is general and can be applied to different tasks. Besides, ART verified the effectiveness of pre-training models with the limited but relevant training corpus.
Guangyu Zheng and Wei Wang were supported by Shanghai Software and Integrated Circuit Industry Development Project(170512).
- Ajakan et al. (2014) Hana Ajakan, Pascal Germain, Hugo Larochelle, François Laviolette, and Mario Marchand. Domain-adversarial neural networks. In NeurIPS, 2014.
- Bahdanau et al. (2015) Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. Neural machine translation by jointly learning to align and translate. In ICLR, 2015.
- Blitzer et al. (2007) John Blitzer, Mark Dredze, and Fernando Pereira. Biographies, bollywood, boom-boxes and blenders: Domain adaptation for sentiment classification. In ACL, pp. 440–447, 2007.
- Daumé III (2007) Hal Daumé III. Frustratingly easy domain adaptation. In ACL, 2007.
- Devlin et al. (2018) Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805, 2018.
- Duchi et al. (2011) John Duchi, Elad Hazan, and Yoram Singer. Adaptive subgradient methods for online learning and stochastic optimization. In JMLR, volume 12, pp. 2121–2159, 2011.
- Ganin et al. (2016) Yaroslav Ganin, Evgeniya Ustinova, Hana Ajakan, Pascal Germain, Hugo Larochelle, François Laviolette, Mario Marchand, and Victor Lempitsky. Domain-adversarial training of neural networks. In JMLR, volume 17, pp. 2096–2030, 2016.
Glorot et al. (2011)
Xavier Glorot, Antoine Bordes, and Yoshua Bengio.
Domain adaptation for large-scale sentiment classification: A deep learning approach.In ICML, pp. 513–520, 2011.
- Jiang & Zhai (2007) Jing Jiang and ChengXiang Zhai. Instance weighting for domain adaptation in nlp. In ACL, volume 7, pp. 264–271, 2007.
- Kingma & Ba (2015) Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. In ICLR, 2015.
- Li et al. (2017) Zheng Li, Yu Zhang, Ying Wei, Yuxiang Wu, and Qiang Yang. End-to-end adversarial memory network for cross-domain sentiment classification. In IJCAI, 2017.
- Li et al. (2018) Zheng Li, Ying Wei, Yu Zhang, and Qiang Yang. Hierarchical attention transfer network for cross-domain sentiment classification. In AAAI, 2018.
- Peng et al. (2015) Hao Peng, Lili Mou, Ge Li, Yunchuan Chen, Yangyang Lu, and Zhi Jin. A comparative study on regularization strategies for embedding-based neural networks. In EMNLP, 2015.
- Pennington et al. (2014) Jeffrey Pennington, Richard Socher, and Christopher D. Manning. Glove: Global vectors for word representation. In EMNLP, pp. 1532–1543, 2014.
- Peters et al. (2018) Matthew E Peters, Mark Neumann, Mohit Iyyer, Matt Gardner, Christopher Clark, Kenton Lee, and Luke Zettlemoyer. Deep contextualized word representations. In NAACL, 2018.
- Radford et al. (2019) Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, and Ilya Sutskever. Language models are unsupervised multitask learners. Technical report, Technical report, OpenAI, 2019.
- Ritter et al. (2011) Alan Ritter, Sam Clark, Oren Etzioni, et al. Named entity recognition in tweets: an experimental study. In EMNLP, pp. 1524–1534, 2011.
- Schnabel & Schütze (2014) Tobias Schnabel and Hinrich Schütze. Flors: Fast and simple domain adaptation for part-of-speech tagging. In TACL, volume 2, pp. 15–26, 2014.
- Song et al. (2011) Yangqiu Song, Haixun Wang, Zhongyuan Wang, Hongsong Li, and Weizhu Chen. Short text conceptualization using a probabilistic knowledgebase. In IJCAI, pp. 2330–2336, 2011.
- Tjong Kim Sang & De Meulder (2003) Erik F. Tjong Kim Sang and Fien De Meulder. Introduction to the conll-2003 shared task: Language-independent named entity recognition. In NAACL, pp. 142–147, 2003.
- Vaswani et al. (2017) Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need. In NeurIPS, pp. 5998–6008, 2017.
- Yang et al. (2017) Zhilin Yang, Ruslan Salakhutdinov, and William W Cohen. Transfer learning for sequence tagging with hierarchical recurrent networks. In ICLR, 2017.
- Yin (2015) Tobias; Schütze Hinrich Yin, Wenpeng; Schnabel. Online updating of word representations for part-of-speech taggging. In EMNLP, 2015.
- Ying et al. (2017) Ding Ying, Jianfei Yu, and Jing Jiang. Recurrent neural networks with auxiliary labels for cross-domain opinion target extraction. In AAAI, 2017.
- Zhou et al. (2016) Guangyou Zhou, Zhiwen Xie, Jimmy Xiangji Huang, and Tingting He. Bi-transferring deep neural networks for domain adaptation. In ACL, 2016.
- Ziser & Reichart (2018) Yftah Ziser and Roi Reichart. Pivot based language modeling for improved neural domain adaptation. In NAACL, volume 1, pp. 1241–1251, 2018.