Multi-task learning is an effective approach to improve the performance of a single task with the help of other related tasks. Recently, neural-based models for multi-task learning have become very popular, ranging from computer visionMisra et al. (2016); Zhang et al. (2014)2008); Luong et al. (2015), since they provide a convenient way of combining information from multiple tasks.
However, most existing work on multi-task learning Liu et al. (2016c, b) attempts to divide the features of different tasks into private and shared spaces, merely based on whether parameters of some components should be shared. As shown in Figure 1-(a), the general shared-private model introduces two feature spaces for any task: one is used to store task-dependent features, the other is used to capture shared features. The major limitation of this framework is that the shared feature space could contain some unnecessary task-specific features, while some sharable features could also be mixed in private space, suffering from feature redundancy.
Taking the following two sentences as examples, which are extracted from two different sentiment classification tasks: Movie reviews and Baby products reviews.
|The infantile cart is simple and easy to use.|
|This kind of humour is infantile and boring.|
The word “infantile” indicates negative sentiment in Movie task while it is neutral in Baby task. However, the general shared-private model could place the task-specific word “infantile” in a shared space, leaving potential hazards for other tasks. Additionally, the capacity of shared space could also be wasted by some unnecessary features.
To address this problem, in this paper we propose an adversarial multi-task framework, in which the shared and private feature spaces are inherently disjoint by introducing orthogonality constraints. Specifically, we design a generic shared-private learning framework to model the text sequence. To prevent the shared and private latent feature spaces from interfering with each other, we introduce two strategies: adversarial training and orthogonality constraints. The adversarial training is used to ensure that the shared feature space simply contains common and task-invariant information, while the orthogonality constraint is used to eliminate redundant features from the private and shared spaces.
The contributions of this paper can be summarized as follows.
Proposed model divides the task-specific and shared space in a more precise way, rather than roughly sharing parameters.
We extend the original binary adversarial training to multi-class, which not only enables multiple tasks to be jointly trained, but allows us to utilize unlabeled data.
We can condense the shared knowledge among multiple tasks into an off-the-shelf neural layer, which can be easily transferred to new tasks.
2 Recurrent Models for Text Classification
There are many neural sentence models, which can be used for text modelling, involving recurrent neural networksSutskever et al. (2014); Chung et al. (2014); Liu et al. (2015a)2011); Kalchbrenner et al. (2014), and recursive neural networks Socher et al. (2013)
. Here we adopt recurrent neural network with long short-term memory (LSTM) due to their superior performance in various NLP tasksLiu et al. (2016a); Lin et al. (2017).
Long Short-term Memory
Long short-term memory network (LSTM) Hochreiter and Schmidhuber (1997) is a type of recurrent neural network (RNN) Elman (1990), and specifically addresses the issue of learning long-term dependencies. While there are numerous LSTM variants, here we use the LSTM architecture used by Jozefowicz et al. (2015), which is similar to the architecture of Graves (2013) but without peep-hole connections.
We define the LSTM units at each time step
to be a collection of vectors in: an input gate , a forget gate , an output gate , a memory cell and a hidden state . is the number of the LSTM units. The elements of the gating vectors , and are in .
The LSTM is precisely specified as follows.
where is the input at the current time step; and are parameters of affine transformation;
denotes the logistic sigmoid function anddenotes elementwise multiplication.
Text Classification with LSTM
Given a text sequence , we first use a lookup layer to get the vector representation (embeddings) of the each word
. The output at the last moment
can be regarded as the representation of the whole sequence, which has a fully connected layer followed by a softmax non-linear layer that predicts the probability distribution over classes.
where is prediction probabilities, is the weight which needs to be learned, is a bias term.
Given a corpus with training samples , the parameters of the network are trained to minimise the cross-entropy of the predicted and true distributions.
where is the ground-truth label; is prediction probabilities, and is the class number.
3 Multi-task Learning for Text Classification
The goal of multi-task learning is to utilizes the correlation among these related tasks to improve classification by learning tasks in parallel. To facilitate this, we give some explanation for notations used in this paper. Formally, we refer to as a dataset with samples for task . Specifically,
where and denote a sentence and corresponding label for task .
3.1 Two Sharing Schemes for Sentence Modeling
The key factor of multi-task learning is the sharing scheme in latent feature space. In neural network based model, the latent features can be regarded as the states of hidden neurons. Specific to text classification, the latent features are the hidden states of LSTM at the end of a sentence. Therefore, the sharing schemes are different in how to group the shared features. Here, we first introduce two sharing schemes with multi-task learning: fully-shared scheme and shared-private scheme.
Fully-Shared Model (FS-MTL)
In fully-shared model, we use a single shared LSTM layer to extract features for all the tasks. For example, given two tasks and , it takes the view that the features of task can be totally shared by task and vice versa. This model ignores the fact that some features are task-dependent. Figure (a)a illustrates the fully-shared model.
Shared-Private Model (SP-MTL)
As shown in Figure (b)b, the shared-private model introduces two feature spaces for each task: one is used to store task-dependent features, the other is used to capture task-invariant features. Accordingly, we can see each task is assigned a private LSTM layer and shared LSTM layer. Formally, for any sentence in task , we can compute its shared representation and task-specific representation as follows:
where is defined as Eq. (4).
The final features are concatenation of the features from private space and shared space.
3.2 Task-Specific Output Layer
For a sentence in task , its feature
, emitted by the deep muti-task architectures, is ultimately fed into the corresponding task-specific softmax layer for classification or other tasks.
The parameters of the network are trained to minimise the cross-entropy of the predicted and true distributions on all the tasks. The loss can be computed as:
where is the weights for each task respectively. is defined as Eq. 6.
4 Incorporating Adversarial Training
Although the shared-private model separates the feature space into the shared and private spaces, there is no guarantee that sharable features can not exist in private feature space, or vice versa. Thus, some useful sharable features could be ignored in shared-private model, and the shared feature space is also vulnerable to contamination by some task-specific information.
Therefore, a simple principle can be applied into multi-task learning that a good shared feature space should contain more common information and no task-specific information. To address this problem, we introduce adversarial training into multi-task framework as shown in Figure 3 (ASP-MTL).
4.1 Adversarial Network
Adversarial networks have recently surfaced and are first used for generative model Goodfellow et al. (2014). The goal is to learn a generative distribution that matches the real data distribution Specifically, GAN learns a generative network G and discriminative model D, in which G generates samples from the generator distribution . and D learns to determine whether a sample is from or . This min-max game can be optimized by the following risk:
While originally proposed for generating random samples, adversarial network can be used as a general tool to measure equivalence between distributions Taigman et al. (2016). Formally, Ajakan et al. (2014) linked the adversarial loss to the -divergence between two distributions and successfully achieve unsupervised domain adaptation with adversarial network. Motivated by theory on domain adaptation Ben-David et al. (2010, 2007); Bousmalis et al. (2016) that a transferable feature is one for which an algorithm cannot learn to identify the domain of origin of the input observation.
4.2 Task Adversarial Loss for MTL
Inspired by adversarial networks Goodfellow et al. (2014)
, we proposed an adversarial shared-private model for multi-task learning, in which a shared recurrent neural layer is working adversarially towards a learnable multi-layer perceptron, preventing it from making an accurate prediction about the types of tasks. This adversarial training encourages shared space to be more pure and ensure the shared representation not be contaminated by task-specific features.
Discriminator is used to map the shared representation of sentences into a probability distribution, estimating what kinds of tasks the encoded sentence comes from.
where is a learnable parameter and is a bias.
Different with most existing multi-task learning algorithm, we add an extra task adversarial loss
to prevent task-specific feature from creeping in to shared space. The task adversarial loss is used to train a model to produce shared features such that a classifier cannot reliably predict the task based on these features. The original loss of adversarial network is limited since it can only be used in binary situation. To overcome this, we extend it to multi-class form, which allow our model can be trained together with multiple tasks:
where denotes the ground-truth label indicating the type of the current task. Here, there is a min-max optimization and the basic idea is that, given a sentence, the shared LSTM generates a representation to mislead the task discriminator. At the same time, the discriminator tries its best to make a correct classification on the type of task. After the training phase, the shared feature extractor and task discriminator reach a point at which both cannot improve and the discriminator is unable to differentiate among all the tasks.
Semi-supervised Learning Multi-task Learning
We notice that the requires only the input sentence and does not require the corresponding label
, which makes it possible to combine our model with semi-supervised learning. Finally, in this semi-supervised multi-task learning framework, our model can not only utilize the data from related tasks, but can employ abundant unlabeled corpora.
4.3 Orthogonality Constraints
We notice that there is a potential drawback of the above model. That is, the task-invariant features can appear both in shared space and private space.
Motivated by recently workJia et al. (2010); Salzmann et al. (2010); Bousmalis et al. (2016) on shared-private latent space analysis, we introduce orthogonality constraints, which penalize redundant latent representations and encourages the shared and private extractors to encode different aspects of the inputs.
After exploring many optional methods, we find below loss is optimal, which is used by bousmalis2016domain and achieve a better performance:
where is the squared Frobenius norm. and are two matrics, whose rows are the output of shared extractor and task-specific extrator of a input sentence.
4.4 Put It All Together
The final loss function of our model can be written as:
where and are hyper-parameter.
The networks are trained with backpropagation and this minimax optimization becomes possible via the use of a gradient reversal layerGanin and Lempitsky (2015).
To make an extensive evaluation, we collect 16 different datasets from several popular review corpora.
The first 14 datasets are product reviews, which contain Amazon product reviews from different domains, such as Books, DVDs, Electronics, ect. The goal is to classify a product review as either positive or negative. These datasets are collected based on the raw data 111https://www.cs.jhu.edu/~mdredze/datasets/sentiment/ provided by Blitzer et al. (2007). Specifically, we extract the sentences and corresponding labels from the unprocessed original data 222Blitzer et al. (2007) also provides two extra processed datasets with the format of Bag-of-Words, which are not proper for neural-based models.. The only preprocessing operation of these sentences is tokenized using the Stanford tokenizer 333http://nlp.stanford.edu/software/tokenizer.shtml.
The remaining two datasets are about movie reviews. The IMDB dataset444https://www.cs.jhu.edu/~mdredze/datasets/sentiment/unprocessed.tar.gz consists of movie reviews with binary classes Maas et al. (2011). One key aspect of this dataset is that each movie review has several sentences. The MR dataset also consists of movie reviews from rotten tomato website with two classes 555https://www.cs.cornell.edu/people/pabo/movie-review-data/.Pang and Lee (2005).
All the datasets in each task are partitioned randomly into training set, development set and testing set with the proportion of 70%, 20% and 10% respectively. The detailed statistics about all the datasets are listed in Table 1.
|Task||Single Task||Multiple Tasks|
5.2 Competitor Methods for Multi-task Learning
The multi-task frameworks proposed by previous works are various while not all can be applied to the tasks we focused. Nevertheless, we chose two most related neural models for multi-task learning and implement them as competitor methods.
MT-CNN: This model is proposed by collobert2008unified with convolutional layer, in which lookup-tables are shared partially while other layers are task-specific.
MT-DNN: The model is proposed by liu2015representation with bag-of-words input and multi-layer perceptrons, in which a hidden layer is shared.
The word embeddings for all of the models are initialized with the 200d GloVe vectors (Pennington et al. (2014)
). The other parameters are initialized by randomly sampling from uniform distribution in. The mini-batch size is set to 16.
For each task, we take the hyperparameters which achieve the best performance on the development set via an small grid search over combinations of the initial learning rate, , and . Finally, we chose the learning rate as , as and as .
|Source Tasks||Single Task||Transfer Models|
5.4 Performance Evaluation
Table 2 shows the error rates on 16 text classification tasks. The column of “Single Task” shows the results of vanilla LSTM, bidirectional LSTM (BiLSTM), stacked LSTM (sLSTM) and the average error rates of previous three models. The column of “Multiple Tasks” shows the results achieved by corresponding multi-task models. From this table, we can see that the performance of most tasks can be improved with a large margin with the help of multi-task learning, in which our model achieves the lowest error rates. More concretely, compared with SP-MTL, ASP-MTL achieves average improvement surpassing SP-MTL with , which indicates the importance of adversarial learning. It is noteworthy that for FS-MTL, the performances of some tasks are degraded, since this model puts all private and shared information into a unified space.
5.5 Shared Knowledge Transfer
With the help of adversarial learning, the shared feature extractor can generate more pure task-invariant representations, which can be considered as off-the-shelf knowledge and then be used for unseen new tasks.
To test the transferability of our learned shared extractor, we also design an experiment, in which we take turns choosing tasks to train our model with multi-task learning, then the learned shared layer are transferred to a second network that is used for the remaining one task. The parameters of transferred layer are kept frozen, and the rest of parameters of the network are randomly initialized.
More formally, we investigate two mechanisms towards the transferred shared extractor. As shown in Figure 4. The first one Single Channel (SC) model consists of one shared feature extractor from , then the extracted representation will be sent to an output layer. By contrast, the Bi-Channel (BC) model introduces an extra LSTM layer to encode more task-specific information. To evaluate the effectiveness of our introduced adversarial training framework, we also make a comparison with vanilla multi-task learning method.
Results and Analysis
As shown in Table 3, we can see the shared layer from ASP-MTL achieves a better performance compared with SP-MTL. Besides, for the two kinds of transfer strategies, the Bi-Channel model performs better. The reason is that the task-specific layer introduced in the Bi-Channel model can store some private features. Overall, the results indicate that we can save the existing knowledge into a shared recurrent layer using adversarial multi-task learning, which is quite useful for a new task.
To get an intuitive understanding of how the introduced orthogonality constraints worked compared with vanilla shared-private model, we design an experiment to examine the behaviors of neurons from private layer and shared layer. More concretely, we refer to as the activation of the -neuron at time step , where and . By visualizing the hidden state and analyzing the maximum activation, we can find what kinds of patterns the current neuron focuses on.
Figure 5 illustrates this phenomenon. Here, we randomly sample a sentence from the validation set of Baby task and analyze the changes of the predicted sentiment score at different time steps, which are obtained by SP-MTL and our proposed model. Additionally, to get more insights into how neurons in shared layer behave diversely towards different input word, we visualize the activation of two typical neurons. For the positive sentence “Five stars, my baby can fall asleep soon in the stroller”, both models capture the informative pattern “Five stars” 666For this case, the vanilla LSTM also give a wrong answer due to ignoring the feature “Five stars”.. However, SP-MTL makes a wrong prediction due to misunderstanding of the word “asleep”.
By contrast, our model makes a correct prediction and the reason can be inferred from the activation of Figure 5-(b), where the shared layer of SP-MTL is so sensitive that many features related to other tasks are included, such as ”asleep”, which misleads the final prediction. This indicates the importance of introducing adversarial learning to prevent the shared layer from being contaminated by task-specific features.
We also list some typical patterns captured by neurons from shared layer and task-specific layer in Table 4, and we have observed that: 1) for SP-MTL, if some patterns are captured by task-specific layer, they are likely to be placed into shared space. Clearly, suppose we have many tasks to be trained jointly, the shared layer bear much pressure and must sacrifice substantial amount of capacity to capture the patterns they actually do not need. Furthermore, some typical task-invariant features also go into task-specific layer. 2) for ASP-MTL, we find the features captured by shared and task-specific layer have a small amount of intersection, which allows these two kinds of layers can work effectively.
6 Related Work
There are two threads of related work. One thread is multi-task learning with neural network. Neural networks based multi-task learning has been proven effective in many NLP problems Collobert and Weston (2008); Glorot et al. (2011).
liu2016multi first utilizes different LSTM layers to construct multi-task learning framwork for text classification. liu2016multiMem proposes a generic multi-task framework, in which different tasks can share information by an external memory and communicate by a reading/writing mechanism. These work has potential limitation of just learning a shared space solely on sharing parameters, while our model introduce two strategies to learn the clear and non-redundant shared-private space.
Another thread of work is adversarial network. Adversarial networks have recently surfaced as a general tool measure equivalence between distributions and it has proven to be effective in a variety of tasks. ajakan2014domain,bousmalis2016domain applied adverarial training to domain adaptation, aiming at transferring the knowledge of one source domain to target domain. park2016image proposed a novel approach for multi-modal representation learning which uses adversarial back-propagation concept.
Different from these models, our model aims to find task-invariant sharable information for multiple related tasks using adversarial training strategy. Moreover, we extend binary adversarial training to multi-class, which enable multiple tasks to be jointly trained.
In this paper, we have proposed an adversarial multi-task learning framework, in which the task-specific and task-invariant features are learned non-redundantly, therefore capturing the shared-private separation of different tasks. We have demonstrated the effectiveness of our approach by applying our model to 16 different text classification tasks. We also perform extensive qualitative analysis, deriving insights and indirectly explaining the quantitative improvements in the overall performance.
We would like to thank the anonymous reviewers for their valuable comments and thank Kaiyu Qian, Gang Niu for useful discussions. This work was partially funded by National Natural Science Foundation of China (No. 61532011 and 61672162), the National High Technology Research and Development Program of China (No. 2015AA015408), Shanghai Municipal Science and Technology Commission (No. 16JC1420401).
- Ajakan et al. (2014) Hana Ajakan, Pascal Germain, Hugo Larochelle, François Laviolette, and Mario Marchand. 2014. Domain-adversarial neural networks. arXiv preprint arXiv:1412.4446 .
- Ben-David et al. (2010) Shai Ben-David, John Blitzer, Koby Crammer, Alex Kulesza, Fernando Pereira, and Jennifer Wortman Vaughan. 2010. A theory of learning from different domains. Machine learning 79(1-2):151–175.
- Ben-David et al. (2007) Shai Ben-David, John Blitzer, Koby Crammer, Fernando Pereira, et al. 2007. Analysis of representations for domain adaptation. Advances in neural information processing systems 19:137.
- Blitzer et al. (2007) John Blitzer, Mark Dredze, Fernando Pereira, et al. 2007. Biographies, bollywood, boom-boxes and blenders: Domain adaptation for sentiment classification. In ACL. volume 7, pages 440–447.
- Bousmalis et al. (2016) Konstantinos Bousmalis, George Trigeorgis, Nathan Silberman, Dilip Krishnan, and Dumitru Erhan. 2016. Domain separation networks. In Advances in Neural Information Processing Systems. pages 343–351.
- Chung et al. (2014) Junyoung Chung, Caglar Gulcehre, KyungHyun Cho, and Yoshua Bengio. 2014. Empirical evaluation of gated recurrent neural networks on sequence modeling. arXiv preprint arXiv:1412.3555 .
- Collobert and Weston (2008) Ronan Collobert and Jason Weston. 2008. A unified architecture for natural language processing: Deep neural networks with multitask learning. In Proceedings of ICML.
- Collobert et al. (2011) Ronan Collobert, Jason Weston, Léon Bottou, Michael Karlen, Koray Kavukcuoglu, and Pavel Kuksa. 2011. Natural language processing (almost) from scratch. The JMLR 12:2493–2537.
- Elman (1990) Jeffrey L Elman. 1990. Finding structure in time. Cognitive science 14(2):179–211.
- Ganin and Lempitsky (2015) Yaroslav Ganin and Victor Lempitsky. 2015. Unsupervised domain adaptation by backpropagation. In Proceedings of the 32nd International Conference on Machine Learning (ICML-15). pages 1180–1189.
Glorot et al. (2011)
Xavier Glorot, Antoine Bordes, and Yoshua Bengio. 2011.
Domain adaptation for large-scale sentiment classification: A deep learning approach.In Proceedings of the 28th International Conference on Machine Learning (ICML-11). pages 513–520.
- Goodfellow et al. (2014) Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. 2014. Generative adversarial nets. In Advances in Neural Information Processing Systems. pages 2672–2680.
- Graves (2013) Alex Graves. 2013. Generating sequences with recurrent neural networks. arXiv preprint arXiv:1308.0850 .
- Hochreiter and Schmidhuber (1997) Sepp Hochreiter and Jürgen Schmidhuber. 1997. Long short-term memory. Neural computation 9(8):1735–1780.
- Jia et al. (2010) Yangqing Jia, Mathieu Salzmann, and Trevor Darrell. 2010. Factorized latent spaces with structured sparsity. In Advances in Neural Information Processing Systems. pages 982–990.
- Jozefowicz et al. (2015) Rafal Jozefowicz, Wojciech Zaremba, and Ilya Sutskever. 2015. An empirical exploration of recurrent network architectures. In Proceedings of The 32nd International Conference on Machine Learning.
- Kalchbrenner et al. (2014) Nal Kalchbrenner, Edward Grefenstette, and Phil Blunsom. 2014. A convolutional neural network for modelling sentences. In Proceedings of ACL.
- Lin et al. (2017) Zhouhan Lin, Minwei Feng, Cicero Nogueira dos Santos, Mo Yu, Bing Xiang, Bowen Zhou, and Yoshua Bengio. 2017. A structured self-attentive sentence embedding. arXiv preprint arXiv:1703.03130 .
- Liu et al. (2016a) Pengfe Liu, Xipeng Qiu, Jifan Chen, and Xuanjing Huang. 2016a. Deep fusion LSTMs for text semantic matching. In Proceedings of ACL.
- Liu et al. (2015a) PengFei Liu, Xipeng Qiu, Xinchi Chen, Shiyu Wu, and Xuanjing Huang. 2015a. Multi-timescale long short-term memory neural network for modelling sentences and documents. In Proceedings of the Conference on EMNLP.
- Liu et al. (2016b) Pengfei Liu, Xipeng Qiu, and Xuanjing Huang. 2016b. Deep multi-task learning with shared memory. In Proceedings of EMNLP.
Liu et al. (2016c)
PengFei Liu, Xipeng Qiu, and Xuanjing Huang. 2016c.
Recurrent neural network for text classification with multi-task
Proceedings of International Joint Conference on Artificial Intelligence.
- Liu et al. (2015b) Xiaodong Liu, Jianfeng Gao, Xiaodong He, Li Deng, Kevin Duh, and Ye-Yi Wang. 2015b. Representation learning using multi-task deep neural networks for semantic classification and information retrieval. In NAACL.
- Luong et al. (2015) Minh-Thang Luong, Quoc V Le, Ilya Sutskever, Oriol Vinyals, and Lukasz Kaiser. 2015. Multi-task sequence to sequence learning. arXiv preprint arXiv:1511.06114 .
Maas et al. (2011)
Andrew L Maas, Raymond E Daly, Peter T Pham, Dan Huang, Andrew Y Ng, and
Christopher Potts. 2011.
Learning word vectors for sentiment analysis.In Proceedings of the ACL. pages 142–150.
Misra et al. (2016)
Ishan Misra, Abhinav Shrivastava, Abhinav Gupta, and Martial Hebert. 2016.
Cross-stitch networks for multi-task learning.
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. pages 3994–4003.
- Pang and Lee (2005) Bo Pang and Lillian Lee. 2005. Seeing stars: Exploiting class relationships for sentiment categorization with respect to rating scales. In Proceedings of the 43rd annual meeting on association for computational linguistics. Association for Computational Linguistics, pages 115–124.
- Park and Im (2016) Gwangbeen Park and Woobin Im. 2016. Image-text multi-modal representation learning by adversarial backpropagation. arXiv preprint arXiv:1612.08354 .
- Pennington et al. (2014) Jeffrey Pennington, Richard Socher, and Christopher D Manning. 2014. Glove: Global vectors for word representation. Proceedings of the EMNLP 12:1532–1543.
- Salzmann et al. (2010) Mathieu Salzmann, Carl Henrik Ek, Raquel Urtasun, and Trevor Darrell. 2010. Factorized orthogonal latent spaces. In AISTATS. pages 701–708.
- Socher et al. (2013) Richard Socher, Alex Perelygin, Jean Y Wu, Jason Chuang, Christopher D Manning, Andrew Y Ng, and Christopher Potts. 2013. Recursive deep models for semantic compositionality over a sentiment treebank. In Proceedings of EMNLP.
- Sutskever et al. (2014) Ilya Sutskever, Oriol Vinyals, and Quoc VV Le. 2014. Sequence to sequence learning with neural networks. In Advances in NIPS. pages 3104–3112.
- Taigman et al. (2016) Yaniv Taigman, Adam Polyak, and Lior Wolf. 2016. Unsupervised cross-domain image generation. arXiv preprint arXiv:1611.02200 .
- Zhang et al. (2014) Zhanpeng Zhang, Ping Luo, Chen Change Loy, and Xiaoou Tang. 2014. Facial landmark detection by deep multi-task learning. In European Conference on Computer Vision. Springer, pages 94–108.