Multi-Task Label Embedding for Text Classification

10/17/2017
by   Honglun Zhang, et al.
Shanghai Jiao Tong University
0

Multi-task learning in text classification leverages implicit correlations among related tasks to extract common features and yield performance gains. However, most previous works treat labels of each task as independent and meaningless one-hot vectors, which cause a loss of potential information and makes it difficult for these models to jointly learn three or more tasks. In this paper, we propose Multi-Task Label Embedding to convert labels in text classification into semantic vectors, thereby turning the original tasks into vector matching tasks. We implement unsupervised, supervised and semi-supervised models of Multi-Task Label Embedding, all utilizing semantic correlations among tasks and making it particularly convenient to scale and transfer as more tasks are involved. Extensive experiments on five benchmark datasets for text classification show that our models can effectively improve performances of related tasks with semantic representations of labels and additional information from each other.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

07/10/2017

A Generalized Recurrent Neural Architecture for Text Classification with Multi-Task Learning

Multi-task learning leverages potential correlations among related tasks...
05/17/2016

Recurrent Neural Network for Text Classification with Multi-Task Learning

Neural network based methods have obtained great progress on a variety o...
12/09/2020

Label Confusion Learning to Enhance Text Classification Models

Representing a true label as a one-hot vector is a common practice in tr...
01/16/2014

Cause Identification from Aviation Safety Incident Reports via Weakly Supervised Semantic Lexicon Construction

The Aviation Safety Reporting System collects voluntarily submitted repo...
10/06/2020

Identifying Spurious Correlations for Robust Text Classification

The predictions of text classifiers are often driven by spurious correla...
04/12/2021

HTCInfoMax: A Global Model for Hierarchical Text Classification via Information Maximization

The current state-of-the-art model HiAGM for hierarchical text classific...
02/25/2020

Label-guided Learning for Text Classification

Text classification is one of the most important and fundamental tasks i...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

Introduction

Text classification is a common Natural Language Processing

task that tries to infer the most appropriate label for a given sentence or document, for example, sentiment analysis, topic classification and so on. With the developments and prosperities of

Deep Learning [Bengio, Courville, and Vincent2013]

, many neural network based models have been exploited by a large body of literature and achieved inspiring performance gains on various text classification tasks. These models are robust at feature engineering and can represent word sequences as fix-length vectors with rich semantic information, which are notably ideal for subsequent NLP tasks.

Due to numerous parameters to train, neural network based models rely heavily on adequate amounts of annotated corpora, which can not always be met as constructions of large-scale high-quality labeled datasets are extremely time-consuming and labor-intensive. Multi-Task Learning solves this problem by jointly training multiple related tasks and leveraging potential correlations among them to increase corpora size implicitly, extract common features and yield classification improvements. Inspired by [Caruana1997], there are lots of works dedicated for multi-task learning with neural network based models [Collobert and Weston2008, Liu et al.2015b, Liu, Qiu, and Huang2016a, Liu, Qiu, and Huang2016b, Zhang et al.2017]. These models usually contain a pre-trained lookup layer that map words into dense, low-dimension and real-value vectors with semantic implications, which is known as Word Embedding [Mikolov et al.2013b], and utilize some lower layers to capture common features that are further fed to follow-up task-specific layers. However, most existing models have the following three disadvantages:

  • Lack of Label Information. Labels of each task are represented by independent and meaningless one-hot vectors, for example, positive and negative in sentiment analysis encoded as and , which may cause a loss of potential label information.

  • Incapable of Scaling. Network structures are elaborately designed to model various correlations for multi-task learning, but most of them are structurally fixed and can only deal with interactions between two tasks, namely pair-wise interactions. When new tasks are introduced, the network structures have to be modified and the whole networks have to be trained again.

  • Incapable of Transferring. For human beings, we can handle a completely new task without any more efforts after learning with several related tasks, which can be concluded as the capability of Transfer Learning [Ling et al.2008]. As discussed above, the network structures of most previous models are fixed, thus not compatible with and failing to tackle new tasks.

In this paper, we proposed Multi-Task Label Embedding (MTLE) to map labels of each task into semantic vectors as well, similar to how Word Embedding represents the word sequences, thereby converting the original text classification tasks into vector matching tasks. Based on MTLE, we implement unsupervised, supervised and semi-supervised multi-task learning models for text classification, all utilizing semantic correlations among tasks and effectively solving the problems of scaling and transferring when new tasks are involved.

We conduct extensive experiments on five benchmark datasets for text classification. Compared to learning separately, jointly learning multiple related tasks based on MTLE demonstrates significant performance gains for each task.

Our contributions are four-folds:

  • Our models efficiently leverage potential label information of each task by mapping labels into dense, low-dimension and real-value vectors with semantic implications.

  • It is particularly convenient for our models to scale when new tasks are involved. The network structures need no modifications and only data from the new tasks require training.

  • After training on several related tasks, our models can also naturally transfer to deal with completely new tasks without any additional training, while still achieving appreciable performances.

  • We consider different scenarios of multi-task learning and demonstrate strong results on several benchmark datasets for text classification. Our models outperform most state-of-the-art baselines.

Problem Statements

Single-Task Learning

In a supervised text classification task, the input is a word sequence denoted by and the output is the class label or the one-hot representation . A pre-trained lookup layer is used to get the embedding vector for each word . A text classification model is trained to produce the predicted distribution for each .

()

and the training objective is to minimize the total cross-entropy over all samples.

()

where denotes the number of training samples and is the class number.

Multi-Task Learning

Given supervised text classification tasks, , a multi-task learning model is trained to transform each from into multiple predicted distributions .

()

where only is used for loss computation. The overall training loss is a weighted linear combination of costs for each task.

()

where , and denote the linear weight, the number of samples and the class number for each task respectively.

Three Perspectives of Multi-Task Learning

Text classification tasks can differ in characteristics of the input word sequence or the output label . There are lots of benchmark datasets for text classification and three different perspectives of multi-task learning can be concluded.

  • Multi-Cardinality Tasks are similar apart from cardinalities, for example, movie review datasets with different average sequence lengths and class numbers.

  • Multi-Domain Tasks are different in domains of corpora, for example, product review datasets on books, DVDs, electronics and kitchen appliances.

  • Multi-Objective Tasks are targeted for different objectives, for example, sentiment analysis, topic classification and question type judgment.

The simplest multi-task learning scenario is that all tasks share the same cardinality, domain and objective, while just come from different sources. On the contrary, when tasks vary in cardinality, domain and even objective, the correlations and interactions among them can be quite complicated and implicit. When implementing multi-task learning, both the model used and the tasks involved have significant influences on the ideal performance gains for each task. We will further investigate the scaling and transferring capabilities of MTLE on different scenarios in the Experiment section.

Methodology

Neural network based models have obtained substantial interests in many NLP tasks for their capabilities to represent variable-length words sequences as fix-length vectors, for example, Neural Bag-of-Words (NBOW), Recurrent Neural Networks (RNN), Recursive Neural Networks (RecNN) and Convolutional Neural Network

(CNN). These models mostly first map sequences of words, n-grams or other semantic units into embedding representations with a pre-trained lookup layer, then comprehend the vector sequences with neural networks of different structures and mechanisms, finally utilize a softmax layer to predict categorical distribution for specific text classification tasks. For RNN, input vectors are absorbed one by one in a recurrent manner, which resembles the way human beings understand texts and makes RNN notably suitable for NLP tasks.

Recurrent Neural Network

RNN maintains a internal hidden state vector that is recurrently updated by a transition function . At each time step , the hidden state is updated according to the current input vector and the previous hidden state .

()

where is usually a composition of an element-wise nonlinearity with an affine transformation of both and . In this way, RNN can accept a word sequence of arbitrary length and produce a fix-length vector, which is fed to a softmax layer for text classification or other NLP tasks. However, gradient of may grow or decay exponentially over long sequences during training, namely the gradient exploding or vanishing problems, which hinder RNN from effectively learning long-term dependencies and correlations.

[Hochreiter and Schmidhuber1997] proposed Long Short-Term Memory Network (LSTM) to solve the above problems. Besides the internal hidden state , LSTM also maintains an internal memory cell and three gating mechanisms. While there are numerous variants of the standard LSTM, in this paper we follow the implementation of [Graves2013]. At each time step , states of the LSTM can be fully described by five vectors in , an input gate , a forget gate , an output gate , the hidden state and the memory cell , which adhere to the following transition equations.

()
()
()
()
()
()

where is the current input,

denotes logistic sigmoid function and

denotes element-wise multiplication. By strictly controlling how to accept and the portions of to update, forget and expose at each time step, LSTM can better understand long-term dependencies according to the labels of the whole sequences.

Multi-Task Label Embedding

Labels of text classification tasks are made up of word sequences as well, for example, positive and negative in binary sentiment classification, very positive, positive, neutral, negative and very negative in 5-categorical sentiment classification. Inspired by Word Embedding, we propose Multi-Task Label Embedding (MTLE) to convert labels of each task into dense, low-dimension and real-value vectors with semantic implications, thereby disclosing potential intra-task and inter-task label correlations.

Figure 1 illustrates the general idea of MTLE for text classification, which mainly consists of three parts, the Input Encoder, the Label Encoder and the Matcher.

In the Input Encoder, each input sequence from is transformed into its embedding representation by the Lookup Layer (). The Learning Layer () is applied to recurrently comprehend and generate a fix-length vector , which can be regarded as an overall representation of the original input sequence .

Figure 1: General idea of MTLE for text classification

In the Label Encoder, labels of each task are mapped and learned to produce fix-length representations as well. There are labels in , namely , where is also a word sequence, for example, very positive, and is mapped into the vector sequence by the Lookup Layer (). The Learning Layer () further absorb to generate a fix-length vector , which can be concluded as an overall semantic representation of the original label .

In order to achieve the classification task for a sample from , the Matcher obtains the corresponding from the Input Encoder, all from the Label Encoder, and then conducts vector matching to select the most appropriate class label.

Based on the idea of MTLE, we implement unsupervised, supervised and semi-supervised models to investigate and explore different possibilities of multi-task learning in text classification.

Model-I: Unsupervised

Suppose that for each task , we only have input sequences and classification labels, but lack the specific annotations for each input sequence and its corresponding label. In this case, we can only implement MTLE in an unsupervised manner.

Word Embedding [Mikolov et al.2013b] leverages contextual features of words and trains them into semantic vectors so that words sharing synonymous meanings result in vectors of similar values. In the unsupervised model, we utilize all available input sequences and classification labels as the whole corpora and train a embedding model  [Mikolov et al.2013a] that covers contextual features of different tasks. The embedding model will be employed as both and .

We achieve and simply by summing up vectors in a sequence and calculating the average, since we don’t have any supervised annotations. After obtaining for each input sample and all for a certain task , we apply unsupervised vector matching methods , for example, Cosine Similarity or Distance, to select the most appropriate for each .

In conclusion, the unsupervised model of MTLE exploits contextual and semantic information of both the input sequences and the classification labels. Model-I may fail to achieve adequately satisfactory performances due to employments of so many unsupervised methods, but can still provide some useful insights when no annotations are available at all.

Model-II: Supervised

Given the specific annotations for each input sequence and its corresponding label, we can better train the Input Encoder and the Label Encoder in a supervised manner.

The and the are both fully-connection layers with the weights and of matrixes, where denotes the vocabulary size and is the embedding size. We can utilize the obtained in Model-I or other pre-trained lookup tables to initialize and further tune their weights during training.

The and the should be trainable models that can transform a vector sequence of arbitrary lengths into a fix-length vector. We apply the implementation of [Graves2013] and denote them by and with hidden size . We can also try some more complicated but effective sequence learning models, but in this paper we mainly focus on the idea and effects of MTLE, so we just choose a common one for implementation and spend more efforts on explorations of MTLE.

We utilize another fully-connection layer of size , denoted by , to achieve the Matcher, which accepts outputs from the and the

to produce a score of matching. Given the matching scores of each label, we implement the idea of cross-entropy and calculate the loss function for a sample

from as follows.

()
()
()
()

where denotes vector concatenation and is the true label in one-hot representation for . The overall training objective is to minimize the weighted linear combination of costs for samples from all tasks.

()

where and denote the linear weight and the number of samples for each task as explained in Eq.(). The network structure of the supervised model for MTLE is illustrated in Figure 2.

Model-II provides a simple and intuitive way to realize multi-task learning, where input sequences and classification labels from different tasks are jointly learned and compactly fused. During the process of training, and learn better understanding of word semantics for different tasks, while and obtain stronger capabilities of sequence representation.

Figure 2: Supervised model for MTLE

When new tasks are involved, it is extremely convenient for Model-II to scale as the whole network structure needs no modifications. We can continue training Model-II and further tune the parameters based on samples from the new tasks, which we define as Hot Update, or re-train Model-II again based on samples from all tasks, which is defined as Cold Update. We will detailedly investigate the performances of these two scaling methods in the Experiment Section.

Model-III: Semi-Supervised

For human beings, we can handle a completely new task without any more efforts and achieve appreciable performances after learning with several related tasks, which we conclude as the capability to transfer.

We propose Model-III for semi-supervised learning based on MTLE. The only different between Model-II and Model-III is the way how they deal with new tasks, annotated or not. If the new tasks are provided with annotations, we can choose to apply Hot Update or Cold Update of Model-II. If the new tasks are completely unlabeled, we can still employ Model-II for vector mapping and find the best label for each input sequence without any further training, which we define as

Zero Update. To avoid confusion, we specially use Model-III to denote the cases where annotations of new tasks are unavailable and only Zero Update is applicable, which corresponds to the transferring and semi-supervised learning capability of human beings. The differences among Hot Update, Cold Update and Zero Update are illustrated in Figure 3, where Before Update denotes the model trained on the old tasks before the new tasks are introduced. We will further investigate these three updating methods in the Experiment Section.

Figure 3: Differences among three updating methods
Dataset Description Type Average Length Class Objective
SST Movie reviews in Stanford Sentiment Treebank including SST-1 and SST-2 Sentence 19 / 19 5 / 2 Sentiment
IMDB Internet Movie Database Document 279 2 Sentiment
MDSD Product reviews on books, DVDs, electronics and kitchen appliances Document 176 / 189 / 115 / 97 2 Sentiment
RN Reuters Newswire topics classification Document 146 46 Topics
QC Question Classification Sentence 10 6 Question Types
Table 1: Five benchmark text classification datasets: SST, IMDB, MDSD, RN, QC.

Experiment

In this section, we design extensive experiments with multi-task learning based on five benchmark datasets for text classification. We investigate the empirical performances of our models and compare them to existing state-of-the-art baselines.

Datasets

As Table 1 shows, we select five benchmark datasets for text classification and design three experiment scenarios to evaluate the performances of Model-I and Model-II.

Hyperparameters and Training

Training of Model-II is conducted through back propagation with stochastic gradient descent 

[Amari1993]. Besides the from Model-I, we also obtain a pre-trained lookup table by applying Word2Vec [Mikolov et al.2013a]

on the Google News corpus, which contains more than 100B words with a vocabulary size of about 3M. During each epoch, we randomly divide samples from different tasks into batches of fixed size. For each iteration, we randomly select one task and choose an untrained batch from the task, calculate the gradient and update the parameters accordingly.

All involved parameters of neural layers are randomly initialized from a truncated normal distribution with zero mean and standard deviation. We apply 10-fold cross-validation and different combinations of hyperparameters are investigated, of which the best one is described in Table

2.

Embedding size
Hidden layer size of LSTM
Batch size
Initial learning rate
Regularization weight
Table 2: Hyperparameter settings

Results of Model-I and Model-II

We compare the performances of Model-I and Model-II with the implementation of [Graves2013] as shown in Table 3.

It is expected that Model-I falls behind [Graves2013] as no annotations are available at all. However, with contextual information of both sequences and labels, Model-I still achieves considerable margins against random choices. Model-I performs better on tasks of shorter lengths, for example, SST-1 and SST-2, as it is difficult for unsupervised methods to learn long-term dependencies.

Model-II obtains significant performance gains with label information and additional correlations from related tasks. Multi-Domain, Multi-Cardinality and Multi-Objective benefit from MTLE with average improvements of 5.8%, 3.1% and 1.7%, as they contain increasingly weaker relevance among tasks. The result of Model-II for IMDB in Multi-Cardinality is slightly better than that in Multi-Objective  (91.3 against 90.9), as SST-1 and SST-2 share more semantically useful information with IMDB than RN and QC.

Model Multi-Cardinality Multi-Domain Multi-Objective Avg
SST-1 SST-2 IMDB Books DVDs Electronics Kitchen IMDB RN QC
Single Task 45.9 85.8 88.5 78.0 79.5 81.2 81.8 88.5 83.6 92.5 -
Random 20.0 50.0 50.0 50.0 50.0 50.0 50.0 50.0 2.2 16.7 -41.6
Model-I 31.4 71.6 67.5 68.8 67.0 69.1 69.3 67.2 70.4 52.3 -17.1
Model-II 49.8 88.4 91.3 84.5 85.2 87.3 86.9 90.9 85.5 93.2 +3.7
Table 3: Results of Model-I and Model-II on different scenarios
Model Case 1 Case 2 Case 3
SST-1 SST-2 IMDB Books DVDs Electronics Kitchen IMDB RN QC
Before Update 48.6 87.6 - 83.7 84.5 85.9 - - 84.8 93.4
Cold Update 49.8 88.3 91.4 84.4 85.2 87.2 86.9 91.0 85.5 93.2
Hot Update 49.5 88.0 91.3 84.1 84.8 86.9 87.0 90.9 85.1 92.9
Zero Update - - 89.9 - - - 86.3 74.2 - -
Table 4: Results of Hot Update, Cold Update and Zero Update in different cases

Scaling and Transferring Capability of MTLE

In order to investigate the scaling and transferring capability of MTLE, we use to denote the case where Model-II is trained on task and , while is the newly involved one. We design three cases based on different scenarios and compare the influences of Hot Update, Cold Update, Zero Update on each task,

  • Case 1 SST-1 SST-2 IMDB.

  • Case 2 Books DVDs Electronics Kitchen.

  • Case 3 RN QC IMDB.

where in Zero Update, we ignore the training set of and directly utilize the test set for evaluations.

As Table 4 shows, Before Update denotes the model trained on the old tasks before the new tasks are involved, so only evaluations on the old tasks are conducted, which outperform the Single Task in Table 3 by 3.1% on average.

Cold Update re-trains Model-II again based on both the old tasks and the new tasks, thus achieving similar performances with those of Model-II in Table 3. Different from Cold Update, Hot Update resumes training only on the new tasks, requires much less training time, while still obtains competitive results with Cold Update. The new tasks like IMDB and Kitchen benefit more from Hot Update than the old tasks, as the parameters are further tuned according to annotations from these new tasks. Based on Cold Update and Hot Update, MTLE can easily scale and needs no structural modifications when new tasks are introduced.

Zero Update provides inspiring possibilities for completely unlabeled tasks. There are no more annotations available for additional training from the new tasks, so we can only employ the models of Before Update for evaluations on the new tasks. Zero Update achieves competitive performances in Case 1 (89.9 for IMDB) and Case 2 (86.3 for Kitchen), as tasks from these two cases all belong to sentiment datasets of different cardinalities or domains that contain rich semantic correlations with each other. However, the result for IMDB in Case 3 is only 74.2, as sentiment shares less relevance with topic classification and question type judgment, thus resulting in poor transferring performances.

Multi-Task or Label Embedding

MTLE mainly consists of two parts, label embedding and multi-task learning, so both implicit information from labels and potential correlations from other tasks make differences. In this section, we conduct experiments to explore the respective contributions of label embedding and multi-task learning.

We choose the four tasks from Multi-Domain scenario and train Model-II on each task respectively. Given that each task is trained separately, in this case their performances are only influenced by label embedding. Then we re-train Model-II from scratch for every two tasks, every three tasks from them and record the performances of each task in different cases, where both label embedding and multi-task learning matter.

Model SST-1 SST-2 IMDB Books DVDs Electronics Kitchen QC
NBOW 42.4 80.5 83.6 - - - - 88.2
PV 44.6 82.7 91.7 - - - - 91.8
MT-CNN - - - 80.2 81.0 83.4 83.0 -
MT-DNN - - - 79.7 80.5 82.5 82.8 -
MT-RNN 49.6 87.9 91.3 - - - - -
DSM 49.5 87.8 91.2 82.8 83.0 85.5 84.0 -
GRNN 47.5 85.5 - - - - - 93.8
Model-II 49.8 88.4 91.3 84.5 85.2 87.3 86.9 93.2
Table 5: Comparisons of Model-II against state-of-the-art models

The results are illustrated in Figure 4, where B, D, E, K are short for Books, DVDs, Electronics and Kitchen. The first three graphs denote the results of Model-II trained on every one task, every two tasks and every three tasks. In the first graph, the four tasks are trained separately and achieve improvements of 3.2%, 3.3%, 3.5%, 2.5% respectively compared to the baseline [Graves2013]. As more tasks are involved step by step, Model-II produces increasing performance gains for each task and achieves an average improvement of 5.9% when all the four tasks are trained together. So it can be concluded that information from labels as well as correlations from other tasks account for considerable parts of contributions, and we integrate both of them into MTLE with the capabilities of scaling and transferring.

In the last graph, diagonal cells denote improvements of every one task, while off-diagonal cells denote average improvements of every two tasks, so an off-diagonal cell of darker color indicates stronger correlations between the corresponding two tasks. An interesting finding is that Books is more related with DVDs and Electronics is more relevant to Kitchen. A possible reason may be that Books and DVDs are products targeted for reading or watching, while customers care more about appearances and functionalities when talking about Electronics and Kitchen.

Figure 4: Performance gains of each task in different cases

Comparisons with State-of-the-art Models

We compare Model-II against the following state-of-the-art models:

  • NBOW Neural Bag-of-Words that sums up embedding vectors of all words and applies a non-linearity followed by a softmax layer.

  • PV

    Paragraph Vectors followed by logistic regression 

    [Le and Mikolov2014].

  • MT-CNN Multi-Task learning with Convolutional Neural Networks [Collobert and Weston2008] where lookup tables are partially shared.

  • MT-DNN Multi-Task learning with Deep Neural Networks [Liu et al.2015b] that utilizes bag-of-word representations and a hidden shared layer.

  • MT-RNN Multi-Task learning with Recurrent Neural Networks by a shared-layer architecture [Liu, Qiu, and Huang2016b].

  • DSM Deep multi-task learning with Shared Memory [Liu, Qiu, and Huang2016a] where a external memory and a reading/writing mechanism are introduced.

  • GRNN Gated Recursive Neural Network for sentence modeling and text classification [Chen et al.2015].

As Table 5 shows, MTLE achieves competitive or better performances on all tasks except for the task QC, as it contains less correlations with other tasks. PV slightly surpasses MTLE on IMDB (91.7 against 91.3), as sentences from IMDB are much longer than SST and MDSD, which require stronger capabilities of long-term dependency learning. In this paper, we mainly focus the idea and effects of integrating label embedding with multi-task learning, so we just apply [Graves2013] to realize and , which can be further implemented by other more effective sentence learning models [Liu et al.2015a, Chen et al.2015] and produce better performances.

Related Work

There are a large body of literatures related to multi-task learning with neural networks in NLP [Collobert and Weston2008, Liu et al.2015b, Liu, Qiu, and Huang2016a, Liu, Qiu, and Huang2016b, Zhang et al.2017].

[Collobert and Weston2008] utilizes a shared lookup layer for common features, followed by task-specific layers for several traditional NLP tasks including part-of-speech tagging and semantic parsing. They use a fix-size window to solve the problem of variable-length input sequences, which can be better addressed by RNN.

[Liu et al.2015b, Liu, Qiu, and Huang2016a, Liu, Qiu, and Huang2016b, Zhang et al.2017] all investigate multi-task learning for text classification. [Liu et al.2015b] applies bag-of-word representation and information of word orders are lost. [Liu, Qiu, and Huang2016a] introduces an external memory for information sharing with a reading/writing mechanism for communications. [Liu, Qiu, and Huang2016b] proposes three different models for multi-task learning with RNN and [Zhang et al.2017] constructs a generalized architecture for RNN based multi-task learning. However, models of these papers ignore essential information of labels and mostly can only address pair-wise interactions between two tasks. Their network structures are also fixed, thereby failing to scale or transfer when new tasks are involved.

Different from the above works, our models map labels of text classification tasks into semantic vectors and provide a more intuitive way to realize multi-task learning with the capabilities of scaling and transferring. Input sequences from three or more tasks are jointly learned together with their labels, benefitting from each other and obtaining better sequence representations.

Conclusion

In this paper, we propose Multi-Task Label Embedding to map labels of text classification tasks into semantic vectors. Based on MTLE, we implement unsupervised, supervised and semi-supervised models to facilitate multi-task learning, all utilizing semantic correlations among tasks and effectively solving the problems of scaling and transferring when new tasks are involved. We explore three different scenarios of multi-task learning and our models can improve performances of most tasks with additional related information from others in all scenarios.

In future work, we would like to explore quantifications of task correlations and generalize MTLE to address other NLP tasks, for example, sequence labeling and sequence-to-sequence learning.

References

  • [Amari1993] Amari, S. 1993. Backpropagation and stochastic gradient descent method. Neurocomputing 5(3):185–196.
  • [Apté, Damerau, and Weiss1994] Apté, C.; Damerau, F.; and Weiss, S. M. 1994. Automated Learning of Decision Rules for Text Categorization. ACM Trans. Inf. Syst. 12(3):233–251.
  • [Bengio, Courville, and Vincent2013] Bengio, Y.; Courville, A. C.; and Vincent, P. 2013. Representation Learning: A Review and New Perspectives. IEEE Trans. Pattern Anal. Mach. Intell. 35(8):1798–1828.
  • [Blitzer, Dredze, and Pereira2007] Blitzer, J.; Dredze, M.; and Pereira, F. 2007. Biographies, Bollywood, Boom-boxes and Blenders: Domain Adaptation for Sentiment Classification. In ACL.
  • [Caruana1997] Caruana, R. 1997. Multitask Learning. Machine Learning 28(1):41–75.
  • [Chen et al.2015] Chen, X.; Qiu, X.; Zhu, C.; Wu, S.; and Huang, X. 2015. Sentence Modeling with Gated Recursive Neural Network. In EMNLP, 793–798.
  • [Collobert and Weston2008] Collobert, R., and Weston, J. 2008. A unified architecture for natural language processing: deep neural networks with multitask learning. In ICML, 160–167.
  • [Graves2013] Graves, A. 2013. Generating Sequences With Recurrent Neural Networks. CoRR abs/1308.0850.
  • [Hochreiter and Schmidhuber1997] Hochreiter, S., and Schmidhuber, J. 1997. Long Short-Term Memory. Neural Computation 9(8):1735–1780.
  • [Le and Mikolov2014] Le, Q. V., and Mikolov, T. 2014. Distributed Representations of Sentences and Documents. In ICML, 1188–1196.
  • [Li and Roth2002] Li, X., and Roth, D. 2002.

    Learning Question Classifiers.

    In COLING.
  • [Ling et al.2008] Ling, X.; Dai, W.; Xue, G.; Yang, Q.; and Yu, Y. 2008. Spectral domain-transfer learning. In ACM SIGKDD, 488–496.
  • [Liu et al.2015a] Liu, P.; Qiu, X.; Chen, X.; Wu, S.; and Huang, X. 2015a. Multi-Timescale Long Short-Term Memory Neural Network for Modelling Sentences and Documents. In EMNLP, 2326–2335.
  • [Liu et al.2015b] Liu, X.; Gao, J.; He, X.; Deng, L.; Duh, K.; and Wang, Y. 2015b. Representation Learning Using Multi-Task Deep Neural Networks for Semantic Classification and Information Retrieval. In NAACL HLT, 912–921.
  • [Liu, Qiu, and Huang2016a] Liu, P.; Qiu, X.; and Huang, X. 2016a. Deep Multi-Task Learning with Shared Memory for Text Classification. In EMNLP, 118–127.
  • [Liu, Qiu, and Huang2016b] Liu, P.; Qiu, X.; and Huang, X. 2016b. Recurrent Neural Network for Text Classification with Multi-Task Learning. In IJCAI, 2873–2879.
  • [Maas et al.2011] Maas, A. L.; Daly, R. E.; Pham, P. T.; Huang, D.; Ng, A. Y.; and Potts, C. 2011. Learning Word Vectors for Sentiment Analysis. In NAACL HLT, 142–150. Association for Computational Linguistics.
  • [Mikolov et al.2013a] Mikolov, T.; Chen, K.; Corrado, G.; and Dean, J. 2013a. Efficient Estimation of Word Representations in Vector Space. CoRR abs/1301.3781.
  • [Mikolov et al.2013b] Mikolov, T.; Sutskever, I.; Chen, K.; Corrado, G. S.; and Dean, J. 2013b. Distributed Representations of Words and Phrases and their Compositionality. In NIPS, 3111–3119.
  • [Socher et al.2013] Socher, R.; Perelygin, A.; Wu, J.; Chuang, J.; Manning, C. D.; Ng, A. Y.; and Potts, C. 2013. Recursive Deep Models for Semantic Compositionality Over a Sentiment Treebank. In EMNLP, 1631–1642. Stroudsburg, PA: Association for Computational Linguistics.
  • [Zhang et al.2017] Zhang, H.; Xiao, L.; Wang, Y.; and Jin, Y. 2017. A generalized recurrent neural architecture for text classification with multi-task learning. In IJCAI-17, 3385–3391.