A Generalized Recurrent Neural Architecture for Text Classification with Multi-Task Learning

by   Honglun Zhang, et al.
Shanghai Jiao Tong University

Multi-task learning leverages potential correlations among related tasks to extract common features and yield performance gains. However, most previous works only consider simple or weak interactions, thereby failing to model complex correlations among three or more tasks. In this paper, we propose a multi-task learning architecture with four types of recurrent neural layers to fuse information across multiple related tasks. The architecture is structurally flexible and considers various interactions among tasks, which can be regarded as a generalized case of many previous works. Extensive experiments on five benchmark datasets for text classification show that our model can significantly improve performances of related tasks with additional information from others.



There are no comments yet.


page 1

page 2

page 3

page 4


Multi-Task Label Embedding for Text Classification

Multi-task learning in text classification leverages implicit correlatio...

Recurrent Neural Network for Text Classification with Multi-Task Learning

Neural network based methods have obtained great progress on a variety o...

Multi-Task Sequence Prediction For Tunisian Arabizi Multi-Level Annotation

In this paper we propose a multi-task sequence prediction system, based ...

Automated Deepfake Detection

In this paper, we propose to utilize Automated Machine Learning to autom...

Compression-Based Regularization with an Application to Multi-Task Learning

This paper investigates, from information theoretic grounds, a learning ...

Neural Multi-Task Learning for Citation Function and Provenance

Citation function and provenance are two cornerstone tasks in citation a...

Multi-task Learning Based Neural Bridging Reference Resolution

We propose a multi task learning-based neural model for bridging referen...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Neural network based models have been widely exploited with the prosperities of Deep Learning [Bengio et al.2013] and achieved inspiring performances on many NLP tasks, such as text classification [Chen et al.2015, Liu et al.2015a], semantic matching [Liu et al.2016d, Liu et al.2016a] and machine translation [Sutskever et al.2014]

. These models are robust at feature engineering and can represent words, sentences and documents as fix-length vectors, which contain rich semantic information and are ideal for subsequent NLP tasks.

One formidable constraint of deep neural networks (DNN) is their strong reliance on large amounts of annotated corpus due to substantial parameters to train. A DNN trained on limited data is prone to overfitting and incapable to generalize well. However, constructions of large-scale high-quality labeled datasets are extremely labor-intensive. To solve the problem, these models usually employ a pre-trained lookup table, also known as Word Embedding [Mikolov et al.2013b], to map words into vectors with semantic implications. However, this method just introduces extra knowledge and does not directly optimize the targeted task. The problem of insufficient annotated resources is not solved either.

Multi-task learning leverages potential correlations among related tasks to extract common features, increase corpus size implicitly and yield classification improvements. Inspired by  [Caruana1997], there are a large literature dedicated for multi-task learning with neural network based models [Collobert and Weston2008, Liu et al.2015b, Liu et al.2016b, Liu et al.2016c]

. These models basically share some lower layers to capture common features and further feed them to subsequent task-specific layers, which can be classified into three types:

  • Type-I One dataset annotated with multiple labels and one input with multiple outputs.

  • Type-II Multiple datasets with respective labels and one input with multiple outputs, where samples from different tasks are fed one by one into the models sequentially.

  • Type-III Multiple datasets with respective labels and multiple inputs with multiple outputs, where samples from different tasks are jointly learned in parallel.

In this paper, we propose a generalized multi-task learning architecture with four types of recurrent neural layers for text classification. The architecture focuses on Type-III, which involves more complicated interactions but has not been researched yet. All the related tasks are jointly integrated into a single system and samples from different tasks are trained in parallel. In our model, every two tasks can directly interact with each other and selectively absorb useful information, or communicate indirectly via a shared intermediate layer. We also design a global memory storage to share common features and collect interactions among all tasks.

We conduct extensive experiments on five benchmark datasets for text classification. Compared to learning separately, jointly learning multiple relative tasks in our model demonstrate significant performance gains for each task.

Our contributions are three-folds:

  • Our model is structurally flexible and considers various interactions, which can be concluded as a generalized case of many previous works with deliberate designs.

  • Our model allows for interactions among three or more tasks simultaneously and samples from different tasks are trained in parallel with multiple inputs.

  • We consider different scenarios of multi-task learning and demonstrate strong results on several benchmark classification datasets. Our model outperforms most of state-of-the-art baselines.

2 Problem Statements

2.1 Single-Task Learning

For a single supervised text classification task, the input is a word sequences denoted by , and the output is the corresponding class label or class distribution . A lookup layer is used first to get the vector representation of each word . A classification model is trained to transform each into a predicted distribution .


and the training objective is to minimize the total cross-entropy of the predicted and true distributions over all samples.


where denotes the number of training samples and is the class number.

2.2 Multi-Task Learning

Given supervised text classification tasks, , a jointly learning model is trained to transform multiple inputs into a combination of predicted distributions in parallel.


where are sequences from each tasks and are the corresponding predictions.

The overall training objective of is to minimize the weighted linear combination of costs for all tasks.


where denotes the number of sample collections, and are class numbers and weights for each task respectively.

2.3 Three Perspectives of Multi-Task Learning

Different tasks may differ in characteristics of the word sequences or the labels . We compare lots of benchmark tasks for text classification and conclude three different perspectives of multi-task learning.

  • Multi-Cardinality Tasks are similar except for cardinality parameters, for example, movie review datasets with different average sequence lengths and class numbers.

  • Multi-Domain Tasks involve contents of different domains, for example, product review datasets on books, DVDs, electronics and kitchen appliances.

  • Multi-Objective

    Tasks are designed for different objectives, for example, sentiment analysis, topics classification and question type judgment.

The simplest multi-task learning scenario is that all tasks share the same cardinality, domain and objective, while come from different sources, so it is intuitive that they can obtain useful information from each other. However, in the most complex scenario, tasks may vary in cardinality, domain and even objective, where the interactions among different tasks can be quite complicated and implicit. We will evaluate our model on different scenarios in the Experiment section.

3 Methodology

Recently neural network based models have obtained substantial interests in many natural language processing tasks for their capabilities to represent variable-length text sequences as fix-length vectors, for example, Neural Bag-of-Words (NBOW), Recurrent Neural Networks (RNN), Recursive Neural Networks (RecNN) and Convolutional Neural Network (CNN). Most of them first map sequences of words, n-grams or other semantic units into embedding representations with a pre-trained lookup table, then fuse these vectors with different architectures of neural networks, and finally utilize a softmax layer to predict categorical distribution for specific classification tasks. For recurrent neural network, input vectors are absorbed one by one in a recurrent way, which makes RNN particularly suitable for natural language processing tasks.

3.1 Recurrent Neural Network

A recurrent neural network maintains a internal hidden state vector that is recurrently updated by a transition function . At each time step , the hidden state is updated according to the current input vector and the previous hidden state .


where is usually a composition of an element-wise nonlinearity with an affine transformation of both and .

In this way, recurrent neural networks can comprehend a sequence of arbitrary length into a fix-length vector and feed it to a softmax layer for text classification or other NLP tasks. However, gradient vector of can grow or decay exponentially over long sequences during training, also known as the gradient exploding or vanishing problems, which makes it difficult to learn long-term dependencies and correlations for RNNs.

[Hochreiter and Schmidhuber1997]

proposed Long Short-Term Memory Network (LSTM) to tackle the above problems. Apart from the internal hidden state

, LSTM also maintains a internal hidden memory cell and three gating mechanisms. While there are numerous variants of the standard LSTM, here we follow the implementation of  [Graves2013]. At each time step , states of the LSTM can be fully represented by five vectors in , an input gate , a forget gate , an output gate , the hidden state and the memory cell , which adhere to the following transition functions.


where is the current input,

denotes logistic sigmoid function and

denotes element-wise multiplication. By selectively controlling portions of the memory cell to update, erase and forget at each time step, LSTM can better comprehend long-term dependencies with respect to labels of the whole sequences.

3.2 A Generalized Architecture

Based on the LSTM implementation of  [Graves2013], we propose a generalized multi-task learning architecture for text classification with four types of recurrent neural layers to convey information inside and among tasks. Figure 1 illustrates the structure design and information flows of our model, where three tasks are jointly learned in parallel.

As Figure 0(a) shows, each task owns a LSTM-based Single Layer for intra-task learning. Pair-wise Coupling Layer and Local Fusion Layer are designed for direct and indirect inter-task interactions. And we further utilize a Global Fusion Layer to maintain a global memory for information shared among all tasks.

(a) Overall architecture with Single Layers, Coupling Layers, Local Fusion Layers and Global Fusion Layer
(b) Details of Coupling Layer Between and
(c) Details of Local Fusion Layer Between and
Figure 1: A generalized recurrent neural architecture for modeling text with multi-task learning

3.2.1 Single Layer

Each task owns a LSTM-based Single Layer with a collection of parameters , taking Eqs.() for example.


Input sequences of each task are transformed into vector representations , which are later recurrently fed into the corresponding Single Layers. The hidden states at the last time step of each Single Layer can be regarded as fix-length representations of the whole sequences, which are followed by a fully connected layer and a softmax non-linear layer to produce class distributions.


where is the predicted class distribution for .

3.2.2 Coupling Layer

Besides Single Layers, we design Coupling Layers to model direct pair-wise interactions between tasks. For each pair of tasks, hidden states and memory cells of the Single Layers can obtain extra information directly from each other, as shown in Figure 0(b).

We re-define Eqs.() and utilize a gating mechanism to control the portion of information flows from one task to another. The memory content of each Single Layer is updated on the leverage of pair-wise couplings.


where controls the portion of information flow from to , based on the correlation strength between and at the current time step.

In this way, the hidden states and memory cells of each Single Layer can obtain extra information from other tasks and stronger relevance results in higher chances of reception.

3.2.3 Local Fusion Layer

Different from Coupling Layers, Local Fusion Layers introduce a shared bi-directional LSTM Layer to model indirect pair-wise interactions between tasks. For each pair of tasks, we feed the Local Fusion Layer with the concatenation of both inputs, , as shown in Figure 0(c). We denote the output of the Local Fusion Layer as , a concatenation of hidden states from the forward and backward LSTM at each time step.

Similar to Coupling Layers, hidden states and memory cells of the Single Layers can selectively decide how much information to accept from the pair-wise Local Fusion Layers. We re-define Eqs.() by considering the interactions between the memory content and outputs of the Local Fusion Layers as follows.


where denotes the coupling term in Eqs.() and represents the local fusion term. Again, we employ a gating mechanism to control the portion of information flow from the Local Coupling Layers to .

3.2.4 Global Fusion Layer

Indirect interactions between Single Layers can be pair-wise or global, so we further propose the Global Fusion Layer as a shared memory storage among all tasks. The Global Fusion Layer consists of a bi-directional LSTM Layer with the inputs and the outputs .

We denote the global fusion term as and the memory content is calculated as follows.


As a result, our architecture covers complicated interactions among different tasks. It is capable of mapping a collection of input sequences from different tasks into a combination of predicted class distributions in parallel, as shown in Eqs.().

3.3 Sampling & Training

Most previous multi-task learning models [Collobert and Weston2008, Liu et al.2015b, Liu et al.2016b, Liu et al.2016c] belongs to Type-I or Type-II. The total number of input samples is , where are the sample numbers of each task.

However, our model focuses on Type-III

and requires a 4-D tensor

as inputs, where are total number of input collections, task number, sequence length and embedding size respectively. Samples from different tasks are jointly learned in parallel so the total number of all possible input collections is . We propose a Task Oriented Sampling algorithm to generate sample collections for improvements of a specific task .

1: samples from each task ; , the oriented task index; , upsampling coefficient s.t.
2:sequence collections and label combinations
3:for each  do
4:     generate a set with samples for each task:
5:     if  then
6:         repeat each sample for times
7:     else if  then
8:         randomly select samples without replacements
9:     else
10:         randomly select samples with replacements
11:     end if
12:end for
13:for each  do
14:     randomly select a sample from each without replacements
15:     combine their features and labels as and
16:end for
17:merge all and to produce the sequence collections and label combinations
Algorithm 1 Task Oriented Sampling

Given the generated sequence collections and label combinations

, the overall loss function can be calculated based on Eqs.(

) and (). The training process is conducted in a stochastic manner until convergence. For each loop, we randomly select a collection from the candidates and update the parameters by taking a gradient step.

4 Experiment

In this section, we design three different scenarios of multi-task learning based on five benchmark datasets for text classification. we investigate the empirical performances of our model and compare it to existing state-of-the-art models.

Dataset Description Type Length Class Objective
SST Movie reviews in Stanford Sentiment Treebank including SST-1 and SST-2 Sentence 19 / 19 5 / 2 Sentiment
IMDB Internet Movie Database Document 279 2 Sentiment
MDSD Product reviews on books, DVDs, electronics and kitchen appliances Document 176 / 189 / 115 / 97 2 Sentiment
RN Reuters Newswire topics classification Document 146 46 Topics
QC Question Classification Sentence 10 6 Question Types
Table 1: Five benchmark classification datasets: SST, IMDB, MDSD, RN, QC.

4.1 Datasets

As Table 1 shows, we select five benchmark datasets for text classification and design three experiment scenarios to evaluate the performances of our model.

  • Multi-Cardinality Movie review datasets with different average lengths and class numbers, including SST-1 [Socher et al.2013], SST-2 and IMDB [Maas et al.2011].

  • Multi-Domain Product review datasets on different domains from Multi-Domain Sentiment Dataset [Blitzer et al.2007], including Books, DVDs, Electronics and Kitchen.

  • Multi-Objective Classification datasets with different objectives, including IMDB, RN [Apté et al.1994] and QC [Li and Roth2002].

4.2 Hyperparameters and Training

The whole network is trained through back propagation with stochastic gradient descent 

[Amari1993]. We obtain a pre-trained lookup table by applying Word2Vec [Mikolov et al.2013a]

on the Google News corpus, which contains more than 100B words with a vocabulary size of about 3M. All involved parameters are randomly initialized from a truncated normal distribution with zero mean and standard deviation.

For each task , we conduct TOS with to improve its performance. After training our model on the generated sample collections, we evaluate the performance of task by comparing and

on the test set. We apply 10-fold cross-validation and different combinations of hyperparameters are investigated, of which the best one, as shown in Table

2, is reserved for comparisons with state-of-the-art models.

Embedding size
Hidden layer size of LSTM
Initial learning rate
Regularization weight
Table 2: Hyperparameter settings

4.3 Results

We compare performances of our model with the implementation of  [Graves2013] and the results are shown in Table 3. Our model obtains better performances in Multi-Domain scenario with an average improvement of 4.5%, where datasets are product reviews on different domains with similar sequence lengths and the same class number, thus producing stronger correlations. Multi-Cardinality scenario also achieves significant improvements of 2.77% on average, where datasets are movie reviews with different cardinalities.

However, Multi-Objective scenario benefits less from multi-task learning due to lacks of salient correlation among sentiment, topic and question type. The QC dataset aims to classify each question into six categories and its performance even gets worse, which may be caused by potential noises introduced by other tasks. In practice, the structure of our model is flexible, as couplings and fusions between some empirically unrelated tasks can be removed to alleviate computation costs.

Model Multi-Cardinality Multi-Domain Multi-Objective
SST-1 SST-2 IMDB Books DVDs Electronics Kitchen IMDB RN QC
Single Task 45.9 85.8 88.5 78.0 79.5 81.2 81.8 88.5 83.6 92.5
Our Model 49.2 87.7 91.6 83.5 84.0 86.2 84.8 89.7 84.2 92.3
Table 3: Results of our model on different scenarios

4.3.1 Influences of in TOS

We further explore the influence of in TOS on our model, which can be any positive integer. A higher value means larger and more various samples combinations, while requires higher computation costs.

Figure 2 shows the performances of datasets in Multi-Domain scenario with different . Compared to , our model can achieve considerable improvements when as more samples combinations are available. However, there are no more salient gains as gets larger and potential noises from other tasks may lead to performance degradations. For a trade-off between efficiency and effectiveness, we determine as the optimal value for our experiments.

Figure 2: Influences of in TOS on different datasets

4.3.2 Pair-wise Performance Gain

In order to measure the correlation strength between two task and , we learn them jointly with our model and define Pair-wise Performance Gain as , where are the performances of tasks and when learned individually and jointly.

We calculate PPGs for every two tasks in Table 1 and illustrate the results in Figure 3, where darkness of colors indicate strength of correlation. It is intuitive that datasets of Multi-Domain scenario obtain relatively higher PPGs with each other as they share similar cardinalities and abundant low-level linguistic characteristics. Sentences of QC dataset are much shorter and convey unique characteristics from other tasks, thus resulting in quite lower PPGs.

Figure 3: Visualization of Pair-wise Performance Gains
Model SST-1 SST-2 IMDB Books DVDs Electronics Kitchen QC
NBOW 42.4 80.5 83.62 - - - - 88.2
PV 44.6 82.7 91.7 - - - - 91.8
MT-RNN 49.6 87.9 91.3 - - - - -
MT-CNN - - - 80.2 81.0 83.4 83.0 -
MT-DNN - - - 79.7 80.5 82.5 82.8 -
GRNN 47.5 85.5 - - - - - 93.8
Our Model 49.2 87.7 91.6 83.5 84.0 86.2 84.8 92.3
Table 4: Comparisons with state-of-the-art models

4.4 Comparisons with State-of-the-art Models

We apply the optimal hyperparameter settings and compare our model against the following state-of-the-art models:

  • NBOW Neural Bag-of-Words that simply sums up embedding vectors of all words.

  • PV

    Paragraph Vectors followed by logistic regression 

    [Le and Mikolov2014].

  • MT-RNN Multi-Task learning with Recurrent Neural Networks by a shared-layer architecture [Liu et al.2016c].

  • MT-CNN Multi-Task learning with Convolutional Neural Networks [Collobert and Weston2008] where lookup tables are partially shared.

  • MT-DNN Multi-Task learning with Deep Neural Networks [Liu et al.2015b] that utilizes bag-of-word representations and a hidden shared layer.

  • GRNN Gated Recursive Neural Network for sentence modeling [Chen et al.2015].

As Table 4 shows, our model obtains competitive or better performances on all tasks except for the QC dataset, as it contains poor correlations with other tasks. MT-RNN slightly outperforms our model on SST, as sentences from this dataset are much shorter than those from IMDB and MDSD, and another possible reason may be that our model are more complex and requires larger data for training. Our model proposes the designs of various interactions including coupling, local and global fusion, which can be further implemented by other state-of-the-art models and produce better performances.

5 Related Work

There are a large body of literatures related to multi-task learning with neural networks in NLP [Collobert and Weston2008, Liu et al.2015b, Liu et al.2016b, Liu et al.2016c].

[Collobert and Weston2008] belongs to Type-I and utilizes shared lookup tables for common features, followed by task-specific neural layers for several traditional NLP tasks such as part-of-speech tagging and semantic parsing. They use a fix-size window to solve the problem of variable-length texts, which can be better handled by recurrent neural networks.

[Liu et al.2015b, Liu et al.2016b, Liu et al.2016c] all belong to Type-II where samples from different tasks are learned sequentially.  [Liu et al.2015b] applies bag-of-word representation and information of word orders are lost.  [Liu et al.2016b] introduces an external memory for information sharing with a reading/writing mechanism for communicating, and  [Liu et al.2016c] proposes three different models for multi-task learning with recurrent neural networks. However, models of these two papers only involve pair-wise interactions, which can be regarded as specific implementations of Coupling Layer and Fusion Layer in our model.

Different from the above models, our model focuses on Type-III and utilize recurrent neural networks to comprehensively capture various interactions among tasks, both direct and indirect, local and global. Three or more tasks are learned simultaneously and samples from different tasks are trained in parallel benefitting from each other, thus obtaining better sentence representations.

6 Conclusion and Future Work

In this paper, we propose a multi-task learning architecture for text classification with four types of recurrent neural layers. The architecture is structurally flexible and can be regarded as a generalized case of many previous works with deliberate designs. We explore three different scenarios of multi-task learning and our model can improve performances of most tasks with additional related information from others in all scenarios.

In future work, we would like to investigate further implementations of couplings and fusions, and conclude more multi-task learning perspectives.


  • [Amari1993] Shun-ichi Amari. Backpropagation and stochastic gradient descent method. Neurocomputing, 5(3):185–196, 1993.
  • [Apté et al.1994] Chidanand Apté, Fred Damerau, and Sholom M. Weiss. Automated Learning of Decision Rules for Text Categorization. ACM Trans. Inf. Syst., 12(3):233–251, 1994.
  • [Bengio et al.2013] Yoshua Bengio, Aaron C. Courville, and Pascal Vincent. Representation Learning: A Review and New Perspectives. IEEE Trans. Pattern Anal. Mach. Intell., 35(8):1798–1828, 2013.
  • [Blitzer et al.2007] John Blitzer, Mark Dredze, and Fernando Pereira. Biographies, Bollywood, Boom-boxes and Blenders: Domain Adaptation for Sentiment Classification. In ACL, 2007.
  • [Caruana1997] Rich Caruana. Multitask Learning. Machine Learning, 28(1):41–75, 1997.
  • [Chen et al.2015] Xinchi Chen, Xipeng Qiu, Chenxi Zhu, Shiyu Wu, and Xuanjing Huang. Sentence Modeling with Gated Recursive Neural Network. In EMNLP, pages 793–798, 2015.
  • [Collobert and Weston2008] Ronan Collobert and Jason Weston. A unified architecture for natural language processing: deep neural networks with multitask learning. In ICML, pages 160–167, 2008.
  • [Graves2013] Alex Graves. Generating Sequences With Recurrent Neural Networks. CoRR, abs/1308.0850, 2013.
  • [Hochreiter and Schmidhuber1997] Sepp Hochreiter and Jürgen Schmidhuber. Long Short-Term Memory. Neural Computation, 9(8):1735–1780, 1997.
  • [Le and Mikolov2014] Quoc V. Le and Tomas Mikolov. Distributed Representations of Sentences and Documents. In Proceedings of the 31th International Conference on Machine Learning, ICML 2014, Beijing, China, 21-26 June 2014, pages 1188–1196, 2014.
  • [Li and Roth2002] Xin Li and Dan Roth. Learning Question Classifiers. In COLING, 2002.
  • [Liu et al.2015a] Pengfei Liu, Xipeng Qiu, Xinchi Chen, Shiyu Wu, and Xuanjing Huang. Multi-Timescale Long Short-Term Memory Neural Network for Modelling Sentences and Documents. In EMNLP, pages 2326–2335, 2015.
  • [Liu et al.2015b] Xiaodong Liu, Jianfeng Gao, Xiaodong He, Li Deng, Kevin Duh, and Ye-Yi Wang. Representation Learning Using Multi-Task Deep Neural Networks for Semantic Classification and Information Retrieval. In NAACL HLT, pages 912–921, 2015.
  • [Liu et al.2016a] Pengfei Liu, Xipeng Qiu, Jifan Chen, and Xuanjing Huang. Deep Fusion LSTMs for Text Semantic Matching. In ACL, 2016.
  • [Liu et al.2016b] Pengfei Liu, Xipeng Qiu, and Xuanjing Huang. Deep Multi-Task Learning with Shared Memory for Text Classification. In EMNLP, pages 118–127, 2016.
  • [Liu et al.2016c] Pengfei Liu, Xipeng Qiu, and Xuanjing Huang. Recurrent Neural Network for Text Classification with Multi-Task Learning. In IJCAI, pages 2873–2879, 2016.
  • [Liu et al.2016d] Pengfei Liu, Xipeng Qiu, Yaqian Zhou, Jifan Chen, and Xuanjing Huang. Modelling Interaction of Sentence Pair with Coupled-LSTMs. In EMNLP, pages 1703–1712, 2016.
  • [Maas et al.2011] Andrew L. Maas, Raymond E. Daly, Peter T. Pham, Dan Huang, Andrew Y. Ng, and Christopher Potts. Learning Word Vectors for Sentiment Analysis. In NAACL HLT, pages 142–150. Association for Computational Linguistics, June 2011.
  • [Mikolov et al.2013a] Tomas Mikolov, Kai Chen, Greg Corrado, and Jeffrey Dean. Efficient Estimation of Word Representations in Vector Space. CoRR, abs/1301.3781, 2013.
  • [Mikolov et al.2013b] Tomas Mikolov, Ilya Sutskever, Kai Chen, Gregory S. Corrado, and Jeffrey Dean. Distributed Representations of Words and Phrases and their Compositionality. In Advances in Neural Information Processing Systems, pages 3111–3119, 2013.
  • [Socher et al.2013] Richard Socher, Alex Perelygin, Jean Wu, Jason Chuang, Christopher D. Manning, Andrew Y. Ng, and Christopher Potts. Recursive Deep Models for Semantic Compositionality Over a Sentiment Treebank. In EMNLP, pages 1631–1642, Stroudsburg, PA, October 2013. Association for Computational Linguistics.
  • [Sutskever et al.2014] Ilya Sutskever, Oriol Vinyals, and Quoc V. Le. Sequence to Sequence Learning with Neural Networks. In Advances in Neural Information Processing Systems, pages 3104–3112, 2014.