Double Path Networks for Sequence to Sequence Learning

06/13/2018 ∙ by Kaitao Song, et al. ∙ Nanjing University Microsoft Peking University 0

Encoder-decoder based Sequence to Sequence learning (S2S) has made remarkable progress in recent years. Different network architectures have been used in the encoder/decoder. Among them, Convolutional Neural Networks (CNN) and Self Attention Networks (SAN) are the prominent ones. The two architectures achieve similar performances but use very different ways to encode and decode context: CNN use convolutional layers to focus on the local connectivity of the sequence, while SAN uses self-attention layers to focus on global semantics. In this work we propose Double Path Networks for Sequence to Sequence learning (DPN-S2S), which leverage the advantages of both models by using double path information fusion. During the encoding step, we develop a double path architecture to maintain the information coming from different paths with convolutional layers and self-attention layers separately. To effectively use the encoded context, we develop a cross attention module with gating and use it to automatically pick up the information needed during the decoding step. By deeply integrating the two paths with cross attention, both types of information are combined and well exploited. Experiments show that our proposed method can significantly improve the performance of sequence to sequence learning over state-of-the-art systems.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 4

page 9

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

This work was done when K. Song was an Intern at Microsoft Research
Our work is built upon https://github.com/StillKeepTry/Transformer-PyTorch based on fairseq.
This work is licensed under a Creative Commons Attribution 4.0 International License. License details: http://creativecommons.org/licenses/by/4.0/.

Sequence to Sequence learning (S2S) [Cho et al.2014, Sutskever et al.2014]

is one of the challenging tasks in artificial intelligence, which covers machine translation, document summarization and question answering, etc. It has attracted more and more attention in recent years due to the success of deep learning. Sequence to sequence learning is usually developed based on a general encoder-decoder framework: The encoder reads the source sequence and generates a set of representations. After that, the decoder estimates the conditional probability of each target token given the source representations and its preceding tokens.

Different network architectures for sequence to sequence modelling have been designed based on this framework. Long Short-Term Memory (LSTM) based model

[Cho et al.2014] is the most popular. In the LSTM based model, the tokens are encoded/decoded using LSTM units, which can effectively summarize the temporal information of the sentences in source/target domains. However, because of its recurrent nature, the LSTM based model is difficult to parallel, so the training efficiency becomes the major challenge. Recently, Convolutional Neural Network (CNN) based [Gehring et al.2017a] and Self Attention Network (SAN) based [Vaswani et al.2017] models have been proposed, which achieved significant improvements in accuracy. Both models replace the recurrent structure with networks that are easy to parallelize, leading to training time acceleration.

CNN based and SAN based sequence to sequence models encode and decode information in very different ways. CNN based model treats the sequences similar with images: The convolutional layer is used to summarize information for a local region of the sequence, which corresponds to a subsequence of tokens. Different from CNN based model, SAN based model uses self-attention layer to abstract the context based on semantic meanings: In each layer, for each token, the similarity between the token and all other tokens in the sequence are computed. Then the contexts are weighted combined based on the global semantic similarity. By stacking multiple convolutional layers or self-attention layers, the two models obtain the whole contextual information during the encoding and decoding step.

As we can see, the convolutional layer focuses more on abstracting information according to local connectivity, while self-attention layer focuses more on using global semantics. Both models achieve similar accuracy on several sequence to sequence learning tasks. Then a natural question raises: is there any way to integrate the two methods together to achieve a higher accuracy? This is exactly the purpose of the paper.

In this work, we investigate the possibility to combine the two models into a unique framework and develop efficient architecture to leverage both advantages. In particular, we propose Double Path Networks for Sequence to Sequence learning (DPN-S2S), which contain a convolutional path and a self-attention path with attention information fusion between the encoder and the decoder. In the encoding step, we use the convolutional path and self-attention path independently. Thus we can summarize the context of the source sequences from different aspects. In the decoding step, we develop a cross attention module with gating to automatically select appropriate context, and use the context in a similar double path structure. As there is no recurrent structure in convolutional layers and self-attention layers, the model is still easy to parallelize. Since we abstract the hidden representation from different ways, the contexts we obtain are more diverse, making it possible to achieve higher accuracy under the same parameter number.

Experimental results show that our proposed approach can achieve significant improvements than single path based S2S framework. The results are summarized as follows:

  • With a similar number of parameters, our model outperforms CNN based, SAN based model. More specifically, on Nist Chinese-English translation task, we outperform the strong CNN based baseline by 0.46-2.96 BLEU, SAN based baseline by 0.74-0.99 BLEU respectively in different Nist test sets. Furthermore, our model also achieves state-of-art BLEU/ROUGE score on IWSLT14 German-English translation task and English Gigaword text summarization task.

  • Through ablation studies and analyses we show that both the CNN and SAN path are important in the encoder and the decoder. Removing any part would cause performance drop, which demonstrates the effectiveness of our model.

2 Background

2.1 Encoder-Decoder Framework

Denote and as the sentence pair, in which and are the -th and -th tokens for sequences and , and are the lengths of the sequences. The sequence to sequence model learns to estimate the conditional probability . Denote the model parameter as and denote the training corpus as . One of the most popularly used objective functions is the log likelihood function:

(1)

According to the probability chain rule, we have:

(2)

where is the sequence of proceeding tokens before position . To model the conditional probability, an encoder-decoder framework is usually employed [Cho et al.2014, Bahdanau et al.2015]. In the encoding phase, is learnt as the representations of the source sequence:

(3)

where can be implemented as RNN, CNN or SAN. During the decoding phase, the decoder generates the target token at position using the previous generated tokens and the source representations .

An attention mechanism is usually adopted to choose the places in the source representations to focus on. There are several types of attention used in the literature, such as , and [Luong et al.2015, Bahdanau et al.2015]. In this paper, we mainly adopt the type which is formulated as the q-k-v form:

(4)

where function computes the attention coefficient which evaluates how well the source representations matches the target query . is usually the hidden state in the current decoder layer and , are usually the same as source representation .

2.2 CNN based Encoder-Decoder

CNN based Encoder-Decoder framework [Gehring et al.2017b] typically uses one-dimensional convolution in each layer to perform nonlinear transformation. Denote with dimension as the -th hidden state in the -th CNN layer and set , where is the input embedding of the sequence. The hidden state is computed by the convolution operation:

(5)

where is the convolution filter with kernel size , is the bias and is the hidden dimension. The convolution input is the concatenation of elements in the lower layer with hidden dimension . We use Gated Linear Units (GLU) [Dauphin et al.2017]

as the activation function

, with

input vector and

output vector.

2.3 SAN based Encoder-Decoder

SAN system is first proposed by NIPS2017_7181. Each single self-attention layer has two sublayers: a multi-head self-attention layer and a feed forward network. Both sublayers are stacked using residual connection

[He et al.2016] and layer normalization [Ba et al.2016]. The multi-head self-attention layer is formulated as follows:

(6)

Each head uses weights , and

to linearly transform the input

, , and then performs which is equal to Equation 4. , and , where is a scale factor, which equal to , is the hidden size of q, and is the number of heads.

For the feed forward network, we use a simple two-layer, fully connected network with ReLU activation function in the middle layer:

(7)

where and are both feed forward network. The feed forward network applies the non-linear transformation on each position identically and separately.

3 Double Path Networks

In our work, we introduce double path networks which incorporate CNN and SAN for sequence to sequence learning. A detailed model structure is shown in Figure 1.

Figure 1: DPN-S2S architecture. The CNN and SAN path are depicted with green and blue module, respectively. The encoder part is colored with dark grey, attention fusion part with light yellow, and decoder part with light grey. The blue and green line show the information flow from encoder to decoder.

3.1 Double Path Encoder

To leverage the advantage of CNN and SAN layer, we design an encoder which includes a CNN and SAN path, introduced in Section 2.2 and 2.3 respectively. We sum up the position embedding and word embedding as the input of encoder, to give the model a sense of the token’s position in the sentence. For each path, we employ residual connections to stack multiple layers for deeper network. Consequently, we can collect two various context representations from encoder.

3.2 Cross Attention with Gating

Between the encoder and the decoder, we develop a cross attention module with gating fusion, to automatically pick up the information needed to decode target word. As shown in Figure 1, the attention fusion module consists of two submodules: and . We now give detailed descriptions of the two submodules.

Each path in the decoder take encoder’s double-path outputs as context to make attention. Thus, there are four types of information flows through encoder to decoder. The module in the decoder takes Eq. (4) to compute attention results, which is denoted as:

(8)

For example, means the attention result with attention query from CNN path in the decoder and attention key and value from SAN path in the encoder.

In order to fully exploit the information captured by different encoder paths, we design a gating mechanism to fuse them for balancing information usage between local connectivity and global semantics. This fusion gate is shown as the module in the right side of Figure 1, formulated as follows:

(9)

where and are scalar attention gates for CNN context and SAN context; is an element concatenate operation; and are the weight matrices; and are the scalar biases of the attention gates; and and are the fusion context vectors for the current CNN layer and SAN layer. For the gate and , both the two attention results are used as inputs to the gating computation.

3.3 Double Path Decoder

The decoder is illustrated in the right side of Figure 1. It is similar to the encoder except for the decoder-to-encoder attention. Each layer in CNN and SAN contains two decoder-to-encoder attention modules, which is denoted in last subsection. We also need remove the future information in the convolution and self-attention operations in order to be consistent with decoding phase, for target sentence is generated sequentially, without any prior knowledge about next word.

We also apply a gating mechanism to combine the outputs of the two decoder paths and feed the combined result into the layer to generate next word:

(10)

where and are the outputs of CNN and SAN path, respectively. is the fusion output and , are linear transform weights and biases before .

4 Experiment Settings

In this section, we introduce the datasets used in experiments, model settings, and training details.

4.1 Datasets

We evaluate our proposed method on two neural machine translation tasks and one abstractive text summarization task. The datasets are described as follows:

IWSLT2014 German-English

We use the dataset of IWSLT14 German-English machine translation track [Cettolo et al.2014] for training and evaluation. We tokenize the data which come from TED and TEDx talks. After preprocessing, the dataset remains 160K training sentences and 7K development sentences111https://github.com/facebookresearch/fairseq-py/blob/master/data/prepare-iwslt14.sh. We concatenate dev2010, tst2010, tst2011 and tst2012 as the test set. We consider a joint source and target byte-pair encoding (BPE) with 10K types [Sennrich et al.2016].

Nist Chinese-English

We use the corpus consisting of 1.25M sentence pairs extracted from LDC corpora 222The corpora includes LDC2002E18, LDC2003E07, LDC2003E14, Hansards portion of LDC2004T07, LDC2004T08 and LDC2005T06.. This dataset contains 27.9M Chinese words and 34.5M English words respectively. We learn a 25K subwords dictionary based on BPE for source and target languages separately. After tokenization, we get a source vocabulary with 37K words and a target vocabulary with 25K words. We choose Nist-2003 as the development set, and Nist-{2004, 2005, 2006, 2008, 2012} as the test sets.

Text Summarization

We use the Gigaword corpus [Graff et al.2003] and follow the processing as Rush DBLP:conf/emnlp/RushCW15. We obtain 3.8M training samples and 190K validation samples. We evaluate our approach on Gigaword dataset, which is a widely used test set with 2000 article-title pairs. Similar to Gehring DBLP:conf/icml/GehringAGYD17, we use source and target vocabulary both with 30K words. Follow Gehring DBLP:conf/icml/GehringAGYD17, we require the length of generated sequence is at least 14 words.

4.2 Model Settings

IWSLT14 German-English

For CNN path, we adopt 4 idential convolutional layers with 256 neurons in both the encoder and the decoder. As one self-attention layer is comprised of 2 sublayers, we set 2 layers for SAN to guarantee the depth consistency of network. The hidden size and the filter size of SAN block are set as 256 and 1024 respectively. The word embedding dimension is 256. And the dropout rate is set as 0.1. In order to demonstrate the efficiency of our method, we compare our method with four following baselines: (i)

Deeper CNN Model, a standard CNN based model with 8 layers of 256 neurons. (ii) Wider CNN Model, a standard CNN based model with 4 layers of 512 neurons. (iii) Wider SAN Model, a standard SAN based model with 2 layers of 512 neurons. (iv) Deeper SAN Model, a standard SAN based model with 4 layers of 256 neurons. Besides, we also list some related works for comparison.

Nist Chinese-English

We perform a deep model on Nist Chinese-English translation task. This model adopts 12 stacked layers for the CNN path and 6 stacked layers for SAN path in the encoder and decoder respectively. The hidden size of both CNN/SAN layers are both set as 256 and the filter size of SAN layer is 1024. We set 256 neurons for the word embedding. The dropout rate is set as 0.2. For the purpose of comparison, We choose two model settings as baselines: (i) CNN Model, a CNN based model with 12 stacked layers of 512 neurons (ii) SAN Model, a SAN model with 6 stacked layers of 512 neurons and the filter size of layer is 2048. These model settings contain approximately similar number of parameters in contrast to our model. We also list some previous works for comparison.

Text Summarization

We follow the same model setting as described in IWSLT14 German-English translation task. For this task, We adopt the models including RNN MRT [Ayana et al.2016], WFE [Suzuki and Nagata2017] and ConvS2S [Gehring et al.2017a] as the baselines.

Params BLEU
AC+LL [Bahdanau et al.2016] - 28.53
NPMT+LM [Huang et al.2017] - 29.16
Wider CNN 20.37M 30.63
Wider SAN 27.06M 31.29
Deeper CNN 13.82M 30.70
Deeper SAN 13.54M 31.43
DPN-S2S 11.57M 31.99
Table 1: IWSLT14 German-English translation performance.

4.3 Training and Evaluation

We adopt Nesterov Accelerated Gradient (NAG) [Bengio et al.2012]

as training optimizer. The initial learning rate is set as 0.25 for IWSLT14 German-English and text summarization, and 0.50 for Nist Chinese-English, with a decay schedule that shrinking by 10 when the validation loss stops decreasing. Each training batch contains approximately 4000 source and target tokens. During inference, we set beam size as 5 for IWSLT14 German-English and text summarization, and 10 for Nist Chinese-English. All models are implemented in PyTorch based on

fairseq-py333https://github.com/facebookresearch/fairseq-py. We use one Nvidia Titan x Pascal GPU for IWSLT14 German-English and text summarization, and 4 GPUs for Nist Chinese-English.

5 Results

In this section, we report our results on translation and summarization tasks. In addition, we also conduct a series of extensive analyses for a better understanding of our model.

5.1 IWSLT2014 German-English

We evaluate the translation accuracy by BLEU 444https://github.com/moses-smt/mosesdecoder/blob/master/scripts/generic/multi-bleu.perl [Papineni et al.2002]. Table 1 shows the experiment results on German-English translation task. We list the results of the wider network, deeper network and some related works for a comparison. The results can be summarized as follow:

When compared with wider network, DPN-S2S achieves an improvement of 1.36 BLEU and 0.70 BLEU over CNN and SAN model. When compared with deeper network, DPN-S2S achieves an improvement of 1.29 BLEU and 0.56 BLEU over CNN and SAN model. Generally, we note that DPN-S2S can produce better performance than wider or deeper networks with fewer parameters. In addition, we choose two methods which used an actor-critic algorithm [Bahdanau et al.2016] and a phrased-based model with an auxiliary language model [Huang et al.2017] separately. We find our method also outperforms these approaches. All of these comparsions indicate the effectiveness of our method.

Nist04 Nist05 Nist06 Nist08 Nist12
MC-NMT [Xiong et al.2018] 40.79 38.49 - 31.51 26.90
NMT + Distortion [Zhang et al.2017] 40.52 36.81 35.77 - -
SD-NMT [Chen et al.2017] 39.81 36.74 34.63 28.61 -
SAN 40.39 40.38 38.06 31.67 30.32
CNN 40.80 38.16 37.89 30.70 29.55
DPN-S2S 41.26 41.12 38.87 32.66 31.17
Table 2: Performance of different systems on Nist Chinese-English translation task.

5.2 Nist Chinese-English

Table 2 lists the results about our model on each Nist Chinese-English test set, together with several NMT baselines. These baselines are RNNSearchs with some well acknowledged techniques including: 1) Using multi-channel information into encoder (MC-NMT) [Xiong et al.2018]; 2) Incorporating word reorder information into NMT (NMT+Distortion) [Zhang et al.2017]; 3) An attention mechanism with a directed-syntax (SD-NMT) [Chen et al.2017]. We can observe that our method surpasses SAN and CNN baseline by 0.74-0.99 and 0.46-2.96 BLEU score in different test sets. In addition, we also compare our results with some related works, and our model outperforms MC-NMT by 0.47-4.27 BLEU, NMT+Distortion by 0.74- 4.31 BLEU and SD-NMT by 1.44-4.38 BLEU in Nist test sets. These results on Nist Chinese-English dataset prove that our model can also achieve improvement in large dataset.

Gigaword
RG-1 (F) RG-2 (F) RG-L (F)
RNN MRT [Ayana et al.2016] 36.54 16.59 33.44
WFE [Suzuki and Nagata2017] 36.30 17.31 33.88
ConvS2S [Gehring et al.2017a] 35.88 17.48 33.29
DPN-S2S 36.92 17.91 34.32
Table 3: Accuracy on Gigaword text summarization task in terms of ROUGE-1 (RG-1), ROUGE-2 (RG-2), and ROUGE-L (RG-L). F stands for F1-score.

5.3 Text Summarization

We employ three variants of ROUGE [Lin2004]

as evaluation metrics for text summarization: ROUGE-1 (unigrams), ROUGE-2 (bigrams), and ROUGE-L (longest-common substring).

ID Encoder Decoder BLEU
CNN SAN CNN SAN
M1 30.38
M2 31.00
M3 31.26
M4 29.80
M5 30.63
M6 31.11
M7 31.03
M8 31.43
M9 31.99
Table 4: Ablation studies of DPN-S2S on IWSLT14 German-English. means the model contains this component.

We compare our model with some previous work including RNN MRT [Ayana et al.2016] which adopts RNNSearch on this task, WFE [Suzuki and Nagata2017] which addresses the problem of repeated sequence generation, and ConvS2S [Gehring et al.2017a]. Table 3 shows the text summarization results on the Gigaword dataset. DPN-S2S outperforms all the three models by 0.43-1.04 ROUGE-1, 0.43-1.32 ROUGE-2 and 0.64-1.23 ROUGE-L in terms of F1 score.

5.4 Ablation Study

To understand to what extent each path in the encoder/decoder can affect the model performance, we conduct some experiments for ablation study. For conciseness we just choose one task (IWSLT14 German-English) for this study. Table 4 shows the experiment settings and results. All the models are of the same dimensions for hidden states and word embedding. Note that model M9 is our proposed DPN-S2S. We have the following observations.

  • Comparing model M4 with M7, M5 with M8, and M6 with M9, we can see that adding the CNN encoder leads to 0.80-1.23 BLEU gain.

  • Comparing model M2 with M3, M5 with M6, and M8 with M9, we observe that adding the CNN decoder leads to 0.26-0.56 BLEU gain.

  • Comparing model M1 with M7, M2 with M8, and M3 with M9, we find that adding the SAN encoder results in 0.43-0.73 BLEU improvement.

  • Similarly comparing model M1 with M3, M4 with M6, and M7 with M9, we see 0.88-1.31 BLEU gain by adding the SAN decoder.

These results suggest that both the CNN path and the SAN path make positive contributions to the overall architecture, and they are needed in both the encoder and decoder.

5.5 Attention Visualization

EncoderDecoder CNN SAN
CNN 2.208 1.406
SAN 1.886 1.377
Table 5: The entropy of the alignments of CNN/SAN decoder to CNN/SAN encoder on IWSLT14 German-English.

To demonstrate the different characteristics of the CNN and SAN path, we analyze the alignments (attention coefficients) from the four types of decoder-to-encoder attention. The alignments are computed by from Eq. (4), and the alignments from each token in decoder form a distribution. We use the entropy of the distribution to depict to what extent the target token focuses on the source tokens. Small entropy means less uncertainty, i.e., the attention is more concentrated.

We choose the 6750 sentences in the German-English test set and calculate the entropy of each target token’s alignments distribution, averaged by target token in the sentence and then averaged by sentence, as shown in Table 5. The entropy of alignments from both paths in decoder to CNN path in encoder (the first row of the table) is larger than that of SAN path in encoder. This is consistent with our intuitions: CNN focuses more on local information, and so the decoder needs to attend to multiple hidden states of the CNN path in the decoder to extract necessary information for target word generation; in contrast, each hidden state in the SAN path has already included global semantics, and so the decoder only needs to focus on a small number of hidden states of the SAN path in the encoder, resulting in more focused alignments and smaller entropy (the second row of the table).

For an intuitive understanding, we visualize the alignments of a sentence pair in Figure 2. Comparing Figure 1(b) with 1(a), we can see that the attention from the CNN path of the decoder to the SAN path of the encoder is more focused than that to the CNN path of the encoder. The same phenomenon can be found for attention from the SAN path of the decoder to the encoder, by comparing Figure 1(d) with 1(c).

(a) CNN to CNN
(b) CNN to SAN
(c) SAN to CNN
(d) SAN to SAN
Figure 2: The alignments of CNN/SAN decoder to CNN/SAN encoder from DPN-S2S. The x-axis and y-axis of each plot correspond to the words in the source language and target language. Each pixel shows the alignment scores of the target word to source word.
Translation
Source (De) das ist die linke hälfte, die die logische seite ist und dann die rechte hälfte, die die intuitive seite ist.
Target (En) there ’s the left half, which is the logical side, and then the right half, which is the intuitive.
CNN this is half the left half, which is the logical side, and then half the rights that is the intuitive side.
SAN that’s the left half, which is the logical side , and then half, that ’s the intuitive side.
DPN-S2S that’s the left half, which is the logical side, and then the right half, which is the intuitive side.
Table 6: A German-English translation case to demonstrate DPN-S2S outperforms CNN and SAN in terms of translation accuracy.

5.6 Case Study

To better understand the advantage of DPN-S2S over single path models, Table 6 shows a case from the German-English test set. We list the generated target sentence by CNN based model, SAN based model and DPN-S2S, respectively. As can be seen, CNN based model generates detailed words such as “half the left half” and “half the rights”, without a global view to correctly organize these local meanings. SAN based model misses the local details about the “the right half” information. Our DPN-S2S well combine the local information from the CNN based path and the global information from the SAN based path, resulting in a better translation.

6 Conclusion

In this paper, we have proposed double path networks for sequence to sequence learning, named DPN-S2S, which uses a cross attention module to leverage the advantages of two different models, and achieves the state-of-art performance in several sequence to sequence tasks.

We plan to explore the following directions in the future. First, we will test the new model on more language pairs for neural machine translation and other sequence to sequence tasks. Second, it is interesting to investigate how to design better structures for information fusion. Third, we would like to extend the current double path model to multiple paths.

Acknowledgments

This work was supported by The National Key Research and Development Program of China under Grant 2018YFB1004904.

References

  • [Ayana et al.2016] Ayana, Shiqi Shen, Zhiyuan Liu, and Maosong Sun. 2016. Neural headline generation with minimum risk training. arXiv.
  • [Ba et al.2016] Lei Jimmy Ba, Ryan Kiros, and Geoffrey E. Hinton. 2016. Layer normalization. arXiv.
  • [Bahdanau et al.2015] Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. 2015. Neural machine translation by jointly learning to align and translate. ICLR 2015.
  • [Bahdanau et al.2016] Dzmitry Bahdanau, Philemon Brakel, Kelvin Xu, Anirudh Goyal, Ryan Lowe, Joelle Pineau, Aaron C. Courville, and Yoshua Bengio. 2016. An actor-critic algorithm for sequence prediction. arXiv.
  • [Bengio et al.2012] Yoshua Bengio, Nicolas Boulanger-Lewandowski, and Razvan Pascanu. 2012. Advances in optimizing recurrent networks. ICASSP.
  • [Cettolo et al.2014] Mauro Cettolo, Jan Niehues, Sebastian Stüker, Luisa Bentivogli, and Marcello Federico. 2014. Report on the 11th iwslt evaluation campaign. In Proceedings of the 11th IWSLT.
  • [Chen et al.2017] Kehai Chen, Rui Wang, Masao Utiyama, Eiichiro Sumita, and Tiejun Zhao. 2017. Syntax-directed attention for neural machine translation. AAAI.
  • [Cho et al.2014] Kyunghyun Cho, Bart van Merrienboer, Çaglar Gülçehre, Dzmitry Bahdanau, Fethi Bougares, Holger Schwenk, and Yoshua Bengio. 2014. Learning phrase representations using RNN encoder-decoder for statistical machine translation. In EMNLP 2014, pages 1724–1734.
  • [Dauphin et al.2017] Yann N. Dauphin, Angela Fan, Michael Auli, and David Grangier. 2017. Language modeling with gated convolutional networks. In Doina Precup and Yee Whye Teh, editors, ICML 2017.
  • [Gehring et al.2017a] Jonas Gehring, Michael Auli, David Grangier, Denis Yarats, and Yann N. Dauphin. 2017a. Convolutional sequence to sequence learning. In ICML 2017.
  • [Gehring et al.2017b] Jonas Gehring, Michael Auli, David Grangier, Denis Yarats, and Yann N. Dauphin. 2017b. Convolutional sequence to sequence learning. ICML 2017.
  • [Graff et al.2003] Graff, David, Kong, Junbo, Chen, Ke, Maeda, and Kazuaki. 2003. English gigaword. In Linguistic Data Consortium.
  • [He et al.2016] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep residual learning for image recognition. In CVPR 2016.
  • [Huang et al.2017] Po-Sen Huang, Chong Wang, Dengyong Zhou, and Li Deng. 2017. Neural phrase-based machine translation. arXiv.
  • [Lin2004] Chin-Yew Lin. 2004. Rouge: A package for automatic evaluation of summaries. In Proc. ACL workshop on Text Summarization Branches Out.
  • [Luong et al.2015] Thang Luong, Hieu Pham, and Christopher D. Manning. 2015. Effective approaches to attention-based neural machine translation. In EMNLP 2015.
  • [Papineni et al.2002] Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu. 2002. Bleu: a method for automatic evaluation of machine translation. In ACL.
  • [Rush et al.2015] Alexander M. Rush, Sumit Chopra, and Jason Weston. 2015.

    A neural attention model for abstractive sentence summarization.

    In Lluís Màrquez, Chris Callison-Burch, Jian Su, Daniele Pighin, and Yuval Marton, editors, EMNLP 2015.
  • [Sennrich et al.2016] Rico Sennrich, Barry Haddow, and Alexandra Birch. 2016. Neural machine translation of rare words with subword units. In ACL 2016.
  • [Sutskever et al.2014] Ilya Sutskever, Oriol Vinyals, and Quoc V. Le. 2014. Sequence to sequence learning with neural networks. In NIPS 2014.
  • [Suzuki and Nagata2017] Jun Suzuki and Masaaki Nagata. 2017. Rnn-based encoder-decoder approach with word frequency estimation. arXiv.
  • [Vaswani et al.2017] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. NIPS.
  • [Xiong et al.2018] Hao Xiong, Zhongjun He, Xiaoguang Hu, and Hua Wu. 2018. Multi-channel encoder for neural machine translation. AAAI.
  • [Zhang et al.2017] Jinchao Zhang, Mingxuan Wang, Qun Liu, and Jie Zhou. 2017. Incorporating word reordering knowledge into attention-based neural machine translation. In ACL.