1 Introduction
Sequence to sequence models are usually trained with a simple tokenlevel likelihood loss (Sutskever et al., 2014; Bahdanau et al., 2014). However, at test time, these models do not produce a single token but a whole sequence. In order to resolve this inconsistency and to potentially improve generation, recent work has focused on training these models at the sequencelevel, for instance using REINFORCE (Ranzato et al., 2015), actorcritic (Bahdanau et al., 2016), or with beam search optimization (Wiseman and Rush, 2016).
Before the recent work on sequence level training for neural networks, there has been a large body of research on training linear models at the sequence level. For example, direct loss optimization has been popularized in machine translation with the Minimum Error Rate Training algorithm (MERT; Och 2003) and expected risk minimization has an extensive history in NLP
Smith and Eisner (2006); Rosti et al. (2010); Green et al. (2014). This paper revisits several objective functions that have been commonly used for structured prediction tasks in NLP (Gimpel and Smith, 2010) and apply them to a neural sequence to sequence model (Gehring et al., 2017b) (§2). Specifically, we consider likelihood training at the sequencelevel, a margin loss as well as expected risk training. We also investigate several combinations of global losses with tokenlevel likelihood. This is, to our knowledge, the most comprehensive comparison of structured losses in the context of neural sequence to sequence models (§3).We experiment on the IWSLT’14 GermanEnglish translation task (Cettolo et al., 2014) as well as the Gigaword abstractive summarization task (Rush et al., 2015). We achieve the best reported accuracy to date on both tasks. We find that the sequence level losses we survey perform similarly to one another and outperform beam search optimization (Wiseman and Rush, 2016) on a comparable setup. On WMT’14 EnglishFrench, we also illustrate the effectiveness of risk minimization on a larger translation task. Classical losses for structured prediction are still very competitive and effective for neural models (§5, §6).
2 Sequence to Sequence Learning
The general architecture of our sequence to sequence models follows the encoderdecoder approach with soft attention first introduced in Bahdanau et al. (2014)
. As a main difference, in most of our experiments we parameterize the encoder and the decoder as convolutional neural networks instead of recurrent networks
(Gehring et al., 2017a, b). Our use of convolution is motivated by computational and accuracy considerations. However, the objective functions we present are model agnostic and equally applicable to recurrent and convolutional models. We demonstrate the applicability of our objective functions to recurrent models (LSTM) in our comparison to Wiseman and Rush (2016) in §6.6.Notation. We denote the source sentence as , an output sentence of our model as , and the reference or target sentence as . For some objectives, we choose a pseudo reference instead, such as a model output with the highest BLEU or ROUGE score among a set of candidate outputs, , generated by our model.
Concretely, the encoder processes a source sentence containing words and outputs a sequence of states . The decoder takes and generates the output sequence left to right, one element at a time. For each output , the decoder computes hidden state based on the previous state , an embedding of the previous target language word , as well as a conditional input derived from the encoder output . The attention context is computed as a weighted sum of at each time step. The weights of this sum are referred to as attention scores and allow the network to focus on the most relevant parts of the input at each generation step. Attention scores are computed by comparing each encoder state to a combination of the previous decoder state and the last prediction ; the result is normalized to be a distribution over input elements. At each generation step, the model scores for the possible next target words by transforming the decoder output via a linear layer with weights and bias : . This is turned into a distribution via a softmax: .
Our encoder and decoder use gated convolutional neural networks which enable fast and accurate generation Gehring et al. (2017b). Fast generation is essential to efficiently train on the model output as is done in this work as sequencelevel losses require generating at training time. Both encoder and decoder networks share a simple block structure that computes intermediate states based on a fixed number of input tokens and we stack several blocks on top of each other. Each block contains a 1D convolution that takes as input
feature vectors and outputs another vector; subsequent layers operate over the
output elements of the previous layer. The output of the convolution is then fed into a gated linear unit (Dauphin et al., 2017). In the decoder network, we rely on causal convolution which rely only on states from the previous time steps. The parameters of our model are all the weight matrices in the encoder and decoder networks. Further details can be found in Gehring et al. (2017b).3 Objective Functions
We compare several objective functions for training the model architecture described in §2
. The corresponding loss functions are either computed over individual tokens (§
3.1), over entire sequences (§3.2) or over a combination of tokens and sequences (§3.3). An overview of these loss functions is given in Figure 1.(1)  
(2)  
(3)  
(4)  
(5)  
(6)  
(7) 
3.1 TokenLevel Objectives
Most prior work on sequence to sequence learning has focused on optimizing tokenlevel loss functions, i.e., functions for which the loss is computed additively over individual tokens.
Token Negative Log Likelihood (TokNLL)
Tokenlevel likelihood (TokNLL, Equation 1) minimizes the negative log likelihood of individual reference tokens . It is the most common loss function optimized in related work and serves as a baseline for our comparison.
Token NLL with Label Smoothing (TokLS)
Likelihood training forces the model to make extreme zero or one predictions to distinguish between the ground truth and alternatives. This may result in a model that is too confident in its training predictions, which may hurt its generalization performance. Label smoothing addresses this by acting as a regularizer that makes the model less confident in its predictions. Specifically, we smooth the target distribution with a prior distribution that is independent of the current input (Szegedy et al., 2015; Pereyra et al., 2017; Vaswani et al., 2017). We use a uniform prior distribution over all words in the vocabulary, . One may also use a unigram distribution which has been shown to work better on some tasks (Pereyra et al., 2017). Label smoothing is equivalent to adding the KL divergence between and the model prediction to the negative log likelihood (TokLS, Equation 2). In practice, we implement label smoothing by modifying the ground truth distribution for word to be and for instead of and where is a smoothing parameter.
3.2 SequenceLevel Objectives
We also consider a class of objective functions that are computed over entire sequences, i.e., sequencelevel objectives. Training with these objectives requires generating and scoring multiple candidate output sequences for each input sequence during training, which is computationally expensive but allows us to directly optimize taskspecific metrics such as BLEU or ROUGE.
Unfortunately, these objectives are also typically defined over the entire space of possible output sequences, which is intractable to enumerate or score with our models. Instead, we compute our sequence losses over a subset of the output space, , generated by the model. We discuss approaches for generating this subset in §4.
Sequence Negative Log Likelihood (SeqNLL)
Similar to TokNLL, we can minimize the negative log likelihood of an entire sequence rather than individual tokens (SeqNLL, Equation 3). The loglikelihood of sequence
is the sum of individual token log probabilities, normalized by the number of tokens to avoid bias towards shorter sequences:
As target we choose a pseudo reference^{2}^{2}2Another option is to use the gold reference target, , but in practice this can lead to degenerate solutions in which the model assigns low probabilities to nearly all outputs. This is discussed further in §4. amongst the candidates which maximizes either BLEU or ROUGE with respect to , the gold reference:
As is common practice when computing BLEU at the sentencelevel, we smooth all initial counts to one (except for unigram counts) so that the geometric mean is not dominated by zerovalued
gram match counts (Lin and Och, 2004).Expected Risk Minimization (Risk)
This objective minimizes the expected value of a given cost function over the space of candidate sequences (Risk, Equation 4). In this work we use taskspecific cost functions designed to maximize BLEU or ROUGE (Lin, 2004), e.g., , for a given a candidate sequence and target . Different to SeqNLL (§3.2), this loss may increase the score of several candidates that have low cost, instead of focusing on a single sequence which may only be marginally better than any alternatives. Optimizing this loss is a particularly good strategy if the reference is not always reachable, although compared to classical phrasebased models, this is less of an issue with neural sequence to sequence models that predict individual words or even subword units.
The Risk objective is similar to the REINFORCE objective used in Ranzato et al. Ranzato et al. (2015), since both objectives optimize an expected cost or reward (Williams, 1992). However, there are a few important differences: (1) whereas REINFORCE typically approximates the expectation with a single sampled sequence, the Risk objective considers multiple sequences; (2) whereas REINFORCE relies on a baseline reward^{3}^{3}3Ranzato et al. Ranzato et al. (2015) estimate the baseline reward for REINFORCE with a separate linear regressor over the model’s current hidden state. to determine the sign of the gradients for the current sequence, for the Risk
objective we instead estimate the expected cost over a set of candidate output sequences (see §
4); and (3) while the baseline reward is different for every word in REINFORCE, the expected cost is the same for every word in risk minimization since it is computed on the sequence level based on the actual cost.MaxMargin
MaxMargin (Equation 5) is a classical margin loss for structured prediction (Taskar et al., 2003; Tsochantaridis et al., 2005) which enforces a margin between the model scores of the highest scoring candidate sequence and a reference sequence. We replace the human reference with a pseudoreference since this setting performed slightly better in early experiments; is the candidate sequence with the highest BLEU score. The size of the margin varies between samples and is given by the difference between the cost of and the cost of . In practice, we scale the margin by a hyperparameter determined on the validation set: . For this loss we use the unnormalized scores computed by the model before the final softmax:
MultiMargin
MaxMargin only updates two elements in the candidate set. We therefore consider MultiMargin (Equation 6) which enforces a margin between every candidate sequence and a reference sequence (Herbrich et al., 1999), hence the name MultiMargin. Similar to MaxMargin, we replace the reference with the pseudoreference .
SoftmaxMargin
Finally, SoftmaxMargin (Equation 7) is another classic loss that has been proposed by Gimpel and Smith (2010) as another way to optimize taskspecific costs. The loss augments the scores inside the of SeqNLL (Equation 3) by a cost. The intuition is that we want to penalize high cost outputs proportional to their cost.
3.3 Combined Objectives
We also experiment with two variants of combining sequencelevel objectives (§3.2) with tokenlevel objectives (§3.1). First, we consider a weighted combination (Weighted) of both a sequencelevel and tokenlevel objective (Wu et al., 2016), e.g., for TokLS and Risk we have:
(8) 
where is a scaling constant that is tuned on a heldout validation set.
Second, we consider a constrained combination (Constrained), where for any given input we use either the tokenlevel or sequencelevel loss, but not both. The motivation is to maintain good tokenlevel accuracy while optimizing on the sequencelevel. In particular, a sample is processed with the sequence loss if the token loss under the current model is at least as good as the token loss of a baseline model . Otherwise, we update according to the token loss:
(9) 
In this work we use a fixed baseline model that was trained with a tokenlevel loss to convergence.
4 Candidate Generation Strategies
The sequencelevel objectives we consider (§3.2) are defined over the entire space of possible output sequences, which is intractable to enumerate or score with our models. We therefore use a subset of candidate sequences , which we generate with our models.
We consider two search strategies for generating the set of candidate sequences. The first is beam search, a greedy breadthfirst search that maintains a “beam” of the top scoring candidates at each generation step. Beam search is the de facto decoding strategy for achieving stateoftheart results in machine translation. The second strategy is sampling (Chatterjee and Cancedda, 2010), which produces independent output sequences by sampling from the model’s conditional distribution. Whereas beam search focuses on high probability candidates, sampling introduces more diverse candidates (see comparison in §6.5).
We also consider both online and offline candidate generation settings in §6.4. In the online setting, we regenerate the candidate set every time we encounter an input sentence during training. In the offline setting, candidates are generated before training and are never regenerated. Offline generation is also embarrassingly parallel because all samples use the same model. The disadvantage is that the candidates become stale. Our model may perfectly be able to discriminate between them after only a single update, hindering the ability of the loss to correct eventual search errors.^{4}^{4}4We can mitigate this issue by regenerating infrequently, i.e., once every batches but we leave this to future work.
Finally, while some past work has added the reference target to the candidate set, i.e., , we find this can destabilize training since the model learns to assign low probabilities nearly everywhere, ruining the candidates generated by the model, while still assigning a slightly higher score to the reference (cf. Shen et al. (2016)). Accordingly we do not add the reference translation to our candidate sets.
5 Experimental Setup
5.1 Translation
We experiment on the IWSLT’14 German to English (Cettolo et al., 2014) task using a similar setup as Ranzato et al. Ranzato et al. (2015), which allows us to compare to other recent studies that also adopted this setup, e.g., Wiseman and Rush (2016).^{5}^{5}5Different to Ranzato et al. (2015) we train on sentences of up to 175 rather than 50 tokens. The training data consists of 160K sentence pairs and the validation set comprises 7K sentences randomly sampled and heldout from the train data. We test on the concatenation of all available test and dev sets of IWSLT 2014, that is TED.tst2010, TED.tst2011, TED.tst2012 and TED.dev2010, TEDX.dev2012 which is of similar size to the validation set.^{6}^{6}6In a previous version of this paper, we erroneously quoted the use of tst2013. We are using TEDX.dev2012 instead. All data is lowercased and tokenized with a bytepair encoding (BPE) of 14,000 types (Sennrich et al., 2016) and we evaluate with caseinsensitive BLEU.
We also experiment on the much larger WMT’14 EnglishFrench task. We remove sentences longer than 175 words as well as pairs with a source/target length ratio exceeding 1.5 resulting in 35.5M sentencepairs for training. The source and target vocabulary is based on 40K BPE types. Results are reported on both newstest2014 and a validation set heldout from the training data comprising 26,658 sentence pairs.
We modify the fairseqpy toolkit to implement the objectives described in §3.^{7}^{7}7https://github.com/pytorch/fairseq/tree/classic_seqlevel. Our translation models have four convolutional encoder layers and three convolutional decoder layers with a kernel width of 3 and 256 dimensional hidden states and word embeddings. We optimize these models using Nesterov’s accelerated gradient method (Sutskever et al., 2013) with a learning rate of 0.25 and momentum of 0.99. Gradient vectors are renormalized to norm 0.1 (Pascanu et al., 2013).
We train our baseline tokenlevel models for 200 epochs and then anneal the learning by shrinking it by a factor of 10 after each subsequent epoch until the learning rate falls below
. All sequencelevel models are initialized with parameters of a tokenlevel model before annealing. We then train sequencelevel models for another 10 to 20 epochs depending on the objective. Our batches contain 8K tokens and we normalize gradients by the number of nonpadding tokens per minibatch. We use weight normalization for all layers except for lookup tables
(Salimans and Kingma, 2016). Besides dropout on the embeddings and the decoder output, we also apply dropout to the input of the convolutional blocks at a rate of 0.3 (Srivastava et al., 2014). We tuned the various parameters above and report accuracy on the test set by choosing the best configuration based on the validation set.We length normalize all scores and probabilities in the sequencelevel losses by dividing by the number of tokens in the sequence so that scores are comparable between different lengths. Additionally, when generating candidate output sequences during training we limit the output sequence length to be less than 200 tokens for efficiency. We generally use 16 candidate sequences per training example, except for the ablations where we use 5 for faster experimental turnaround.
5.2 Abstractive Summarization
For summarization we use the Gigaword corpus as training data Graff et al. (2003) and preprocess it identically to Rush et al. (2015) resulting in 3.8M training and 190K validation examples. We evaluate on a Gigaword test set of 2,000 pairs identical to the one used by Rush et al. (2015) and report F1 ROUGE similar to prior work. Our results are in terms of three variants of ROUGE Lin (2004), namely, ROUGE1 (RG1, unigrams), ROUGE2 (RG2, bigrams), and ROUGEL (RGL, longestcommon substring). Similar to Ayana et al. (2016) we use a source and target vocabulary of 30k words. Our models for this task have 12 layers in the encoder and decoder each with 256 hidden units and kernel width 3. We train on batches of 8,000 tokens with a learning rate of 0.25 for 20 epochs and then anneal as in §5.1.
6 Results
6.1 Comparison of Sequence Level Losses
First, we compare all objectives based on a weighted combination with tokenlevel label smoothing (Equation 8). We also show the likelihood baseline (MLE) of Wiseman and Rush (2016), their beam search optimization method (BSO), the actor critic result of Bahdanau et al. (2016) as well as the best reported result on this dataset to date by Huang et al. (2017). We show a likeforlike comparison to Wiseman and Rush (2016) with a similar baseline model below (§6.6).
Table 1 shows that all sequencelevel losses outperform tokenlevel losses. Our baseline tokenlevel results are several points above other figures in the literature and we further improve these results by up to 0.61 BLEU with Risk training.
test  std  
MLE (W & R, 2016) [T]  24.03  
BSO (W & R, 2016) [S]  26.36  
Actorcritic (B, 2016) [S]  28.53  
Huang et al. (2017) [T]  28.96  
Huang et al. (2017) (+LM) [T]  29.16  
TokNLL [T]  31.78  0.07 
TokLS [T]  32.23  0.10 
SeqNLL [S]  32.68  0.09 
Risk [S]  32.84  0.08 
MaxMargin [S]  32.55  0.09 
MultiMargin [S]  32.59  0.07 
SoftmaxMargin [S]  32.71  0.07 
, [S] indicates sequence leveltraining and [T] tokenlevel training. We report averages and standard deviations over five runs with different random initialization.
6.2 Combination with TokenLevel Loss
Next, we compare various strategies to combine sequencelevel and tokenlevel objectives (cf. §3.3). For these experiments we use 5 candidate sequences per training example for faster experimental turnaround. We consider Risk as sequencelevel loss and label smoothing as tokenlevel loss. Table 2 shows that combined objectives perform better than pure Risk. The weighted combination (Equation 8) with performs best, outperforming constrained combination (Equation 9). We also compare to randomly choosing between tokenlevel and sequencelevel updates and find it underperforms the more principled constrained strategy. In the remaining experiments we use the weighted strategy.
valid  test  
TokLS  33.11  32.21 
Risk only  33.55  32.45 
Weighted  33.91  32.85 
Constrained  33.77  32.79 
Random  33.70  32.61 
6.3 Effect of initialization
So far we initialized sequencelevel models with parameters from a tokenlevel model trained with label smoothing. Table 3 shows that initializing weighted Risk with tokenlevel label smoothing achieves 0.70.8 better BLEU compared to initializing with parameters from tokenlevel likelihood. The improvement of initializing with TokNLL is only 0.3 BLEU with respect to the TokNLL baseline, whereas, the improvement from initializing with TokLS is 0.60.8 BLEU. We believe that the regularization provided by label smoothing leads to models with less sharp distributions that are a better starting point for sequencelevel training.
valid  test  

TokNLL  32.96  31.74 
Risk init with TokNLL  33.27  32.07 
+0.31  +0.33  
TokLS  33.11  32.21 
Risk init with TokLS  33.91  32.85 
+0.8  +0.64 
6.4 Online vs. Offline Candidate Generation
Next, we consider the question if refreshing the candidate subset at every training step (online) results in better accuracy compared to generating candidates before training and keeping the set static throughout training (offline). Table 4 shows that offline generation gives lower accuracy. However the online setting is much slower, since regenerating the candidate set requires incremental (left to right) inference with our model which is very slow compared to efficient forward/backward over large batches of pregenerated hypothesis. In our setting, offline generation has 26 times higher throughput than the online generation setting, despite the high inference speed of fairseq (Gehring et al., 2017b).
valid  test  

Online generation  33.91  32.85 
Offline generation  33.52  32.44 
6.5 Beam Search vs. Sampling and Candidate Set Size
So far we generated candidates with beam search, however, we may also sample to obtain a more diverse set of candidates (Shen et al., 2016). Figure 2 compares beam search and sampling for various candidate set sizes on the validation set. Beam search performs better for all candidate set sizes considered. In other experiments, we rely on a candidate set size of 16 which strikes a good balance between efficiency and accuracy.
6.6 Comparison to BeamSearch Optimization
BLEU  

MLE  24.03  
+ BSO  26.36  +2.33 
MLE Reimplementation  23.93  
+ Risk  26.68  +2.75 
RG1  RG2  RGL  
ABS+ [T]  29.78  11.89  26.97 
RNN MLE [T]  32.67  15.23  30.56 
RNN MRT [S]  36.54  16.59  33.44 
WFE [T]  36.30  17.31  33.88 
SEASS [T]  36.15  17.54  33.63 
DRGD [T]  36.27  17.57  33.62 
TokLS  36.53  18.10  33.93 
+ Risk RG1  36.96  17.61  34.18 
+ Risk RG2  36.65  18.32  34.07 
+ Risk RGL  36.70  17.88  34.29 
Next, we compare classical sequencelevel training to the recently proposed Beam Search Optimization (Wiseman and Rush, 2016). To enable a fair comparison, we reimplement their baseline, a single layer LSTM encoder/decoder model with 256dimensional hidden layers and word embeddings as well as attention and input feeding (Luong et al., 2015). This baseline is trained with Adagrad (Duchi et al., 2011) using a learning rate of for five epochs, with batches of 64 sequences. For sequencelevel training we initialize weights with the baseline parameters and train with Adam (Kingma and Ba, 2014) for another 10 epochs with learning rate and 16 candidate sequences per training example. We conduct experiments with Risk since it performed best in trial experiments.
Different from other sequencelevel experiments (§5), we rescale the BLEU scores in each candidate set by the difference between the maximum and minimum scores of each sentence. This avoids short sentences dominating the sequence updates, since candidate sets for short sentences have a wider range of BLEU scores compared to longer sentences; a similar rescaling was used by Bahdanau et al. (2016).
Table 5 shows the results from Wiseman and Rush (2016) for their tokenlevel likelihood baseline (MLE), best beam search optimization results (BSO), as well as our reimplemented baseline. Risk significantly improves BLEU compared to our baseline at +2.75 BLEU, which is slightly better than the +2.33 BLEU improvement reported for Beam Search Optimization (cf. Wiseman and Rush (2016)). This shows that classical objectives for structured prediction are still very competitive.
6.7 WMT’14 EnglishFrench results
Next, we experiment on the much larger WMT’14 EnglishFrench task using the same model setup as Gehring et al. (2017b). We TokLSfor 15 epochs and then switch to sequencelevel training for another epoch. Table 7 shows that sequencelevel training can improve an already very strong model by another +0.37 BLEU. Next, we improve the baseline by adding selfattention (Paulus et al., 2017; Vaswani et al., 2017) to the decoder network (TokLS + selfatt) which results in a smaller gain of +0.2 BLEU by Risk. If we train Risk only on the newscommentary portion of the training data, then we achieve state of the art accuracy on this dataset of 41.5 BLEU (Xia et al., 2017).
valid  test  
TokLS  34.06  40.58 
+ Risk  34.20  40.95 
TokLS + selfatt  34.24  41.02 
+ in domain  34.51  41.26 
+ Risk  34.30  41.22 
+ Risk in domain  34.50  41.47 
6.8 Abstractive Summarization
Our final experiment evaluates sequencelevel training on Gigaword headline summarization. There has been much prior art on this dataset originally introduced by Rush et al. (2015) who experiment with a feedforward network (ABS+). Ayana et al. (2016) report a likelihood baseline (RNN MLE) and also experiment with risk training (RNN MRT). Different to their setup we did not find a softmax temperature to be beneficial, and we use beam search instead of sampling to obtain the candidate set (cf. §6.5). Suzuki and Nagata (2017) improve over an MLE RNN baseline by limiting generation of repeated phrases. Zhou et al. (2017) also consider an MLE RNN baseline and add an additional gating mechanism for the encoder. Li et al. (2017) equip the decoder of a similar network with additional latent variables to accommodate the uncertainty of this task.
Table 6 shows that our baseline (TokLS) outperforms all prior approaches in terms of ROUGE2 and ROUGEL and it is on par to the best previous result for ROUGE1. We optimize all three ROUGE metrics separately and find that Risk can further improve our strong baseline. We also compared Risk only training to Weighted on this dataset (cf. §6.2) but accuracy was generally lower on the validation set: RG1 (36.59 Risk only vs. 36.67 Weighted), RG2 (17.34 vs. 18.05), and RGL (33.66 vs. 33.98).
7 Conclusion
We present a comprehensive comparison of classical losses for structured prediction and apply them to a strong neural sequence to sequence model. We found that combining sequencelevel and tokenlevel losses is necessary to perform best, and so is training on candidates decoded with the current model.
We show that sequencelevel training improves stateoftheart baselines both for IWSLT’14 GermanEnglish translation and Gigaword abstractive sentence summarization. Structured prediction losses are very competitive to recent work on reinforcement or beam optimization. Classical expected risk can slightly outperform beam search optimization (Wiseman and Rush, 2016) in a likeforlike setup. Future work may investigate better use of already generated candidates since invoking generation for each batch slows down training by a large factor, e.g., mixing with fresh and older candidates inspired by MERT Och (2003).
References
 Ayana et al. (2016) Ayana, Shiqi Shen, Yu Zhao, Zhiyuan Liu, Maosong Sun, et al. 2016. Neural headline generation with sentencewise optimization. arXiv preprint arXiv:1604.01904 .
 Bahdanau et al. (2016) Dzmitry Bahdanau, Philemon Brakel, Kelvin Xu, Anirudh Goyal, Ryan Lowe, Joelle Pineau, Aaron Courville, and Yoshua Bengio. 2016. An ActorCritic Algorithm for Sequence Prediction. In arXiv preprint arXiv:1607.07086.
 Bahdanau et al. (2014) Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. 2014. Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473 .
 Cettolo et al. (2014) Mauro Cettolo, Jan Niehues, Sebastian Stüker, Luisa Bentivogli, and Marcello Federico. 2014. Report on the 11th IWSLT evaluation campaign. In Proc. of IWSLT.
 Chatterjee and Cancedda (2010) Samidh Chatterjee and Nicola Cancedda. 2010. Minimum error rate training by sampling the translation lattice. In EMNLP.
 Dauphin et al. (2017) Yann N. Dauphin, Angela Fan, Michael Auli, and David Grangier. 2017. Language Modeling with Gated Convolutional Networks. In Proc. of ICML.

Duchi et al. (2011)
John Duchi, Elad Hazan, and Yoram Singer. 2011.
Adaptive subgradient methods for online learning and stochastic
optimization.
Journal of Machine Learning Research
12(Jul):2121–2159. 
Gehring et al. (2017a)
Jonas Gehring, Michael Auli, David Grangier, and Yann N Dauphin.
2017a.
A Convolutional Encoder Model for Neural Machine Translation.
In Proc. of ACL.  Gehring et al. (2017b) Jonas Gehring, Michael Auli, David Grangier, Denis Yarats, and Yann N. Dauphin. 2017b. Convolutional Sequence to Sequence Learning. In Proc. of ICML.
 Gimpel and Smith (2010) Kevin Gimpel and Noah Smith. 2010. Softmaxmargin crfs: Training loglinear models with cost functions. In Proc. of ACL.
 Graff et al. (2003) David Graff, Junbo Kong, Ke Chen, and Kazuaki Maeda. 2003. English gigaword. Linguistic Data Consortium, Philadelphia .
 Green et al. (2014) Spence Green, Daniel Cer, and Christopher Manning. 2014. An Empirical Comparison of Features and Tuning for Phrasebased Machine Translation. In Proc. of WMT. Association for Computational Linguistics.
 Herbrich et al. (1999) Ralf Herbrich, Thore Graepel, and Klaus Obermayer. 1999. Support vector learning for ordinal regression. In Proc. of ICANN.
 Huang et al. (2017) PoSen Huang, Chong Wang, Dengyong Zhou, and Li Deng. 2017. Neural Phrasebased Machine Translation. In arXiv preprint arXiv:1706.05565.
 Kingma and Ba (2014) Diederik P. Kingma and Jimmy Ba. 2014. Adam: A Method for Stochastic Optimization. Proc. of ICLR .

Li et al. (2017)
Piji Li, Wai Lam, Lidong Bing, and Zihao Wang. 2017.
Deep recurrent generative decoder for abstractive text summarization.
arXiv .  Lin (2004) ChinYew Lin. 2004. Rouge: A package for automatic evaluation of summaries. In Text Summarization Branches Out: Proceedings of the ACL04 Workshop.

Lin and Och (2004)
ChinYew Lin and Franz Josef Och. 2004.
Orange: a method for evaluating automatic evaluation metrics for machine translation.
In Proc. of COLING.  Luong et al. (2015) MinhThang Luong, Hieu Pham, and Christopher D Manning. 2015. Effective approaches to attentionbased neural machine translation. In Proc. of EMNLP.
 Och (2003) Franz Josef Och. 2003. Minimum Error Rate Training in Statistical Machine Translation. In ACL. Sapporo, Japan, pages 160–167.

Pascanu et al. (2013)
Razvan Pascanu, Tomas Mikolov, and Yoshua Bengio. 2013.
On the difficulty of training recurrent neural networks.
In Proceedings of The 30th International Conference on Machine Learning. pages 1310–1318.  Paulus et al. (2017) Romain Paulus, Caiming Xiong, and Richard Socher. 2017. A deep reinforced model for abstractive summarization. arXiv preprint arXiv:1705.04304 .
 Pereyra et al. (2017) Gabriel Pereyra, George Tucker, Jan Chorowski, Lukasz Kaiser, and Geoffrey E. Hinton. 2017. Regularizing neural networks by penalizing confident output distributions. In Proc. of ICLR Workshop.
 Ranzato et al. (2015) Marc’Aurelio Ranzato, Sumit Chopra, Michael Auli, and Wojciech Zaremba. 2015. Sequence level Training with Recurrent Neural Networks. In Proc. of ICLR.
 Rosti et al. (2010) AnttiVeikko I Rosti, Bing Zhang, Spyros Matsoukas, and Richard Schwartz. 2010. BBN System Description for WMT10 System Combination Task. In Proc. of WMT. Association for Computational Linguistics, pages 321–326.
 Rush et al. (2015) Alexander M Rush, Sumit Chopra, and Jason Weston. 2015. A neural attention model for abstractive sentence summarization. In Proc. of EMNLP.
 Salimans and Kingma (2016) Tim Salimans and Diederik P Kingma. 2016. Weight normalization: A simple reparameterization to accelerate training of deep neural networks. arXiv preprint arXiv:1602.07868 .
 Sennrich et al. (2016) Rico Sennrich, Barry Haddow, and Alexandra Birch. 2016. Neural Machine Translation of Rare Words with Subword Units. In Proc. of ACL.
 Shen et al. (2016) Shiqi Shen, Yong Cheng, Zhongjun He, Wei He, Hua Wu, Maosong Sun, and Yang Liu. 2016. Minimum Risk Training for Neural Machine Translation. In Proc. of ACL.
 Smith and Eisner (2006) David A. Smith and Jason Eisner. 2006. Minimum Risk Annealing for Training LogLinear Models. In Proc. of ACL.
 Srivastava et al. (2014) Nitish Srivastava, Geoffrey E. Hinton, Alex Krizhevsky, Ilya Sutskever, and Ruslan Salakhutdinov. 2014. Dropout: a simple way to prevent Neural Networks from overfitting. JMLR 15:1929–1958.

Sutskever et al. (2013)
Ilya Sutskever, James Martens, George E. Dahl, and Geoffrey E. Hinton. 2013.
On the importance of initialization and momentum in deep learning.
In ICML.  Sutskever et al. (2014) Ilya Sutskever, Oriol Vinyals, and Quoc V Le. 2014. Sequence to Sequence Learning with Neural Networks. In Proc. of NIPS. pages 3104–3112.
 Suzuki and Nagata (2017) Jun Suzuki and Masaaki Nagata. 2017. Cuttingoff redundant repeating generations for neural abstractive summarization. arXiv preprint arXiv:1701.00138 .

Szegedy et al. (2015)
Christian Szegedy, Vincent Vanhoucke, Sergey Ioffe, Jonathon Shlens, and
Zbigniew Wojna. 2015.
Rethinking the inception architecture for computer vision.
arXiv .  Taskar et al. (2003) Ben Taskar, Carlos Guestrin, and Daphne Koller. 2003. Maxmargin markov networks. In NIPS.
 Tsochantaridis et al. (2005) Ioannis Tsochantaridis, Thorsten Joachims, Thomas Hofmann, and Yasemin Altun. 2005. Large margin methods for structured and interdependent output variables. Journal of Machine Learning Research 6:1453––1484.
 Vaswani et al. (2017) Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. arXiv .
 Williams (1992) R. J. Williams. 1992. Simple statistical gradientfollowing algorithms for connectionist reinforcement learning. Machine Learning 8:229––256.
 Wiseman and Rush (2016) Sam Wiseman and Alexander M. Rush. 2016. Sequencetosequence learning as beamsearch optimization. In Proc. of ACL.
 Wu et al. (2016) Yonghui Wu, Mike Schuster, Zhifeng Chen, Quoc V Le, Mohammad Norouzi, Wolfgang Macherey, Maxim Krikun, Yuan Cao, Qin Gao, Klaus Macherey, et al. 2016. Google’s Neural Machine Translation System: Bridging the Gap between Human and Machine Translation. arXiv preprint arXiv:1609.08144 .
 Xia et al. (2017) Yingce Xia, Fei Tian, Lijun Wu, Jianxin Lin, Tao Qin, Nenghai Yu, and TieYan Liu. 2017. Deliberation networks: Sequence generation beyond onepass decoding. In Proc. of NIPS.
 Zhou et al. (2017) Qingyu Zhou, Nan Yang, Furu Wei, and Ming Zhou. 2017. Selective encoding for abstractive sentence summarization. arXiv .
Comments
There are no comments yet.