1 Introduction
Sequencetosequence (seq2seq) models have been successfully used for many sequential decision tasks such as machine translation [Sutskever, Vinyals, and Le2014, Bahdanau, Cho, and Bengio2015], parsing [Dyer et al.2016, Dyer et al.2015], summarization [Rush, Chopra, and Weston2015], dialog generation [Serban et al.2015], and image captioning [Xu et al.2015]. Beam search is a desirable choice of testtime decoding algorithm for such models because it potentially avoids search errors made by simpler greedy methods. However, the typical approach to training neural sequence models is to use a locally normalized maximum likelihood objective (crossentropy training) [Sutskever, Vinyals, and Le2014]. This objective does not directly reason about the behaviour of the final decoding method. As a result, for crossentropy trained models, beam decoding can sometimes yield reduced test performance when compared with greedy decoding [Koehn and Knowles2017, Neubig2017, Cho et al.2014]. These negative results are not unexpected. The training procedure was not searchaware: it was not able to consider the effect that changing the model’s scores might have on the ease of search while using a beam decoding, greedy decoding, or otherwise.
We hypothesize that the underperformance of beam search in certain scenarios can be resolved by using a better designed training objective. Because beam search potentially offers more accurate search when compared to greedy decoding, we hope that appropriately trained models should be able to leverage beam search to improve performance. In order to train models that can more effectively make use of beam search, we propose a new training procedure that focuses on the final loss metric (e.g. Hamming loss) evaluated on the output of beam search. While welldefined and a valid training criterion, this “direct loss” objective is discontinuous and thus difficult to optimize. Hence, in our approach, we form a subdifferentiable surrogate objective by introducing a novel continuous approximation of the beam search decoding procedure. In experiments, we show that optimizing this new training objective yields substantially better results on two sequence tasks (Named Entity Recognition and CCG Supertagging) when compared with both crossentropy trained greedy decoding and crossentropy trained beam decoding baselines.
Several related methods, including reinforcement learning
[Ranzato et al.2016, Bahdanau et al.2017][Daumé, Langford, and Marcu2009, Ross, Gordon, and Bagnell2011, Bengio et al.2015], and discrete search based methods [Wiseman and Rush2016, Andor et al.2016, Daumé III and Marcu2005, Gormley, Dredze, and Eisner2015], have also been proposed to make training searchaware. These methods include approaches that forgo direct optimization of a global training objective, instead incorporating credit assignment for search errors by using methods like early updates [Collins and Roark2004]that explicitly track the reachability of the gold target sequence during the search procedure. While addressing a related problem – credit assignment for search errors during training – in this paper, we propose an approach with a novel property: we directly optimize a continuous and global training objective using backpropagation. As a result, in our approach, credit assignment is handled directly via gradient optimization in an endtoend computation graph. The most closely related work to our own approach was proposed by Goyal et al.
[Goyal, Dyer, and BergKirkpatrick2017]. They do not consider beam search, but develop a continuous approximation of greedy decoding for scheduled sampling objectives. Other related work involves training a generator with a Gumbel reparamterized sampling module to more reliably find the MAP sequences at decodetime [Gu, Im, and Li2017], and constructing surrogate loss functions
[Bahdanau et al.2016] that are close to task losses.2 Model
We denote the seq2seq model parameterized by as . We denote the input sequence as , the gold output sequence as and the result of beam search over as . Ideally, we would like to directly minimize a final evaluation loss, , evaluated on the result of running beam search with input and model . Throughout this paper we assume that the evaluation loss decomposes over time steps as: ^{1}^{1}1
This assumption does not hold for some popular evaluation metrics (e.g. BLEU). In these cases, surrogate evaluation losses such as Hamming distance can be used .
. We refer to this idealized training objective that directly evaluates prediction loss as the “direct loss” objective and define it as:(1) 
Unfortunately, optimizing this objective using gradient methods is difficult because the objective is discontinuous. The two sources of discontinuity are:

As we describe later in more detail, beam search decoding (referred to as the function Beam) involves discrete argmax decisions and thus represents a discontinuous function.

The output, , of the Beam function, which is the input to the loss function, , is discrete and hence the evaluation of the final loss is also discontinuous.
We introduce a surrogate training objective that avoids these problems and as a result is fully continuous. In order to accomplish this, we propose a continuous relaxation to the composition of our final loss metric, , and our decoder function, :
Specifically, we form a continuous function softLB that seeks to approximate the result of running our decoder on input and then evaluating the result against using . By introducing this new module, we are now able to construct our surrogate training objective:
(2) 
Specified in more detail in Section 2.3
, our surrogate objective in Equation 2 will additionally take a hyperparameter
that trades approximation quality for smoothness of the objective. Under certain conditions, Equation 2 converges to the objective in Equation 1 as is increased. We first describe the standard discontinuous beam search procedure and then our training approach (Equation 2) involving a continuous relaxation of beam search.2.1 Discontinuity in Beam Search
Formally, beam search is a procedure with hyperparameter that maintains a beam of elements at each time step and expands each of the elements to find the best candidates for the next time step. The procedure finds an approximate argmax of a scoring function defined on output sequences.
We describe beam search in the context of seq2seq models in Algorithm 1 – more specifically, for an encoderdecoder [Sutskever, Vinyals, and Le2014] model with a nonlinear autoregressive decoder (e.g. an LSTM [Hochreiter and Schmidhuber1997]). We define the global model score of a sequence with length to be the sum of local output scores at each time step of the seq2seq model: . In neural models, the function is implemented as a differentiable mapping, , which yields scores for vocabulary elements using the recurrent hidden states at corresponding time steps. In our notation, is the hidden state of the decoder at time step for beam element , is the embedding of the output symbol at timestep for beam element , and is the cumulative model score at step for beam element . In Algorithm 1, we denote by the cumulative candidate score matrix which represents the model score of each successor candidate in the vocabulary for each beam element. This score is obtained by adding the local output score (computed as ) to the running total of the score for the candidate. The function in Algorithms 1 and 3 yields successive hidden states in recurrent neural models like RNNs, LSTMs etc. The embedding operation maps a word in the vocabulary
, to a continuous embedding vector. Finally, backpointers at each time step to the beam elements at the previous time step are also stored for identifying the best sequence
, at the conclusion of the search procedure. A backpointer at time step for a beam element is denoted by which points to one of the elements at the previous beam. We denote a vector of backpointers for all the beam elements by . The followbackpointer operation takes as input backpointers () and candidates () for all the beam elements at each time step and traverses the sequence in reverse (from timestep through 1) following backpointers at each time step and identifying candidate words associated with each backpointer that results in a sequence , of length .The procedure described in Algorithm 1 is discontinuous because of the topkargmax procedure that returns a pair of vectors corresponding to the highestscoring indices for backpointers and vocabulary items from the score matrix . This index selection results in hard backpointers at each time step which restrict the gradient flow during backpropagation. In the next section, we describe a continuous relaxation to the topkargmax procedure which forms the crux of our approach.
2.2 Continuous Approximation to topkargmax
The key property that we use in our approximation is that for a real valued vector , the argmax with respect to a vector of scores, , can be approximated by a temperature controlled softmax operation. The argmax operation can be represented as:
which can be relaxed by replacing the indicator function with a peakedsoftmax operation with hyperparameter :
As , so long as there is only one maximum value in the vector . This peakedsoftmax operation has been shown to be effective in recent work [Maddison, Mnih, and Teh2017, Jang, Gu, and Poole2016, Goyal, Dyer, and BergKirkpatrick2017] involving continuous relaxation to the argmax operation, although to our knowledge, this is the first work to apply it to approximate the beam search procedure.
Using this peakedsoftmax operation, we propose an iterative algorithm for computing a continuous relaxation to the topkargmax procedure in Algorithm 2 which takes as input a score matrix of size and returns peaked matrices of size . Each matrix represents the index of th max. For example, will have most of its mass concentrated on the index in the matrix that corresponds to the argmax, while will have most of its mass concentrated on the index of the 2ndhighest scoring element. Specifically, we obtain matrix by computing the squared difference between the highest score and all the scores in the matrix and then using the peakedsoftmax operation over the negative squared differences. This results in scores closer to the highest score to have a higher mass than scores far away from the highest score.
Hence, the continuous relaxation to topkargmax operation can be simply implemented by iteratively using the max operation which is continuous and allows for gradient flow during backpropagation. As , each vector converges to hard index pairs representing hard backpointers and successor candidates described in Algorithm 1. For finite , we introduce a notion of a soft backpointer, represented as a vector in the
probability simplex, which represents the contribution of each beam element from the previous time step to a beam element at current time step. This is obtained by a rowwise sum over
to get values representing soft backpointers.2.3 Training with Continuous Relaxation of Beam Search
We describe our approach in detail in Algorithm 3 and illustrate the soft beam recurrence step in Figure 1. For composing the loss function and the beam search function for our optimization as proposed in Equation 2, we make use of decomposability of the loss function across timesteps. Thus for a sequence y, the total loss is: . In our experiments, is the Hamming loss which can be easily computed at each timestep by simply comparing gold with . While exact computation of will vary according to the loss, our proposed procedure will be applicable as long as the total loss is decomposable across timesteps. While decomposability of loss is a strong assumption, existing literature on structured prediction [Taskar, Guestrin, and Koller2004, Tsochantaridis et al.2005] has made due with this assumption, often using decomposable losses as surrogates for nondecomposable ones. We detail the continuous relaxation to beam search in Algorithm 3 with being the cumulative loss of beam element at time step and being the embedding matrix of the target vocabulary which is of size where is the size of the embedding vector.
In Algorithm 3, all the discrete selection functions have been replaced by their soft, continuous counterparts which can be backpropagated through. This results in all the operations being matrix and vector operations which is ideal for a GPU implementation. An important aspect of this algorithm is that we no longer rely on exactly identifying a discrete search prediction since we are only interested in a continuous approximation to the direct loss (line 18 of Algorithm 3), and all the computation is expressed via the soft beam search formulation which eliminates all the sources of discontinuities associated with the training objective in Equation 1. The computational complexity of our approach for training scales linearly with the beam size and hence is roughly times slower than standard CE training for beam size . Since we have established the pointwise convergence of peakedsoftmax to argmax as for all vectors that have a unique maximum value, we can establish pointwise convergence of objective in Equation 2 to objective in Equation 1 as , as long as there are no ties among the topk scores of the beam expansion candidates at any time step. We posit that absolute ties are unlikely due to random initialization of weights and the domain of the scores being . Empirically, we did not observe any noticeable impact of potential ties on the training procedure and our approach performed well on the tasks as discussed in Section 4.
(3) 
We experimented with different annealing schedules for starting with nonpeaked softmax moving toward peakedsoftmax
across epochs so that learning is stable with informative gradients. This is important because cost functions like Hamming distance with very high
tend to be nonsmooth and are generally flat in regions far away from changepoints and have a very large gradient near the changepoints which makes optimization difficult.2.4 Decoding
The motivation behind our approach is to make the optimization aware of beam search decoding while maintaining the continuity of the objective. However, since our approach doesn’t introduce any new model parameters and optimization is agnostic to the architecture of the seq2seq model, we were able to experiment with various decoding schemes like locally normalized greedy decoding, and hard beam search, once the model has been trained.
However, to reduce the gap between the training procedure and test procedure, we also experimented with soft beam search decoding. This decoding approach closely follows Algorithm 3, but along with soft back pointers, we also compute hard back pointers at each time step. After computing all the relevant quantities like model score, loss etc., we follow the hard backpointers to obtain the best sequence
. This is very different from hard beam decoding because at each time step, the selection decisions are made via our soft continuous relaxation which influences the scores, LSTM hidden states and input embeddings at subsequent timesteps. The hard backpointers are essentially the MAP estimate of the soft backpointers at each step. With small, finite
, we observe differences between soft beam search and hard beam search decoding in our experiments.2.5 Comparison with MaxMargin Objectives
Maxmargin based objectives are typically motivated as another kind of surrogate training objective which avoid the discontinuities associated with direct loss optimization. Hinge loss for structured prediction typically takes the form:
where is the input sequence, is the gold target sequence, is the output search space and is the discontinuous cost function which we assume is decomposable across the timesteps of a sequence. Finding the cost augmented maximum score is generally difficult in large structured models and often involves searching over the output space and computing the approximate cost augmented maximal output sequence and the score associated with it via beam search. This procedure introduces discontinuities in the training procedure of structured maxmargin objectives and renders it non amenable to training via backpropagation. Related work [Wiseman and Rush2016] on incorporating beam search into the training of neural sequence models does involve costaugmented maxmargin loss but it relies on discontinuous beam search forward passes and an explicit mechanism to ensure that the gold sequence stays in the beam during training, and hence does not involve back propagation through the beam search procedure itself.
Our continuous approximation to beam search can very easily be modified to compute an approximation to the structured hinge loss so that it can be trained via backpropagation if the cost function is decomposable across timesteps. In Algorithm 3, we only need to modify line 5 as:
and instead of computing in Algorithm 3, we first compute the cost augmented maximum score as:
and also compute the target score by simply running the forward pass of the LSTM decoder over the gold target sequence. The continuous approximation to the hinge loss to be optimized is then: . We empirically compare this approach with the proposed approach to optimize direct loss in experiments.
3 Experimental Setup
Since our goal is to investigate the efficacy of our approach for training generic seq2seq models, we perform experiments on two NLP tagging tasks with very different characteristics and output search spaces: Named Entity Recognition (NER) and CCG supertagging. While seq2seq models are appropriate for CCG supertagging task because of the longrange correlations between the sequential output elements and a large search space, they are not ideal for NER which has a considerably smaller search space and weaker correlations between predictions at subsequent time steps. In our experiments, we observe improvements from our approach on both
of the tasks. We use a seq2seq model with a bidirectional LSTM encoder (1 layer with tanh activation function) for the input sequence
, and an LSTM decoder (1 layer with tanh activation function) with a fixed attention mechanism that deterministically attends to the th input token when decoding the th output, and hence does not involve learning of any attention parameters. Since, computational complexity of our approach for optimization scales linearly with beam size for each instance, it is impractical to use very large beam sizes for training. Hence, beam size for all the beam search based experiments was set to 3 which resulted in improvements on both the tasks as discussed in the results. For both tasks, the direct loss function was the Hamming distance cost which aims to maximize word level accuracy.3.1 Named Entity Recognition
For named entity recognition, we use the CONLL 2003 shared task data [Tjong Kim Sang and De Meulder2003]
for German language and use the provided data splits. We perform no preprocessing on the data. The output vocabulary length (label space) is 10. A peculiar characteristic of this problem is that the training data is naturally skewed toward one default label (‘O’) because sentences typically do not contain many named entities and the evaluation focuses on the performance recognizing entities. Therefore, we modify the Hamming cost such that incorrect prediction of ‘O’ is doubly penalized compared to other incorrect predictions. We use the hidden layers of size 64 and label embeddings of size 8. As mentioned earlier, seq2seq models are not an ideal choice for NER (taglevel correlations are shortranged in NER – the unnecessary expressivity of full seq2seq models over simple encoderclassifier neural models makes training harder). However, we wanted to evaluate the effectiveness of our approach on different instantiations of seq2seq models.
3.2 CCG Supertagging
We used the standard splits of CCG bank [Hockenmaier and Steedman2002] for training, development, and testing. The label space of supertags is 1,284 which is much larger than NER. The distribution of supertags in the training data exhibits a long tail because these supertags encode specific syntactic information about the words’ usage. The supertag labels are correlated with each other and many tags encode similar information about the syntax. Moreover, this task is sensitive to the long range sequential decisions and search effects because of how it holistically encodes the syntax of the entire sentence. We perform minor preprocessing on the data similar to the preprocessing in [Vaswani et al.2016]. For this task, we used hidden layers of size 512 and the supertag label embeddings were also of size 512. The standard evaluation metric for this task is the word level label accuracy which directly corresponds to Hamming loss.
3.3 Hyperparameter tuning
For tuning all the hyperparameters related to optimization we trained our models for 50 epochs and picked the models with the best performance on the development set. We also ran multiple random restarts for all the systems evaluated to account for performance variance across randomly started runs. We pretrained all our models with standard cross entropy training which was important for stable optimization of the non convex neural objective with a large parameter search space. This warm starting is a common practice in prior work on complex neural models
[Ranzato et al.2016, Rush, Chopra, and Weston2015, Bengio et al.2015].Training procedure  Greedy  Hard Beam Search  Soft Beam Search  

Dev  Test  Dev  Test  Dev  Test  
Baseline CE  80.15  80.35  82.17  82.42  81.62  82.00 
annealed      83.03  83.54  82.82  83.05 
=1.0      83.02  83.36  82.49  82.85 
=1.0      83.23  82.65  82.58  82.82 
annealed      85.69  85.82  85.58  85.78 
Training procedure  CE Greedy  Hard Beam Search  Soft Beam Search  

Dev  Test  Dev  Test  Dev  Test  
Baseline CE  50.21  54.92  46.22  51.34  47.50  52.78 
annealed      41.10  45.98  41.24  46.34 
=1.0      40.09  44.67  39.67  43.82 
=1.0      49.88  54.08  50.73  54.77 
annealed      51.86  56.15  51.96  56.38 
3.4 Comparison
We report performance on validation and test sets for both the tasks in Tables 1 and 2. The baseline model is a cross entropy trained seq2seq model (Baseline CE) which is also used to warm start the the proposed optimization procedures in this paper. This baseline has been compared against the approximate direct loss training objective (Section 2.3), referred to as in the tables, and the approximate maxmargin training objective (Section 2.5), referred to as in the tables. Results are reported for models when trained with annealing , and also with a constant setting of which is a very smooth but inaccurate approximation of the original direct loss that we aim to optimize^{2}^{2}2Our pilot experiments that involved training with a very large constant resulted in unstable optimization.. Comparisons have been made on the basis of performance of the models under different decoding paradigms (represented as different column in the tables): locally normalized decoding (CE greedy), hard beam search decoding and soft beam search decoding described in Section 2.4.
4 Results
As shown in Tables 1 and 2, our approach shows significant improvements over the locally normalized CE baseline with greedy decoding for both the tasks (+5.5 accuracy points gain for supertagging and +1.5 F1 points for NER). The improvement is more pronounced on the supertagging task, which is not surprising because: (i) the evaluation metric is taglevel accuracy which is congruent with the Hamming loss that directly optimizes and (ii) the supertagging task itself is very sensitive to the search procedure because tags across timesteps tend to exhibit long range dependencies as they encode specialized syntactic information about word usage in the sentence.
Another common trend to observe is that annealing always results in better performance than training with a constant for both (Section 2.3) and (Section 2.5). This shows that a stable training scheme that smoothly approaches minimizing the actual direct loss is important for our proposed approach. Additionally, we did not observe a large difference when our soft approximation is used for decoding (Section 2.4) compared to hard beam search decoding, which suggests that our approximation to the hard beam search is as effective as its discrete counterpart.
For supertagging, we observe that the baseline cross entropy trained model improves its predictions with beam search decoding compared to greedy decoding by 2 accuracy points, which suggests that beam search is already helpful for this task, even without searchaware training. Both the optimization schemes proposed in this paper improve upon the baseline with soft direct loss optimization (), performing better than the approximate maxmargin approach. ^{3}^{3}3Separately, we also ran experiments with a maxmargin objective that used hard beam search to compute lossaugmented decodes. This objective is discontinuous, but we evaluated the performance of gradient optimization nonetheless. While not included in the result tables, we found that this approach was unstable and considerably underperformed both approximate maxmargin and direct loss objectives.
For NER, we observe that optimizing outperforms all the other approaches but we also observe interesting behaviour of beam search decoding and the approximate maxmargin objective for this task. The pretrained CE baseline model yields worse performance when beam search is done instead of greedy locally normalized decoding. This is because the training data is heavily skewed toward the ‘O’ label and hence the absolute score resolution between different tags at each timestep during decoding isn’t enough to avoid leading beam search toward a wrong hypothesis path. We observed in our experiments that hard beam search resulted in predicting more ‘O’s which also hurt the prediction of tags at future time steps and hurt precision as well as recall. Encouragingly, optimization, even though warm started with a CE trained model that performs worse with beam search, led to the NER model becoming more search aware, which resulted in superior performance. However, we also observe that the approximate maxmargin approach () performs poorly here. We attribute this to a deficiency in the maxmargin objective when coupled with approximate search methods like beam search that do not provide guarantees on finding the supremum: one way to drive this objective down is to learn model scores such that the search for the best hypothesis is difficult, so that the value of the loss augmented decode is low, while the gold sequence maintains higher model score. Because we also warm started with a pretrained model that results in a worse performance with beam search decode than with greedy decode, we observe the adverse effect of this deficiency. The result is a model that scores the gold hypothesis highly, but yields poor decoding outputs. This observation indicates that using maxmargin based objectives with beam search during training actually may achieve the opposite of our original intent: the objective can be driven down by introducing search errors.
The observation that our optimization method led to improvements on both the tasks–even on NER for which hard beam search during decoding on a CE trained model hurt the performance–by making the optimization more search aware, indicates the effectiveness of our approach for training seq2seq models.
5 Conclusion
While beam search is a method of choice for performing search in neural sequence models, as our experiments confirm, it is not necessarily guaranteed to improve accuracy when applied to crossentropytrained models. In this paper, we propose a novel method for optimizing model parameters that directly takes into account the process of beam search itself through a continuous, endtoend subdifferentiable relaxation of beam search composed with the final evaluation loss. Experiments demonstrate that our method is able to improve overall testtime results for models using beam search as a testtime inference method, leading to substantial improvements in accuracy.
References

[Andor et al.2016]
Andor, D.; Alberti, C.; Weiss, D.; Severyn, A.; Presta, A.; Ganchev, K.;
Petrov, S.; and Collins, M.
2016.
Globally normalized transitionbased neural networks.
In Association for Computational Linguistics.  [Bahdanau et al.2016] Bahdanau, D.; Serdyuk, D.; Brakel, P.; Ke, N. R.; Chorowski, J.; Courville, A.; and Bengio, Y. 2016. Task loss estimation for structured prediction.
 [Bahdanau et al.2017] Bahdanau, D.; Brakel, P.; Xu, K.; Goyal, A.; Lowe, R.; Pineau, J.; Courville, A.; and Bengio, Y. 2017. An actorcritic algorithm for sequence prediction. In International Conference on Learning Representations.
 [Bahdanau, Cho, and Bengio2015] Bahdanau, D.; Cho, K.; and Bengio, Y. 2015. Neural machine translation by jointly learning to align and translate. In International Conference on Learning Representations.

[Bengio et al.2015]
Bengio, S.; Vinyals, O.; Jaitly, N.; and Shazeer, N.
2015.
Scheduled sampling for sequence prediction with recurrent neural networks.
In Advances in Neural Information Processing Systems, 1171–1179.  [Cho et al.2014] Cho, K.; Van Merriënboer, B.; Bahdanau, D.; and Bengio, Y. 2014. On the properties of neural machine translation: Encoderdecoder approaches. arXiv preprint arXiv:1409.1259.

[Collins and
Roark2004]
Collins, M., and Roark, B.
2004.
Incremental parsing with the perceptron algorithm.
In Proceedings of the 42nd Annual Meeting on Association for Computational Linguistics, 111. Association for Computational Linguistics. 
[Daumé III and
Marcu2005]
Daumé III, H., and Marcu, D.
2005.
Learning as search optimization: Approximate large margin methods for
structured prediction.
In
Proceedings of the 22nd international conference on Machine learning
, 169–176. ACM.  [Daumé, Langford, and Marcu2009] Daumé, H.; Langford, J.; and Marcu, D. 2009. Searchbased structured prediction. Machine learning 75(3):297–325.
 [Dyer et al.2015] Dyer, C.; Ballesteros, M.; Ling, W.; Matthews, A.; and Smith, N. A. 2015. Transitionbased dependency parsing with stack long shortterm memory. arXiv preprint arXiv:1505.08075.
 [Dyer et al.2016] Dyer, C.; Kuncoro, A.; Ballesteros, M.; and Smith, N. A. 2016. Recurrent neural network grammars. In Proceedings of NAACLHLT, 199–209.
 [Gormley, Dredze, and Eisner2015] Gormley, M. R.; Dredze, M.; and Eisner, J. 2015. Approximationaware dependency parsing by belief propagation. Transactions of the Association for Computational Linguistics (TACL).
 [Goyal, Dyer, and BergKirkpatrick2017] Goyal, K.; Dyer, C.; and BergKirkpatrick, T. 2017. Differentiable scheduled sampling for credit assignment. In Association for Computational Linguistics.
 [Gu, Im, and Li2017] Gu, J.; Im, D. J.; and Li, V. O. 2017. Neural machine translation with gumbelgreedy decoding. arXiv preprint arXiv:1706.07518.
 [Hochreiter and Schmidhuber1997] Hochreiter, S., and Schmidhuber, J. 1997. Long shortterm memory. Neural computation 9(8):1735–1780.
 [Hockenmaier and Steedman2002] Hockenmaier, J., and Steedman, M. 2002. Acquiring compact lexicalized grammars from a cleaner treebank. In LREC.
 [Jang, Gu, and Poole2016] Jang, E.; Gu, S.; and Poole, B. 2016. Categorical reparameterization with gumbelsoftmax. In International Conference on Learning Representations.
 [Koehn and Knowles2017] Koehn, P., and Knowles, R. 2017. Six challenges for neural machine translation. arXiv preprint arXiv:1706.03872.

[Maddison, Mnih, and
Teh2017]
Maddison, C. J.; Mnih, A.; and Teh, Y. W.
2017.
The concrete distribution: A continuous relaxation of discrete random variables.
In International Conference on Learning Representations.  [Neubig2017] Neubig, G. 2017. Neural machine translation and sequencetosequence models: A tutorial. arXiv preprint arXiv:1703.01619.
 [Ranzato et al.2016] Ranzato, M.; Chopra, S.; Auli, M.; and Zaremba, W. 2016. Sequence level training with recurrent neural networks. In International Conference on Learning Representations.
 [Ross, Gordon, and Bagnell2011] Ross, S.; Gordon, G. J.; and Bagnell, D. 2011. A reduction of imitation learning and structured prediction to noregret online learning. In AISTATS, volume 1, 6.

[Rush, Chopra, and
Weston2015]
Rush, A. M.; Chopra, S.; and Weston, J.
2015.
A neural attention model for abstractive sentence summarization.
InEmpirical Methods in Natural Language Processing
. 
[Serban et al.2015]
Serban, I. V.; Sordoni, A.; Bengio, Y.; Courville, A.; and Pineau, J.
2015.
Building endtoend dialogue systems using generative hierarchical
neural network models.
In
AAAI’16 Proceedings of the Thirtieth AAAI Conference on Artificial Intelligence
.  [Sutskever, Vinyals, and Le2014] Sutskever, I.; Vinyals, O.; and Le, Q. V. 2014. Sequence to sequence learning with neural networks. In Advances in neural information processing systems, 3104–3112.
 [Taskar, Guestrin, and Koller2004] Taskar, B.; Guestrin, C.; and Koller, D. 2004. Maxmargin markov networks. In Advances in neural information processing systems, 25–32.
 [Tjong Kim Sang and De Meulder2003] Tjong Kim Sang, E. F., and De Meulder, F. 2003. Introduction to the conll2003 shared task: Languageindependent named entity recognition. In Proceedings of the seventh conference on Natural language learning at HLTNAACL 2003Volume 4, 142–147. Association for Computational Linguistics.
 [Tsochantaridis et al.2005] Tsochantaridis, I.; Joachims, T.; Hofmann, T.; and Altun, Y. 2005. Large margin methods for structured and interdependent output variables. Journal of machine learning research 6(Sep):1453–1484.
 [Vaswani et al.2016] Vaswani, A.; Bisk, Y.; Sagae, K.; and Musa, R. 2016. Supertagging with lstms. In Proceedings of NAACLHLT, 232–237.
 [Wiseman and Rush2016] Wiseman, S., and Rush, A. M. 2016. Sequencetosequence learning as beamsearch optimization. In Empirical Methods in Natural Language Processing.
 [Xu et al.2015] Xu, K.; Ba, J.; Kiros, R.; Cho, K.; Courville, A. C.; Salakhutdinov, R.; Zemel, R. S.; and Bengio, Y. 2015. Show, attend and tell: Neural image caption generation with visual attention. In ICML, volume 14, 77–81.