1 Introduction
We describe the University of Maryland systems for bandit machine translation. For the shared translation task of the EMNLP 2017’s second conference on machine translation (WMT17), we focused on the task of bandit machine translation. This shared task was set up, consistent with (Kreutzer et al., 2017), simultaneously as a bandit learning problem and a domain adaptation problem. This raises the natural question: can we combine these potentially complementary information sources?
To investigate this question, we started from a standard neural machine translation (NMT) setup § 2^{1}^{1}1Our implementation is based on OpenNMT (Klein et al., 2017), an opensource toolkit for neural MT., and then we:

trained robust reinforcement learning algorithms that can effectively learn from bandit feedback (§ 3); this allows our model to “test” proposed generalizations and adapt from the provided feedback signals.
Tackling the problem of learning with bandit feedback is important because neural machine translation systems, like other natural language processing technology, currently learn almost exclusively from labeled data for a specific domain. While this approach is useful, it cannot scale to a broad variety of language and domains, as linguistic systems often cannot generalize well beyond their training data. Machine translation systems need to be able to learn to improve their performance from naturalistic interaction with users in addition to labeled data.
Bandit feedback (Robbins, 1985) offers systems the opportunity to “test” proposed generalizations and receive feedback on their performance; particularly interesting are contextual bandit systems, which make predictions based on a given input context (Auer et al., 2002; Langford and Zhang, 2008; Beygelzimer et al., 2010; Dudik et al., 2011). For example, a neural translation system trained on parliament proceedings often performs quite poorly at translating anything else. However, a translation system that is deployed to facilitate conversations between users might receive either explicit feedback (e.g. thumbs up/down) on its translations, or even implicit feedback, for example, the conversation partner asking for clarifications. There has recently been a flurry of work specifically addressing the bandit structured prediction problem (Chang et al., 2015; Sokolov et al., 2016a, b), of which machine translation is a special case.
Because this task is—at it’s core—a domain adaptation problem (for which a bandit learning signal is available to “help”), we also explored the use of standard domain adaptation techniques. We make a strong assumption that a sizable amount of monolingual, source language data is available before bandit feedback begins.^{2}^{2}2This raises a natural question: in the cases where this assumption is unreasonable, could we do adaptation online? We believe that in many realistic settings, one can at least get some amount of unlabeled data to begin with (we consider sentences). Using this monolingual data, we use data selection on a large corpus of parallel outofdomain data (Europarl, NewsCommentary, CommonCrawl, Rapid) to seed an initial translation model.
Overall, the results support the following conclusions (§ 5), based on the limited setting of one new domain and one language pair:

data selection for domain adaptation alone improves translation quality by about Bleu points.

on top of the domain adaptation, reinforcement learning (which requires exploration) leads to an initial degradation of about Bleu points, which is recovered (on development data) after approximately sentences of bandit feedback.^{3}^{3}3Unfortunately, due to our implementation bug, our evaluation of the test server is incomplete for the reinforcement learning setting; see § 5.3 for a discussion.
2 Neural MT architecture
We closely follow Luong et al. (2015)
for the structure of our neural machine translation (NMT) systems. Our NMT model consists of an encoder and a decoder, each of which is a recurrent neural network (RNN). We use a bidirectionaral RNN as the encoder and a unidirectional RNN as the decoder. The model directly estimates the posterior distribution
of translating a source sentence to a target sentence :(1) 
where are all tokens in the target sentence prior to .
Each local distribution
is modeled as a multinomial distribution over the target language vocabulary. We represent this as a linear transformation followed by a softmax function on the decoder’s output vector
:(2)  
(3)  
(4) 
where is the concatenation of two vectors, is an attention mechanism, ^{4}^{4}4We use the “concat” mechanism in (Luong et al., 2015).,
is the temperature hyperparameter of the softmax function,
and are the hidden vectors generated by the encoder and the decoder, respectively.During training, the encoder first encodes to a continuous vector , which is used as the initial hidden vector for the decoder. The decoder performs RNN updates to produce a sequence of hidden vectors:
(5) 
where is a word embedding lookup operation, is an LSTM cell. ^{5}^{5}5Feeding to the next step is “input feeding.”
At prediction time, the groundtruth token in Eq. 5 is replaced by the model’s own prediction :
(6) 
In a supervised learning framework, an NMT model is typically trained under the maximum loglikelihood objective:
(7) 
where is the training set.
However, this learning framework is not applicable to our problem since reference translations are not available.
3 Reinforcement Learning
The translation process of an NMT model can be viewed as a Markov decision process operating on a continuous state space. The states are the hidden vectors
generated by the decoder. The action space is the target language’s vocabulary.3.1 Markov decision process formulation
To generate a translation from a source sentence , an NMT model commences at an initial state , which is a representation of computed by the encoder. At time step , the model decides the next action to take by defining a stochastic policy , which is directly parametrized by the parameters of the model. This policy takes the previous state
as input and produces a probability distribution over all actions (words in the target vocabulary). The next action
is chosen either by taking or sampling from this policy. The encoder computes the current state by applying an RNN update on the previous state and the next action taken (Eq. 5).The objective of bandit NMT is to find a policy that maximizes the expected quality of translations sampled from the model’s policy:
(8) 
where is a reward function that returns a score in reflecting the quality of the input translation.
We optimize this objective function by policy gradient methods. The gradient of the objective in Eq. 8 with respect to is: ^{6}^{6}6For notation brevity, we omit from this equation. The expectations are also taken over all given .
(9)  
3.2 Advantage ActorCritic
We follow the approach of the advantage actorcritic (A2C) algorithm (Mnih et al., 2016), which combines the REINFORCE algorithm (Williams, 1992) with actorcritic. The algorithm approximates the gradient in Eq. 9 by a singlepoint sample and normalize the rewards by
values to reduce variance:
with  (10) 
where and is a baseline that estimates the expected future reward given and .
We train a critic model to estimate the values. This model is an attentionbased encoderdecoder model that encodes a source sentence and decodes a predicted translation . At time step , it computes where is the hidden state of the RNN decoder, and is a matrix that transforms a vector into a scalar. ^{7}^{7}7We abuse the notation to denote the decoder output. But since the translation model and the critic model do not share parameters, their decoder outputs are distinct.
The critic model is trained to minimize the MSE between its estimates and the true values:
(11) 
Given a fixed , the gradient with respect to of this objective is:
(12) 
4 Domain Adaptation
We performed domain adaptation by choosing the best outofdomain parallel data for training using Moore and Lewis (2010) crossentropy based data selection technique.
CrossEntropy Difference
The Moore and Lewis method uses the crossentropy difference  for scoring a given sentence , based on an indomain language model and an outofdomain language model (Moore and Lewis, 2010). We trained using the GermanEnglish Europarl, NewsCommentary, CommonCrawl and Rapid (i.e. outofdomain) data sets and using the ecommerce domain data provided by Amazon. After training both language models, we follow Moore and Lewis method by applying the crossentropy difference to score each sentence in the outofdomain data. The crossentropy is mathematically defined as:
where is the probability of a LM for the word sequence and represents the history of the word .
Sentences with the lowest crossentropy difference scores are the most relevant because they are the more similar to the indomain data and less similar to the average of the outofdomain data. Using this criteria, the top outofdomain sentences are used to create the training set . In this work we consider various sizes, selecting the that provides the best performance on the validation set.
5 Experiments
This section describes the experiments we conducted in attempt to assess the challenges posed by bandit machine translation and our exploration of efficient algorithms to improve machine translation systems using bandit feedback.
As explained in previous sections, this task requires performing domain adaptation for machine translation through bandit feedback. With this in mind, we experimented with two types of models: simple domain adaptation without using the feedbacks, and reinforcement learning models that leverage the feedbacks. In the following sections, we explain how we train the regular NMT model, how we select training data for domain adaptation, and how we use reinforcement learning to leverage the bandit feedbacks.
We trained our systems using the outofdomain parallel data restricted by the shared task. The entire outofdomain dataset contains 4.5 millions parallel GermanEnglish sentences from Europarl, NewsCommentary, CommonCrawl and Rapid data for the News Translation (constrained) task. Our NMT model is based on OpenNMT’s (Klein et al., 2017)PyTorch implementation of attentionbased encoderdecoder model. We extended their implementation and added our implementation of the A2C algorithm. Details of the model configuration and training hyperparameters are listed in Table 1.
Word embedding size  500 
Hidden vector size  500 
Number of LSTM layers  2 
Batch size  64 
Epochs  13 
Optimizer  SGD 
Initial learning rate  1 
Dropout  0.3 
BPE size  20000 
Vocab size  25k (*) 
5.1 Subword Unit for Neural Machine Translation
Neural machine translation (NMT) relies on first mapping each word into the vector space, and traditionally we have a word vector corresponding to each word in a fixed vocabulary. Due to the data scarcity, it’s hard for the system to learn high quality representations for rare words. To address this problem, with the goal of open vocabulary NMT, Sennrich et al. (2015) proposed to learn subword units and perform translation on a subword level. We incorporated this approach in our system as a preprocessing step. We generate the socalled bytepair encoding (BPE), which is a mapping from words to subword units, on the whole training set (WMT15), for both the source and target languages. The same mapping is used for all the training sets in our system. After the translation, we do an extra postprocessing step to convert the target language subword units back to words. With BPE, the vocabulary size is reduced dramatically and we no longer need to prune the vocabularies. We find this approach to be very helpful and use it for all our systems.
5.2 Domain Adaptation
As explained in Section 4, we use the data selection method of (Moore and Lewis, 2010) for domain adaptation. We use the kenlm toolkit (Heafield, 2011) to build all the language models used for the data selection. We train 4gram language models. For computing the crossentropy similarity scores, we use the XenC (Rousseau, 2013) open source data selection tool. We use the monolingual data selection mode of XenC on the indomain and outofdomain source sentences.
We have two parameters in this data selection process: the size of indomain dataset that is used for training the indomain language model, and the size of the outofdomain training data that we select. We experimented with different configurations and the results on the development server are listed in Table 2. For obtaining the indomain data, we prefetch the source sentences from development and training servers. For the training server, we do not have enough keys to test all combinations, so we picked several configurations and for each sentence, we select randomly a system to translate it. In addition, we also compare with and without beam search. The purpose for this is to provide another comparable baseline for the later reinforcement learning model, for which beam search cannot be used. Thus, the domain adaptation system that we submit to the training server is the uniformly random combination of 6 systems, and their individual average bleu scores are listed in Table 3.
indomain size  

o.o.d.%  40k  200k  800k 
10%  18.50  18.57  18.85 
20%  19.56  19.41  19.23 
30%  19.54  20.16  19.11 
40%  19.58  19.37  19.36 
60%  18.88  18.81  19.59 
85%  19.12  18.69  18.26 
(*) 100%  18.70  18.70  18.70 
i.d. size  o.o.d. %  beam=1  beam=5 

0  100%  18.07  18.65 (+0.58) 
40k  40%  18.77  19.51 (+0.74) 
200k  30%  19.67  20.13 (+0.46) 
It can be seen from these results that most configurations of data selection improve the overall Bleu score. The model without data selection achieves Bleu on the development server, while the best data selection configurations achieves , while on the training server the scores are without data selection and with. It can also be seen from Table 3 that beam search does help with improving the Bleu score.
5.3 Reinforcement Learning Results
While translating with the domain adaptation models to the development server, we collect 320,000 triples of (source sentence, translation, feedback) from 8 submitted systems. We use these triples to pretrain the critic in the A2C algorithm. We use the same pretrained critic for all A2Ctrained systems. The critic for each model is then updated jointly with the actor respectively. We use Adam (Kingma and Ba, 2014) with learning rate of to update the both the translation model and the critic model. We do not use dropout (Srivastava et al., 2014) during training with A2C as it makes learning less stable.
We note that there are some drawbacks when using the A2C algorithm when it comes to generating translations. Normally we generate translations by greedy decoding, which means at each time step we pick the word with the highest probability from the distribution produced by the model. But with A2C, we need to sample from the distribution of words to ensure exploration. As a direct consequence, it is not clear how to apply beam search for A2C (and for policy gradient methods in general). To control the tradeoff between exploration and exploitation, we use the temperature hyperparameter in the softmax function. In our experiments is set to , which produces a more peaky distribution and makes the model explore less.
It is best to have batching during bandit training for stability. Due to the limitation of the submission servers, that is, we only get the single reward feedback each time, we had to devise a method for batching for the feedback from the server. We cache the rewards until we reach the batch size, then do a batch update. However, due to some bugs in the implementation of this method, some sentences are not submitted in the correct order. And at some test points on the training server the scores are near or equal to zero.
In Figure 1 we present some results from the development server. We use a data selection model (200k indomain data, 30% outofdomain training data) as the baseline translation model, upon which we use the A2C algorithm to improve further. From this model, we generate translations with both sampling and greedy decoding to see how much the exploration required by the A2C algorithm hurts the performance. Figure 1 shows the average BLEU score of every 2000 sentences from the development server. A2C loses at the beginning because of exploration, and catches up as it sees more examples. Using sampling instead of greedy decoding, but exploration eventually improves the model.
6 Conclusion
We present the University of Maryland neural machine translation systems for the WMT17 bandit MT shared task. We employ two approaches: outofdomain data selection and reinforcement learning. Experiments show that the best performance is achieved with a model pretrained with only onethird of the available outofdomain data. When applying reinforcement learning to further improve this model with bandit feedback, the model performance degrades initially due to exploration but gradually improves over time. Future work is to determine if reinforcement learning is more effective on a larger bandit learning dataset.
Acknowledgements
The authors thank the anonymous reviewers for many helpful comments. We would like to thank the task organizers: Pavel Danchenko, Hagen Fuerstenau, Julia Kreutzer, Stefan Riezler, Artem Sokolov, Kellen Sunderland, and Witold Szymaniak for organizing the task and for their help throughout the process.
This work was supported by NSF grants IIS1320538 and IIS1618193, as well as an Amazon Research Award and LTS grant DO0032. Any opinions, findings, conclusions, or recommendations expressed here are those of the authors and do not necessarily reflect the view of the sponsor(s).
References
 Auer et al. (2002) Peter Auer, Nicolo CesaBianchi, Yoav Freund, and Robert E Schapire. 2002. The nonstochastic multiarmed bandit problem. SIAM journal on computing 32(1):48–77.
 Beygelzimer et al. (2010) Alina Beygelzimer, Lihong Li, Robert E Schapire, John Langford, and Lev Reyzin. 2010. An optimal high probability algorithm for the contextual bandit problem. Technical report.
 Chang et al. (2015) KaiWei Chang, He He, Hal Daumé III, and John Langford. 2015. Learning to search for dependencies. arXiv preprint arXiv:1503.05615 .
 Dudik et al. (2011) Miroslav Dudik, Daniel Hsu, Satyen Kale, Nikos Karampatziakis, John Langford, Lev Reyzin, and Tong Zhang. 2011. Efficient optimal learning for contextual bandits. arXiv preprint arXiv:1106.2369 .
 Heafield (2011) Kenneth Heafield. 2011. Kenlm: Faster and smaller language model queries. In Proceedings of the Sixth Workshop on Statistical Machine Translation. Association for Computational Linguistics, pages 187–197.
 Kingma and Ba (2014) Diederik P. Kingma and Jimmy Ba. 2014. Adam: A method for stochastic optimization. CoRR abs/1412.6980. http://arxiv.org/abs/1412.6980.
 Klein et al. (2017) Guillaume Klein, Yoon Kim, Yuntian Deng, Jean Senellart, and Alexander M Rush. 2017. Opennmt: Opensource toolkit for neural machine translation. arXiv preprint arXiv:1701.02810 .
 Kreutzer et al. (2017) Julia Kreutzer, Artem Sokolov, and Stefan Riezler. 2017. Bandit structured prediction for neural sequencetosequence learning. In Association of Computational Linguistics.
 Langford and Zhang (2008) John Langford and Tong Zhang. 2008. The epochgreedy algorithm for multiarmed bandits with side information. In Advances in neural information processing systems. pages 817–824.
 Luong et al. (2015) MinhThang Luong, Hieu Pham, and Christopher D. Manning. 2015. Effective approaches to attentionbased neural machine translation. In Empirical Methods in Natural Language Processing (EMNLP). Association for Computational Linguistics, Lisbon, Portugal, pages 1412–1421. http://aclweb.org/anthology/D151166.

Mnih et al. (2016)
Volodymyr Mnih, Adria Puigdomenech Badia, Mehdi Mirza, Alex Graves, Timothy P
Lillicrap, Tim Harley, David Silver, and Koray Kavukcuoglu. 2016.
Asynchronous methods for deep reinforcement learning.
In
International Conference on Machine Learning
.  Moore and Lewis (2010) Robert C Moore and William Lewis. 2010. Intelligent selection of language model training data. In Proceedings of the ACL 2010 conference short papers. Association for Computational Linguistics, pages 220–224.
 Robbins (1985) Herbert Robbins. 1985. Some aspects of the sequential design of experiments. In Herbert Robbins Selected Papers, Springer, pages 169–177.
 Rousseau (2013) Anthony Rousseau. 2013. Xenc: An opensource tool for data selection in natural language processing. The Prague Bulletin of Mathematical Linguistics (100):73–82.
 Sennrich et al. (2015) Rico Sennrich, Barry Haddow, and Alexandra Birch. 2015. Neural machine translation of rare words with subword units. arXiv preprint arXiv:1508.07909 .
 Sokolov et al. (2016a) Artem Sokolov, Julia Kreutzer, Christopher Lo, and Stefan Riezler. 2016a. Learning structured predictors from bandit feedback for interactive nlp. ACL.
 Sokolov et al. (2016b) Artem Sokolov, Julia Kreutzer, Stefan Riezler, and Christopher Lo. 2016b. Stochastic structured prediction under bandit feedback. In Advances in Neural Information Processing Systems. pages 1489–1497.
 Srivastava et al. (2014) Nitish Srivastava, Geoffrey Hinton, Alex Krizhevsky, Ilya Sutskever, and Ruslan Salakhutdinov. 2014. Dropout: A simple way to prevent neural networks from overfitting. The Journal of Machine Learning Research 15(1):1929–1958.
 Williams (1992) Ronald J Williams. 1992. Simple statistical gradientfollowing algorithms for connectionist reinforcement learning. Machine learning 8(34):229–256.
Comments
There are no comments yet.