Bi-Directional Differentiable Input Reconstruction for Low-Resource Neural Machine Translation

11/02/2018 ∙ by Xing Niu, et al. ∙ University of Maryland 0

We aim to better exploit the limited amounts of parallel text available in low-resource settings by introducing a differentiable reconstruction loss for neural machine translation (NMT). We reconstruct the input from sampled translations and leverage differentiable sampling and bi-directional NMT to build a compact model that can be trained end-to-end. This approach achieves small but consistent BLEU improvements on four language pairs in both translation directions, and outperforms an alternative differentiable reconstruction strategy based on hidden states.



There are no comments yet.


page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Neural Machine Translation (NMT) performance degrades sharply when parallel training data is limited (Koehn and Knowles, 2017). Past work has addressed this problem by leveraging monolingual data (Sennrich et al., 2016a; Ramachandran et al., 2017) or multilingual parallel data Zoph et al. (2016); Johnson et al. (2017); Gu et al. (2018a). We explore a complementary direction: how can we better exploit the potential of the limited bilingual training data available?

Our approach builds on the bi-directional NMT model (Niu et al., 2018), which improves low-resource translation by jointly modeling translation in both directions (e.g., Swahili to English and English to Swahili). We propose a new training objective for this model by augmenting the standard translation cross-entropy loss with a differentiable input reconstruction loss to further exploit the source side of parallel samples.

Input reconstruction is motivated by the idea of round-trip translation. Suppose sentence is translated forward to using model and then translated back to using model , then is more likely to be a good translation if the distance between and is small (Brislin, 1970). Prior work applied round-trip translation to monolingual examples and sampled the intermediate translation from a -best list generated by model using beam search (Cheng et al., 2016; He et al., 2016). However, beam search is not differentiable which prevents back-propagating reconstruction errors to

. As a result, reinforcement learning algorithms, or independent updates to

and were required.

In this paper, we focus on the problem of making input reconstruction differentiable to simplify training. In past work, Tu et al. (2017) addressed this issue by reconstructing source sentences from the decoder’s hidden states. However, this reconstruction task can be artificially easy if hidden states over-memorize the input.

We propose instead to combine benefits from differentiable sampling and bi-directional NMT to obtain a compact model that can be trained end-to-end with back-propagation. Specifically,

  • Translations are sampled using the Straight-Through Gumbel Softmax (STGS) estimator

    (Jang et al., 2017; Bengio et al., 2013), which allows back-propagating reconstruction errors.

  • A single bi-directional model is used as a translator and a reconstructor (i.e. ). By contrast, uni-directional models would require a distinct reconstructor which would introduce additional parameters.

Experiments show that our approach yields more desirable performance than reconstructing from hidden states. It achieves consistent improvements across various low-resource language pairs and directions, showing its effectiveness in making better use of limited parallel data.

2 Related Work

Using round-trip translations () as a training signal for NMT usually requires auxiliary models to perform back-translation and cannot be trained end-to-end without reinforcement learning. For instance, Cheng et al. (2016) added a reconstruction loss for monolingual examples to the training objective, but did not back-propagate errors to the forward translator. He et al. (2016) evaluated the quality of by a language model and by a reconstruction likelihood. Both approaches have symmetric forward and backward translation models which are updated alternatively. This require policy gradient algorithms for training, which are not always stable.

Back-translation (Sennrich et al., 2016a) performs half of the reconstruction process, by generating a synthetic source side for monolingual target language examples: . It uses an auxiliary backward model to generate the synthetic data but only the primary forward model is updated by training on it (i.e. reconstructing ). Forward and backward models can be updated iteratively (Zhang et al., 2018; Niu et al., 2018), however this is an expensive process as back-translations are regenerated at each iteration.

Prior work has seeked to simplify the optimization of reconstruction losses by side-stepping beam search. Tu et al. (2017) first proposed to reconstruct NMT input from the decoder’s hidden states while Wang et al. (2018a, b) suggested to use both encoder and decoder hidden states to improve translation of dropped pronouns. However, these models might achieve low reconstruction errors by learning to copy the input to hidden states. To avoid copying the input, Artetxe et al. (2018) and Lample et al. (2018)

used denoising autoencoders

(Vincent et al., 2008) in unsupervised NMT.

Our approach is based instead on the Gumbel Softmax (Jang et al., 2017; Maddison et al., 2017), which facilitates differentiable sampling of sequences of discrete tokens. It has been successfully applied in many sequence generation tasks, including learning to communicate (Havrylov and Titov, 2017), composing tree structures from text (Choi et al., 2018), and tasks under the umbrella of generative adversarial networks Goodfellow et al. (2014) such as generating the context-free grammar (Kusner and Hernández-Lobato, 2016), machine comprehension (Wang et al., 2017) and machine translation (Gu et al., 2018b).

3 Approach

NMT is framed as a conditional language model, where the probability of predicting target token

at step is conditioned on the previously generated sequence of tokens and the source sequence given the model parameter

. Suppose each token is indexed and represented as a one-hot vector, its probability is realized as a softmax function over a linear transformation

where is the decoder’s hidden state at step :


The hidden state is calculated by neural network layers given the embeddings of the previous target tokens

in the embedding matrix and the context coming from the source:


In our bi-directional model, the source sentence can be either or and it is translated to or . The language is marked by a tag (e.g., <en>) at the beginning of each source sentence (Niu et al., 2018; Johnson et al., 2017). To facilitate symmetric reconstruction, we also add language tags to target sentences. The training data corpus is then built by swapping the source and target sentences of a parallel corpus and appending the swapped version to the original.

3.1 Bi-Directional Reconstruction

Our bi-directional model performs both forward translation and backward reconstruction. By contrast, uni-directional models require an auxiliary reconstruction module, which introduces additional parameters. This module can be either a decoder-based reconstructor (Tu et al., 2017; Wang et al., 2018a, b) or a reversed dual NMT model (Cheng et al., 2016; He et al., 2016; Wang et al., 2018c; Zhang et al., 2018).

Here the reconstructor, which shares the same parameter with the translator, can also be trained end-to-end by maximizing the log-likelihood of reconstructing :


Combining with the forward translation likelihood


we use as the final training objective for . The dual model is trained simultaneously by swapping the language direction in bi-directional NMT.

Reconstruction is reliable only with a model that produces reasonable base translations. Following prior work (Tu et al., 2017; He et al., 2016; Cheng et al., 2016), we pre-train a base model with and fine-tune it with .

3.2 Differentiable Sampling

We use differentiable sampling to side-step beam search and back-propagate error signals. We sample a translation token at each time step with the probability reparamterized using the Gumbel-Max trick (Maddison et al., 2014):


where is i.i.d. and drawn from 111i.e. and .. We use scaled with parameter , i.e. , to control the randomness. The sampling becomes deterministic (which is equivalent to greedy search) as approaches 0.

Since is not a differentiable operation, we approximate its gradient with the Straight-Through Gumbel Softmax (STGS) (Jang et al., 2017; Bengio et al., 2013): , where


As approaches 0, is closer to but training might be more unstable. While the STGS estimator is biased when is large, it performs well in practice (Gu et al., 2018b; Choi et al., 2018) and is sometimes faster and more effective than reinforcement learning (Havrylov and Titov, 2017).

To rectify errors in the model’s own predictions, the decoder used for sampling translations only consumes its previously predicted . This contrasts with the usual teacher forcing strategy (Williams and Zipser, 1989), which always feeds in the ground-truth previous tokens when predicting the current token , and would force the decoder to replicate the ground-truth sequence during training.

4 Experiments

Training Dev. Test
SWEN 60,570 500 3,000
TLEN 70,703 704 3,000
SOEN 68,550 844 3,000
TREN 207,021 1,001 3,007
Table 1: Number of sentences comprising training, development and test sets.
Baseline 33.60 0.14 30.70 0.19 27.23 0.11 32.15 0.21 12.25 0.08 20.80 0.12 12.90 0.04 15.32 0.11
Hidden 33.41 0.15 30.91 0.19 27.43 0.14 32.20 0.35 12.30 0.11 20.72 0.16 12.77 0.11 15.34 0.10
    -0.19 0.24 0.21 0.14 0.19 0.13 0.04 0.17 0.05 0.11 -0.08 0.12 -0.13 0.13 0.01 0.07
33.92 0.10 31.37 0.18 27.65 0.09 32.75 0.32 12.47 0.08 21.14 0.19 13.26 0.07 15.60 0.19
    0.32 0.12 0.66 0.11 0.42 0.16 0.59 0.13 0.22 0.04 0.35 0.15 0.36 0.09 0.28 0.11
33.97 0.08 31.39 0.09 27.65 0.10 32.65 0.24 12.48 0.09 21.20 0.14 13.16 0.08 15.52 0.07
    0.37 0.09 0.69 0.11 0.42 0.11 0.50 0.08 0.23 0.03 0.41 0.13 0.25 0.09 0.19 0.05
Table 2: BLEU scores on eight translation directions. The numbers before and after ‘

’ are the mean and standard deviation over five randomly seeded models. Our proposed methods (

) achieve small but consistent improvements. BLEU scores are in bold if meanstd is above zero while in red if the mean is below zero.
(a) training set
(b) development set
Figure 1: Training curves of perplexity on the training and the development sets for TREN. Reconstructing from hidden states and reconstructing from sampled translations () are compared. Reconstructing from hidden states achieves extremely low training perplexity and suffers from unstable training during the early stage.

4.1 Tasks and Data

We evaluate our approach on four low-resource language pairs. Parallel data for SwahiliEnglish (SWEN), TagalogEnglish (TLEN) and SomaliEnglish (SOEN) contains a mixture of domains such as news and weblogs and is collected from the IARPA MATERIAL program222, the Global Voices parallel corpus333, Common Crawl (Smith et al., 2013), and the LORELEI Somali representative language pack (LDC2018T11). The test samples are extracted from the heldout ANALYSIS set of MATERIAL. Parallel TurkishEnglish (TREN) data is provided by the WMT news translation task (Bojar et al., 2018). We use pre-processed corpus/newsdev2016/newstest2017 as training/development/test sets.444

We apply normalization, tokenization, true-casing, joint source-target BPE with 32,000 operations (Sennrich et al., 2016b) and sentence-filtering (length 80 cutoff) to parallel data. Itemized data statistics after preprocessing can be found in Table 1. We report case-insensitive BLEU with the WMT standard ‘13a’ tokenization using SacreBLEU (Post, 2018).

4.2 Model Configuration and Baseline

We build NMT models upon the attentional RNN encoder-decoder architecture (Bahdanau et al., 2015) implemented in the Sockeye toolkit (Hieber et al., 2017)

. Our translation model uses a bi-directional encoder with a single LSTM layer of size 512, multilayer perceptron attention with a layer size of 512, and word representations of size 512. We apply layer normalization

(Ba et al., 2016) and add dropout to embeddings and RNNs of the encoder and decoder with probability 0.2. We train using the Adam optimizer (Kingma and Ba, 2015) with a batch size of 48 sentences and checkpoint the model every 1000 updates. The learning rate for baseline models is initialized to 0.001 and reduced by 30% after 4 checkpoints without improvement of perplexity on the development set. Training stops after 10 checkpoints without improvement.

The bi-directional NMT model ties source and target embeddings to yield a bilingual vector space. It also ties the output layer’s weight matrix and embeddings to achieve better performance in low-resource scenarios Press and Wolf (2017); Nguyen and Chiang (2018).

We train five randomly seeded bi-directional baseline models by optimizing the forward translation objective and report the mean and standard deviation of test BLEU. We fine-tune baseline models with objective , inheriting all settings except the learning rate which is re-initialized to 0.0001. Each randomly seeded model is fine-tuned independently, so we are able to report the standard deviation of BLEU.

4.3 Contrastive Reconstruction Model

We compare our approach with reconstruction from hidden states. Following the best practice of Wang et al. (2018a), two reconstructors are used to take hidden states from both the encoder and the decoder. The corresponding two reconstruction losses and the canonical translation loss were originally uniformly weighted (i.e. ), but we found that balancing the reconstruction and translation losses yields better results (i.e. ) in preliminary experiments.

We use the reconstructor exclusively to compute the reconstruction training loss. It has also been used to re-rank translation hypotheses in prior work, but Tu et al. (2017) showed in ablation studies that the gains from re-ranking are small compared to those from training.

4.4 Results

Table 2 shows that our reconstruction approach achieves small but consistent BLEU improvements over the baseline on all eight tasks.

We evaluate the impact of the Gumbel Softmax hyperparameters on the development set. We select

and based on training stability and BLEU. Interestingly, increased randomness in sampling does not have a strong impact on BLEU. Sampling by greedy search (i.e. ) performs similarly as sampling with increased Gumbel noise (i.e. more random translation selection when , possibly including lower quality samples).

The scale of improvement is influenced by the distance between training and test data. For example, the out-of-vocabulary (OOV) rate of TREN is more than twice the OOV rate of SWEN and the later obtains higher BLEU. This suggests that reconstructing the input helps to better fit the limited parallel training data.

Reconstructing from hidden states yields more mixed results. It fails to improve BLEU in more difficult cases, such as TREN with high OOV rates. We observe extremely low training perplexity for reconstructing from hidden states compared with our proposed approach (Figure (a)a). This suggests that reconstructing from hidden states yields representations that memorize the input rather than improve output representations.

Another advantage of our approach is that all parameters were jointly pre-trained, which result in more stable training behavior. By contrast, reconstructing from hidden states requires to initialize the reconstructors independently and suffers from unstable early training behavior (Figure 1).

5 Conclusion

We studied reconstructing the input of NMT from its intermediate translations to better exploit training samples in low-resource settings. We used a bi-directional NMT model and the Straight-Through Gumbel Softmax to build a fully differentiable reconstruction model that does not require any additional parameters. We empirically demonstrated that our approach is effective in low-resource scenarios. In future work, we will investigate the use of differentiable reconstruction from sampled sequences in unsupervised and semi-supervised sequence generation tasks.


This research is based upon work supported in part by an Amazon Web Services Machine Learning Research Award, and by the Office of the Director of National Intelligence (ODNI), Intelligence Advanced Research Projects Activity (IARPA), via contract #FA8650-17-C-9117. The views and conclusions contained herein are those of the authors and should not be interpreted as necessarily representing the official policies, either expressed or implied, of ODNI, IARPA, or the U.S. Government. The U.S. Government is authorized to reproduce and distribute reprints for governmental purposes notwithstanding any copyright annotation therein.