Regressing Word and Sentence Embeddings for Regularization of Neural Machine Translation

In recent years, neural machine translation (NMT) has become the dominant approach in automated translation. However, like many other deep learning approaches, NMT suffers from overfitting when the amount of training data is limited. This is a serious issue for low-resource language pairs and many specialized translation domains that are inherently limited in the amount of available supervised data. For this reason, in this paper we propose regressing word (ReWE) and sentence (ReSE) embeddings at training time as a way to regularize NMT models and improve their generalization. During training, our models are trained to jointly predict categorical (words in the vocabulary) and continuous (word and sentence embeddings) outputs. An extensive set of experiments over four language pairs of variable training set size has showed that ReWE and ReSE can outperform strong state-of-the-art baseline models, with an improvement that is larger for smaller training sets (e.g., up to +5:15 BLEU points in Basque-English translation). Visualizations of the decoder's output space show that the proposed regularizers improve the clustering of unique words, facilitating correct predictions. In a final experiment on unsupervised NMT, we show that ReWE and ReSE are also able to improve the quality of machine translation when no parallel data are available.



There are no comments yet.


page 1


Overcoming the Rare Word Problem for Low-Resource Language Pairs in Neural Machine Translation

Among the six challenges of neural machine translation (NMT) coined by (...

Multiple Segmentations of Thai Sentences for Neural Machine Translation

Thai is a low-resource language, so it is often the case that data is no...

Learning Paraphrastic Sentence Embeddings from Back-Translated Bitext

We consider the problem of learning general-purpose, paraphrastic senten...

An Empirical Analysis of NMT-Derived Interlingual Embeddings and their Use in Parallel Sentence Identification

End-to-end neural machine translation has overtaken statistical machine ...

Optimizing Segmentation Granularity for Neural Machine Translation

In neural machine translation (NMT), it is has become standard to transl...

How Much Does Tokenization Affect in Neural Machine Translation?

Tokenization or segmentation is a wide concept that covers simple proces...

How Much Does Tokenization Affect Neural Machine Translation?

Tokenization or segmentation is a wide concept that covers simple proces...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

Machine translation (MT) is a field of natural language processing (NLP) focussing on the automatic translation of sentences from a

source language to a target language. In recent years, the field has been progressing quickly mainly thanks to the advances in deep learning and the advent of neural machine translation (NMT). The first NMT model was presented in 2014 by Sutskever et al. [43]

and consisted of a plain encoder-decoder architecture based on recurrent neural networks (RNNs). In the following years, a series of improvements has led to major performance increases, including the attention mechanism (a word-aligment model between words in the source and target sentences)

[4, 30] and the transformer (a non-recurrent neural network that offers an alternative to RNNs and makes NMT highly parallelizable) [44]. As a result, NMT models have rapidly outperformed traditional approaches such as phrase-based statistical machine translation (PBSMT) [24] in challenging translation contexts (e.g., the WMT conference series). Nowadays, the majority of commercial MT systems utilise NMT in some form.

However, NMT systems are not exempt from limitations. The main is their tendence to overfit the training set due to their large number of parameters. This issue is common to many other tasks that use deep learning models and it is caused to a large extent by the way these models are trained: maximum likelihood estimation (MLE). As pointed out by Elbayad et al.

[15], in the case of machine translation, MLE has two clear shortcomings that contribute to overfitting:

  1. Single ground-truth reference

    : Usually, NMT models are trained with translation examples that have a single reference translation in the target language. MLE tries to give all the probability to the words of the ground-truth reference and zero to all others. Nevertheless, a translation that uses different words from the reference (e.g. paraphrase sentences, synonyms) can be equally correct. Standard MLE training is not able to leverage this type of information since it treats every word other than the ground truth as completely incorrect.

  2. Exposure bias[5]: NMT models are trained with “teacher forcing”, which means that the previous word from the reference sentence is given as input to the decoder for the prediction of the next. This is done to speed up training convergence and avoid prediction drift. However, at test time, due to the fact that the reference is not available, the model has to rely on its own predictions and the performance can be drastically lower.

Both these limitations can be mitigated with sufficient training data. In theory, MLE could achieve optimal performance with infinite training data, but in practice this is impossible as the available resources are always limited. In particular, when the training data are scarce such as in low-resource language pairs or specific translation domains, NMT models display a modest performance, and other traditional approaches (e.g., PBSMT)[25] often obtain better accuracies. As such, generalization of NMT systems still calls for significant improvement.

In our recent work [20]

, we have proposed a novel regularization technique that is based on co-predicting words and their embeddings (“regressing word embeddings”, or ReWE for short). ReWE is a module added to the decoder of a sequence-to-sequence model so that, during training, the model is trained to jointly predict the next word in the translation (categorical value) and its pre-trained word embedding (continuous value). This approach can leverage the contextual information embedded in pre-trained word vectors to achieve more accurate translations at test time. ReWE has been showed to be very effective over low/medium size training sets 

[20]. In this paper, we extend this idea to its natural counterpart: sentence embedding. We propose regressing sentence embeddings (ReSE) as an additional regularization method to further improve the accuracy of the translations. ReSE uses a self-attention mechanism to infer a fixed-dimensional sentence vector for the target sentence. During training, the model is trained to regress this inferred vector towards the pre-trained sentence embedding of the ground-truth sentence. The main contributions of this paper are:

  • The proposal of a new regularization technique for NMT based on sentence embeddings (ReSE).

  • Extensive experimentation over four language pairs of different dataset sizes (from small to large) with both word and sentence regularization. We show that using both ReWE and ReSE can outperform strong state-of-the-art baselines based on long short-term memory networks (LSTMs) and transformers.

  • Insights on how ReWE and ReSE help to improve NMT models. Our analysis shows that these regularizers improve the organization of the decoder’s output vector space, likely facilitating correct word classification.

  • Further experimentation of the regularizer on unsupervised machine translation, showing that it can improve the quality of the translations even in the absence of parallel training data.

The rest of this paper is organized as follows. Section II presents and discusses the related work. Section III describes the model used as baseline while Section IV presents the proposed regularization techniques, ReWE and ReSE. Section V describes the experiments and analyzes the experimental results. Finally, Section VI concludes the paper.

Ii Related Work

The related work is organized over the three main research subareas that have motivated this work: regularization techniques, word and sentence embeddings and unsupervised NMT.

Ii-a Regularization Techniques

In recent years, the research community has dedicated much attention to the problem of overfitting in deep neural models. Several regularization approaches have been proposed in turn such as dropout [42, 17], data augmentation [16] and multi-task learning [18, 9]. Their common aim is to encourage the model to learn parameters that allow for better generalization.

In NMT, too, mitigating overfitting has been the focus of much research. As mentioned above, the two, main acknowledged problems are the single ground-truth reference and the exposure bias. For the former, Fadee et al. [16] have proposed augmenting the training data with synthetically-generated sentence pairs containing rare words. The intuition is that the model will be able to see the vocabulary’s words in more varied contexts during training. Kudo [26]

has proposed using variable word segmentations to improve the model’s robustness, achieving notable improvements in low-resource languages and out-of-domain settings. Another line of work has focused on “smoothing” the output probability distribution over the target vocabulary

[15, 8]. These approaches use token-level and sentence-level reward functions that push the model to distribute the output probability mass over words other than the ground-truth reference. Similarly, Ma et al. [31] have added a bag-of-words term to the training objective, assuming that the set of correct translations share similar bag-of-word vectors.

There has also been extensive work on addressing the exposure bias problem. An approach that has proved effective is the incorporation of predictions in the training, via either imitation learning

[11, 39, 29]

or reinforcement learning

[38, 3]. Another approach, that is computationally more efficient, leverages scheduled sampling to obtain a stochastic mixture of words from the reference and the predictions [5]. In turn, Wu et al. [46] have proposed a soft alignment algorithm to alleviate the missmatches between the reference translations and the predictions obtained with scheduled sampling; and Zhang et al.[48] have introduced two regularization terms based on the Kullback-Leibler (KL) divergence to improve the agreement of sentences predicted from left-to-right and right-to-left.

Ii-B Word and Sentence Embeddings

Fig. 1: Baseline NMT model. (Left) The encoder receives the input sentence and generates a context vector for each decoding step using an attention mechanism. (Right) The decoder generates one-by-one the output vectors , which represent the probability distribution over the target vocabulary. During training is a token from the ground truth sentence, but during inference the model uses its own predictions.

Word vectors or word embeddings [33, 35, 6] are ubiquitous in NLP since they provide effective input features for deep learning models. Recently, contextual word vectors such as ELMo [36], BERT [13] and the OpenAI transformer [37] have led to remarkable performance improvements in several language understanding tasks. Additionally, researchers have focused on developing embeddings for entire sentences and documents as they may facilitate several textual classification tasks [22, 10, 7, 2].

In NMT models, word embeddings play an important role as input of both the encoder and the decoder. A recent paper has shown that contextual word embeddings provide effective input features for both stages [14]. However, very little research has been devoted to using word embeddings as targets. Kumar and Tsvetkov [27]

have removed the typical output softmax layer, forcing the decoder to generate continuous outputs. At inference time, they use a nearest-neighbour search in the word embedding space to select the word to predict. Their model allows for significantly faster training while performing on par with state-of-the-art models. Our approach differs from

[27] in that our decoder generates continuous outputs in parallel with the standard softmax layer, and only during training to provide regularization. At inference time, the continuous output is ignored and prediction operates as in a standard NMT model. To the best of our knowledge, our model is the first to use embeddings as targets for regularization, and at both word and sentence level.

Ii-C Unsupervised NMT

The amount of available parallel, human-annotated corpora for training NMT systems is at times very scarce. This is the case of many low-resource languages and specialized translation domains (e.g., health care). Consequently, there has been a growing interest in developing unsupervised NMT models [28, 1, 47]

which do not require annotated data for training. Such models learn to translate by only using monolingual corpora, and even though their accuracy is still well below that of their supervised counterparts, they have started to reach interesting levels. The architecture of unsupervised NMT systems differs from that of supervised systems in that it combines translation in both directions (source-to-target and target-to-source). Typically, a single encoder is used to encode sentences from both languages, and a separate decoder generates the translations in each language. The training of such systems follows three stages: 1) building a bilingual dictionary and word embedding space, 2) training two monolingual language models as denoising autoencoders

[45], and 3) converting the unsupervised problem into a weakly-supervised one by use of back-translations [40]. For more details on unsupervised NMT systems, we refer the reader to the original papers  [28, 1, 47].

In this paper, we explore using the proposed regularization approach also for unsupervised NMT. Unsupervised NMT models still require very large amounts of monolingual data for training, and often such amounts are not available. Therefore, these models, too, are expected to benefit from improved regularization.

Fig. 2: Full model: Baseline + ReWE + ReSE. (Left) The encoder with the attention mechanism generates vectos in the same way as the baseline system. (Right) The decoder generates one-by-one the output vectors , which represent the probability distribution over the target vocabulary, and , which is a continuous word vector. Additionally, the model can also generate another continuous vector, r, which represents the sentence embedding.

Iii The Baseline NMT model

In this section, we describe the NMT model that has been used as the basis for the proposed regularizer. It is a neural encoder-decoder architecture with attention [4] that can be regarded as a strong baseline as it incorporates both LSTMs and transformers as modules. Let us assume that is the source sentence with tokens and is the target translated sentence with tokens. First, the words in the source sentence are encoded into their word embeddings by an embedding layer:


and then the source sentence is encoded by a sequential module into its hidden vectors, :


Next, for each decoding step , an attention network provides a context vector as a weighted average of all the encoded vectors, , conditional on the decoder output at the previous step, (Eq. 3). For this network, we have used the attention mechanism of Badhdanau et al.[4].


Given the context vector, , the decoder output at the previous step, , and the word embedding of the previous word in the target sentence, (Eq. 4), the decoder generates vector (Eq. 5). This vector is later transformed into a larger vector of the same size as the target vocabulary via learned parameters W, b and a softmax layer (Eq. 6). The resulting vector, , is the inferred probability distribution over the target vocabulary at decoding step . Fig. 1 depicts the full architecture of the baseline model.


The model is trained by minimizing the negative log-likelihood (NLL) which can be expressed as:


where the probability of ground-truth word has been noted as . Minimizing the NLL is equivalent to MLE and results in assigning maximum probability to the words in the reference translation,

. The training objective is minimized with standard backpropagation over the training data, and at inference time the model uses beam search for decoding.

Iv Regressing word and sentence embeddings

As mentioned in the introduction, MLE suffers from some limitations when training a neural machine translation system. To alleviate these shortcomings, in our recent paper [20] we have proposed a new regularization method based on regressing word embeddings. In this paper, we extend this idea to sentence embeddings.

Iv-a ReWE

Pre-trained word embeddings are trained on large monolingual corpora by measuring the co-occurences of words in text windows (“contexts”). Words that occur in similar contexts are assumed to have similar meaning, and hence, similar vectors in the embedding space. Our goal with ReWE is to incorporate the information embedded in the word vector in the loss function to encourage model regularization.

In order to generate continuous vector representations as outputs, we have added a ReWE block to the NMT baseline (Fig. 2). At each decoding step, the ReWE block receives the hidden vector from the decoder, , as input and outputs another vector, , of the same size of the pre-trained word embeddings:


where , , and

are the learnable parameters of a two-layer feed-forward network with a Rectified Linear Unit (ReLU) as activation function between the layers. Vector

aims to reproduce the word embedding of the target word, and thus the distributional properties (or co-occurrences) of its contexts.

During training, the model is guided to regress the predicted vector, , towards the word embedding of the ground-truth word, . This is achieved by using a loss function that computes the distance between and (Eq. 9). Previous work [20] has showed that the cosine distance is empirically an effective distance between word embeddings and has thus been adopted as loss. This loss and the original NLL loss are combined together with a tunable hyper-parameter, (Eq. 10). Therefore, the model is trained to jointly predict both a categorical and a continuous representation of the words. Even though the system is performing a single task, this setting could also be interpreted as a form of multi-task learning with different representations of the same targets.


The word vectors of both the source () and target () vocabularies are initialized with pre-trained embeddings, but updated during training. At inference time, we ignore the outputs of the ReWE block and we perform translation using only the categorical prediction.


Sentence vectors, too, have been extensively used as input representations in many NLP tasks such as text classification, paraphrase detection, natural language inference and question answering. The intuition behind them is very similar to that of word embeddings: sentences with similar meanings are expected to be close to each other in vector space. Many off-the-shelf sentence embedders are currently available and they can be easily integrated in deep learning models. Based on similar assumptions to the case of word embeddings, we have hypothesized that an NMT model could also benefit from a regularization term based on regressing sentence embeddings (the ReSE block in Fig. 2).

The main difference of ReSE compared to ReWE is that there has to be a single regressed vector per sentence rather than one per word. Thus, ReSE first uses a self-attention mechanism to learn a weighted average of the decoder’s hidden vectors, :


where the attention weights are obtained from Eqs. 12 and 13, and and are learnable parameters. Then, a two-layered neural network similar to ReWE’s predicts the sentence vector, r (Eq. 14). Parameters , , and are also learned during training.


Similarly to ReWE, a loss function computes the cosine distance between the predicted sentence vector, r, and the sentence vector inferred with the off-the-shelf sentence embedder, (Eq. 15). This loss is added to the previous objective as an extra term with an additional, tunable hyper-parameter, :


Since the number of sentences is significantly lower than that of the words, typically needs to be higher than . Nevertheless, we tune it blindly using the validation set. The reference sentence embedding, , can be inferred with any off-the-shelf pre-trained embedder. At inference time, the model solely relies on the categorical prediction and ignores the predicted word and sentence vectors.

V Experiments

We have carried out an ample range of experiments to probe the performance of the proposed regularization approaches. This section describes the datasets, the models and the hyper-parameters used, and presents and discusses all results.

V-a Datasets

Four different language pairs have been selected for the experiments. The datasets’ size varies from tens of thousands to millions of sentences to test the regularizers’ ability to improve translation over a range of low-resource and high-resource language pairs.

  • De-En: The German-English dataset (de-en) has been taken from the WMT18 news translation shared task111WMT18: The training set contains over 5M sentence pairs collected from the Europarl, CommonCrawl and Newscommentary parallel corpora. As validation and test sets, we have used the newstest2017 and the newstest2018 datasets, respectively. We consider this dataset as a high-resource case.

  • En-Fr: The English-French dataset (en-fr) has been sourced from the IWSLT 2016 translation shared task222IWSLT16: This corpus contains translations of TED talks of very diverse topics. The training data provided by the organizers consist of translations which allow us to categorize this dataset as low/medium-resource. Following Denkowski and Neubig [12], the validation set has been formed by merging the 2013 and 2014 test sets from the same shared task, and the test set has been formed with the 2015 and 2016 test sets.

  • Cs-En: The Czech-English dataset (cs-en) is also from the IWSLT 2016 TED talks translation task. However, this dataset is approximately half the size of en-fr as its training set consists of sentence pairs. Again following Denkowski and Neubig [12]), the validation set has been formed by merging the 2012 and 2013 test sets, and the test set by merging the 2015 and 2016 test sets. We regard this dataset as low-resource.

  • Eu-En: The Basque-English dataset (eu-en) has been collected from the WMT16 IT-domain translation shared task333WMT16 IT: This is the smallest dataset, with only sentence pairs in the training set. However, only sentences in the training set have been translated by human annotators. The remaining sentence pairs are translations of IT-domain short phrases and Wikipedia titles. Therefore, we consider this dataset as extremely low-resource. It must be said that translations in the IT domain are somehow easier than in the news domain, as this domain is very specific and the wording of the sentences are less varied. For this dataset, we have used the validation and test sets ( sentences each) provided in the shared task.

All the datasets have been pre-processed with moses-tokenizer444URL moses. Additionally, words have been split into subword units using byte pair encoding (BPE) [41]. For the BPE merge operations parameter, we have used (the default value) for all the datasets, except for eu-en where we have set it to since this dataset is much smaller. Experiments have been performed at both word and subword level since morphologically-rich languages such as German, Czech and Basque can benefit greatly from operating the NMT model at subword level.

V-B Model Training and Hyper-Parameter Selection

To implement ReWE and ReSE, we have modified the popular OpenNMT open-source toolkit [23]555Our code is publicly available on Github at: We will also release it on CodeOcean.. Two variants of the standard OpenNMT model have been used as baselines: the LSTM and the transformer, described hereafter.

Models Word/BPE BLEU
LSTM word 34.21
LSTM + ReWE() word 35.43
Transformer word 34.56
Transformer + ReWE() word 35.3
LSTM bpe 34.06
LSTM + ReWE() bpe 35.09
Transformer bpe 35.31
Transformer + ReWE() bpe 36.3
TABLE I: BLEU scores over the En-Fr test set. The reported results are the average of 5 independent runs.
Models Word/BPE BLEU
LSTM word 20.48
+ ReWE() word 21.81
+ ReWE() + ReSE() word 21.98
Transformer word 20.56
+ ReWE() word 21.16
+ ReWE() + ReSE() word 20.05
LSTM bpe 22.56
+ ReWE() bpe 23.72
+ ReWE() + ReSE() bpe 23.56
Transformer bpe 21.02
+ ReWE() bpe 22.19
+ ReWE() + ReSE() bpe 20.53
TABLE II: BLEU scores over the Cs-En test set. The reported results are the average of 5 independent runs.
Models Word/BPE BLEU
LSTM word 10.87
+ ReWE() word 13.83
+ ReWE() + ReSE() word 16.02
Transformer word 12.15
+ ReWE() word 13.53
+ ReWE() + ReSE() word 6.92
LSTM bpe 17.14
+ ReWE() bpe 19.54
+ ReWE() + ReSE() bpe 20.29
Transformer bpe 12.70
+ ReWE() bpe 13.21
+ ReWE() + ReSE() bpe 9.63
TABLE III: BLEU scores over the Eu-En test set. The reported results are the average of 5 independent runs.
Models Word/BPE BLEU
LSTM word 29.75
+ ReWE() word 30.17
+ ReWE() + ReSE() word 30.23
LSTM bpe 34.03
+ ReWE() bpe 33.66
+ ReWE() + ReSE() bpe 33.91
TABLE IV: BLEU scores over the De-En test set. The reported results are the average of 5 independent runs.
  • LSTM: A strong NMT baseline was prepared by following the indications given by Denkowski and Neubig [12]. The model uses a bidirectional LSTM [19] for the encoder and a unidirectional LSTM for the decoder, with two layers each. The size of the word embeddings was set to 300d and that of the sentence embeddings to 512d. The sizes of the hidden vectors of both LSTMs and of the attention network were set to 1024d. In turn, the LSTM’s dropout rate was set to and the training batch size was set to 40 sentences. As optimizer, we have used Adam [21] with a learning rate of . During training, the learning rate was halved with simulated annealing upon convergence of the perplexity over the validation set, which was evaluated every training sentences. Training was stopped after halving the learning rate times.

  • Transformer

    : The transformer network

    [44] has somehow become the de-facto neural network for the encoder and decoder of NMT pipelines thanks to its strong empirical accuracy and highly-parallelizable training. For this reason, we have used it as another baseline for our model. For its hyper-parameters, we have used the default values set by the developers of OpenNMT666Transformer: Both the encoder and the decoder are formed by a 6-layer network. The sizes of the word embeddings, the hidden vectors and the attention network have all been set to either 300d or 512d, depending on the best results over the validation set. The head count has been set correspondingly to either 6 or 8, and the dropout rate to as for the LSTM. The model was also optimized using Adam, but with a much higher learning rate of (OpenAI default). For this model, we have not used simulated annealing since some preliminary experiments showed that it did penalize performance. The batch size used was and

    words, again selected based on the accuracy over the validation set. Training was stopped upon convergence in perplexity over the validation set, which was evaluated at every epoch.

In addition, the word embeddings for both models were initialized with pre-trained fastText embeddings [6]. For the 300d word embeddings, we have used the word embeddings available on the official fastText website777Fasttext: For the 512d embeddings and the subword units, we have trained our own pre-trained vectors using the fastText embedder with a large monolingual corpora from Wikipedia888Wikipedia: and the training data. Both models have used the same sentence embeddings which have been computed with the Universal Sentence Encoder (USE)999USE: However, the USE is only available for English, so we have only been able to use ReSE with the datasets where English is the target language (i.e., de-en, cs-en and eu-en). When using BPE, the subwords of every sentence have been merged back into words before passing them to the USE. The BLEU score for the BPE models has also been computed after post-processing the subwords back into words. Finally, hyper-parameters and have been tuned only once for all datasets by using the en-fr validation set. This was done in order to save the significant computational time that would have been required by further hyper-parameter exploration. However, in the de-en case the initial results were far from the state of the art and we therefore repeated the selection with its own validation set. For all experiments, we have used an Intel Xeon E5-2680 v4 with an NVidia GPU card Quadro P5000. On this machine, the training time of the transformer has been approximately an order of magnitude larger than that of the LSTM.

Example 1:
Src: Sakatu Fitxategia fitxa Oihal atzeko ikuspegia atzitzeko ; sakatu Berria . Hautatu txantiloia eta sakatu Sortu hautatutako txantiloia erabiltzeko .
Ref: Click the File tab to access Backstage view , select New . Select a template and click Create to use the selected template .
Baseline: Click the default tab of the tab that you want to open the tab tab . Select the template and select the selected template .
Baseline + ReWE: Press the File tab to access the view view ; click New . Select the template and click Add to create the selected template .
Baseline + ReWE + ReSE: Press the File tab to access the chart view ; press New . Select the template and click Create to use the selected template .
TABLE V: Translation examples. Example 1: Eu-En and Example 2: Cs-En.
Example 2:
Src: Na tomto projektu bylo skvělé , že žáci viděli lokální problém a bum – okamžitě se s ním snaží vyrovnat .
Ref: What was really cool about this project was that the students saw a local problem , and boom – they are trying to immediately address it .
Baseline: In this project , it was great that the kids had seen local problems and boom – immediately he’s trying to deal with him .
Baseline + ReWE: In this project , it was great that the kids saw a local issue , and boom – they immediately try to deal with it .
Baseline + ReWE + ReSE: What was great about this project was that the students saw a local problem, and boom , they’re trying to deal with him .

V-C Results

We have carried out a number of experiments with both baselines. The scores reported are an average of the BLEU scores (in percentage points, or pp) [34] over the test sets of independently trained models. Table I shows the results over the en-fr dataset. In this case, the models with ReWE have outperformed the LSTM and transformer baselines consistently. The LSTM did not benefit from using BPE, but the transformer+ReWE with BPE reached BLEU pp (a pp improvement over the best model without ReWE). For this dataset we did not use ReSE because French was the target language.

Table II reports the results over the cs-en dataset. Also in this case, all the models with ReWE have improved over the corresponding baselines. The LSTM+ReWE has achieved the best results ( BLEU pp; an improvement of pp over the best model without ReWE). This language pair has also benefited more from the BPE pre-processing, likely because Czech is a morphologically-rich language. For this dataset, it was possible to use ReSE in combination with ReWE, with an improvement for the LSTM at word level ( BLEU pp), but not for the remaining cases. We had also initially tried to use ReSE without ReWE (i.e., ), but the results were not encouraging and we did not continue with this line of experiments.

Fig. 3: BLEU scores over the De-En test set for models trained with training sets of different size.
(a) Baseline
(b) Baseline + ReWE
Fig. 4: Visualization of the vectors from the decoder for a subset of the cs-en test set. Please refer to Section V-D for explanations. This figure should be viewed in color.

For the eu-en dataset (Table III), the results show that, again, ReWE outperforms the baselines by a large margin. Moreover, ReWE+ReSE has been able to improve the results even further ( BLEU pp when using BPE and BLEU pp at word level over the corresponding baselines). Basque is, too, a morphologically-rich language and using BPE has proved very beneficial ( BLEU pp over the best word-level model). As noted before, the eu-en dataset is very low-resource (less than sentence pairs) and it is more likely that the baseline models generalize poorly. Consequently, regularizers such as ReWE and ReSE are more helpful, with larger margins of improvement with respect to the baselines. On a separate note, the transformer has unexpectedly performed well below the LSTM on this dataset, and especially so with BPE. We speculate that it may be more sensitive than the LSTM to the dataset’s much smaller size, or in need of more refined hyper-parameter tuning.

Finally, Table IV shows the results over the de-en dataset that we categorize as high-resource (5M+ sentence pairs). For this dataset, we have only been able to perform experiments with the LSTM due to the exceedingly long training times of the transformer. At word level, both ReWE and ReWE+ReSE have been able to outperform the baseline, although the margins of improvement have been smaller than for the other language pairs ( and BLEU pp, respectively). However, when using BPE both ReWE and ReWE+ReSE have performed slightly below the baseline ( and points BLEU pp, respectively). This shows that when the training data are abundant, ReWE or ReSE may not be beneficial. To probe this further, we have repeated these experiments by training the models over subsets of the training set of increasing size (200K, 500K, 1M, and 2M sentence pairs). Fig. 3 shows the BLEU scores achieved by the baseline and the regularized models for the different training data sizes. The plot clearly shows that the performance margin increases as the training data size decreases, as expected from a regularized model.

Table V shows two examples of the translations made by the different LSTM models for eu-en and cs-en. A qualitative analysis of these examples shows that both ReWE and ReWE+ReSE have improved the quality of these translations. In the eu-en example, ReWE has correctly translated “File tab”; and ReSE has correctly added “click Create”. In the cs-en example, the model with ReWE has picked the correct subject “they”, and only the model with ReWE and ReSE has correctly translated “students” and captured the opening phrase “What was…about this…”.

V-D Understanding ReWE and ReSE

The quantitative experiments have proven that ReWE and ReSE can act as effective regularizers for low- and medium-resource NMT. Yet, it would be very interesting to understand how do they influence the training to achieve improved models. For that purpose, we have conducted an exploration of the values of the hidden vectors on the decoder end (, Eq. 5

). These values are the “feature space” used by the final classification block (a linear transformation and a softmax) to generate the class probabilities and can provide insights on the model. For this reason, we have considered the cs-en test set and stored all the

vectors with their respective word predictions. Then, we have used t-SNE [32] to reduce the dimensionality of the vectors to a visualizable 2d. Finally, we have chosen a particular word (architecture) as the center of the visualization, and plotted all the vectors within a chosen neighborhood of this center word (Fig. 4). To avoid cluttering the figure, we have not superimposed the predicted words to the vectors, but only used a different color for each distinct word. The center word in the two subfigures (a: baseline; b: baseline+ReWE) is the same (architecture) and from the same source sentence, so the visualized regions are comparable. The visualizations also display all other predicted instances of word architecture in the neighborhood.

These visualizations show two interesting behaviors: 1) from eye judgment, the points predicted by the ReWE model seem more uniformly spread out; 2) instances of the same words have vectors that are close to each other. For instance, several instances of word architecture are close to each other in Fig. 3(b) while a single instance appears in Fig. 3(b). The overall observation is that the ReWE regularizer leads to a vector space that is easier to discriminate, i.e. find class boundaries for, facilitating the final word prediction. In order to confirm this observation, we have computed various clustering indexes over the clusters formed by the vectors with identical predicted word. As indexes, we have used the silhouette and the Davies-Bouldin indexes that are two well-known unsupervised metrics for clustering. The silhouette index ranges from -1 to +1, where values closer to 1 mean that the clusters are compact and well separated. The Davies-Bouldin index is an unbounded nonnegative value, with values closer to 0 meaning better clustering. Table VI shows the values of these clustering indexes over the entire cs-en test set for the LSTM models. As the table shows, the models with ReWE and ReWE+ReSE have reported the best values. This confirms that applying ReWE and ReSE has a positive impact on the decoder’s hidden space, ultimately justifying the increase in word classification accuracy.

Model Sillhouette Davies-Bouldin
LSTM -0.19 1.87
+ ReWE() -0.17 1.80
+ ReWE() + ReSE() -0.16 1.80
TABLE VI: Clustering indexes of the LSTM models over the cs-en test set. The reported results are the average of 5 independent runs.

For further exploration, we have created another visualization of the s vectors and their predictions over a smaller neighborhood (Fig. 5). The same word (architecture) has been used as the center word of the plot. Then, we have “vibrated” each of the vector by small increments (between 0.05 and 8 units) in each of their dimensions, creating several new synthetic instances of s vectors which are very close to the original ones. These synthetic vectors have then been decoded with the trained NMT model to obtain their predicted words. Finally, we have used t-SNE to reduce the dimensionality to 2d, and visualized all the vectors and their predictions in a small neighborhood ( units) around the center word. Fig. 5 shows that, with the ReWE model, all the s vectors surrounding the center word predict the same word (architecture). Conversely, with the baseline, the surrounding points predict different words (power, force, world). This is additional evidence that the s space is evened out by the use of the proposed regularizer.

(a) Baseline
(b) Baseline + ReWE
Fig. 5: Visualization of the vectors in a smaller neighborhood of the center word.

V-E Unsupervised NMT

(a) en-fr
(b) fr-en
Fig. 6: BLEU scores over the test set. The reported results are the average of 5 independent runs.. The red line represents the baseline model and the blue line is the baseline + ReWE.

Finally, we have also experimented with the use of ReWE and ReWE+ReSE for an unsupervised NMT task. For this experiment, we have used the open-source model provided by Lample et al. [28]101010UnsupervisedMT: which is currently the state of the art for unsupervised NMT, and also adopted its default hyper-parameters and pre-processing steps which include 4-layer transformers for the encoder and both decoders, and BPE subword learning. The experiments have been performed using the WMT14 English-French test set for testing in both language directions (en-fr and fr-en), and the monolingual data from that year’s shared task for training.

As described in Section II-C, an unsupervised NMT model contains two decoders to be able to translate into both languages. The model is trained by iterating over two alternate steps: 1) training using the decoders as monolingual, de-noising language models (e.g., en-en, fr-fr), and 2) training using back-translations (e.g., en-fr-en, fr-en-fr). Each step requires an objective function, which is usually an NLL loss. Moreover, each step is performed in both directions (enfr and fren), which means that an unsupervised NMT model uses a total of four different objective functions. Potentially, the regularizers could be applied to each of them. However, the pre-trained USE sentence embeddings are only available in English, not in French, and for this reason we have limited our experiments to ReWE alone. In addition, the initial results have showed that ReWE is actually detrimental in the de-noising language model step, so we have limited its use to both language directions in the back-translation step, with the hyper-parameter, , tuned over the validation set ().

To probe the effectiveness of the regularized model, Fig. 6 shows the results over the test set from the different models trained with increasing amounts of monolingual data (50K, 500K, 1M, 2M, 5M and 10M sentences in each language). The model trained using ReWE has been able to consistently outperform the baseline in both language directions. The trend we had observed in the supervised case has applied to these experiments, too: the performance margin has been larger for smaller training data sizes. For example, in the en-fr direction the margin has been BLEU points with 50K training sentences, but it has reduced to BLEU points when training with 10M sentences. Again, this behavior is in line with the regularizing nature of the proposed regressive objectives.

Vi Conclusion

In this paper, we have proposed regressing continuous representations of words and sentences (ReWE and ReSE, respectively) as novel regularization techniques for improving the generalization of NMT models. Extensive experiments over four different language pairs of different training data size (from 89K to 5M sentence pairs) have shown that both ReWE and ReWE+ReSE have improved the performance of NMT models, particularly in low- and medium-resource cases, for increases in BLEU score up to percentage points. In addition, we have presented a detailed analysis showing how the proposed regularization modifies the decoder’s output space, enhancing the clustering of the vectors associated with unique words. Finally, we have showed that the regularized models have also outperformed the baselines in experiments on unsupervised NMT. As future work, we plan to explore how the categorical and continuous predictions from our model could be jointly utilized to further improve the quality of the translations.


The authors would like to thank the RoZetta Institute (formerly CMCRC) for providing financial support to this research.


  • [1] M. Artetxe, G. Labaka, E. Agirre, and K. Cho (2017) Unsupervised neural machine translation. Proc. Int. Conf. Learn. Representations (ICLR). Cited by: §II-C.
  • [2] M. Artetxe and H. Schwenk (2018) Massively multilingual sentence embeddings for zero-shot cross-lingual transfer and beyond. arXiv:1812.10464 [cs]. Cited by: §II-B.
  • [3] D. Bahdanau, P. Brakel, K. Xu, A. Goyal, R. Lowe, J. Pineau, A. Courville, and Y. Bengio (2017) An actor-critic algorithm for sequence prediction. Proc. Int. Conf. Learn. Representations (ICLR). Cited by: §II-A.
  • [4] D. Bahdanau, K. Cho, and Y. Bengio (2015) Neural machine translation by jointly learning to align and translate. In Proc. Int. Conf. Learn. Representations (ICLR), Cited by: §I, §III, §III.
  • [5] S. Bengio, O. Vinyals, N. Jaitly, and N. Shazeer (2015) Scheduled sampling for sequence prediction with recurrent neural networks. In Proc. Adv. Neural Inf. Process. Syst. (NIPS), pp. 1171–1179. Cited by: item 2, §II-A.
  • [6] P. Bojanowski, E. Grave, A. Joulin, and T. Mikolov (2017) Enriching word vectors with subword information. Trans. Assoc. Comput. Linguistics (TACL) 5, pp. 135–146. Cited by: §II-B, §V-B.
  • [7] D. Cer, Y. Yang, S. Kong, N. Hua, N. Limtiaco, R. S. John, N. Constant, M. Guajardo-Cespedes, S. Yuan, C. Tar, et al. (2018) Universal sentence encoder. arXiv:1803.11175 [cs]. Cited by: §II-B.
  • [8] K. Chousa, K. Sudoh, and S. Nakamura (2018) Training neural machine translation using word embedding-based loss. arXiv:1807.11219 [cs]. Cited by: §II-A.
  • [9] K. Clark, M. Luong, C. D. Manning, and Q. V. Le (2018) Semi-supervised sequence modeling with cross-view training. Proc. Conf. Empirical Methods Natural Lang. Process. (EMNLP). Cited by: §II-A.
  • [10] A. Conneau, D. Kiela, H. Schwenk, L. Barrault, and A. Bordes (2017) Supervised learning of universal sentence representations from natural language inference data. Proc. Conf. Empirical Methods Natural Lang. Process. (EMNLP). Cited by: §II-B.
  • [11] H. Daumé, J. Langford, and D. Marcu (2009) Search-based structured prediction. Mach. Learn. 75 (3), pp. 297–325. Cited by: §II-A.
  • [12] M. Denkowski and G. Neubig (2017) Stronger baselines for trustable results in neural machine translation. In Proc. First Workshop Neural Machine Translation, pp. 18–27. Cited by: §V-A, §V-A, §V-B.
  • [13] J. Devlin, M. Chang, K. Lee, and K. Toutanova (2019) BERT: pre-training of deep bidirectional transformers for language understanding. Proc. North American Chapter Assoc. Comput. Linguistics (NAACL). Cited by: §II-B.
  • [14] S. Edunov, A. Baevski, and M. Auli (2019) Pre-trained language model representations for language generation. Proc. North American Chapter Assoc. Comput. Linguistics (NAACL). Cited by: §II-B.
  • [15] M. Elbayad, L. Besacier, and J. Verbeek (2018) Token-level and sequence-level loss smoothing for RNN language models. In Proc. 56th Annu. Meeting Assoc. Comput. Linguistics (ACL), pp. 2094–2103. Cited by: §I, §II-A.
  • [16] M. Fadaee, A. Bisazza, and C. Monz (2017) Data augmentation for low-resource neural machine translation. In Proc. 55th Annu. Meeting Assoc. Comput. Linguistics (ACL), pp. 567–573. Cited by: §II-A, §II-A.
  • [17] Y. Gal and Z. Ghahramani (2016) A theoretically grounded application of dropout in recurrent neural networks. In Proc. Adv. Neural Inf. Process. Syst. (NIPS), pp. 1019–1027. Cited by: §II-A.
  • [18] J. Gu, Y. Wang, Y. Chen, K. Cho, and V. O. Li (2018) Meta-learning for low-resource neural machine translation. Proc. Conf. Empirical Methods Natural Lang. Process. (EMNLP). Cited by: §II-A.
  • [19] S. Hochreiter and J. Schmidhuber (1997) Long short-term memory. Neural Comput. 9 (8), pp. 1735–1780. Cited by: §V-B.
  • [20] I. Jauregi Unanue, E. Zare Borzeshi, N. Esmaili, and M. Piccardi (2019) ReWE: regressing word embeddings for regularization of neural machine translation systems. In Proc. North American Chapter Assoc. Comput. Linguistics (NAACL), Cited by: §I, §IV-A, §IV.
  • [21] D. P. Kingma and J. Ba (2015) Adam: a method for stochastic optimization. In Proc. Int. Conf. Learn. Representations (ICLR), Cited by: §V-B.
  • [22] R. Kiros, Y. Zhu, R. R. Salakhutdinov, R. Zemel, R. Urtasun, A. Torralba, and S. Fidler (2015) Skip-thought vectors. In Proc. Adv. Neural Inf. Process. Syst. (NIPS), pp. 3294–3302. Cited by: §II-B.
  • [23] G. Klein, Y. Kim, Y. Deng, J. Senellart, and A. M. Rush (2017) OpenNMT: open-source toolkit for neural machine translation. arXiv:1701.02810 [cs]. Cited by: §V-B.
  • [24] P. Koehn, H. Hoang, A. Birch, C. Callison-Burch, M. Federico, N. Bertoldi, B. Cowan, W. Shen, C. Moran, R. Zens, et al. (2007) Moses: open source toolkit for statistical machine translation. In Proc. 45th Annu. Meeting Assoc. Comput. Linguistics (ACL), pp. 177–180. Cited by: §I.
  • [25] P. Koehn and R. Knowles (2017) Six challenges for neural machine translation. In Proc. First Workshop Neural Machine Translation, pp. 28–39. Cited by: §I.
  • [26] T. Kudo (2018) Subword regularization: improving neural network translation models with multiple subword candidates. In Proc. 56th Annu. Meeting Assoc. Comput. Linguistics (ACL), pp. 66–75. Cited by: §II-A.
  • [27] S. Kumar and Y. Tsvetkov (2018) Von Mises-Fisher loss for training sequence to sequence models with continuous outputs. Proc. Int. Conf. Learn. Representations (ICLR). Cited by: §II-B.
  • [28] G. Lample, M. Ott, A. Conneau, L. Denoyer, and M. Ranzato (2018) Phrase-based & neural unsupervised machine translation. Proc. Conf. Empirical Methods Natural Lang. Process. (EMNLP). Cited by: §II-C, §V-E.
  • [29] R. Leblond, J. Alayrac, A. Osokin, and S. Lacoste-Julien (2018) SEARNN: training RNNs with global-local losses. Proc. Int. Conf. Learn. Representations (ICLR). Cited by: §II-A.
  • [30] M. Luong, H. Pham, and C. D. Manning (2015) Effective approaches to attention-based neural machine translation. In Proc. Conf. Empirical Methods Natural Lang. Process. (EMNLP), pp. 1412–1421. Cited by: §I.
  • [31] S. Ma, X. Sun, Y. Wang, and J. Lin (2018) Bag-of-words as target for neural machine translation. In Proc. 56th Annu. Meeting Assoc. Comput. Linguistics (ACL), pp. 332–338. Cited by: §II-A.
  • [32] L. v. d. Maaten and G. Hinton (2008) Visualizing data using t-SNE. J. Mach. Learn. Res. 9 (Nov), pp. 2579–2605. Cited by: §V-D.
  • [33] T. Mikolov, I. Sutskever, K. Chen, G. S. Corrado, and J. Dean (2013) Distributed representations of words and phrases and their compositionality. In Proc. Adv. Neural Inf. Process. Syst. (NIPS), pp. 3111–3119. Cited by: §II-B.
  • [34] K. Papineni, S. Roukos, T. Ward, and W. Zhu (2002) BLEU: a method for automatic evaluation of machine translation. In Proc. 40th Annu. Meeting Assoc. Comput. Linguistics (ACL), Cited by: §V-C.
  • [35] J. Pennington, R. Socher, and C. Manning (2014) GloVe: global vectors for word representation. In Proc. Conf. Empirical Methods Natural Lang. Process. (EMNLP), pp. 1532–1543. Cited by: §II-B.
  • [36] M. E. Peters, M. Neumann, M. Iyyer, M. Gardner, C. Clark, K. Lee, and L. Zettlemoyer (2018) Deep contextualized word representations. Proc. North American Chapter Assoc. Comput. Linguistics (NAACL). Cited by: §II-B.
  • [37] A. Radford, K. Narasimhan, T. Salimans, and I. Sutskever (2018) Improving language understanding by generative pre-training. URL https://s3-us-west-2. amazonaws. com/openai-assets/research-covers/languageunsupervised/language understanding paper. pdf. Cited by: §II-B.
  • [38] M. Ranzato, S. Chopra, M. Auli, and W. Zaremba (2016) Sequence level training with recurrent neural networks. Proc. Int. Conf. Learn. Representations (ICLR). Cited by: §II-A.
  • [39] S. Ross, G. Gordon, and D. Bagnell (2011) A reduction of imitation learning and structured prediction to no-regret online learning. In Proc. 14th Int. Conf. Artif. Intell. Statist. (AISTATS), pp. 627–635. Cited by: §II-A.
  • [40] R. Sennrich, B. Haddow, and A. Birch (2016) Improving neural machine translation models with monolingual data. Proc. 54th Annu. Meeting Assoc. Comput. Linguistics (ACL). Cited by: §II-C.
  • [41] R. Sennrich, B. Haddow, and A. Birch (2016) Neural machine translation of rare words with subword units. In Proc. 54th Annu. Meeting Assoc. Comput. Linguistics (ACL), pp. 1715–1725. Cited by: §V-A.
  • [42] N. Srivastava, G. Hinton, A. Krizhevsky, I. Sutskever, and R. Salakhutdinov (2014) Dropout: a simple way to prevent neural networks from overfitting.

    The Journal of Machine Learning Research

    15 (1), pp. 1929–1958.
    Cited by: §II-A.
  • [43] I. Sutskever, O. Vinyals, and Q. V. Le (2014) Sequence to sequence learning with neural networks. In Proc. Adv. Neural Inf. Process. Syst. (NIPS), pp. 3104–3112. Cited by: §I.
  • [44] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin (2017) Attention is all you need. In Proc. Adv. Neural Inf. Process. Syst. (NIPS), pp. 5998–6008. Cited by: §I, §V-B.
  • [45] P. Vincent, H. Larochelle, Y. Bengio, and P. Manzagol (2008) Extracting and composing robust features with denoising autoencoders. In Proc. 25th Int. Conf. Mach. Learn. (ICML), pp. 1096–1103. Cited by: §II-C.
  • [46] W. Xu, X. Niu, and M. Carpuat (2019) Differentiable sampling with flexible reference word order for neural machine translation. Proc. North American Chapter Assoc. Comput. Linguistics (NAACL). Cited by: §II-A.
  • [47] Z. Yang, W. Chen, F. Wang, and B. Xu (2018) Unsupervised neural machine translation with weight sharing. Proc. 56th Annu. Meeting Assoc. Comput. Linguistics (ACL). Cited by: §II-C.
  • [48] Z. Zhang, S. Wu, S. Liu, M. Li, M. Zhou, and E. Chen (2018) Regularizing neural machine translation by target-bidirectional agreement. Proc. AAAI Conf. Artif. Intell.. Cited by: §II-A.