Log In Sign Up

Understanding Neural Machine Translation by Simplification: The Case of Encoder-free Models

by   Gongbo Tang, et al.

In this paper, we try to understand neural machine translation (NMT) via simplifying NMT architectures and training encoder-free NMT models. In an encoder-free model, the sums of word embeddings and positional embeddings represent the source. The decoder is a standard Transformer or recurrent neural network that directly attends to embeddings via attention mechanisms. Experimental results show (1) that the attention mechanism in encoder-free models acts as a strong feature extractor, (2) that the word embeddings in encoder-free models are competitive to those in conventional models, (3) that non-contextualized source representations lead to a big performance drop, and (4) that encoder-free models have different effects on alignment quality for German-English and Chinese-English.


page 1

page 2

page 3

page 4


Towards Robust Neural Machine Translation

Small perturbations in the input can severely distort intermediate repre...

Shared-Private Bilingual Word Embeddings for Neural Machine Translation

Word embedding is central to neural machine translation (NMT), which has...

Examining Structure of Word Embeddings with PCA

In this paper we compare structure of Czech word embeddings for English-...

Attention Weights in Transformer NMT Fail Aligning Words Between Sequences but Largely Explain Model Predictions

This work proposes an extensive analysis of the Transformer architecture...

Dynamic Position Encoding for Transformers

Recurrent models have been dominating the field of neural machine transl...

Learning to Reuse Translations: Guiding Neural Machine Translation with Examples

In this paper, we study the problem of enabling neural machine translati...

1 Introduction

Neural machine translation (NMT) Kalchbrenner and Blunsom (2013); Sutskever et al. (2014); Bahdanau et al. (2015); Luong et al. (2015) has emerged in the last few years and has achieved new state-of-the-art performance. However, NMT models are black boxes for humans and are hard to interpret. NMT models employ encoder-decoder architectures where an encoder encodes source-side sentences and an attentional decoder generates target-side sentences based on the outputs of the encoder. In this paper, we attempt to obtain a more interpretable NMT model by simplifying the encoder-decoder architecture. We train encoder-free models where the sums of word embeddings and sinusoid embeddings Vaswani et al. (2017) represent the source. The decoder is a standard Transformer Vaswani et al. (2017) or recurrent neural network (RNN) that attends to embeddings via attention mechanisms.

As motivation for our architecture simplification, consider the attention mechanism111We refer to the encoder-decoder attention mechanism unless otherwise specified. Bahdanau et al. (2015); Luong et al. (2015)

, which has been introduced to extract features from the hidden representations in encoders dynamically. Attention and alignment were initially used interchangeably, but it was soon discovered that the attention mechanism can behave very differently from traditional word alignment

(see Ghader and Monz, 2017; Koehn and Knowles, 2017)

. One reason for this discrepancy is that the attention mechanism operates on representations that potentially includes information from the whole sentence due to the encoder’s recurrent or self-attentional architecture. Intuitively, bypassing these encoder layers and attending word embeddings directly could lead to a more alignment-like, and thus predictable and interpretable behavior of the attention model.

By comparing encoder-free models with conventional models, we can better understand the working mechanism of NMT, figure out which components are more crucial, and learn lessons for improvement. Experimental results show that there is a significant gap between the two models. We focus on exploring what leads to the big gap.

As the embeddings in encoder-free Transformers (Trans-noEnc) are only influenced by attention mechanisms, without the help of encoders, we hypothesize that the quality of embeddings leads to the gap between Transformers and Trans-noEnc models. Thus we conduct both qualitative and quantitative evaluations of the embeddings from Transformers and Trans-noEnc models. We also hypothesize that the attention distribution in Trans-noEnc is not spread out enough for extracting contextual features. However, we find that word embeddings and attention distributions are not the major reasons causing the distinct gap. We further explore NMT encoders. We find that even NMT models with one layer encoder get significant improvement compared to encoder-free models which indicates that non-contextualized source representations lead to the evident gap.

In encoder-free models, the attention attends to source embeddings rather than hidden representations fused with the context. We hypothesize that encoder-free models generate better alignments than default models. We evaluate the alignments generated on GermanEnglish (DEEN) and ChineseEnglish (ZHEN). We find that encoder-free models improve the alignments for DEEN but worsen the alignments for ZHEN.

2 Related Work

2.1 Understanding NMT

The attention mechanism has been introduced as a way to learn an alignment between the source and target text, and improves encoder-decoder models significantly, while also providing a way to interpret the inner workings of NMT models. However, ghader2017what and koehn2017challenges have shown that the attention mechanism is different from a word alignment. While there are linguistically plausible explanations in some cases – when translating a verb, knowledge about the subject, object etc. may be relevant information – other cases are harder to explain, such as an off-by-one mismatch between attention and word alignment for some models. We suspect that such a pattern can be learned if relevant information is passed to neighboring representations via recurrent or self-attentional connections.

ding2017visualizing show that only using attention is not sufficient for deep interpretation and propose to use layer-wise relevance propagation to better understand NMT. wang2018neuralHMM replace the attention model with an alignment model and a lexical model to make NMT models more interpretable. The proposed model is not superior but on a par with the attentional model. They clarify the difference between alignment models and attention models by saying that that the alignment model is to identify translation equivalents while the attention model is to predict the next target word.

In this paper, we try to understand NMT by simplifying the model. We explore the importance of different NMT components and what causes the performance gap after model simplification.

2.2 Alignments and Source Embeddings

Nguyen2018improving introduce a lexical model to generate a target word directly based on the source words. With the lexical model, NMT models generate better alignments. Kuang2018attention propose three different methods to bridge source and target word embeddings. The bridging methods can significantly improve the translation quality. Moreover, the word alignments generated by the model are improved as well.

Our encoder-free model is a simplification and only attends to the source word embeddings. We aim to interpret NMT models rather than pursuing better performance.

Different from previous work, zenkel2019adding introduce a separate alignment layer directly optimizing the word alignment. The alignment layer is an attention network learning to attend to source tokens given a target token. The attention network can attend to either the word embeddings or the hidden representations or both of them. The proposed model significantly improves the alignment quality and performs as well as the aligners based on traditional IBM models.

3 Experiments

In addition to training Transformer and Trans-noEnc models, we also compare Trans-noEnc with NMT models based on RNNs (RNNS2S). We train RNNS2S models without encoders (RNNS2S-noEnc), without attention mechanisms (RNNS2S-noAtt), and without both encoders and attention mechanisms (RNNS2S-noAtt-noEnc) to explore which component is more important for NMT. We also investigate the importance of positional embeddings in Trans-noEnc.

3.1 Experimental Settings

We use the Sockeye Hieber et al. (2017) toolkit, which is based on MXNet Chen et al. (2015), to train models. Each encoder/decoder has 6 layers. For RNNS2S

, we choose long short-term memory (LSTM) RNN units. Transformers have 8 attention heads. The size of embeddings and hidden states is 768. We tie the source, target, and output embeddings. The dropout rate of embeddings and Transformer blocks is set to 0.1. The dropout rate of RNNs is 0.2. All the models are trained with a single GPU. During training, each mini-batch contains 2,048 tokens. A model checkpoint is saved every 1,000 updates. We use

Adam Kingma and Ba (2015) as the optimizer. The initial learning rate is set to 0.0001. If the performance on the validation set has not improved for 8 checkpoints, the learning rate is multiplied by 0.7. We set the early stopping patience to 32 checkpoints.

The training data is from the WMT15 shared task Bojar et al. (2015) on Finnish–English (FI–EN). We choose newsdev2015 as the validation set and use newstest2015 as the test set. All the BLEU Papineni et al. (2002) scores are measured by SacreBLEU Post (2018). There are about 2.1M sentence pairs in the training set after preprocessing. We learn a joint BPE model with 32K subword units Sennrich et al. (2016). We employ the models that have the best perplexity on the validation set for the evaluation. We set the beam size to 8 during inference.

To test the universality of our findings, we conduct experiments on DEEN and ZHEN as well. For DEEN, we use the training data from the WMT17 shared task Bojar et al. (2017). We use newstest2013 as the validation set and newstest2017 as the test set. We learn a joint BPE model with 32k subword units. For ZHEN, we choose the CWMT parallel data of the WMT17 shared task for training. We use newsdev2017 as the validation set and newstest2017 as the test set. We apply Jieba222 to Chinese segmentation. We then learn 60K subword units for Chinese and English separately. There are about 5.9M and 9M sentence pairs in the training set after preprocessing in DEEN and ZHEN, respectively.

3.2 Results

Table 1 shows the performance of all the trained models. Encoder-free models (NMT-noEncs) perform rather poorly compared to conventional NMT models.333We also trained a Transformer with less parameters (64.3M). The Transformer still achieved a significantly better BLEU score (18.2) than Trans-noEnc which means that the number of parameters is not the primary factor in this case. It is interesting that Trans-noEnc obtains a BLEU score similar to the RNNS2S model. Even though the attention networks only attend to the non-contextualized word embeddings, Trans-noEnc still performs as well as the RNNS2S by paying attention to the context with multiple attention layers. Tang2018why find that the superiority of Transformer models is attributed to the self-attention network which is a powerful semantic feature extractor. Given our results, we conclude that the attention mechanism is also a strong feature extractor in Trans-noEnc without self-attention in the encoder.

Model Param. PPL BLEU
Transformer 104.4M 09.6 18.9
Trans-noEnc 071.4M 11.7 15.9
RNNS2S 091.5M 14.9 15.9
RNNS2S-noEnc 064.3M 25.2 12.5
RNNS2S-noAtt 090.3M 33.3 08.2
RNNS2S-noAtt-noEnc 063.1M 53.7 04.1
Trans-noEnc-noPos 071.4M 26.6 07.1
Table 1: The performance of NMT models. PPL is the perplexity on the development set. BLEU scores are evaluated on newstest2015. “Param.” denotes the number of parameters.
Word Neighbors
Transformer Trans-noEnc
more less, better, greater, most, further less, greater, better, fewer, most
for to, in, on, of, with to, in, of, on, towards
ole (not)
olekaan (not the), kykene (unable to), kuulu
(part of), pysty (upright), ollut (been)
olekaan, kuulu (part of), ei
(no/not), ene (a suffix), liity (sign up)
arvoisa, Arvoisat (honorable), arvoisaa,
arvoisan (honorable), hyvät (honorable)
arvoisa, arvoisat, hyvät,
Arvoisat, Hyvä (honorable)
Table 2: Neighbors of the selected word embeddings. Bold words are distinct neighbors.

The attention mechanism improves encoder-decoder architectures significantly. However, there are no empirical results to clarify whether encoders or attention mechanisms are more important for NMT models. We compare RNNS2S-noAtt, RNNS2S-noEnc, and RNNS2S-noAtt-noEnc to explore which component contributes more to NMT models.444Because the encoders and decoders in Transformers are only connected via attention, we only conduct this experiment on RNNS2S models. In Table 1, RNNS2S-noEnc performs much better than RNNS2S-noAtt. Moreover, the gap between RNNS2S-noEnc and RNNS2S-noAtt-noEnc is distinctly larger than the gap between RNNS2S-noAtt and RNNS2S-noAtt-noEnc. These results hint that attention mechanisms are more powerful than encoders in NMT.

The positional embedding is also very important to Transformers which holds the sequential information. We are interested in the extent to which the positional embedding affects the translation performance. We further simplify the model by removing the positional embedding in the source (Trans-noEnc-noPos). Trans-noEnc-noPos has a dramatic drop in BLEU score. It is even worse than RNNS2S-noAtt. This result indicates that positional information is indeed crucial for Transformers.

4 Analysis

Trans-noEnc is obviously inferior to Transformer but we are more interested in investigating what causes the performance gap. In this section, we will test our hypotheses on embedding quality and attention distributions.

4.1 Embeddings

Word embeddings are randomly initialized by default and learned during training. As the embeddings in Trans-noEnc are only updated by attention mechanisms, we hypothesize that embeddings in Trans-noEnc are not well learned and therefore affect translation performance. We test our hypothesis by (1) evaluating the embeddings in the two models manually and (2) initializing Trans-noEnc with the learned embeddings in Transformer as pre-trained embeddings.

Qualitative Evaluation

We select the 150 most frequent tokens from the vocabulary and then manually evaluate the quality of embeddings by comparing the 5 nearest neighbors.

The quality of English word embeddings is quite good based on the output of neighbors. Finnish word embeddings are not as good as English word embeddings. Table 2 exhibits four examples, two English words, “more”, “for” and two Finnish word, “ole” (not), “Arvoisa” (honorable). The neighbors of “more” in Transformer and Trans-noEnc are all quite related words, including comparatives and “most” which is the superlative of “more”. The words “further” and “fewer” are more different neighbors but both are related to “more”. For the Finnish word “ole” (not), both models have negative words as neighbors, but there are different unrelated words as well. We can see that the qualities of neighbors in two embedding matrices are close. We cannot easily distinguish which embedding matrix is better based on the neighbors.

Quantitative Evaluation

In addition to the qualitative evaluation, we also conduct a quantitative evaluation. We first employ the learned embeddings from Transformer to initialize the embedding parameters in Trans-noEnc. The pre-trained embeddings can be either fixed or not fixed during training. Table 3 gives the BLEU scores of these models. The pre-trained embeddings slightly improve the BLEU score.

Embeddings Random Fixed Not-fixed
BLEU 15.9 16.1 16.2
Table 3: BLEU scores of Trans-noEncs with different embedding initialization. “Random” means no pre-trained embeddings. “Fixed” and “Not-fixed” denote using pre-trained embeddings.

The evaluation reveals that the embeddings from Trans-noEnc are competitive to those of Transformer. Thus, we can rule out differences in embedding quality as the main factor for the performance drop.

4.2 Attention Distribution

The attention networks in Trans-noEnc only attend to word embeddings. To better capture the sentence-level context, the attention networks need to distribute more attention to the context. We test our hypothesis that the attention distributions in Trans-noEnc are not as distributed as those in Transformer. If the attention distributions in Transformer are more spread out than those in Trans-noEnc, it means that smaller weights are distributed to contextual features by Trans-noEnc.


We use attention entropy (Equation 1) Ghader and Monz (2017) to measure the concentration of the attention distribution at timestep . We then average the attention entropy at all the timesteps as the final attention entropy. denotes the th source token, is the prediction at timestep , and represents the attention distribution at timestep . The attention mechanism in Transformer has multiple layers, and each layer has multiple heads. In each layer, we average the attention weights from all the heads.

Figure 1 shows the entropy of attention distributions in both models. The attention distributions are consistent with the finding in Tang2018WSD that the distribution gets concentrated first and then becomes distributed again. Transformer has lower entropy, which potentially is because the contextual information has been encoded in the hidden representations. The attention entropy of Trans-noEnc is clearly higher than that of Transformer in each attention layer. The attention in Trans-noEnc tends to extract features from source tokens more uniformly which indicates that the attention mechanism compensates for the fact that embeddings are non-contextualized by distributing attention across more tokens.

Figure 1: The attention entropy of each attention layer and the entire attention mechanism.

4.3 Encoders

We have shown that embeddings and attention distributions are not the primary reasons causing the gap between Transformer and Trans-noEnc. Therefore, we move to explore encoders.

Encoders are responsible for providing source hidden representations to the decoder. Encoder-free models have to use word embeddings to represent source tokens without the help of encoders. Thus, the source-side representations probably lead to the performance gap.

We train NMT models with different encoder layers. Table 4 displays the performance of Transformer models that have different layers in the encoder. It is clear that even the model with only a 1-layer encoder outperforms Trans-noEnc (0-layer) by 1.7 BLEU points, which accounts for 56.7% of the performance gap. The results seem to show that source-side hidden representations are crucial in NMT.

Layers Param. PPL BLEU
0 071.4M 11.7 15.9
1 076.9M 10.3 17.6
3 087.9M 09.9 18.4
5 098.9M 09.5 18.6
6 104.4M 09.6 18.9
Table 4: The performance of Transformer models that have different layers in the encoder, including the perplexity (PPL) on the development set and the BLEU scores on newstest2015.

It has been shown that encoders could extract syntactic and semantic features in NMT Belinkov et al. (2017a, b); Poliak et al. (2018). In the meantime, contextual information is encoded in hidden representations as well. Hence we conclude that the quality of source representations is the main factor causing the big gap between Transformer and Trans-noEnc.

In Table 5, our additional experiments on DEEN and ZHEN confirm that models with contextualized representations are much better. Transformer models always outperform Trans-noEnc models substantially.

Lan. Trans-noEnc Transformer Impr.
DEEN 29.5 32.6 10.5%
ZHEN 18.5 20.9 13.0%
Table 5: The improvement (Impr.) of employing encoders in Trans-noEncs on DEEN and ZHEN.

5 Alignment

The weights of the attention mechanism can be interpreted as an alignment between the source and target text. We further explore whether encoder-free models have better alignments than default models. We evaluate the alignments on two manually annotated alignment data sets. The first one has been provided by RWTH,555 and consists of 508 DEEN sentence pairs. The other one is from liu2015contrastive and contains 900 ZHEN sentence pairs. We apply alignment error rate (AER) Och and Ney (2003)

as the evaluation metric.

Following luong2015effective,Kuang2018attention, we also force the models to produce the reference target words during inference to get the alignment between input sentences and their reference outputs. We merge the subwords after translation following the method in koehn2017challenges.666(1) If an input word is split into subwords, we sum their attention weights. (2) If a target word is split into subwords, we average their attention weights. We sum the attention weights in all attention heads in each attention layer.777Following Tang2018WSD, we tried maximizing the attention weights as well but got worse alignment quality. Given a target token, the source token with the highest attention weight is viewed as the alignment of the current target token Luong et al. (2015). However, a source token maybe aligned to multiple target tokens and vice versa. Therefore, we also align a source token to the target token that has the highest attention weight given the source token. Experimental results show that the bidirectional method achieves higher alignment quality.

Figure 2 displays the evaluation results. The alignment in the fourth attention layer achieves the best performance. Therefore, we only compare the alignments in the fourth layer. In DEEN, the encoder-free model has a lower AER score (0.41) than the default model (0.43) which accords with our hypothesis. However, in ZHEN, the alignment quality of the encoder-free model (0.46) is worse than that of the default model (0.43). The effect on alignment quality is not clear-cut for encoder-free models given limited language pairs.

Figure 2: The AER scores of alignments in different attention layers on DEEN and ZHEN.

6 Conclusion

To better understand NMT, we simplify the attentional encoder-decoder architecture by training encoder-free NMT models in this paper. The non-contextualized source representations in encoder-free models cause a big performance drop, but the word embeddings in encoder-free models are shown competitive to those in default models. Also, we find that the attention component in encoder-free models is a powerful feature extractor, and can partially compensate for the lack of contextualized encoder representations.

Regarding the interpretability of attention, our results do not show that the attention mechanism in encoder-free models is consistently more alignment-like: only attending to source embeddings improves the alignment quality on DEEN but makes the alignment quality worse on ZHEN.


We thank all reviewers for their valuable and insightful comments. Gongbo Tang is mainly funded by the Chinese Scholarship Council (grant number 201607110016).