Training Deeper Neural Machine Translation Models with Transparent Attention

08/22/2018 ∙ by Ankur Bapna, et al. ∙ Google 0

While current state-of-the-art NMT models, such as RNN seq2seq and Transformers, possess a large number of parameters, they are still shallow in comparison to convolutional models used for both text and vision applications. In this work we attempt to train significantly (2-3x) deeper Transformer and Bi-RNN encoders for machine translation. We propose a simple modification to the attention mechanism that eases the optimization of deeper models, and results in consistent gains of 0.7-1.1 BLEU on the benchmark WMT'14 English-German and WMT'15 Czech-English tasks for both architectures.



There are no comments yet.


page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

The past few years have seen significant advances in the quality of machine translation systems, owing to the advent of neural sequence to sequence models. While current state of the art models come in different flavours, including Transformers Vaswani et al. (2017), convolutional seq2seq models Gehring et al. (2017) and LSTMs Chen et al. (2018), all of these models follow the seq2seq with attention Bahdanau et al. (2015) paradigm.

While revolutionary new architectures have contributed significantly to these quality improvements, the importance of larger model capacities cannot be downplayed. The first major improvement in NMT quality since the switch to neural models, amongst other factors, was brought about by a huge scale up in model capacity Zhou et al. (2016); Wu et al. (2016). While there are multiple approaches to increase capacity, deeper models have been shown to extract more expressive features Mhaskar et al. (2016); Telgarsky (2016); Eldan and Shamir (2015), and have resulted in significant gains for vision tasks over the past few years He et al. (2015); Srivastava et al. (2015).

Despite this being an obvious avenue for improvement, research in deeper models is often restricted by computational constraints. Additionally, deep models are often plagued by trainability concerns like vanishing or exploding gradients Bengio et al. (1994). These issues have been studied in the context of capturing long range dependencies in recurrent architectures Pascanu et al. (2012); Hochreiter et al. (2001), but resolving these deficiencies in Transformers or LSTM seq2seq models deeper than 8 layers is unfortunately under-explored Wang et al. (2017); Barone et al. (2017); Devlin (2017).

In this study we take the first step towards training extremely deep models for translation, by training deep encoders for Transformer and LSTM based models. As we increase the encoder depth the vanilla Transformer models completely fail to train. We also observe sub-optimal performance for LSTM models, which we believe is associated with trainability issues. To ease optimization we propose an enhancement to the attention mechanism, which allows us to train deeper models and results in consistent gains on the WMT’14 EnDe and WMT’15 CsEn tasks.

2 Transparent Attention

While the effect of attention on the forward pass is exalted with visualizations and linguistic interpretations, its influence on the gradient flow is often forgotten. Consider the original seq2seq model without attention Sutskever et al. (2014). To propagate the error signal from the last layer of the decoder to the first layer of the encoder, it has to pass through multiple time-steps in the decoder, survive the encoder-decoder bottleneck, and pass through multiple time-steps in the encoder, before reaching the parameter to be updated. There is some loss of information at every step, especially in the early stages of training. Attention Bahdanau et al. (2015) creates a direct path from the decoder to the topmost layer of the encoder, ensuring its efficient dispersal over time. This increase in inter-connectivity significantly shortens the credit-assignment path Britz et al. (2017), making the network less susceptible to optimization pathologies like vanishing gradients.

Figure 1: Grad-norm ratio () vs training step () comparison for a 6 layer (blue) and 20 layer (red) Transformer trained on WMT 14 EnDe.

For deeper networks the error signal also needs to traverse along the depth of the encoder. We propose an extension to the attention mechanism that behaves akin to creating weighted residual connections along the encoder depth, allowing the dispersal of error signal simultaneously over encoder depth and time. Using trainable weights, this ‘transparent’ attention allows the model the flexibility to adjust the gradient flow to different layers in the encoder depending on its training phase.

2.1 Experimental Setup

We train our models on the standard WMT’14 EnDe dataset. Each sentence is tokenized with the Moses tokenizer before breaking into sub-word units similar to Sennrich et al. (2016). We use a shared vocabulary of 32k units for each language pair. We report all our results on newstest 2014, and use a combination of newstest 2012 and newstest 2013 for validation. To verify our results, we also evaluate our models on WMT’15 CsEn. Here we use newstest 2013 for validation and newstest 2015 as the test set. To evaluate the models we compute BLEU on the tokenized, true-case output. We report the mean post-convergence score over a window of 21 checkpoints, obtained using dev performance, following Chen et al. (2018).

2.2 Baseline Experiments

We base our study on two architectures: Transformer Vaswani et al. (2017) and RNMT+ Chen et al. (2018). We choose a smaller version of each model to fit deep encoders with up to 20 layers on a single GPU. All our models are trained on eight P100 GPUs with synchronous training, and optimized using Adam Kingma and Ba (2014). For both architectures we train four models, with 6, 12, 16 and 20 encoder layers. We use 6 and 8 decoder layers for all our transformers and RNMT+ experiments respectively. We also report performance for the standard Transformer Big and RNMT+ setups, as described in Chen et al. (2018), for comparison against higher capacity models.

Transformer: We use the latest version of the Transformer base model, using the implementation from Chen et al. (2018). We modify the learning rate schedule to use a learning rate of and warmup steps.

RNMT+: We implemented a smaller version of the EnDe RNMT+ model based on the description in Chen et al. (2018), with 512 LSTM nodes in both encoder and decoder.

Figure 2: Grad-norm ratio () vs training step () comparison for a 6 layer (blue) and 20 layer (red) RNMT+ model trained on WMT 14 EnDe.

2.3 Analysis

From Tables 1 and 2, we notice that the deeper Transformer encoders completely fail to train. To understand what goes wrong we keep track of the grad norm ratio , where is the loss at time step , is the number of layers in the encoder, is the output of the first encoder layer, is the output of the -th encoder layer, and is the total number of training steps. We use as a diagnostic measure for two reasons: First, it indicates if training is suffering from exploding or vanishing gradients. Second, when a network is properly trained the lowest layers usually converge quickly, whereas the top-most layers take longer Raghu et al. (2017). We therefore expect that, for a healthy training process, is relatively large during the early stages of training when updates to lower layers are larger than upper layers. We observe this in most successful Transformer and RNMT+ training runs.

Figure 1 illustrates the curves for the 6-layer and 20-layer Transformers. As expected, the shallow model has a high value during early stages of training. For the deep model, however, remains flat at a much smaller value throughout training. We also observe that remains below for both models, although the problem seems much less severe for the shallow model.

From Tables 3 and 4, we also observe that the performance of deep RNMT+ encoders is not significantly impacted, reaching the level of the 6 layer model. This is supported by the RNMT+ curves in Figure 2, which indicate few differences in the learning dynamics of the shallow and deep models. This contrasts with the Transformer experiments, where increasing the depth leads to an unstable training process.

Figure 3: Grad-norm ratio () vs training step for 20 layer Transformer with transparent attention.

To gain further insights into the stability of the two architectures we completely remove the residual connections from their encoders. Residual connections have been shown, in theory and practice, to improve training stability and performance of deeper networks (see He et al. (2015); Philipp et al. (2017); Hardt and Ma (2017); Orhan (2017)). Removing residual connections leads to disastrous results for the Transformer, where the training process either does not converge or results in significantly worse results. On the other hand, the 6 layer RNMT+ converges with only a slight degradation in quality. Deeper versions of RNMT+ fail to train in the absence of residual connections.

EnDe WMT 14
Transformer (Base)
Encoder layers 6 12 16 20 6
Num. Parameters 94M 120M 137M 154M 375M
Baseline 27.26 * * * 27.94
Baseline - residuals * 6.00 * * N/A
Transparent 27.52 27.79 28.04 27.96 N/A
Table 1: BLEU scores on EnDe newstest 2014 with Transformers. * indicates that a model failed to train.
CsEn WMT 15
Transformer (Base)
Encoder layers 6 12 16 20 6
Num. Parameters 94M 120M 137M 154M 375M
Baseline 27.20 * * * 27.76
Baseline - residuals 25.83 * * * N/A
Transparent 27.41 27.69 27.93 27.80 N/A
Table 2: BLEU scores CsEn newstest 2015 with Transformers. * indicates that a model failed to train.
EnDe WMT 14
RNMT+ (512)
Encoder layers 6 12 16 20 6
Num. Parameters 128M 165M 191M 216M 379M
Baseline 26.63 26.32 26.49 26.33 28.49
Baseline - residuals 26.37 * * * N/A
Transparent 26.61 26.87 27.07 27.33 N/A
Table 3: BLEU scores on EnDe newstest 2014 with RNMT+. * indicates that a model failed to train.
CsEn WMT 15
RNMT+ (512)
Encoder layers 6 12 16 20 6
Num. Parameters 128M 165M 191M 216M 379M
Baseline 25.77 25.86 26.02 25.75 26.66
Baseline - residuals 25.43 * * * N/A
Transparent 26.69 26.74 26.79 26.72 N/A
Table 4: BLEU scores CsEn newstest 2015 with RNMT+. * indicates that a model failed to train.

2.4 Regulating Deep Encoder Gradients with Transparent Attention

Our baseline experiments reveal that mechanisms to regulate gradient flow can be critical to improving the optimization of deeper encoders. Since the only difference between our shallow and deep models is the number of layers in the encoder, the trainability issues are likely to be associated with gradient flow through the encoder.

Figure 4: Plot illustrating the variations in the learned attention weights for the 20 layer Transformer encoder over the training process.

To improve gradient flow we let the decoder attend weighted combinations of all encoder layer outputs, instead of just the top encoder layer. Similar approaches have been found to be useful in deep convolutional networks, for example Shen and Zeng (2016); Huang et al. (2016a); Srivastava et al. (2015); Huang et al. (2016b), but this remains un-investigated in sequence-to-sequence models. We formulate our proposal below.

Assume the model has encoder layers and encoder-decoder attention modules. For Transformer models each decoder layer attends the encoder, so is equivalent to the number of decoder layers (). For RNMT+, attention is only applied in the first decoder layer, thus . Let the activations from the -th encoder layer be , and embeddings be layer . Then the traditional attention module attends to . In transparent attention we evaluate weighted combinations of the encoder outputs, one corresponding to each attention module. We define a

weight vector

, which is learned during training.111Here +1 is for the embedding layer. We apply dropout to since we empirically found it helpful to stabilize training. We then compute softmax to normalize the weights.


We now define


Now attention module attends to . Since in RNMT+ a projection is applied to the encoder final layer output, we apply a projection to the weighted combination of encoder outputs before the attention module.

3 Results and Analysis

Our results, from tables 1 and 2, indicate that adding transparent attention improves the performance of most of our transformer experiments, but the gains are most pronounced for deeper models. While the baseline transformer fails to train with 12 layers or deeper encoders, transparent attention allows us to train encoders with up to 20 layers, improving by more than 0.7 BLEU points on both datasets. Relative to Transformer Big, deeper models seem to result in better or comparable performance with less than half the model capacity.

We also observe gains of 0.7 and 1.0 BLEU for RNMT+ models, on EnDe and CsEn respectively, as indicated by Tables 3 and 4. However, experiments comparing wide models against deeper ones are inconclusive. While deeper models perform slightly better than a wide model with double their capacity on Cs-En, they are clearly out-performed by the larger model on En-De.

The plot in Figure 3, also indicates that the learning dynamics now resemble what we expect to see with stable training. We also notice that the scale of now resembles that of the RNMT+ model, although the lower layers converge more slowly for the Transformer, possibly because it uses a much smaller learning rate.

A plot of the weights , in Figure 4, also seems to support our findings. The scalar weights for the lowest embeddings layer grow rapidly in the early stages of training, but once these layers converge the weights for layers 16 and 20 become much larger. The weights for the top few layers remain comparable at convergence, suggesting that the observed gains in performance might also be partially associated with an ensembling effect of the encoder features, similar to the effect observed in Peters et al. (2018).

4 Conclusions and Future Work

In this work we explore deeper encoders for Transformer and RNMT+ based machine translation models. We observe that Transformer models are extremely difficult to train when encoder depth is increased beyond 12 layers. While RNMT+ models train with deeper encoders, we did not observe any big performance improvements.

We associated the difficulty in training deeper encoders with hindered gradient flow, and resolved it by proposing the transparent attention mechanism. This enabled us to successfully train deeper Transformer and RNMT+ models, resulting in consistent gains in translation quality on both WMT’14 EnDe and WMT’15 CsEn.

Our results show that there is potential for improvement in translation quality by training deeper architectures, even though they pose optimization challenges. While this study explores training deeper encoders for narrow models, we plan to further study extremely deep and wide models to utilize the full strength of these architectures.

5 Acknowledgments

We would like to thank the Google Brain and Google Translate teams for their foundational contributions to this project.