Weighted Transformer Network for Machine Translation

by   Karim Ahmed, et al.

State-of-the-art results on neural machine translation often use attentional sequence-to-sequence models with some form of convolution or recursion. Vaswani et al. (2017) propose a new architecture that avoids recurrence and convolution completely. Instead, it uses only self-attention and feed-forward layers. While the proposed architecture achieves state-of-the-art results on several machine translation tasks, it requires a large number of parameters and training iterations to converge. We propose Weighted Transformer, a Transformer with modified attention layers, that not only outperforms the baseline network in BLEU score but also converges 15-40 multi-head attention by multiple self-attention branches that the model learns to combine during the training process. Our model improves the state-of-the-art performance by 0.5 BLEU points on the WMT 2014 English-to-German translation task and by 0.4 on the English-to-French translation task.


page 1

page 2

page 3

page 4



Recent advancements in attention mechanisms have replaced recurrent neur...

A Mixture of h-1 Heads is Better than h Heads

Multi-head attentive neural architectures have achieved state-of-the-art...

Scaling Neural Machine Translation

Sequence to sequence learning models still require several days to reach...

Multiresolution Transformer Networks: Recurrence is Not Essential for Modeling Hierarchical Structure

The architecture of Transformer is based entirely on self-attention, and...

Self-Attention with Relative Position Representations

Relying entirely on an attention mechanism, the Transformer introduced b...

Meta-Embeddings Based On Self-Attention

Creating meta-embeddings for better performance in language modelling ha...

Fully Quantized Transformer for Improved Translation

State-of-the-art neural machine translation methods employ massive amoun...

1 Introduction

Recurrent neural networks (RNNs), such as long short-term memory networks (LSTMs) (Hochreiter & Schmidhuber, 1997), form an important building block for many tasks that require modeling of sequential data. RNNs have been successfully employed for several such tasks including language modeling (Melis et al., 2017; Merity et al., 2017), speech recognition (Xiong et al., 2017; Graves et al., 2013), and machine translation (Wu et al., 2016; Bahdanau et al., 2014)

. RNNs make output predictions at each time step by computing a hidden state vector

based on the current input token and the previous states. This sequential computation underlies their ability to map arbitrary input-output sequence pairs. However, because of their auto-regressive property of requiring previous hidden states to be computed before the current time step, they cannot benefit from parallelization.

Variants of recurrent networks that use strided convolutions eschew the traditional time-step based computation

(Kaiser & Bengio, 2016; Lei & Zhang, 2017; Bradbury et al., 2016; Gehring et al., 2016, 2017; Kalchbrenner et al., 2016). However, in these models, the operations needed to learn dependencies between distant positions can be difficult to learn (Hochreiter et al., 2001; Hochreiter, 1998). Attention mechanisms, often used in conjunction with recurrent models, have become an integral part of complex sequential tasks because they facilitate learning of such dependencies (Luong et al., 2015; Bahdanau et al., 2014; Parikh et al., 2016; Paulus et al., 2017; Kim et al., 2017).

In Vaswani et al. (2017), the authors introduce the Transformer network, a novel architecture that avoids the recurrence equation and maps the input sequences into hidden states solely using attention. Specifically, the authors use positional encodings in conjunction with a multi-head attention mechanism. This allows for increased parallel computation and reduces time to convergence. The authors report results for neural machine translation that show the Transformer networks achieves state-of-the-art performance on the WMT 2014 English-to-German and English-to-French tasks while being orders-of-magnitude faster than prior approaches.

Transformer networks still require a large number of parameters to achieve state-of-the-art performance. In the case of the newstest2013 English-to-German translation task, the base model required M parameters, and the large model required M parameters. We propose a variant of the Transformer network which we call Weighted Transformer that uses self-attention branches in lieu of the multi-head attention. The branches replace the multiple heads in the attention mechanism of the original Transformer network, and the model learns to combine these branches during training. This branched architecture enables the network to achieve comparable performance at a significantly lower computational cost. Indeed, through this modification, we improve the state-of-the-art performance by and BLEU scores on the WMT 2014 English-to-German and English-to-French tasks, respectively. Finally, we present evidence that suggests a regularizing effect of the proposed architecture.

2 Related Work

Most architectures for neural machine translation (NMT) use an encoder and a decoder that rely on deep recurrent neural networks like the LSTM (Luong et al., 2015; Sutskever et al., 2014; Bahdanau et al., 2014; Wu et al., 2016; Barone et al., 2017; Cho et al., 2014). Several architectures have been proposed to reduce the computational load associated with recurrence-based computation (Gehring et al., 2016, 2017; Kaiser & Bengio, 2016; Kalchbrenner et al., 2016). Self-attention, which relies on dot-products between elements of the input sequence to compute a weighted sum (Lin et al., 2017; Bahdanau et al., 2014; Parikh et al., 2016; Kim et al., 2017), has also been a critical ingredient in modern NMT architectures. The Transformer network (Vaswani et al., 2017) avoids the recurrence completely and uses only self-attention.

We propose a modified Transformer network wherein the multi-head attention layer is replaced by a branched self-attention layer. The contributions of the various branches is learned as part of the training procedure. The idea of multi-branch networks has been explored in several domains (Ahmed & Torresani, 2017; Gastaldi, 2017; Shazeer et al., 2017; Xie et al., 2016). To the best of our knowledge, this is the first model using a branched structure in the Transformer network. In Shazeer et al. (2017), the authors use a large network, with billions of weights, in conjunction with a sparse expert model to achieve competitive performance. Ahmed & Torresani (2017)

analyze learned branching, through gates, in the context of computer vision while in

Gastaldi (2017), the author analyzes a two-branch model with randomly sampled weights in the context of image classification.

2.1 Transformer Network

The original Transformer network uses an encoder-decoder architecture with each layer consisting of a novel attention mechanism, which the authors call multi-head attention, followed by a feed-forward network. We describe both these components below.

From the source tokens, learned embeddings of dimension are generated which are then modified by an additive positional encoding. The positional encoding is necessary since the network does not otherwise possess any means of leveraging the order of the sequence since it contains no recurrence or convolution. The authors use additive encoding which is defined as:

where is the position of a word in the sentence and is the dimension of the vector. The authors also experiment with learned embeddings (Gehring et al., 2016, 2017) but found no benefit in doing so. The encoded word embeddings are then used as input to the encoder which consists of layers each containing two sub-layers: (a) a multi-head attention mechanism, and (b) a feed-forward network.

A multi-head attention mechanism builds upon scaled dot-product attention, which operates on a query , key and a value :


where is the dimension of the key.

In the first layer, the inputs are concatenated such that each of is equal to the word vector matrix. This is identical to dot-product attention except for the scaling factor , which improves numerical stability.

Multi-head attention mechanisms obtain different representations of (, , ), compute scaled dot-product attention for each representation, concatenate the results, and project the concatenation with a feed-forward layer. This can be expressed in the same notation as Equation (1):


where the and are parameter projection matrices that are learned. Note that , , and where denotes the number of heads in the multi-head attention. Vaswani et al. (2017) proportionally reduce so that the computational load of the multi-head attention is the same as simple self-attention.

The second component of each layer of the Transformer network is a feed-forward network. The authors propose using a two-layered network with a ReLU activation. Given trainable weights

, the sub-layer is defined as:


The dimension of the inner layer is which is set to in their experiments. For the sake of brevity, we refer the reader to Vaswani et al. (2017) for additional details regarding the architecture.

For regularization and ease of training, the network uses layer normalization (Ba et al., 2016)

after each sub-layer and a residual connection around each full layer

(He et al., 2016). Analogously, each layer of the decoder contains the two sub-layers mentioned above as well as an additional multi-head attention sub-layer that receives as inputs from the output of the corresponding encoding layer. In the case of the decoder multi-head attention sub-layers, the scaled dot-product attention is masked to prevent future positions from being attended to, or in other words, to prevent illegal leftward-ward information flow.

One natural question regarding the Transformer network is why self-attention should be preferred to recurrent or convolutional models. Vaswani et al. (2017) state three reasons for the preference: (a) computational complexity of each layer, (b) concurrency, and (c) path length between long-range dependencies. Assuming a sequence length of and vector dimension , the complexity of each layer is for self-attention layers while it is for recurrent layers. Given that typically , the complexity of self-attention layers is lower than that of recurrent layers. Further, the number of sequential computations is for self-attention layers and for recurrent layers. This helps improved utilization of parallel computing architectures. Finally, the maximum path length between dependencies is for the self-attention layer while it is for the recurrent layer. This difference is instrumental in impeding recurrent models’ ability to learn long-range dependencies.

Figure 1: Our proposed network architecture.

3 Proposed Network Architecture

We now describe the proposed architecture, the Weighted Transformer, which is more efficient to train and makes better use of representational power.

In Equations (3) and (4), we described the attention layer proposed in Vaswani et al. (2017) comprising the multi-head attention sub-layer and a FFN sub-layer. For the Weighted Transformer, we propose a branched attention that modifies the entire attention layer in the Transformer network (including both the multi-head attention and the feed-forward network). The proposed attention layer can be described as:


where denotes the total number of branches, are learned parameters and . The FFN function above is identical to Equation (4). Further, we require that and so that Equation (7) is a weighted sum of the individual branch attention values.

In the equations above, can be interpreted as a learned concatenation weight and as the learned addition weight. Indeed, scales the contribution of the various branches before is used to sum them in a weighted fashion. We ensure that all bounds are respected during each training step by projection.

While it is possible that and could be merged into one variable and trained, we found better training outcomes by separating them. It also improves the interpretability of the models gives that

can be thought of as probability masses on the various branches.

It can be shown that if and for all , we recover the equation for the multi-head attention (3). However, given the and bounds, these values are not permissible in the Weighted Transformer. One interpretation of our proposed architecture is that it replaces the multi-head attention by a multi-branch attention. Rather than concatenating the contributions of the different heads, they are instead treated as branches that a multi-branch network learns to combine.

This mechanism adds trainable weights. This is an insignificant increase compared to the total number of weights. Indeed, in our experiments, the proposed mechanism added weights to a model containing weights already. Without these additional trainable weights, the proposed mechanism is identical to the multi-head attention mechanism in the Transformer. The proposed attention mechanism is used in both the encoder and decoder layers and is masked in the decoder layers as in the Transformer network. Similarly, the positional encoding, layer normalization, and residual connections in the encoder-decoder layers are retained. We eliminate these details from Figure 1 for clarity. Instead of using

learned weights, it is possible to also use a mixture-of-experts normalization via a softmax layer

(Shazeer et al., 2017). However, we found this to perform worse than our proposal.

Unlike the Transformer, which weighs all heads equally, the proposed mechanism allows for ascribing importance to different heads. This in turn prioritizes their gradients and eases the optimization process. Further, as is known from multi-branch networks in computer vision (Gastaldi, 2017), such mechanisms tend to cause the branches to learn decorrelated input-output mappings. This reduces co-adaptation and improves generalization. This observation also forms the basis for mixture-of-experts models (Shazeer et al., 2017).

4 Experiments

4.1 Training Details

The weights and are initialized randomly, as with the rest of the Transformer weights.

In addition to the layer normalization and residual connections, we use label smoothing with , attention dropout, and residual dropout with probability . Attention dropout randomly drops out elements (Srivastava et al., 2014) from the softmax in (1).

As in Vaswani et al. (2017), we used the Adam optimizer (Kingma & Ba, 2014) with and . We also use the learning rate warm-up strategy for Adam wherein the learning rate takes on the form:

for the all parameters except and

for .

This corresponds to the warm-up strategy used for the original Transformer network except that we use a larger peak learning rate for to compensate for their bounds. Further, we found that freezing the weights in the last iterations aids convergence. During this time, we continue training the rest of the network. We hypothesize that this freezing process helps stabilize the rest of the network weights given the weighting scheme.

We note that the number of iterations required for convergence to the final score is substantially reduced for the Weighted Transformer. We found that Weighted Transformer converges faster as measured by the total number of iterations to achieve optimal performance. We train the baseline model for K steps for the smaller variant and K for the larger. We train the Weighted Transformer for the respective variants for K and K iterations. We found that the objective did not significantly improve by running it for longer. Further, we do not use any averaging strategies employed in Vaswani et al. (2017) and simply return the final model for testing purposes.

In order to reduce the computational load associated with padding, sentences were batched such that they were approximately of the same length. All sentences were encoded using byte-pair encoding

(Sennrich et al., 2015) and shared a common vocabulary. Weights for word embeddings were tied to corresponding entries in the final softmax layer (Inan et al., 2016; Press & Wolf, 2016). We trained all our networks on NVIDIA K80 GPUs with a batch containing roughly 25,000 source and target tokens.

4.2 Results on Benchmark Data Sets

We benchmark our proposed architecture on the WMT 2014 English-to-German and English-to-French tasks. The WMT 2014 English-to-German data set contains M sentence pairs. The English-to-French contains M sentence pairs.

Results of our experiments are summarized in Table 1. The Weighted Transformer achieves a BLEU score improvement over the state-of-the-art on the English-to-German task for the smaller network and BLEU improvement for the larger network. In the case of the larger English-to-French task, we note a BLEU improvement for the smaller model and a improvement for the larger model. Also, note that the performance of the smaller model for Weighted Transformer is close to that of the larger baseline model, especially for the English-to-German task. This suggests that the Weighted Transformer better utilizes available model capacity since it needs only of the parameters as the baseline transformer for matching its performance. Our relative improvements do not hinge on using the BLEU scores for comparison; experiments with the GLEU score proposed in Wu et al. (2016) also yielded similar improvements.

Finally, we comment on the regularizing effect of the Weighted Transformer. Given the improved results, a natural question is whether the results stem from improved regularization of the model. To investigate this, we report the testing loss of the Weighted Transformer and the baseline Transformer against the training loss in Figure 2. Models which have a regularizing effect tend to have lower testing losses for the same training loss. We see this effect in our experiments suggesting that the proposed architecture may have better regularizing properties. This is not unexpected given similar outcomes for other branching-based strategies such as Shake-Shake Gastaldi (2017) and mixture-of-experts Shazeer et al. (2017).

Figure 2: Testing v/s Training Loss for the newstest2013 English-to-German task. The Weighted Transformer has lower testing loss compared to the baseline Transformer for the same training loss, suggesting a regularizing effect.
Transformer (small) (Vaswani et al., 2017) 27.3 38.1
Weighted Transformer (small) 28.4 38.9
Transformer (large) (Vaswani et al., 2017) 28.4 41.0
Weighted Transformer (large) 28.9 41.4
ByteNet (Kalchbrenner et al., 2016) 23.7 -
Deep-Att+PosUnk (Zhou et al., 2016) - 39.2
GNMT+RL (Wu et al., 2016) 24.6 39.9
ConvS2S (Gehring et al., 2017) 25.2 40.5
MoE (Shazeer et al., 2017) 26.0 40.6
Table 1: Experimental results on the WMT 2014 English-to-German (EN-DE) and English-to-French (EN-FR) translation tasks. Our proposed model outperforms the state-of-the-art models including the Transformer (Vaswani et al., 2017). The small model corresponds to configuration (A) in Table 2 while large corresponds to configuration (B).
Model Settings BLEU params
train steps
Transformer (C) 2 512 2048 8 NA 0.1 100K 23.7
Weighted Transformer (C) 2 512 2048 8 8 0.1 60K 24.8
Transformer 4 512 2048 8 NA 0.1 100K 25.3
Weighted Transformer 4 512 2048 8 8 0.1 60K 26.2
Transformer (A) 6 512 2048 8 NA 0.1 100K 25.8
Weighted Transformer (A) 6 512 2048 8 8 0.1 60K 26.5
Transformer 8 512 2048 8 NA 0.1 100K 25.5
Weighted Transformer 8 512 2048 8 8 0.3 60K 25.6
Transformer (B) 6 1024 4096 16 NA 0.3 300K 26.4
Weighted Transformer (B) 6 1024 4096 16 16 0.3 250K 27.2
Table 2: Experimental comparison between different variants of the Transformer (Vaswani et al., 2017) architecture and our proposed Weighted Transformer. Reported BLEU scores are evaluated on the English-to-German translation development set, newstest2013.

4.3 Sensitivity Analysis

In Table 2, we report sensitivity results on the newstest2013 English-to-German task. Specifically, we vary the number of layers in the encoder/decoder and compare the performance of the Weighted Transformer and the Transformer baseline. The results clearly demonstrate the benefit of the branched attention; for every experiment, the Weighted Transformer outperforms the baseline transformer, in some cases by up to BLEU points. As in the case of the baseline Transformer, increasing the number of layers does not necessarily improve performance; a modest improvement is seen when the number of layers is increased from to and to but the performance degrades when is increased to . Increasing the number of heads from to in configuration (A) yielded an even better BLEU score. However, preliminary experiments with and , like in the case with , degrade the performance of the model.

In Figure 3, we present the behavior of the weights for the second encoder layer of the configuration (C) for the English-to-German newstest2013 task. The figure shows that, in terms of relative weights, the network does prioritize some branches more than others; circumstantially by as much as . Further, the relative ordering of the branches changes over time suggesting that the network is not purely exploitative. A purely exploitative network, which would learn to exploit a subset of the branches at the expense of the rest, would not be preferred since it would effectively reduce the number of available parameters and limit the representational power. Similar results are seen for other layers, including the decoder layers; we omit them for brevity.

Figure 3: Convergence of the weights for the second encoder layer of Configuration (C) for the English-to-German newstest2013 task. We smoothen the curves using a mean filter. This shows that the network does prioritize some branches more than others and that the architecture does not exploit a subset of the branches while ignoring others.

4.4 Randomization Baseline

The proposed modification can also be interpreted as a form of Shake-Shake regularization proposed in Gastaldi (2017). In this regularization strategy, random weights are sampled during forward and backward passes for weighing the various branches in a multi-branch network. During test time, they are weighed equally. In our strategy, the weights are learned instead of being sampled randomly. Consequently, no changes to the model are required during test time.

In order to better understand whether the network benefits from the learned weights or if, at test time, random or uniform weights suffice, we propose the following experiment: the weights for the Weighted Transformer, including are trained as before, but, during test time, we replace them with (a) randomly sampled weights, and (b) where is the number of incoming branches. In Table 3, we report experimental results on the configuration (C) of the Weighted Transformer on the English-to-German newstest2013 data set (see Table 2 for details regarding the configuration). It is evident that random or uniform weights cannot replace the learned weights during test time. Preliminary experiments suggest that a Shake-Shake-like strategy where the weights are sampled randomly during training also leads to inferior performance.

Weights BLEU
Learned 24.8
Random 21.1
Uniform 23.4
Table 3: Performance of the architecture with random and uniform normalization weights on the newstest2013 English-to-German task for configuration (C). This shows that the learned weights of the Weighted Transformer are crucial to its performance.

4.5 Gating

In order to analyze whether a hard (discrete) choice through gating will outperform our normalization strategy, we experimented with using gates instead of the proposed concatenation-addition strategy. Specifically, we replaced the summation in Equation (7) by a gating structure that sums up the contributions of the top branches with the highest probabilities. This is similar to the sparsely-gated mixture of experts model in Shazeer et al. (2017). Despite significant hyper-parameter tuning of and , we found that this strategy performs worse than our proposed mechanism by a large margin. We hypothesize that this is due to the fact that the number of branches is low, typically less than 16. Hence, sparsely-gated models lose representational power due to reduced capacity in the model. We plan to investigate the setup with a large number of branches and sparse gates in future work.

5 Conclusions

We present the Weighted Transformer that trains faster and achieves better performance than the original Transformer network. The proposed architecture replaces the multi-head attention in the Transformer network by a multiple self-attention branches whose contributions are learned as a part of the training process. We report numerical results on the WMT 2014 English-to-German and English-to-French tasks and show that the Weighted Transformer improves the state-of-the-art BLEU scores by and points respectively. Further, our proposed architecture trains faster than the baseline Transformer. Finally, we present evidence suggesting the regularizing effect of the proposal and emphasize that the relative improvement in BLEU score is observed across various hyper-parameter settings for both small and large models.