Multi-Layer Softmaxing during Training Neural Machine Translation for Flexible Decoding with Fewer Layers

08/27/2019 ∙ by Raj Dabre, et al. ∙ 0

This paper proposes a novel procedure for training an encoder-decoder based deep neural network which compresses NxM models into a single model enabling us to dynamically choose the number of encoder and decoder layers for decoding. Usually, the output of the last layer of the N-layer encoder is fed to the M-layer decoder, and the output of the last decoder layer is used to compute softmax loss. Instead, our method computes a single loss consisting of NxM losses: the softmax loss for the output of each of the M decoder layers derived using the output of each of the N encoder layers. A single model trained by our method can be used for decoding with an arbitrary fewer number of encoder and decoder layers. In practical scenarios, this (a) enables faster decoding with insignificant losses in translation quality and (b) alleviates the need to train NxM models, thereby saving space. We take a case study of neural machine translation and show the advantage and give a cost-benefit analysis of our approach.



There are no comments yet.


page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Deep neural networks, which allow for end-to-end training, typically consist of an encoder and a decoder coupled via an attention mechanism. Whereas the very first deep models used stacked recurrent neural networks (RNN)

Sutskever et al. (2014); Cho et al. (2014); Bahdanau et al. (2015) in the encoder and decoder, the recent Transformer model Vaswani et al. (2017) constitutes the current state-of-the-art approach, owing to its better context generation mechanism via multi-head self- and cross-attentions.

Given an encoder-decoder architecture and its hyper-parameters, such as the number of layers of encoder and decoder and the sizes of vocabularies (in the case of text based models) and hidden layers, the parameters of the model, i.e., matrices and biases for non-linear transformations, are optimized by iteratively updating them so that the loss for the training data is minimized. The hyper-parameters can also be tuned, for instance, through maximizing the automatic evaluation score on the development data. However, in general, it is not guaranteed (and also highly impossible) that a single set of hyper-parameters suffices diverse cost-benefit demands at the same time. For instance, in practical low-latency scenarios, it is often acceptable to sacrifice output quality for speed. Once a model has been trained, using fewer number of layers for faster decoding is theoretically possible. Note also that an optimal set of hyper-parameters does not guarantee that it always results in the best translation for any input. Hosting multiple models simultaneously for flexible decoding is impractical, since it requires unreasonably large quantity of memory.

(a) Multiple tied-layer vanilla models.
(b) Collapsing tied layers into one.
Figure 1: The general concept of multi-layer softmaxing for training multi-layer neural models with an example of a 4-layer model. Figure 0(a) is a depiction of our idea in the form of multiple vanilla models whose layers are tied together. Figure 0(b) shows the result of collapsing all tied layers into a single layer. The red lines indicate the flow of gradients and hence the lowest layer in the stack receives the most updates.

To this end, we propose to train multi-layer neural models referring to the output of all layers during training. Conceptually, this approach equals to tying the parameters of multiple models with different number of layers, as illustrated in Figure 1, and is not specific to any type of multi-layer neural models. In this paper, however, we specifically focus on encoder-decoder models with encoder and decoder layers, and compress models111Rather than casting the encoder-decoder model into a single column model with () layers. to update the model, where a total of losses are computed by softmaxing the output of each of the decoder layers, where it attends to the output of each of the encoder layers. Each decoder layer is updated referring to a direct signal from the overall loss, and so does each encoder layer from all the decoder layers. The number of parameters of the resultant encoder-decoder model is equivalent to that of the most complex subsumed model with encoder and decoder layers. Yet, we can now perform faster decoding using a fewer number of encoder and decoder layers, given that shallower layers are better trained.

In this paper, we take the case study of neural machine translation (NMT) Cho et al. (2014); Bahdanau et al. (2015), where we focus on the numbers of encoder and decoder layers of the Transformer model Vaswani et al. (2017), and demonstrate that it is possible to train a single model with encoder and decoder layers that can be used for decoding with flexibly fewer number of layers than and without appreciable quality loss. We evaluate our proposed approach on WMT18 English-to-German translation task, and give a cost-benefit analysis for translation quality vs. decoding speed.

Although we apply our method to encoder-decoder models and evaluate it on an NMT task, the method should potentially be applicable to any general multi-layer neural models.

2 Related Work

There are studies that exploit multiple layers simultaneously. Wang et al. (2018)

fused hidden representations of multiple layers in order to improve the translation quality.

Belinkov et al. (2017) and Dou et al. (2018)

focused on identifying which encoder or decoder layer can generate useful representations for different natural language processing tasks. There are also notable approaches for speeding-up: knowledge distillation

Hinton et al. (2015); Freitag et al. (2017), average attention networks Xiong et al. (2018), and binary code prediction Oda et al. (2017).

However, to the best of our knowledge, none of them has tackled the issue in training a flexible translation model.

3 Multi-Layer Softmaxing

Figure 1 gives a simple overview of the concept of multi-layer softmaxing for training a generic 4-layer model. This model takes an input, passes it through 4 layers,222We make no assumptions about the nature of the layers.

and then into the softmax layer to predict the output. Typically, one would apply softmax to the 4th layer only, compute loss, and then back-propagate gradients for updating weights. Instead, we propose to apply softmax to each layer, aggregate the computed losses, and then back-propagate losses. This ensures that during decoding we can choose any layer instead of only the topmost layer.

Extending this to a multi-layer encoder-decoder model is straightforward. In encoder-decoder models, the encoder comprises an embedding layer for the input (source language for NMT) and stacked transformation layers. The decoder consists of an embedding layer and a softmax layer for generating the output (target language for NMT) along with stacked transformation layers. Let be the input to the -layer encoder, the anticipated output of the -layer decoder as well as the input to the decoder (for training), and the predicted output by the decoder. The pseudo-code for our proposed approach is shown in Algorithm 1. The line 3 represents the process done by the -th encoder layer, , and the line 5 does the same for the -th decoder layer, . In simple words, we compute a loss using the output of each of the decoder layers which in turn is computed using the output of each of the encoder layers. In line 10, the losses are aggregated333We averaged multiple losses in our experiment, but there are a number of options, such as weighted averaging. before back-propagation. Henceforth, we will refer to this as the model.

For a comparison, the vanilla model is formulated in Algorithm 2.

2:for  in 1 to  do
4:     for  in 1 to  do
8:     end for
9:end for
11:Back-propagate using
Algorithm 1 Training an model
36 individual vanilla models Our single model
1 2 3 4 5 6 1 2 3 4 5 6
1 27.07 30.25 31.63 31.61 31.48 32.11 24.24 28.85 30.24 30.55 30.91 30.93
2 29.12 32.05 32.74 32.94 33.09 32.81 27.11 31.63 33.00 33.36 33.61 33.76
3 29.64 32.64 33.36 33.92 34.09 33.80 28.31 32.79 34.11 34.52 34.73 34.64
4 30.29 33.61 34.33 34.44 34.16 34.39 28.96 33.30 34.48 34.76 34.81 34.76
5 31.00 33.94 34.37 35.27 34.08 34.94 29.19 33.47 34.52 34.71 34.95 34.89
6 31.48 34.07 34.31 35.35 34.71 34.87 29.35 33.61 34.52 34.61 34.91 34.87
Table 1: BLEU scores for the WMT EnDe task. The scores on the left side are for the 36 individual models that are trained separately. The scores on the right are for our proposed model.

1 2 3 4 5 6
1 95.86 95.47 92.31 94.22 95.38 94.30
2 110.85 114.44 114.82 116.28 116.57 117.05
3 148.61 155.78 153.35 151.79 157.17 155.39
4 181.99 182.00 192.75 198.58 198.48 195.85
5 214.49 223.05 225.72 223.57 245.10 241.72
6 247.78 257.47 265.33 264.81 259.94 264.76
Table 2: Decoding time (in seconds) with the different layer configurations. Given that a vanilla model with encoder and decoder layers and our model used with encoder and decoder layers have no difference in the amount of computation, we show only one set of decoding times.

4 Experiments

We trained following two types of models, and evaluated them on both translation quality and decoding speed.

Vanilla model:

36 vanilla models with 1 to 6 encoder and 1 to 6 decoder layers, trained referring only to the last layer for computing loss.


A single model with encoder and decoder layers, trained by our multi-layer softmaxing.

4.1 Datasets and Preprocessing

We experimented with the WMT18 English-to-German (EnDe) translation task. We used all the parallel corpora available for WMT18, except ParaCrawl corpus,444 consisting of 5.58M sentence pairs as the training data and 2,998 sentences in newstest2018 as test data.

The English and German sentences were pre-processed using the tokenizer.perl and lowercase.perl scripts in Moses.555

2:for  in 1 to  do
4:end for
5:for  in 1 to  do
7:end for
10:Back-propagate using
Algorithm 2 Training a vanilla model

4.2 Model Training

Our multi-layer softmaxing method was implemented on top of an open-source toolkit of the Transformer model Vaswani et al. (2017) in the version 1.6 branch of tensor2tensor.666 For training, we used the default model settings corresponding to transformer_base_single_gpu in the implementation, except what follows. We used a shared sub-word vocabulary of 32k777We determined the sub-word vocabularies using the internal sub-word segmenter of tensor2tensor, for simplicity. and trained the models for 300k iterations. We trained the vanilla models on 1 GPU and our model on 2 GPUs with the halved batch size to ensure that both models see the same amount of training data.

We averaged the last 10 checkpoints saved every after 1k updates, and decoded the test sentences, fixing a beam size of 4 and length penalty, , of 0.6.888One can realize faster decoding by narrowing down the beam width. This approach is orthogonal to ours and in this paper we do not insist which is superior to the other. We evaluated our models using the BLEU metric Papineni et al. (2002) implemented in tensor2tensor as t2t_bleu: case-sensitive and detokenized BLEU. We also report on the time (in seconds) consumed to translate the test set, which includes times for the model creation, loading the checkpoints, sub-word splitting and indexing, decoding, and sub-word de-indexing and merging, whereas times for detokenization are not taken into account.

Note that we did not use any development data for two reasons. First, we train all models for the same number of iterations.999In our opinion, this is a fair training method because it ensures that each model sees roughly the same number of training examples. Second, we use checkpoint averaging before decoding, where using a development set for early stopping is not needed. We use this training and decoding approach, because it is known to give the best results for NMT using the Transformer implementation we use Vaswani et al. (2017).

4.3 Results

Table 2 gives the BLEU scores and Table 2 gives the decoding times of the models. These summarize the cost-benefit property of our model in comparison with the results of the corresponding vanilla models. When our model was used for decoding with the 5 encoder and 5 decoder layers, it achieved a BLEU score of 34.95 which is comparable with the BLEU score of 35.35 of the best vanilla model with 4-layer encoder and 6-layer decoder, even though the objective function for our proposed model is substantially more complex than the one for the vanilla model. Note that the vanilla models give significantly better results compared to our models, when using a single decoder layer. However, when the number of decoder layers are increased there is no statistically significant difference between the performance of vanilla models and our model; difference is less than 1.0 BLEU points in most configurations. We have essentially compressed 36 models into one.

Regarding the cost-benefit property of our model, two points must be noted:

  • BLEU score and decoding time increase only slightly, when we use more encoder layers.

  • The bulk of the decoding time is consumed by the decoder, since it works in an auto-regressive manner. We can substantially cut down decoding time by using fewer decoder layers which does lead to sub-optimal translation quality.

Consider our model used with 4 encoder and 3 decoder layers which gives a BLEU of 34.52. Compared to the best vanilla model (with 4 encoder and 6 decoder layers; 35.35 BLEU), it can decode 1.7 times faster (151.79s vs. 264.81s) for the loss of 0.83 BLEU points. This loss in BLEU is statistically significant but in real-time low-latency scenarios, however, this certainly will not have a negative impact on the quality of service.101010There are several researchers (Tan et al., 2015; Nakazawa et al., 2018) who have shown that BLEU score is not often correlated with actual translation quality judged through human evaluation. For instance, when choosing our model used with 6 encoder and 2 decoder layers, we lose 1.59 BLEU points, but this might not have a massive impact on human evaluation. As such, we can choose this configuration and afford to decode twice as fast (117.05s vs 264.81s).

36 individual vanilla models.
Our single model.
Figure 2: Distribution of oracle translations among 36 combinations of encoder and decoder layers for WMT EnDe (2,999 sentences).

One may argue that training a single vanilla model with optimal numbers of encoder and decoder layers is enough. However, as discussed in Section 1, it is impossible to know which configuration is the best a priori. More importantly, a single vanilla model cannot suffice diverse cost-benefit demands and cannot guarantee best translation for any input (see Section 5.3). Recall that we aim at a flexible model and all the results in Table 2 have been obtained using our single model, albeit using different number of encoder and decoder layers for decoding.

5 Analysis and Discussion

In this section, to better understand the nature of our proposed method, we give an analysis of our model from the perspective of training times, model sizes, and decoding behavior, in comparison with vanilla models.

5.1 Training Time

All our models were trained for 300k iterations. We thus compare training times between two models by comparing the time in seconds required to complete 100 iterations of training computations.111111This is the time reported by tensor2tensor by default. As such, the training time for our model was about 9.5 times that of the vanilla model with 6 encoder and 6 decoder layers. In contrast, the total training time for all individual subsumed 36 vanilla models was 25.54 times121212We measured the collapsed time for a fair comparison; we assumed that all individual models are trained on a single GPU one after another, even though one may be able to use 36 GPUs to train the 36 subsumed models in parallel. that of the vanilla model with 6 encoder and 6 decoder layers. Note that this time is calculated by adding the times required to complete 100 iterations of training computations for all individual vanilla models. Consequently, our proposed method of training an model is computationally much more efficient than independently training all the 36 subsumed models with different number of layers.

5.2 Model Size

Our proposed method can help train models whose number of parameters is exactly same as vanilla model with encoder and decoder layers. If we train a set of separate models with different numbers of encoder and decoder layers, we end up with significantly more parameters. For instance, in case of in our experiment, we have 26.45 times more parameters; a total of 5,500M for 36 subsumed models against 207M for our model.

5.3 Decoding Behavior

Figure 2 gives the distribution of the test sentences which were best translated (oracle translations) by different combinations of encoder and decoder layers during decoding with the vanilla and our models. We observed the followings.

  • Using the model, around 50% of the test set is best translated by using 1 to 2 encoder and decoder layers despite the low corpus-level BLEU scores with these configurations.

  • In contrast, among the individual 36 models, those with 1 to 2 encoder and 1 to 2 decoder layers give the best translation for only 30% of the test set. To cover 50% of the test set, we have to consider the models up to 3 encoder and 3 decoder layers.

  • The distribution of best performing combinations is quite sharp for the model unlike the individual vanilla models.

Currently, we do not have an explanation for the difference in behavior between our and vanilla models in terms of the distribution of optimal layer combinations. However, it is clear that our model can essentially do what 36 individually trained models can do.

In addition, if we can predict an appropriate layer combination to decode each given input, we can automatically decode with a variable number of layers and save significant amount of computation. We leave further analyses and the design of the layer choosing mechanism for future work.

6 Conclusion

In this paper, we have proposed a novel procedure for training encoder-decoder models, where we softmax the output of each of the decoder layers derived using the output of each of the encoder layers. This compresses models into a single model that can be used for decoding with a variable number of encoder () and decoder () layers. This model can be used in different latency scenarios and hence is highly versatile. We have experimented with NMT as a case study of encoder-decoder models and given a cost-benefit analysis of our method.

In our future work, we will make an in-depth analysis on the nature of our models, such as the diversity of hypotheses generated by different layers. We will focus on approaches to automatically choose layer combinations depending on the input and thereby save decoding time by performing the minimal number of computations to obtain the best output. For further speed up in decoding as well as model compaction, we plan to combine our approach with other techniques, such as those mentioned in Section 2. Although we have only tested our idea for NMT, it should be applicable to other tasks based on deep neural networks.