Deep neural networks, which allow for end-to-end training, typically consist of an encoder and a decoder coupled via an attention mechanism. Whereas the very first deep models used stacked recurrent neural networks (RNN)Sutskever et al. (2014); Cho et al. (2014); Bahdanau et al. (2015) in the encoder and decoder, the recent Transformer model Vaswani et al. (2017) constitutes the current state-of-the-art approach, owing to its better context generation mechanism via multi-head self- and cross-attentions.
Given an encoder-decoder architecture and its hyper-parameters, such as the number of layers of encoder and decoder and the sizes of vocabularies (in the case of text based models) and hidden layers, the parameters of the model, i.e., matrices and biases for non-linear transformations, are optimized by iteratively updating them so that the loss for the training data is minimized. The hyper-parameters can also be tuned, for instance, through maximizing the automatic evaluation score on the development data. However, in general, it is not guaranteed (and also highly impossible) that a single set of hyper-parameters suffices diverse cost-benefit demands at the same time. For instance, in practical low-latency scenarios, it is often acceptable to sacrifice output quality for speed. Once a model has been trained, using fewer number of layers for faster decoding is theoretically possible. Note also that an optimal set of hyper-parameters does not guarantee that it always results in the best translation for any input. Hosting multiple models simultaneously for flexible decoding is impractical, since it requires unreasonably large quantity of memory.
To this end, we propose to train multi-layer neural models referring to the output of all layers during training. Conceptually, this approach equals to tying the parameters of multiple models with different number of layers, as illustrated in Figure 1, and is not specific to any type of multi-layer neural models. In this paper, however, we specifically focus on encoder-decoder models with encoder and decoder layers, and compress models111Rather than casting the encoder-decoder model into a single column model with () layers. to update the model, where a total of losses are computed by softmaxing the output of each of the decoder layers, where it attends to the output of each of the encoder layers. Each decoder layer is updated referring to a direct signal from the overall loss, and so does each encoder layer from all the decoder layers. The number of parameters of the resultant encoder-decoder model is equivalent to that of the most complex subsumed model with encoder and decoder layers. Yet, we can now perform faster decoding using a fewer number of encoder and decoder layers, given that shallower layers are better trained.
In this paper, we take the case study of neural machine translation (NMT) Cho et al. (2014); Bahdanau et al. (2015), where we focus on the numbers of encoder and decoder layers of the Transformer model Vaswani et al. (2017), and demonstrate that it is possible to train a single model with encoder and decoder layers that can be used for decoding with flexibly fewer number of layers than and without appreciable quality loss. We evaluate our proposed approach on WMT18 English-to-German translation task, and give a cost-benefit analysis for translation quality vs. decoding speed.
Although we apply our method to encoder-decoder models and evaluate it on an NMT task, the method should potentially be applicable to any general multi-layer neural models.
2 Related Work
There are studies that exploit multiple layers simultaneously. Wang et al. (2018)
fused hidden representations of multiple layers in order to improve the translation quality.Belinkov et al. (2017) and Dou et al. (2018)
focused on identifying which encoder or decoder layer can generate useful representations for different natural language processing tasks. There are also notable approaches for speeding-up: knowledge distillationHinton et al. (2015); Freitag et al. (2017), average attention networks Xiong et al. (2018), and binary code prediction Oda et al. (2017).
However, to the best of our knowledge, none of them has tackled the issue in training a flexible translation model.
3 Multi-Layer Softmaxing
Figure 1 gives a simple overview of the concept of multi-layer softmaxing for training a generic 4-layer model. This model takes an input, passes it through 4 layers,222We make no assumptions about the nature of the layers.
and then into the softmax layer to predict the output. Typically, one would apply softmax to the 4th layer only, compute loss, and then back-propagate gradients for updating weights. Instead, we propose to apply softmax to each layer, aggregate the computed losses, and then back-propagate losses. This ensures that during decoding we can choose any layer instead of only the topmost layer.
Extending this to a multi-layer encoder-decoder model is straightforward. In encoder-decoder models, the encoder comprises an embedding layer for the input (source language for NMT) and stacked transformation layers. The decoder consists of an embedding layer and a softmax layer for generating the output (target language for NMT) along with stacked transformation layers. Let be the input to the -layer encoder, the anticipated output of the -layer decoder as well as the input to the decoder (for training), and the predicted output by the decoder. The pseudo-code for our proposed approach is shown in Algorithm 1. The line 3 represents the process done by the -th encoder layer, , and the line 5 does the same for the -th decoder layer, . In simple words, we compute a loss using the output of each of the decoder layers which in turn is computed using the output of each of the encoder layers. In line 10, the losses are aggregated333We averaged multiple losses in our experiment, but there are a number of options, such as weighted averaging. before back-propagation. Henceforth, we will refer to this as the model.
For a comparison, the vanilla model is formulated in Algorithm 2.
|36 individual vanilla models||Our single model|
We trained following two types of models, and evaluated them on both translation quality and decoding speed.
- Vanilla model:
36 vanilla models with 1 to 6 encoder and 1 to 6 decoder layers, trained referring only to the last layer for computing loss.
A single model with encoder and decoder layers, trained by our multi-layer softmaxing.
4.1 Datasets and Preprocessing
We experimented with the WMT18 English-to-German (EnDe) translation task. We used all the parallel corpora available for WMT18, except ParaCrawl corpus,444http://www.statmt.org/wmt18/translation-task.html consisting of 5.58M sentence pairs as the training data and 2,998 sentences in newstest2018 as test data.
The English and German sentences were pre-processed using the tokenizer.perl and lowercase.perl scripts in Moses.555http://www.statmt.org/moses
4.2 Model Training
Our multi-layer softmaxing method was implemented on top of an open-source toolkit of the Transformer model Vaswani et al. (2017) in the version 1.6 branch of tensor2tensor.666https://github.com/tensorflow/tensor2tensor For training, we used the default model settings corresponding to transformer_base_single_gpu in the implementation, except what follows. We used a shared sub-word vocabulary of 32k777We determined the sub-word vocabularies using the internal sub-word segmenter of tensor2tensor, for simplicity. and trained the models for 300k iterations. We trained the vanilla models on 1 GPU and our model on 2 GPUs with the halved batch size to ensure that both models see the same amount of training data.
We averaged the last 10 checkpoints saved every after 1k updates, and decoded the test sentences, fixing a beam size of 4 and length penalty, , of 0.6.888One can realize faster decoding by narrowing down the beam width. This approach is orthogonal to ours and in this paper we do not insist which is superior to the other. We evaluated our models using the BLEU metric Papineni et al. (2002) implemented in tensor2tensor as t2t_bleu: case-sensitive and detokenized BLEU. We also report on the time (in seconds) consumed to translate the test set, which includes times for the model creation, loading the checkpoints, sub-word splitting and indexing, decoding, and sub-word de-indexing and merging, whereas times for detokenization are not taken into account.
Note that we did not use any development data for two reasons. First, we train all models for the same number of iterations.999In our opinion, this is a fair training method because it ensures that each model sees roughly the same number of training examples. Second, we use checkpoint averaging before decoding, where using a development set for early stopping is not needed. We use this training and decoding approach, because it is known to give the best results for NMT using the Transformer implementation we use Vaswani et al. (2017).
Table 2 gives the BLEU scores and Table 2 gives the decoding times of the models. These summarize the cost-benefit property of our model in comparison with the results of the corresponding vanilla models. When our model was used for decoding with the 5 encoder and 5 decoder layers, it achieved a BLEU score of 34.95 which is comparable with the BLEU score of 35.35 of the best vanilla model with 4-layer encoder and 6-layer decoder, even though the objective function for our proposed model is substantially more complex than the one for the vanilla model. Note that the vanilla models give significantly better results compared to our models, when using a single decoder layer. However, when the number of decoder layers are increased there is no statistically significant difference between the performance of vanilla models and our model; difference is less than 1.0 BLEU points in most configurations. We have essentially compressed 36 models into one.
Regarding the cost-benefit property of our model, two points must be noted:
BLEU score and decoding time increase only slightly, when we use more encoder layers.
The bulk of the decoding time is consumed by the decoder, since it works in an auto-regressive manner. We can substantially cut down decoding time by using fewer decoder layers which does lead to sub-optimal translation quality.
Consider our model used with 4 encoder and 3 decoder layers which gives a BLEU of 34.52. Compared to the best vanilla model (with 4 encoder and 6 decoder layers; 35.35 BLEU), it can decode 1.7 times faster (151.79s vs. 264.81s) for the loss of 0.83 BLEU points. This loss in BLEU is statistically significant but in real-time low-latency scenarios, however, this certainly will not have a negative impact on the quality of service.101010There are several researchers (Tan et al., 2015; Nakazawa et al., 2018) who have shown that BLEU score is not often correlated with actual translation quality judged through human evaluation. For instance, when choosing our model used with 6 encoder and 2 decoder layers, we lose 1.59 BLEU points, but this might not have a massive impact on human evaluation. As such, we can choose this configuration and afford to decode twice as fast (117.05s vs 264.81s).
One may argue that training a single vanilla model with optimal numbers of encoder and decoder layers is enough. However, as discussed in Section 1, it is impossible to know which configuration is the best a priori. More importantly, a single vanilla model cannot suffice diverse cost-benefit demands and cannot guarantee best translation for any input (see Section 5.3). Recall that we aim at a flexible model and all the results in Table 2 have been obtained using our single model, albeit using different number of encoder and decoder layers for decoding.
5 Analysis and Discussion
In this section, to better understand the nature of our proposed method, we give an analysis of our model from the perspective of training times, model sizes, and decoding behavior, in comparison with vanilla models.
5.1 Training Time
All our models were trained for 300k iterations. We thus compare training times between two models by comparing the time in seconds required to complete 100 iterations of training computations.111111This is the time reported by tensor2tensor by default. As such, the training time for our model was about 9.5 times that of the vanilla model with 6 encoder and 6 decoder layers. In contrast, the total training time for all individual subsumed 36 vanilla models was 25.54 times121212We measured the collapsed time for a fair comparison; we assumed that all individual models are trained on a single GPU one after another, even though one may be able to use 36 GPUs to train the 36 subsumed models in parallel. that of the vanilla model with 6 encoder and 6 decoder layers. Note that this time is calculated by adding the times required to complete 100 iterations of training computations for all individual vanilla models. Consequently, our proposed method of training an model is computationally much more efficient than independently training all the 36 subsumed models with different number of layers.
5.2 Model Size
Our proposed method can help train models whose number of parameters is exactly same as vanilla model with encoder and decoder layers. If we train a set of separate models with different numbers of encoder and decoder layers, we end up with significantly more parameters. For instance, in case of in our experiment, we have 26.45 times more parameters; a total of 5,500M for 36 subsumed models against 207M for our model.
5.3 Decoding Behavior
Figure 2 gives the distribution of the test sentences which were best translated (oracle translations) by different combinations of encoder and decoder layers during decoding with the vanilla and our models. We observed the followings.
Using the model, around 50% of the test set is best translated by using 1 to 2 encoder and decoder layers despite the low corpus-level BLEU scores with these configurations.
In contrast, among the individual 36 models, those with 1 to 2 encoder and 1 to 2 decoder layers give the best translation for only 30% of the test set. To cover 50% of the test set, we have to consider the models up to 3 encoder and 3 decoder layers.
The distribution of best performing combinations is quite sharp for the model unlike the individual vanilla models.
Currently, we do not have an explanation for the difference in behavior between our and vanilla models in terms of the distribution of optimal layer combinations. However, it is clear that our model can essentially do what 36 individually trained models can do.
In addition, if we can predict an appropriate layer combination to decode each given input, we can automatically decode with a variable number of layers and save significant amount of computation. We leave further analyses and the design of the layer choosing mechanism for future work.
In this paper, we have proposed a novel procedure for training encoder-decoder models, where we softmax the output of each of the decoder layers derived using the output of each of the encoder layers. This compresses models into a single model that can be used for decoding with a variable number of encoder () and decoder () layers. This model can be used in different latency scenarios and hence is highly versatile. We have experimented with NMT as a case study of encoder-decoder models and given a cost-benefit analysis of our method.
In our future work, we will make an in-depth analysis on the nature of our models, such as the diversity of hypotheses generated by different layers. We will focus on approaches to automatically choose layer combinations depending on the input and thereby save decoding time by performing the minimal number of computations to obtain the best output. For further speed up in decoding as well as model compaction, we plan to combine our approach with other techniques, such as those mentioned in Section 2. Although we have only tested our idea for NMT, it should be applicable to other tasks based on deep neural networks.
- Bahdanau et al. (2015) Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. 2015. Neural machine translation by jointly learning to align and translate. In Proceedings of the 3rd International Conference on Learning Representations, San Diego, USA.
- Belinkov et al. (2017) Yonatan Belinkov, Lluís Màrquez, Hassan Sajjad, Nadir Durrani, Fahim Dalvi, and James Glass. 2017. Evaluating layers of representation in neural machine translation on part-of-speech and semantic tagging tasks. In Proceedings of the Eighth International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pages 1–10, Taipei, Taiwan.
- Cho et al. (2014) Kyunghyun Cho, Bart van Merriënboer, Çaglar Gülçehre, Dzmitry Bahdanau, Fethi Bougares, Holger Schwenk, and Yoshua Bengio. 2014. Learning phrase representations using RNN encoder–decoder for statistical machine translation. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing, pages 1724–1734, Doha, Qatar.
- Dou et al. (2018) Zi-Yi Dou, Zhaopeng Tu, Xing Wang, Shuming Shi, and Tong Zhang. 2018. Exploiting deep representations for neural machine translation. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pages 4253–4262, Brussels, Belgium.
- Freitag et al. (2017) Markus Freitag, Yaser Al-Onaizan, and Baskaran Sankaran. 2017. Ensemble distillation for neural machine translation. CoRR, abs/1702.01802.
- Hinton et al. (2015) Geoffrey Hinton, Oriol Vinyals, and Jeffrey Dean. 2015. Distilling the knowledge in a neural network. CoRR, abs/1503.02531.
- Nakazawa et al. (2018) Toshiaki Nakazawa, Shohei Higashiyama, Chenchen Ding, Raj Dabre, Anoop Kunchukuttan, Win Pa Pa, Isao Goto, Hideya Mino, Katsuhito Sudoh, and Sadao Kurohashi. 2018. Overview of the 5th workshop on Asian translation. In Proceedings of the 5th Workshop on Asian Translation, Hong Kong, China.
- Oda et al. (2017) Yusuke Oda, Philip Arthur, Graham Neubig, Koichiro Yoshino, and Satoshi Nakamura. 2017. Neural machine translation via binary code prediction. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 850–860, Vancouver, Canada.
- Papineni et al. (2002) Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu. 2002. BLEU: A method for automatic evaluation of machine translation. In Proceedings of the 40th Annual Meeting on Association for Computational Linguistics, pages 311–318, Philadelphia, USA.
- Sutskever et al. (2014) Ilya Sutskever, Oriol Vinyals, and Quoc V. Le. 2014. Sequence to sequence learning with neural networks. In Proceedings of the 27th Neural Information Processing Systems Conference, pages 3104–3112, Montréal, Canada.
- Tan et al. (2015) Liling Tan, Jon Dehdari, and Josef van Genabith. 2015. An awkward disparity between BLEU / RIBES scores and human judgements in machine translation. In Proceedings of the 2nd Workshop on Asian Translation, pages 74–81, Kyoto, Japan.
- Vaswani et al. (2017) Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. In Proceedings of the 30th Neural Information Processing Systems Conference, pages 5998–6008, Long Beach, USA.
- Wang et al. (2018) Qiang Wang, Fuxue Li, Tong Xiao, Yanyang Li, Yinqiao Li, and Jingbo Zhu. 2018. Multi-layer representation fusion for neural machine translation. In Proceedings of the 27th International Conference on Computational Linguistics, pages 3015–3026, Santa Fe, USA.
- Xiong et al. (2018) Deyi Xiong, Biao Zhang, and Jinsong Su. 2018. Accelerating neural transformer via an average attention network. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics, Long Papers, pages 1789–1798, Melbourne, Australia.