Neural machine translation (NMT) Cho et al. (2014); Sutskever et al. (2014); Bahdanau et al. (2015) allows for end-to-end training of a translation system without needing to deal with word alignments, translation rules and complicated decoding algorithms, which are integral to statistical machine translation (SMT) Koehn et al. (2007). In encoder-decoder based NMT models, one of the most commonly followed practices is the stacking of multiple recurrent111Recurrent across time-steps., convolutional or self-attentional feed-forward layers in the encoders and decoders with each layer having its own parameters. It has been empirically shown that such stacking leads to an improvement in translation quality, especially in resource rich resource scenarios. However, it also increases the size of the model by a significant amount.
In this paper, we propose to reduce the number of model parameters by sharing parameters across layers. In other words, our Recurrently Stacked NMT (RSNMT) model has the same size of a single layer NMT model. We evaluate our method on several publicly available data-sets and show that a RSNMT model with 6 recurrence steps gives results that are comparable to a 6-layer NMT model which does not use any recurrences. The contributions of this paper are as follows:
We propose a novel modification to the NMT architecture where parameters are shared across layers which we call Recurrently Stacked NMT or RSNMT.
We use the Transformer Vaswani et al. (2017) architecture but our method is architecture independent.
We experiment with several publicly available data-sets and empirically show the effectiveness of our approach. The language directions we experimented with are: Turkish-English and English-Turkish (WMT), Japanese-English (ALT, KFTT, GCP) and English-Japanese (GCP).
We also experimented with using back-translated corpora and show that our method further benefits from the additional data.
To the best of our knowledge, this is the first work that shows that it is possible to reduce the NMT model size by sharing parameters across layers and yet achieve results that are comparable to a model that does not share parameters across layers.
2 Related Work
The most prominent way of reducing the size of a neural model is knowledge distillation Hinton et al. (2015), which requires training a parent model which can be a time-consuming task. The work on zero-shot NMT Johnson et al. (2016) shows that it is possible for multiple language pairs to share a single encoder and decoder without an appreciable loss in translation quality. However, this work does not consider sharing the parameters across the stacked layers in the encoder or the decoder. The work on Universal Transformer Dehghani et al. (2018) shows that feeding the output of the multi-layer encoder (and decoder) to itself repeatedly leads to an improvement in quality for English-German translation. Our method is similar to this, except that our RSNMT model has the same size as that of a 1-layer NMT model and yet manages to approach the translation quality given by a 6-layer NMT model. We additionally show that the recurrent stacking of layers can benefit from back-translated data.
3 Recurrent Stacked NMT
illustrates our approach. The left-hand side shows the vanilla stacking of N layers where each layer in the neural network has its own parameters. The right-hand side shows our approach of stacking N layers where we use the same parameters for all layers. As a result of this sharing, the resultant neural model is technically a single layer model in which the same layer is recurrently stacked N times. This leads to a massive reduction in the size of the model.
4 Experimental Settings
4.1 Data-sets and Languages
We experimented with Japanese-English translation using the Asian language treebank (ALT) parallel corpus222http://www2.nict.go.jp/astrec-att/member/mutiyama/ALT/index.html Thu et al. (2016) and the Kyoto free translation task (KFTT) corpus333http://www.phontron.com/kftt Neubig (2011), both of which are publicly available. The ALT-JE task contains 18088, 1000, and 1018 sentences for training, development, and testing, respectively. The KFTT-JE task contains 440288, 1166, and 1160 sentences for training, development, and testing, respectively. We also experimented with Turkish-English and English-Turkish translation using the WMT 2018 corpus444http://www.statmt.org/wmt18/translation-task.html which contains 207678, 3007 and 3010 sentences for training, development, and testing, respectively. Finally, we experimented with Japanese-English and English-Japanese translation using an in-house parallel corpus called the GCP corpus Imamura and Sumita (2018); Imamura et al. (2018), which consists of 400000, 2000, 2000 sentences for training, development and testing, respectively. In addition, there are 1552475 lines of monolingual corpora for both Japanese and English, which we use for back-translation experiments.
We tokenized the Japanese sentences in the KFTT and ALT corpora using the JUMAN555http://nlp.ist.i.kyoto-u.ac.jp/EN/index.php?JUMAN Kurohashi et al. (1994) morphological analyzer. We tokenized and lowercased the English sentences in KFTT and ALT using the tokenizer.perl and lowercase.perl scripts in Moses666http://www.statmt.org/moses. The GCP corpora were available to us in a pre-tokenized and lowercased form. We did not tokenize the Turkish-English data in any way777tensor2tensor has an internal tokenization mechanism which was used for this language pair..
4.2 NMT models
We trained and evaluated the following NMT models:
6-layer model without any shared parameters across layers
1, 2, 3, 4, 5 and 6-layer models with parameters shared across all layers. These are referred to as 1, 2, 3, 4, 5 and 6 recurrently stacked NMT models
4.3 Implementation and Model Settings
We used the open-source implementation of the Transformer model Vaswani et al. (2017) in tensor2tensor888https://github.com/tensorflow/tensor2tensor for all our NMT experiments. We implemented our approach using the version 1.6 branch of tensor2tensor. We used the Transformer because it is the current state-of-the-art NMT model. However, our approach of sharing parameters across layers is implementation and model independent. For training, we used the default model settings corresponding to transformer_base_single_gpu in the implementation and to base_model in Vaswani et al. (2017) with the exception of the number of sub-words, training iterations and number of GPUs. These numbers vary as we train the models to convergence. We used the tensor2tensor internal sub-word segmenter for simplicity. For the GCP corpora we used separate 16000 sub-word vocabularies and trained all models on 1 GPU with 60000 iterations for English-Japanese and 120000 iterations for Japanese-English. For the KFTT corpus we used separate 16000 sub-word vocabularies and trained all models on 1 GPU for 160000 iterations. For the ALT corpus we used separate 8000 sub-word vocabularies and trained all models on 1 GPU for 40000 iterations. For the WMT corpus we used a joint999We did this to exploit cognates across Turkish and English. 16000 sub-word vocabulary and trained all models on 4 GPUs for 50000 iterations.
We averaged the last 10 checkpoints and decoded the test set sentences with a beam size of 4 and length penalty of for the KFTT Japanese-English experiments and for the rest. We evaluate our models using the BLEU Papineni et al. (2002) metric implemented in tensor2tensor as t2t_bleu. In order to generate pseudo-parallel corpora by back-translating the GCP monolingual corpora, we used the 1-layer NMT models for decoding. In order to save time we perform greedy decoding101010We could translate approximately 1.5 million lines in approximately 40 minutes using 8 GPUs.. For the GCP English-Japanese translation direction, we also tried to see what happens if a model is trained using N-layers of recurrence but is decoded using fewer than N-1 layers and more than N layers of recurrence.
|#recurrently stacked layers in model|
|Back-translated Data Used|
5 Results and Discussion
5.1 Main Results
Refer to Table 1
for the results of the experiments using up to 6-layers of recurrently stacked layers for the WMT, ALT, KFTT and GCP data-sets. We observed that, no matter the data-set, the translation quality improves as the same parameters are recurrently used in a depth-wise fashion. The most surprising result is that the performance of our 6-layer recurrently stacked model with shared parameters across all layers approaches the performance of the vanilla 6-layer model without any parameter sharing across layers. The most probable explanation of this phenomenon is that the parameter sharing forces the higher layers of the NMT model to learn more complex features. Note that dropout is applied by default in the transformer implementation we use. Thus, at each stage, the same set of parameters has to make do with less reliable representations. This means that the representations at the topmost layers are very robust and thus enable better translation quality. The gains in translation quality are inversely proportional to the amount of recurrent stacking layers.
In the case of the ALT corpus, the performance trends are not very clear. Although, there are improvements in translation quality with each level of recurrent stacking, they are not very sharp as observed for the other language directions. We suspect that this is because of the extremely low resource setting in which NMT training is highly unreliable. Nevertheless, we do not see any detrimental effects of recurrent stacking of layers.
5.2 Decoding Using Different Recurrence Steps
In order to understand what each step of recurrent stacking brings about, we decided to train a N-layer recurrently stacked model and during decoding perform recurrence up to N-1 times. Refer to Table 2 for the results of the same on the GCP English-Japanese translation. It can be seen that once the NMT model has been trained to use N layers of recurrent stacking, it is unable to perform optimally using fewer than N levels of stacking for decoding. Although, this is expected, there are three crucial observations as below.
Firstly, the computation of the most useful and hence the most complex features takes place at higher levels of recurrence. For a 6-layer recurrent stacking model, using just 1-layer (no recurrence) during decoding, gives a BLEU of 2.56. However, as we perform more layers of recurrence, the BLEU jumps drastically. This could imply that the NMT model delays the learning of extremely complex features till the very end. Secondly, for the same model, the difference between using the full 6-layer recurrent stacking and 5-layer recurrent stacking is not very significant. This means that as we train the model using a large number of recurrent stacking, it is possible to use fewer layers of recurrent stacking for decoding. Thirdly, when we use the more than 6-layer recurrent stacking, the BLEU score starts dropping again (25.53 and 24.87 for 7 and 8 layers during decoding). This indicates that the model has not learned to extract complex features beyond what it has been trained for. However, once the model has been trained for non-zero number of recurrences, the drop in quality is less severe as can be seen for a 3-layer recurrent model being decoded for more than 3 layers of recurrence. In the future, we plan to see what happens when we train a model using beyond 6 layers of recurrent stacking during training and decoding to identify the limits of recurrence.
5.3 Using Back-translated Corpora
As mentioned in the experimental section, for the GCP corpus setting, we generated pseudo-parallel corpora by translating monolingual corpora of 1552475 lines using the 1-layer models111111To translate the Japanese monolingual corpus we used the 1-layer Japanese-English model.. We added this pseudo-parallel corpora to the original parallel corpora of 400000 lines and trained all models mentioned in Section 4.2 from scratch. In order to compensate for the additional data we trained, both, the Japanese-English and the English-Japanese models for 200000 iterations on 1 GPU. Table 3 provides the results for models up to 6-layer recurrent stacking.
We can see that despite no increase in the number of parameters, the presence of back-translated data augments the translation quality for both translation directions. For the English-Japanese translation, the quality of a 3-layer recurrently stacked model trained using additional back-translated data matches the quality of a vanilla 6-layer model trained on the original parallel corpus of 400000 lines. Furthermore, the 6-layer recurrently stacked model beats the 6-layer model trained on the original parallel corpus of 400000 lines. It is clear that the gains using additional layers of recurrence in a low resource scenario is much higher than the gains in a resource rich scenario.
5.4 Parameter Reduction Due to Sharing
The number of parameters in a vanilla 6-layer Turkish-English model is 158894599 whereas the number of parameters for the recurrent stacking layer models is 48640519 no matter how many layers are in the stack. This corresponds to a 3.26 times reduction in the number of parameters. Knowledge distillation could help reduce this size even further. Similarly, for English-Japanese, the recurrently stacked models are 2.12 times smaller than the vanilla 6-layer model121212Note that the Turkish-English models have a shared matrix for encoder embedding, decoder embedding and softmax and thus will have greater savings in terms of parameters.
In this paper, we have proposed a novel modification to the NMT architecture where we share parameters across the layers of a N-layer model leading to a recurrently stacked NMT model. As a result, our model has the same size as that of a single layer NMT model and gives comparable performance to a 6-layer NMT model where the parameters across layers are not shared. This shows that it is possible to train compact NMT models without a significant loss in translation quality. We also showed that our approach can be used to generate pseudo-parallel corpora or back-translated corpora which when added to the original parallel corpora leads to further improvements in translation quality. We believe that our work will promote the research of techniques that rely on reusability of parameters and hence simplify the existing NMT architectures. In the future, we will perform an in-depth analysis of the limits of recurrent stacking of layers in addition to combining our methods with knowledge distillation approaches for high performance compact NMT modeling. We also plan to experiment with more complex mechanisms to compute the recurrent information during stacking for improved NMT performance.
- Bahdanau et al. (2015) Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. 2015. Neural machine translation by jointly learning to align and translate. In In Proceedings of the 3rd International Conference on Learning Representations (ICLR 2015). International Conference on Learning Representations, San Diego, USA. https://arxiv.org/pdf/1409.0473.pdf.
Cho et al. (2014)
Kyunghyun Cho, Bart van Merriënboer, Çağlar Gülçehre,
Dzmitry Bahdanau, Fethi Bougares, Holger Schwenk, and Yoshua Bengio. 2014.
representations using RNN encoder–decoder for statistical machine
Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP). Association for Computational Linguistics, Doha, Qatar, pages 1724–1734. http://www.aclweb.org/anthology/D14-1179.
- Dehghani et al. (2018) Mostafa Dehghani, Stephan Gouws, Oriol Vinyals, Jakob Uszkoreit, and Lukasz Kaiser. 2018. Universal transformers http://arxiv.org/abs/1807.03819.
Hinton et al. (2015)
Geoffrey Hinton, Oriol Vinyals, and Jeffrey Dean. 2015.
Distilling the knowledge in a
NIPS Deep Learning and Representation Learning Workshop. http://arxiv.org/abs/1503.02531.
- Imamura et al. (2018) Kenji Imamura, Atsushi Fujita, and Eiichiro Sumita. 2018. Enhancement of encoder and attention using target monolingual corpora in neural machine translation. In Proceedings of the 2nd Workshop on Neural Machine Translation and Generation. Association for Computational Linguistics, pages 55–63. http://aclweb.org/anthology/W18-2707.
- Imamura and Sumita (2018) Kenji Imamura and Eiichiro Sumita. 2018. Multilingual parallel corpus for global communication plan. In Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018). European Language Resources Association (ELRA), Paris, France, pages 3453–3458. http://www.lrec-conf.org/proceedings/lrec2018/pdf/104.pdf.
- Johnson et al. (2016) Melvin Johnson, Mike Schuster, Quoc V. Le, Maxim Krikun, Yonghui Wu, Zhifeng Chen, Nikhil Thorat, Fernanda B. Viégas, Martin Wattenberg, Greg Corrado, Macduff Hughes, and Jeffrey Dean. 2016. Google’s multilingual neural machine translation system: Enabling zero-shot translation. CoRR abs/1611.04558. http://arxiv.org/abs/1611.04558.
- Koehn et al. (2007) Philipp Koehn, Hieu Hoang, Alexandra Birch, Chris Callison-Burch, Marcello Federico, Nicola Bertoldi, Brooke Cowan, Wade Shen, Christine Moran, Richard Zens, Chris Dyer, Ondrej Bojar, Alexandra Constantin, and Evan Herbst. 2007. Moses: Open source toolkit for statistical machine translation. In Proceedings of the 45th Annual Meeting of the Association for Computational Linguistics Companion Volume Proceedings of the Demo and Poster Sessions. Association for Computational Linguistics, Prague, Czech Republic, pages 177–180. http://www.aclweb.org/anthology/P/P07/P07-2045.
- Kurohashi et al. (1994) Sadao Kurohashi, Toshihisa Nakamura, Yuji Matsumoto, and Makoto Nagao. 1994. Improvements of Japanese morphological analyzer JUMAN. In Proceedings of the International Workshop on Sharable Natural Language. pages 22–28. https://ci.nii.ac.jp/naid/10027016015/.
- Neubig (2011) Graham Neubig. 2011. The Kyoto free translation task. http://www.phontron.com/kftt.
- Papineni et al. (2002) Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu. 2002. BLEU: A method for automatic evaluation of machine translation. In Proceedings of the 40th Annual Meeting on Association for Computational Linguistics. Association for Computational Linguistics, Stroudsburg, PA, USA, ACL ’02, pages 311–318. https://doi.org/10.3115/1073083.1073135.
- Sutskever et al. (2014) Ilya Sutskever, Oriol Vinyals, and Quoc V. Le. 2014. Sequence to sequence learning with neural networks. In Proceedings of the 27th International Conference on Neural Information Processing Systems. MIT Press, Cambridge, MA, USA, NIPS’14, pages 3104–3112. http://dl.acm.org/citation.cfm?id=2969033.2969173.
- Thu et al. (2016) Ye Kyaw Thu, Win Pa Pa, Masao Utiyama, Andrew Finch, and Eiichiro Sumita. 2016. Introducing the Asian language treebank (ALT). In Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC 2016), European Language Resources Association (ELRA), Paris, France. http://www2.nict.go.jp/astrec-att/member/mutiyama/ALT/ALT-Parallel-Corpus-20171201/ALT-O-COCOSDA.pdf.
- Vaswani et al. (2017) Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Lukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. In Advances in Neural Information Processing Systems 30, Curran Associates, Inc., pages 5998–6008. http://papers.nips.cc/paper/7181-attention-is-all-you-need.pdf.