Large and deep Transformer models have dominated machine translation (MT) tasks in recent years Vaswani et al. (2017); Edunov et al. (2018); Wang et al. (2019); Raffel et al. (2020). Despite their high accuracy, these models are inefficient and difficult to deploy Wang et al. (2020a); Hu et al. (2021); Lin et al. (2021b). Many efforts have been made to improve the translation efficiency, including efficient architectures Li et al. (2021a, b), quantization Bhandare et al. (2019); Lin et al. (2020), and knowledge distillation Li et al. (2020); Lin et al. (2021a).
This work investigates efficient Transformers architectures and optimizations specialized for different hardware platforms. In particular, we study deep encoder and shallow decoder Transformer models and optimize them for both GPUs and CPUs. Starting from an ensemble of three deep Transformer teacher models, we train various student models via sequence-level knowledge distillation (Skd) Hinton et al. (2015); Li et al. (2021a); Kim and Rush (2016) and data augmentation Shen et al. (2020). We find that using a deep encoder (6 layers) and a shallow decoder (1 layer) gives reasonable improvements in speed while maintaining high translation quality. We improve the student model’s efficiency by removing unimportant components, including the FFN sub-layers and multi-head mechanism. We also explore other model-agnostic optimizations, including graph optimization, dynamic batching, parallel pre/post-processing, 8-bit matrix multiplication on CPUs, and 16-bit computation on GPUs.
Section 2 describes the training procedures of the deep teacher models. Then, Section 3 presents various optimizations for reducing the model size, improving model performance and efficiency. Finally, Section 4 details the accuracy and efficiency results of our submissions for the shared efficiency task.
2 Model Overview
Following Hu et al. (2020), Li et al. (2021a) and Lin et al. (2021a), we use the Skd method to train our models. Our experiments also show that the Skd method can obtain better performance than the word-level knowledge distillation (Wkd) method, similar to Kim and Rush (2016)
. Therefore, all of student models are optimized by using the interpolatedSkd method Kim and Rush (2016), and trained on data generated from the teacher models.
2.1 Deep Transformer Teacher Models
Recently, researchers have explored deeper models to improve the translation quality Wang et al. (2019); Li et al. (2020); Dehghani et al. (2019); Wang et al. (2020b). Inspired by them, we employ deep Transformers as the teacher models. More specifically, we train three teachers with different configurations, including Deep-30, Deep-12-768, and Skipping Sublayer-40. We also utilize Li et al. (2019)’s ensemble strategy to boost the teachers.
Deep-30 Transformer Model:
We set the number of encoder layers to 30 in the Transformer model. Other hyper-parameters are identical to the vanilla Transformer.
Deep-12-768 Transformer Model:
This model modifies the number of encoder layers, hidden sizes and embedding sizes to 12, 3,072 and 768. Such a setting makes the Transformer model deeper and wider. Other hyper-parameters are the same as vanilla Transformer.
Skipping Sublayer-40 Transformer Model:
This model uses a simple training procedure that samples one streaming configuration in each iteration Li et al. (2021a). The number of encoder layers is 40, and other setups are the same as Li et al. (2021a).
We adopt the relative position representation (RPR) Shaw et al. (2018) to further improve the teacher models and set the key’s relative length to 8.
2.2 Lightweight Transformer Student Models
Although the ensemble teacher model delivers excellent performance, our goal is to learn lightweight models. The natural idea is to compress knowledge from an ensemble into the lightweight model using knowledge distillation Hinton et al. (2015). We employ sequence-level knowledge distillation on the ensemble teacher model described in Section 2.1.
Seqence-level Knowledge Distillation
The Skd will make a student model mimic the teacher’s behaviors at the sequence level. Moreover, the method considers the sequence-level distribution specified by the model over all possible sequences . Following Kim and Rush (2016)
, the loss function ofSkd method for training students is
where is the indicator function, is the output of teacher model using beam search, symbolizes the source sentence and
denotes the conditional probability. We use the ensemble teacher model to generate multiple translations of the raw English sentences. In particular, we collect the 5-best list for each sentence against the original target to create the synthetic training data. However, we select only 12 million synthetic data to train our student models to reduce training costs. We find that student models will not have better performance when increasing the number of training data.
Fast Student Models
As suggested in Hu et al. (2020), the bottleneck of translation efficiency is the decoder part. Hence, we accelerate the decoding by reducing the number of decoder layers and removing multi-head mechanism333Although the multi-head mechanism does not increase the parameter of the model, it brings non-negligible computational costs.. Inspired by Hu et al. (2021), we design the lightweight Transformer student model with one decoder layer. We further remove the multi-head mechanism in the decoder’s attention modules. Table 1 shows that the Transformer student model with fewer decoder layers and decoder attention heads can achieve similar translation quality to the baseline. Therefore, we train four different student models based on the Transformer architecture with one decoder layer and a single decoder attention head. Those student models are described in detail in Table 2. Besides, experiments show that adding more encoder layers cannot improve the performance when the student model has 12 encoder layers. Therefore, our submissions have 12 encoder layers at most.
2.3 Data and Training Details
Our data is constrained by the condition of the WMT 2021 English-German news translation task444https://www.statmt.org/wmt21/translation-task.html, and we use the same data filtering method as Zhang et al. (2020). We select 20 million pairs to train our teacher models after filtering all official released parallel datasets (without official synthetic datasets). The data is tokenized with Moses Koehn et al. (2007), and jointly Byte-Pair Encoded (BPE) Sennrich et al. (2016) with 32K merge operations using a shared vocabulary. After decoding, we remove the BPE separators and de-tokenize all tokens with Moses Koehn et al. (2007).
Teacher Models Training
We train three teacher models using newstest19 as the development set with Fairseq Ott et al. (2019). We share the source-side and target-side embeddings with the decoder output weights. We use the Adam optimizer Kingma and Ba (2015) with , and
as well as gradient accumulation due to the high GPU memory footprints. Each model is trained on 8 TITAN V GPUs for up to 11 epochs. The learning rate is decayed based on the inverse square root of the update number after 16,000 warm-up steps, and the maximum learning rate is 0.002. After training, we average the last five checkpoints in the training process for all models. Similar toZhang et al. (2020), we train our teacher models with a round of back-translation with 12 million monolingual data selected from the News crawl and News Commentary. We train three DeEn models with the same method and model setup to generate pseudo-data. Table 3 shows the results of all teacher models and their ensemble, where we report SacreBLEU Post (2018) and the model size. Our final ensemble teacher model can achieve a BLEU score of 33.4 on newstest20.
Student Models Training
The training settings for student models are the same for the teacher models, except its learning rate is 7e and warmup-updates are 8,000. In addition, we also use the cutoff method Shen et al. (2020) to boost our student models555https://github.com/stevezheng23/fairseq_extension/tree/master/examples/translation/augmentation and we train our student model with 21 epochs. Table 2 shows the results of all student models. Our student model yields a significant speedup (2-2.6) with modest sacrifice in terms of BLEU (0.2-0.9 on newstest20).
2.4 Interpretation of Results
After training the final student models, we evaluate their BLEU scores on the English-German newstest20, newstest19, and newstest18 before any inference optimization. Results show that the student models can achieve very similar performance to the teachers. For instance, the Student-12-1-512 model delivers a loss of 0.2 BLEU score compared to the ensemble of teacher models.
3 Optimizations for Decoding
Our optimizations for decoding are implemented with NiuTensor 666https://github.com/NiuTrans/NiuTensor. The optimizations can be divided into three parts, including optimizations for CPUs, GPUs, and device-independent techniques.
3.1 Optimizations for GPUs
For the GPU-based decoding, we mainly explore dynamic batching and FP16 inference.
Unlike the CPU version, the easiest way to reduce the translation time on GPUs is to increase the batch size within a specific range. We implement a dynamic batching scheme that maximizes the number of sentences in the batch while limiting the number of tokens. This strategy significantly accelerates the inference compared to a fixed batch size when the sequence length is short.
Since the Tesla A100 GPU supports calculations under FP16, our systems execute almost all operations in 16-bit floating-point. To escape overflow, we convert the data type before and after the softmax operation in the attention modules. We also reorder some operations for numerical stability. For instance, we apply the scaling operation (dived by ) to the query instead of the attention weights. To accelerate our systems further, we replace the vanilla layer normalization with the L1-norm Lin et al. (2020). Also, we find that removing the multi-head mechanism (by setting the head to 1) in the student models significantly improves the throughput without performance loss.
3.2 Optimizations for CPUs
We employ the Student-6-1-512 and Student-3-1-512 models as our CPU submissions. Two methods are discussed to speed up the decoding for our CPU systems.
The Use of MKL
We use the Intel Math Kernel Library Wang et al. (2014) to optimize our NiuTensor framework, which helps our systems to make the full use of the Intel architecture and to extract the maximum performance.
8-bit Matrix Multiplication with Packing
We implement 8-bit matrix multiplication using the open-source library FBGEMMKhudia et al. (2021). Following Kim et al. (2019), we quantize each column of the weight matrix separately with different scales and offsets. Scale and offsets for weight matrix are calculated by:
refer to average and standard deviation for the-th column. The quantization parameters for the input matrix is calculated by:
where and are the maximum and minimum values of the matrix respectively. With FBGEMM API, we also execute the packing operation to change the layout of the matrices into a form that uses the CPU more efficiently. We pre-quantize and pre-pack all the weight matrices to avoid repeated operation during inference.
3.3 Other Optimizations
Furthermore, we explore other device-independent methods to optimize our systems. Those methods help our systems to achieve obvious speed-up without translation precision loss.
Furthermore, we explore other device-independent methods to optimize our systems. Those methods help our systems to achieve significant speed-up without translation precision loss.
Computation optimization. We prune all redundant operations and reorder some operations in the computational graph. For instance, we remove the log-softmax operation in the output layer when using greedy search. We also extract the transpose operations from matrix multiplications to the beginning of decoding.
Memory optimization. We reuse all possible nodes to minimize memory consumption. We also reduce the memory allocation or movement with an efficient memory pool. Moreover, we sort the source sentences in descending order of length and detect the peak memory footprint before decoding.
We use the GNU Parallel Tange (2011) for our systems to perform tasks in parallel. More specifically, we split the standard input into several lines and deliver them via the pipeline. The method is used to accelerate pre-processing, post-processing, and decoding on CPUs. We also find that the system decoding speed/memory is strongly correlated with the number of lines per task. To find the best number of lines for each run, we measure the time cost in different setups against the number of lines. Figure 1 shows that 2,000 is a relatively good choice, and the Student-6-1-512 model can translate 100,000 sentences in 102.6s on CPUs under this setup.
Better Decoding Configurations
As aforementioned, our GPU versions use a large batch size, but the batch size on the CPU is much smaller. We use (sbatch) and (wbatch) to restrict the number of sentences and number of words in a mini-batch not to be greater than sbatch and wbatch, respectively. In our GPU systems, we set the sbatch/wbatch to 3,072/64,000. For our CPU systems, the number of processes is managed by the Parallel tool, which is more efficient and accurate. Moreover, We use one MKL thread for each process and set the sbatch/wbatch to 128/2,048.
In the practice of knowledge distillation, we find that our systems are insensitive to the beam size. It means that the translation quality is good enough even using greedy search in all submissions.
Fast Data Preparation
We use the fastBPE777https://github.com/glample/fastBPE, a faster C++ version of subword-nmt888https://github.com/rsennrich/subword-nmt, to speed the BPE process. Moreover, we also use the fast-mosestokenizer999https://github.com/mingruimingrui/fast-mosestokenizer for tokenization.
3.4 Results after Optimizations
Figure 2 plots the Student-6-1-512 model’s performance with different decoding optimizations. All results show that our optimizations can significantly speed up our system without losing BLEU. Interestingly, we observe additional improvements of 0.4/0.1 BLEU points on the GPU/CPU through decoding optimizations in all our experiments. We also measure other models after decoding optimizations and find their performance is similar to the Student-6-1-512 model.
4 Submissions and Results
Our GPU submissions are compiled with CUDA 11.2. We set the number of decoder layers, and the number of decoder attention heads to one as described in Section 2.2 for all our GPU systems. Student-12-1-512 model gives a speedup of more than 6 on the GPU with a slight decrease of 0.2 BLEU on the newstest20 compared to the deep ensemble model. The system is named as Base-GPU-System in following part. We continue to reduce the number of encoder layers for more accelerations. The GPU system with Student-6-1-512 model improves the translation speed by 25% with 6 less encoder layers compared to the Base-GPU-System. Our fastest GPU system consists of three encoder layers and one decoder layer, which achieves 31.5 BLEU on the newstest20 with GPU and 1.6 speedup compared to the Base-GPU-System. We also employ the Student-6-1-0 model to create a GPU system that can achieve the 1.3 speedup compared to Base-GPU-System. Our systems are compiled in the 11.2.1-devel-centos7 docker image, an NVIDIA open-source image101010https://hub.docker.com/r/nvidia/cuda. We copy the executable, dependence tools, and model files to the 11.2.1-base-centos7 docker images (final submission). In this way, we ensure all of our system docker images can be executed by the organizers successfully and reduce the docker images size.
For the CPU track submissions, we use the test machine, which has 18 virtual cores. Our CPU version is compiled with MKL static library, and the executable file is 23MiB. Also, we use the 8-bit matrix multiplication with packing to speed the matrix multiplication in the network. We use the Student-3-1-512 and Student-6-1-512 models in our CPU systems, and they respectively achieve 31.5 and 32.8 BLEU on newstest20. For our CPU docker images, we use the base-centos7 docker image111111https://hub.docker.com/_/centos to deploy our CPU MT systems.
Furthermore, all submissions are tested with different cases, including dirty data, empty input, and very long sentences. The test results show that our systems can run successfully with exceptional inputs.
Our systems for the GPU-throughput track are the fastest overall submissions. Specifically, the Student-3-1-512 system can translate about 250 thousand words per second and achieve 25.5 BLEU on newstest21. We attribute this to the comparison of the performance of our teacher model on WMT21. In the CPU track, our system also has competitive performance. Our fastest CPU system created by Student-3-1-512 model can translate about 48 thousand words per real second via 36 CPU cores and can achieve 25.5 BLEU. We find that reducing the number of encoder layers for student model achieves lower BLEU scores at a similar speed for our CPU systems. Moreover, we compare the cost-efficiency of GPU and CPU decoding in terms of millions of words translated per dollar according to the official evaluation results. We find that translating on GPUs is much more cost-effective than on CPUs. Notably, our GPU system with the Student-3-1-512 model can translate 300 million words per dollar with acceptable quality. Also, all of our GPU systems have the lowest RAM consumption (about 4 GB) over all submissions according to official test.
We have described our systems for the WMT21 shared efficiency task. We have explored various efficient Transformer architectures and optimizations specialized for both CPUs and GPUs. We have shown that a lightweight decoder and proper optimizations for different hardware can significantly accelerate the translation process with slight or no loss of translation quality. Our fastest GPU system with three encoder layers and one decoder layer is 11 faster than the deep ensemble model and lose 1.9 BLEU points.
This work was supported in part by the National Science Foundation of China (Nos. 61876035 and 61732005), the National Key R&D Program of China (No.2019QY1801), and the Ministry of Science and Technology of the PRC (Nos. 2019YFF0303002 and 2020AAA0107900). The authors would like to thank the anonymous reviewers for their comments and suggestions.
- Bhandare et al. (2019) Aishwarya Bhandare, Vamsi Sripathi, Deepthi Karkada, Vivek Menon, Sun Choi, Kushal Datta, and Vikram Saletore. 2019. Efficient 8-bit quantization of transformer neural machine language translation model.
- Dehghani et al. (2019) Mostafa Dehghani, Stephan Gouws, Oriol Vinyals, Jakob Uszkoreit, and Lukasz Kaiser. 2019. Universal transformers. In 7th International Conference on Learning Representations, ICLR 2019, New Orleans, LA, USA, May 6-9, 2019. OpenReview.net.
- Edunov et al. (2018) Sergey Edunov, Myle Ott, Michael Auli, and David Grangier. 2018. Understanding back-translation at scale.
- Hinton et al. (2015) Geoffrey Hinton, Oriol Vinyals, and Jeff Dean. 2015. Distilling the knowledge in a neural network. arXiv preprint arXiv:1503.02531.
- Hu et al. (2020) Chi Hu, Bei Li, Yinqiao Li, Ye Lin, Yanyang Li, Chenglong Wang, Tong Xiao, and Jingbo Zhu. 2020. The NiuTrans system for WNGT 2020 efficiency task. In Proceedings of the Fourth Workshop on Neural Generation and Translation, pages 204–210, Online. Association for Computational Linguistics.
- Hu et al. (2021) Chi Hu, Chenglong Wang, Xiangnan Ma, Xia Meng, Yinqiao Li, Tong Xiao, Jingbo Zhu, and Changliang Li. 2021. Ranknas: Efficient neural architecture search by pairwise ranking.
- Khudia et al. (2021) Daya Khudia, Jianyu Huang, Protonu Basu, Summer Deng, Haixin Liu, Jongsoo Park, and Mikhail Smelyanskiy. 2021. Fbgemm: Enabling high-performance low-precision deep learning inference. arXiv preprint arXiv:2101.05615.
Kim and Rush (2016)
Yoon Kim and Alexander M. Rush. 2016.
Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, pages 1317–1327, Austin, Texas. Association for Computational Linguistics.
- Kim et al. (2019) Young Jin Kim, Marcin Junczys-Dowmunt, Hany Hassan, Alham Fikri Aji, Kenneth Heafield, Roman Grundkiewicz, and Nikolay Bogoychev. 2019. From research to production and back: Ludicrously fast neural machine translation. In Proceedings of the 3rd Workshop on Neural Generation and Translation, pages 280–288, Hong Kong. Association for Computational Linguistics.
- Kingma and Ba (2015) Diederik P. Kingma and Jimmy Ba. 2015. Adam: A method for stochastic optimization. In 3rd International Conference on Learning Representations, ICLR 2015, San Diego, CA, USA, May 7-9, 2015, Conference Track Proceedings.
- Koehn et al. (2007) Philipp Koehn, Hieu Hoang, Alexandra Birch, Chris Callison-Burch, Marcello Federico, Nicola Bertoldi, Brooke Cowan, Wade Shen, Christine Moran, Richard Zens, Chris Dyer, Ondřej Bojar, Alexandra Constantin, and Evan Herbst. 2007. Moses: Open source toolkit for statistical machine translation. In Proceedings of the 45th Annual Meeting of the Association for Computational Linguistics Companion Volume Proceedings of the Demo and Poster Sessions, pages 177–180, Prague, Czech Republic. Association for Computational Linguistics.
- Li et al. (2019) Bei Li, Yinqiao Li, Chen Xu, Ye Lin, Jiqiang Liu, Hui Liu, Ziyang Wang, Yuhao Zhang, Nuo Xu, Zeyang Wang, et al. 2019. The niutrans machine translation systems for wmt19. In Proceedings of the Fourth Conference on Machine Translation (Volume 2: Shared Task Papers, Day 1), pages 257–266.
Li et al. (2021a)
Bei Li, Ziyang Wang, Hui Liu, Quan Du, Tong Xiao, Chunliang Zhang, and Jingbo
Learning light-weight translation models from deep transformer.
Thirty-Fifth AAAI Conference on Artificial Intelligence, AAAI 2021, Thirty-Third Conference on Innovative Applications of Artificial Intelligence, IAAI 2021, The Eleventh Symposium on Educational Advances in Artificial Intelligence, EAAI 2021, Virtual Event, February 2-9, 2021, pages 13217–13225. AAAI Press.
- Li et al. (2020) Bei Li, Ziyang Wang, Hui Liu, Yufan Jiang, Quan Du, Tong Xiao, Huizhen Wang, and Jingbo Zhu. 2020. Shallow-to-deep training for neural machine translation. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 995–1005, Online. Association for Computational Linguistics.
- Li et al. (2021b) Yanyang Li, Ye Lin, Tong Xiao, and Jingbo Zhu. 2021b. An efficient transformer decoder with compressed sub-layers. CoRR, abs/2101.00542.
- Lin et al. (2020) Ye Lin, Yanyang Li, Tengbo Liu, Tong Xiao, Tongran Liu, and Jingbo Zhu. 2020. Towards fully 8-bit integer inference for the transformer model. In Proceedings of the Twenty-Ninth International Joint Conference on Artificial Intelligence, IJCAI 2020, pages 3759–3765. ijcai.org.
- Lin et al. (2021a) Ye Lin, Yanyang Li, Ziyang Wang, Bei Li, Quan Du, Tong Xiao, and Jingbo Zhu. 2021a. Weight distillation: Transferring the knowledge in neural network parameters. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing, ACL/IJCNLP 2021, (Volume 1: Long Papers), Virtual Event, August 1-6, 2021, pages 2076–2088. Association for Computational Linguistics.
- Lin et al. (2021b) Ye Lin, Yanyang Li, Tong Xiao, and Jingbo Zhu. 2021b. Bag of tricks for optimizing transformer efficiency.
- Ott et al. (2019) Myle Ott, Sergey Edunov, Alexei Baevski, Angela Fan, Sam Gross, Nathan Ng, David Grangier, and Michael Auli. 2019. fairseq: A fast, extensible toolkit for sequence modeling. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics (Demonstrations), pages 48–53, Minneapolis, Minnesota. Association for Computational Linguistics.
- Post (2018) Matt Post. 2018. A call for clarity in reporting BLEU scores. In Proceedings of the Third Conference on Machine Translation: Research Papers, pages 186–191, Brussels, Belgium. Association for Computational Linguistics.
- Raffel et al. (2020) Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and Peter J. Liu. 2020. Exploring the limits of transfer learning with a unified text-to-text transformer.
- Sennrich et al. (2016) Rico Sennrich, Barry Haddow, and Alexandra Birch. 2016. Neural machine translation of rare words with subword units. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 1715–1725, Berlin, Germany. Association for Computational Linguistics.
- Shaw et al. (2018) Peter Shaw, Jakob Uszkoreit, and Ashish Vaswani. 2018. Self-attention with relative position representations.
- Shen et al. (2020) Dinghan Shen, Mingzhi Zheng, Yelong Shen, Yanru Qu, and Weizhu Chen. 2020. A simple but tough-to-beat data augmentation approach for natural language understanding and generation. arXiv preprint arXiv:2009.13818.
- Tange (2011) O. Tange. 2011. Gnu parallel - the command-line power tool. ;login: The USENIX Magazine, 36(1):42–47.
- Vaswani et al. (2017) Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. In Advances in Neural Information Processing Systems 30: Annual Conference on Neural Information Processing Systems 2017, December 4-9, 2017, Long Beach, CA, USA, pages 5998–6008.
- Wang et al. (2014) Endong Wang, Qing Zhang, Bo Shen, Guangyong Zhang, Xiaowei Lu, Qing Wu, and Yajuan Wang. 2014. Intel math kernel library. In High-Performance Computing on the Intel® Xeon Phi™, pages 167–188. Springer.
- Wang et al. (2020a) Hanrui Wang, Zhanghao Wu, Zhijian Liu, Han Cai, Ligeng Zhu, Chuang Gan, and Song Han. 2020a. Hat: Hardware-aware transformers for efficient natural language processing.
- Wang et al. (2019) Qiang Wang, Bei Li, Tong Xiao, Jingbo Zhu, Changliang Li, Derek F. Wong, and Lidia S. Chao. 2019. Learning deep transformer models for machine translation. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 1810–1822, Florence, Italy. Association for Computational Linguistics.
- Wang et al. (2020b) Qiang Wang, Tong Xiao, and Jingbo Zhu. 2020b. Training flexible depth model by multi-task learning for neural machine translation. In Findings of the Association for Computational Linguistics: EMNLP 2020, pages 4307–4312, Online. Association for Computational Linguistics.
- Zhang et al. (2020) Yuhao Zhang, Ziyang Wang, Runzhe Cao, Binghao Wei, Weiqiao Shan, Shuhan Zhou, Abudurexiti Reheman, Tao Zhou, Xin Zeng, Laohu Wang, et al. 2020. The niutrans machine translation systems for wmt20. In Proceedings of the Fifth Conference on Machine Translation, pages 338–345.