Machine translation (MT) is a technique that uses machine learning to translate from one language to another. Machine translation has evolved greatly in recent years, allowing users to understand text in a foreign language without any knowledge of the language itselfQiu et al. (2021).
A classical statistical machine translation (SMT) system was initially employed for the MT task, which relies on separate lossy components, such as word aligners, translation rule extractors, and other features extractors Dabre et al. (2020). There was no robust end-to-end model for the translation task.
MT using neural methods, referred to as neural machine translation (NMT), has become the dominant paradigm for academic and commercial researchDabre et al. (2020). This paradigm has proven very successful at learning features from data, resulting in a remarkable breakthrough in the field of MT He et al. (2021)
. Its success is largely attributed to the use of distributed representations of language that enable end-to-end training of a MT system. The dominant NMT approach in the last few years is Embed - Encode - Attend - Decode.
The state-of-the-art NMT models are almost entirely based on the attention mechanism. Attention models make it easier to relate input sequence units regardless of their spatial or temporal distance. Additionally, attention models make it easier to parallelize sequence data processing.He et al. (2021).
The Transformer model, proposed by Vaswani et al. (2017)
, represents an important contribution to natural language processing (NLP) in general and to NMT in particular. It is classified as a branch of sequential neural networks, offering a faster processing speed than recurrent neural networks (RNN)Katharopoulos et al. (2020). The Transformer is an attention-based model. There are no recurrent or convolutional layers. Because of this, it has a very straightforward and nearly linear structure, which explains its rapid processing speed Qiu et al. (2021).
The main contributions of this paper include the following: (1) Investigating and carefully describing the different components constituting the Transformer model, (2) Testing the Transformer model with the popular German-English dataset from the 2013 Workshop on Statistical Machine Translation Task (WMT’13) dataset WMT (2013), and (3) Proposing the inclusion of a general-domain dataset from the 2016 International Workshop of Spoken Language Translation (IWSLT’16) Roedder (2016) in the training dataset.
The rest of the paper is organized as follows: Section 2 presents the literature review, Section 3 describes the Transformer model in detail, Section 4 illustrates the dataset collection and preprocessing, Section 5 describes the experimental setup and evaluation results, and finally, Section 6 provides a summary and conclusion including possible future work.
2 Literature Review
MT follows the same pattern as other NLP fields in which deep learning became the predominant approach, called NMTDabre et al. (2020). In modern-day research, Transformer-based models are almost mostly used in MT systems Qiu et al. (2021). They show excellent performance for many language pairs and vastly improve the performance of NMT models He et al. (2021). These models are being used for various MT tasks including the well-known WMT and IWSLT tasks.
In this paper, we aim at a neural approach based on the vanilla Transformer model by Vaswani et al. (2017). However, there are numerous attempts to enhance the basic Transformer encoder-decoder model architecture as stated by Dabre et al. (2020). Related to this, Garg et al. (2019) present an approach to train the Transformer model to produce both accurate translations and word alignments. Another work by Gordon and Duh (2020) explores the usage of general-domain data together with in-domain data for domain adaptation of the Transformer model. Xia et al. (2019) propose to share the weights and model parts of both the encoder and decoder of the Transformer model. Moreover, Abrishami et al. (2020) utilize a weighted combination of layers’ primary input and output of the previous layers in the multi-layer encoder and decoder of the Transformer model. Liu et al. (2020) show that it is feasible to build standard Transformer-based models with up to 60 encoder layers, achieving new state-of-the-art benchmark results on the WMT English-German translation task. Araabi and Monz (2020)
show that the effectiveness of the Transformer model under low-resource conditions is highly dependent on the hyperparameter settings.Bahar et al. (2020) make use of synthetic data to fine-tune the encoder and decoder models for domain adaptation. Qiu et al. (2021) propose an enhanced Transformer model for fast streaming translation methods. In the context of obtaining low latencies, Pham et al. (2020) employ the Transformer model but implement relative attention in the attention blocks, following the work by Dai et al. (2019)
, which takes into account the relative distances between words instead of using their absolute position vectors.
In inspiration to some of these works, we adapt the inclusion and usage of a general-domain dataset from IWSLT to improve the Transformer model performance on the WMT dataset.
3 Transformer Model
3.1 Attention Mechanism
The attention mechanism deployed by Vaswani et al. (2017) is at the core of the Transformer model, which relies on a few basic vector operations. It is used to process a sentence as an input to construct its contextualized embeddings as an output that are a better representation of its true meaning and the surrounding context. This mechanism is essentially a scaled dot-product self-attention, as shown in Figure 1 by Futrzynski . It can be broken down into the following:
Keys, Queries, and Values.
The input sentence tokens are converted into embeddings of size 512. Using learned positional encodings, which have the same embedding size of 512, information about the absolute positions of the tokens are included. The two embeddings can then be summed up. For modeling complex relationships between token embeddings, they are represented as keys, queries, and values. Each key, query, and value has its own linear projection layer. Through these projections, we can focus on certain aspects of the embeddings that matter the most in the relationship.
The scalar dot product between the projections of keys and queries is calculated as shown in Equation 1. The dot product is typically scaled-down for numerical stability, where
The projections of values corresponding to every input token are taken in proportions according to the results of the softmax function. This results in new contextualized embeddings which are more representative of the input tokens.
Because the query of the river token is strongly related to the key of the bank token for the term (river bank), the value of the bank token contributes largely to the contextualized embedding of the river token. The final output embeddings are now dependent on the surrounding context.
As shown in Figure 2 by Futrzynski , multi-head attention means that the projections of keys, queries, and values are split into heads, which in our case are 8, resulting in projection embeddings of size 64. Each of these projections (heads) focuses on calculating different types of relationships between the tokens and creating the corresponding contextualized embeddings. The contextualized embeddings from each attention head are concatenated to form the final output of the multi-head attention layer. Typically, multiple layers of multi-head attention are used in practice to achieve the best results.
An encoder consists of a stack of identical layers, which in our case are three. Figure 3
shows the structure of one encoder layer. Each layer contains two sub-layers. The input is passed first into a multi-head self-attention layer with a padded mask. This mask ensures that the attention mechanism ignores the token used for padding.
Next, the output is passed through a position-wise fully connected feed-forward layer. Each of these sub-layers is followed by a dropout layer, and a residual connection is added before going through layer normalization.
The decoder is composed of a stack of identical layers, which in our case are three as well. Figure 3 shows the structure of a single decoder layer. Its structure is similar to that of the encoder. In addition to the two sub-layers in the encoder layer, the decoder further adds a third sub-layer in between, which performs multi-head attention over the encoder’s output.
First, the input is processed by a multi-head attention layer with a look-ahead mask that prevents the decoder from attending to subsequent positions. The output is then passed through the next sub-layer, which performs multihead attention on the encoder output. The output is then passed through a position-wise fully connected feed-forward layer.
As with the encoder, dropout is used and residual connections are employed around each of the sub-layers, followed by layer normalization. Note that the output embeddings are offset by one position. It ensures that by using the look-ahead mask, the predictions for time step (position) can only be based on the known outputs at time steps less than .
Finally, to obtain the output probabilities across the vocabulary, the output of the decoder stack is passed through a linear layer and a softmax activation function.
3.4 Translation Process
In order to translate a sentence from a source language to a target language, the input tokens pass through an embedding layer before being added to their corresponding absolute positional embedding. The sum goes through the encoder stack layers, each of these layers contains a multi-head attention layer and a position-wise fully connected feed-forward network layer.
In the meantime, the 3-layer decoder stack receives as input the translated sequences in the target language. These input tokens are passed through the embedding and positional encoding layers which are added together. They pass then through the first multi-head attention layer.
This layer’s outputs, along with the encoder’s outputs, will be passed along to a second multi-head attention layer inside the decoder layer, which is followed by a position-wise fully connected feed-forward layer. After the output passes through the 3-layer decoder stack, it will pass through a feed-forward linear layer and a softmax activation function to determine the output probabilities. The token corresponds to the maximum probability in the softmax layer is produced as an output in each time step to generate the final translated sentence.
The German-English dataset used in this project is from the 2013 Workshop of Statistical Machine Translation (WMT’13) datasets WMT (2013). These datasets come from the workshop organizers and are used for the competition on MT tasks. Among these datasets, the German-English news commentary dataset is used in our work. There are also updated versions from WMT after 2013, but the WMT’13 datasets are generally used by many other works in the literature.
The news commentary dataset originally consists of about 178,000 sentence pairs that contain the same sentence in both German and English. The exact statistics of the dataset can be found in Table 1.
|No. of Samples||178,793||179,011|
|No. of Samples Preprocessed||178,221||178,221|
|No. of Samples Cleaned||177,145||177,145|
|No. of Samples Unique||176,742||176,742|
In our experiments, we use the following data processing steps: (i) data is preprocessed using Moses data preprocessing tool Koehn et al. (2007), which normalizes punctuation marks and does word segmentation, (ii) data is cleaned by removing empty sentences, deleting sentences that are obviously not aligned, and controlling long sentences to have a sequence length only from 1 to 80, and (iii) finally, duplicate sentences are removed to obtain unique sentences. Afterward, the data is split into train, validation, and test sets. The statistics of the splits can be found in Table 2.
|No. of Samples||174,742||1,000||1,000|
For inclusion and usage of a general-domain dataset, we adopt the TED talks dataset from the IWSLT 2016 German-English translation task datasets Roedder (2016). We apply the same data processing steps as on the WMT’13 dataset. These steps are simply applying punctuation normalization, tokenization, and data cleaning using the Moses scripts. The dataset statistics can be found in Table 3.
|No. of Samples||196,884||196,884|
|No. of Samples Preprocessed||196,884||196,884|
|No. of Samples Cleaned||195,897||195,897|
|No. of Samples Unique||194,226||194,226|
This processing pipeline results in 194,226 sentence pairs. These pairs are used to provide additional data for the training set of WMT’13.
Why Including the IWSLT’16 Dataset.
We are motivated by the usage of a general-domain dataset to improve the Transformer model performance on the WMT’13 test set. Our hypothesis is that the general-domain nature of the TED talks data of the IWLST’16 dataset should help the Transformer model learn a better representation of the textual context of both the source and target sentences. This benefits the system in general even if the evaluation is done only using the test set of WMT’13. This usage of additional general-domain data is also helpful when the original dataset for the MT task is tiny. This helps reduce the overfitting of the Transformer model to the original dataset.
5 Experimental Evaluation
In this section, we present our study on training the Transformer model on the WMT’13 news commentary German-English translation task dataset. We report the effects of including the IWSLT’16 TED talks German-English translation task dataset in training.
5.1 Experimental Setup
The settings we specify for the Transformer model can be found in Table 4. We went previously over most of these parameters in our explanation for the Transformer model. However, the forward expansion factor refers to the number of times to expand the embedding size in the first layer of each point-wise feed-forward network layer inside the Transformer model before collapsing it again to the original embedding size in the second layer.
|Source (German) Vocabulary Size||137,485|
|Target (English) Vocabulary Size||56,225|
|No. of Heads||8|
|No. of Encoder Layers||3|
|No. of Decoder Layers||3|
|Max. Sequence Length||100|
|Forward Expansion Factor||4|
The hyperparameters used for training are shown in Table 5
. We do not perform hyperparameter tuning for any of these parameters. We only experiment with modifying the number of epochs as we see later in the results section.
|No. of Epochs||5|
The candidate translations produced by the Transformer model are tested using reference translations and assessed by the Bilingual Evaluation Understudy score. In the literature, the BLUE score is widely adopted for comparison across different models of MT. The BLEU score ranges from 0 to 1. We do not want to achieve a score close to 1. The score should be close to the score of a human Qiu et al. (2021)
. However, in our case, we focus on comparing the model trained on the WMT’13 dataset vs. the model trained on the WMT’13 plus IWSLT’16 datasets. When computing the BLEU score, the average of the n-gram precision is computed and presented as. A word’s count in a candidate translation is clipped into the count of the word in its corresponding reference translation. The total count is then divided by the total number of words in the candidate translation. Equation 3 shows how the BLEU score is computed, where is the maximum length of n-grams and is the n-th weight with the sum of all the equal to 1.
In the calculation of the BLEU score, the Brevity Penalty (BP) penalizes short candidate translations. The shorter the candidate translation, the higher the penalty. In Equation 2, represents the length of the candidate translation, while represents the length of the reference translation.
|Training Dataset||No. of Epochs||Bleu Score|
|WMT’13 + IWSLT’16||5||18.4|
|WMT’13 + IWSLT’16||20||23.1|
|WMT’13 + IWSLT’16||50||25.8|
We first investigate the effect of training the Transformer model on the WMT’13 training set, and then including the IWSLT’16 dataset in the training. We report the results of applying these trained Transfomer models on the test set of the WMT’13 dataset. We adopted the BLEU score in this assessment.
In addition, we experiment with using different numbers of epochs in the training. Table 6 compares these models in terms of the BLEU score on the test set of the WMT’13 dataset. Intuitively, the higher the number of epochs, the better the BLEU scores.
However, for the same number of epochs, the model trained on WMT’13 plus IWLST’16 always outperforms the model trained only on WMT’13 with a gain of around two BLEU score points. We continue the experiments using the best model (trained on WMT’13 plus IWSLT’16) and train it for 50 epochs, achieving a maximum BLEU score result of 25.8.
The results show that training on WMT’13 datset only is not sufficient and that especially including the IWSLT’16 datset helps improve the BLEU scores. This indicates the value of including general-domain dataset in the training data.
This fact is supported by Figure 4. It shows the progression of the calculated training and validation losses over the number of epochs. The y-axis reflects the cross-entropy loss, and the x-axis reflects the number of epochs. The lines in dark and light red refer to the training and validation losses of the model trained on WMT’13 only, respectively. While the lines in dark and light blue refer to the training and validation losses of the model trained on WMT’13 plus IWSLT’16 datasets, respectively.
If we investigate the validation losses of the two models represented by the light red and blue lines, we find that the model trained on WMT’13 plus IWSLT’16 achieves consistently a lower validation loss than the model trained on WMT’13 only. It always scores about 0.5 lower cross-entropy loss for all the stated number of epochs. In addition, the dark red and blue lines, which show the training losses, demonstrate that the model trained on WMT’13 plus IWSLT’16 trains faster than the model trained on WMT’13 only. It achieves the same loss achieved by the model trained on WMT’13 only in earlier epochs before they both achieve nearly the same loss at the 20th epoch.
In this part, we take a closer look at the translations produced by the Transformer model. We aim to assess how it learns to translate an example sentence and how the quality of the translation improves over the number of epochs, as shown in Table 7.
The table shows an example sentence in the source language (German) and its ground truth reference translation in the target language (English). For the model trained for 5 epochs, the resulting translation is not accurate. It misses an important keyword which is "gold". On the other hand, the model trained for 50 epochs can capture this important keyword and the produced sentence overall is more accurate and relevant to the reference translation.
Another assessment is done to compare the Transformer model trained on WMT’13 and the model trained on WMT’13 plus IWSLT’16 for the same number of epochs. The results can be shown in Table 8. It shows the same example sentence and reference translation from the previous table.
The results show that the model trained on WMT’13 only gives a translation that misses a relevant keyword "than ever" from the reference sentence, although the overall meaning does not massively change. However, the model trained on WMT’13 plus IWSLT’16 produces a more accurate translation that is more consistent with the reference sentence, and it contains the complete meaning by including the keyword "than ever". This demonstrates the strength of including additional general-domain datasets in the training data to produce more relevant translation sentences.
In this paper, we presented an explanation for the Transformer model and its components. In particular, we utilized it to obtain a German-English machine translation system. In our experiments, we investigated the effect of including a general-domain dataset in training and whether it helps or harms the performance of the model. The original dataset used to train and evaluate the model comes from the news commentary dataset of the WMT’13 German-English translation task. The additional general-domain dataset deployed is the IWSLT’16 TED talks German-English dataset. Our results showed that it is helpful to use additional general-domain dataset in the training data. We observe that for the same number of epochs, the model trained with additional general-domain data achieves on average a gain of two BLEU score points on the test set of the WMT’13 dataset. Possible actions of this work can go in the direction of investigating methods to generate synthetic data to augment the general-domain dataset automatically. This should improve and help the Transformer model obtain more relevant translation sentences and generate higher quality rich representations for the textual context of both the source and target sentences, which in return should help with the translation task.
- WMT (2013) 2013. Proceedings of the Eighth Workshop on Statistical Machine Translation, WMT@ACL 2013, August 8-9, 2013, Sofia, Bulgaria. The Association for Computer Linguistics.
- Abrishami et al. (2020) Mahsa Abrishami, Mohammad Javad Rashti, and Marjan Naderan. 2020. Machine translation using improved attention-based transformer with hybrid input. In 2020 6th International Conference on Web Research (ICWR), pages 52–57. IEEE.
- Araabi and Monz (2020) Ali Araabi and Christof Monz. 2020. Optimizing transformer for low-resource neural machine translation. In Proceedings of the 28th International Conference on Computational Linguistics, COLING 2020, Barcelona, Spain (Online), December 8-13, 2020, pages 3429–3435. International Committee on Computational Linguistics.
- Bahar et al. (2020) Parnia Bahar, Patrick Wilken, Tamer Alkhouli, Andreas Guta, Pavel Golik, Evgeny Matusov, and Christian Herold. 2020. Start-before-end and end-to-end: Neural speech translation by apptek and RWTH aachen university. In Proceedings of the 17th International Conference on Spoken Language Translation, IWSLT 2020, Online, July 9 - 10, 2020, pages 44–54. Association for Computational Linguistics.
- Dabre et al. (2020) Raj Dabre, Chenhui Chu, and Anoop Kunchukuttan. 2020. A survey of multilingual neural machine translation. ACM Comput. Surv., 53(5):99:1–99:38.
- Dai et al. (2019) Zihang Dai, Zhilin Yang, Yiming Yang, Jaime G. Carbonell, Quoc Viet Le, and Ruslan Salakhutdinov. 2019. Transformer-xl: Attentive language models beyond a fixed-length context. In Proceedings of the 57th Conference of the Association for Computational Linguistics, ACL 2019, Florence, Italy, July 28- August 2, 2019, Volume 1: Long Papers, pages 2978–2988. Association for Computational Linguistics.
- (7) Romain Futrzynski. Self-attention: step-by-step video.
- Garg et al. (2019) Sarthak Garg, Stephan Peitz, Udhyakumar Nallasamy, and Matthias Paulik. 2019. Jointly learning to align and translate with transformer models. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing, EMNLP-IJCNLP 2019, Hong Kong, China, November 3-7, 2019, pages 4452–4461. Association for Computational Linguistics.
- Gordon and Duh (2020) Mitchell A. Gordon and Kevin Duh. 2020. Distill, adapt, distill: Training small, in-domain models for neural machine translation. In Proceedings of the Fourth Workshop on Neural Generation and Translation, NGT@ACL 2020, Online, July 5-10, 2020, pages 110–118. Association for Computational Linguistics.
- He et al. (2021) Weihua He, Yongyun Wu, and Xiaohua Li. 2021. Attention mechanism for neural machine translation: A survey. In 2021 IEEE 5th Information Technology, Networking, Electronic and Automation Control Conference (ITNEC), volume 5, pages 1485–1489. IEEE.
- Katharopoulos et al. (2020) Angelos Katharopoulos, Apoorv Vyas, Nikolaos Pappas, and François Fleuret. 2020. Transformers are rnns: Fast autoregressive transformers with linear attention. In Proceedings of the 37th International Conference on Machine Learning, ICML 2020, 13-18 July 2020, Virtual Event, volume 119 of Proceedings of Machine Learning Research, pages 5156–5165. PMLR.
- Koehn et al. (2007) Philipp Koehn, Hieu Hoang, Alexandra Birch, Chris Callison-Burch, Marcello Federico, Nicola Bertoldi, Brooke Cowan, Wade Shen, Christine Moran, Richard Zens, Chris Dyer, Ondrej Bojar, Alexandra Constantin, and Evan Herbst. 2007. Moses: Open source toolkit for statistical machine translation. In ACL 2007, Proceedings of the 45th Annual Meeting of the Association for Computational Linguistics, June 23-30, 2007, Prague, Czech Republic. The Association for Computational Linguistics.
- Liu et al. (2020) Xiaodong Liu, Kevin Duh, Liyuan Liu, and Jianfeng Gao. 2020. Very deep transformers for neural machine translation. CoRR, abs/2008.07772.
- Pham et al. (2020) Ngoc-Quan Pham, Felix Schneider, Tuan-Nam Nguyen, Thanh-Le Ha, Thai Son Nguyen, Maximilian Awiszus, Sebastian Stüker, and Alexander H. Waibel. 2020. Kit’s IWSLT 2020 SLT translation system. In Proceedings of the 17th International Conference on Spoken Language Translation, IWSLT 2020, Online, July 9 - 10, 2020, pages 55–61. Association for Computational Linguistics.
- Qiu et al. (2021) Jiabao Qiu, Melody Moh, and Teng-Sheng Moh. 2021. Fast streaming translation using machine learning with transformer. In ACM SE ’21: 2021 ACM Southeast Conference, Virtual Event, USA, April 15-17, 2021, pages 9–16. ACM.
- Roedder (2016) Margit (IAR) Roedder. 2016. IWSLT2016. Archive Location: KIT Publisher: Roedder, Margit (IAR).
- Vaswani et al. (2017) Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. In Advances in Neural Information Processing Systems 30: Annual Conference on Neural Information Processing Systems 2017, December 4-9, 2017, Long Beach, CA, USA, pages 5998–6008.
Xia et al. (2019)
Yingce Xia, Tianyu He, Xu Tan, Fei Tian, Di He, and Tao Qin. 2019.
transformers: Neural machine translation with shared encoder and decoder.
The Thirty-Third AAAI Conference on Artificial Intelligence, AAAI 2019, The Thirty-First Innovative Applications of Artificial Intelligence Conference, IAAI 2019, The Ninth AAAI Symposium on Educational Advances in Artificial Intelligence, EAAI 2019, Honolulu, Hawaii, USA, January 27 - February 1, 2019, pages 5466–5473. AAAI Press.