Machine translation is important for news translation, a biomedical translation, automatic post-editing task, chatbots for understanding different languages, and question/answer systems(Yu et al., 2019). Machine Translation (MT) provides metrics to assess (i) the translation quality given reference translation, (ii) the translation quality without access to any reference, (iii) a robust translation task, and (iv) a parallel comparable corpora task to improve the translations by parallelizing model for translation and searching the web for translation. MT quality have increased considerably with most notably advances in the field of Neural Machine Translation (NMT) ((Bulté and Tezcan, 2019)) by learning the mapping between source and target language via neural networks and attention mechanisms.
Neural MT models - Recurrent Neural Network, Long Short Term Memory, Gated Recurrent Unit, Transformer (16-layer) , Transformer-Big (more hidden units) (Wang et al., 2019; Vaswani et al., 2017) are used for translation. These models are also used for selection and preparation of training data using comparable corpora for MT. The Recurrent Neural Network (RNN) units stacked with 1-2 layers are sufficient for a small data set system which may be used for mobile applications or embedded system. However recently the system based on multi-layer self- attention has shown some improvement on large scale state-of-art datasets.
(Domingo et al., 2019) presented online learning for NMT wherein authors integrated machine translation with the user interface so that machine continuously learn from human choices and adapt the model to a specific domain. (Yu et al., 2019)
presented a context-aware model for machine comprehension using an encoder, decoder and reinforcement learning.
In this paper, an application-based corpus populated with regional vocabulary, human translations and corresponding translations of the email content from Google Translate is prepared for developing the neural machine translation model. We want to show that these types of models are required in comparison to commercial general translators e.g. Google translator. Therefore, a RNN based Gated Recurrent Unit with attention decoder model is used for the University Email application, which predicts the next word conditioned on the previous context words (Zhang et al., 2019). The bilingual emails collected at the University for communication over a period of three years in size is small in comparison with state-of-the-art-dataset e.g. WMT-18 (English German).
The problem is found to require the context of the email content to be preserved during training on the dataset that may have multiple contexts. The problem has different challenges for ML EN and EN ML translations. The model developed for the problem initially was unable to learn the context for source and target languages within an email even in the presence of attention mechanism. Thus, the problem needs more efforts and a different approach when the dataset has multiple contexts. The bilingual emails are compared with ML EN Google translations and EN ML Google translations, respectively.
Table 1 depicts the format of the email corpus for the problem undertaken in the research. The trained model output sentence usually has multiple reasonable translations even if it generates a word different from ground truth word.
reference: Dear All
candidate 1: All in all
candidate 2: Dear all
candidate 3: Respected all
For example, the translation candidate 1 can be treated as a potential error in comparison to candidate 2 and candidate 3.
We observed that splitting the input email based on the context before feeding it into RNN Encoder improved the performance over Google Translate by 10-20 BLEU points which means the model error was improved. However, we could not address all the problems observed in Google Translate.
We improved the BLEU score over the regional vocabulary keeping the size of dataset small, including multiple contexts in an email. Results show that the training can be improved on the application scale, even with a small dataset and using a simple model rather than a very deep model. The results indicate that application-based regional models are better.
Our contributions are following
Bilingual regional vocabulary populated email corpus with corresponding translations from Google Translate
Trained NMT model with higher BLEU score than Google Translate
Context-based results of language translations when using NMT model
Regional language Malay based application
|eng human||malay human||malay google translate||eng google translate|
|Dear Students||Pelajar yang dihormati||Pelajar yang dihormati||Dear student|
2. RNN-based NMT Model
The approach can be used in applications of NMT models. We take the RNN-based NMT model to explain the method we used in this application (Bahdanau et al., 2014). Assume the source email and observed translation email can be expressed as sequence of word as and respectively. The core of the NMT is composed of sequence to sequence model generating translations using sequence to sequence or encoder-decoder network. The network consists of mainly three parts:
Attention Context Vector
Encoder: An encoder is a stack of many recurrent units where each accepts a single word or element of the input sequence, process the element and pass the state forward. The hidden states are computed as in Equation 1 with the help of current input , previous state , and weights of the network . This is the final hidden state of the encoder that is represented by Equation 1.
Attention: The context vector aims to encapsulate input sequence information to assist the prediction of another sequence by a decoder. This acts as an initial hidden state for the decoder. The context vector are computed as in Equation 2, 3 and 4 with the help of previous hidden state , previous state , and weights of the network normalized over the source sequence in Equation 3
The source context vector is weighted sum of all source annotations and can be calculated in Equation 4
Decoder: A decoder is similar to the encoder as it comprises of many recurrent units cells wherein each cell predicts an output word at a time step. Each recurrent unit cell accepts the previous target state and source context vector to produce output and next target hidden state represented by Equation 5.
target hidden state in the decoder is computed using the previous hidden state. The Probability distributionover all words in target vocabulary is produced from the decoder at any time, is computed using Softmax using Equation 7.
Thus, sequence to sequence model can map sequences of varying lengths to each other.
3. Approach: Neural Machine Translation model for email application with attention decoder
The main aim of our method is to feed the higher context, i.e. splitting the input text into contextual content to increase the model output probability distribution so that it matches with the probability distribution of the ground truth values. This potentially can reduce the gap between training and inference by training the model to handle the situation, which will appear during test time.
We discuss two methods to use NMT for the applications. To predict the p-th target word , the following steps are involved in our approach:
3.1. Solely-Email-Level Selection
Solely-Email-Level Selection approach uses the full email that has multiple contexts in different paragraphs.
The source context vector is a weighted sum of all source annotations and can be calculated as in Equation 10
At the p-th step, the NMT model needs the ground truth word word as the context word to predict , thus we need to select a which need to be similar to the ground truth word. We used Equations 8, 9 and 10, however, the model could not converge. We used GRU model for translation, and it could not converge because Equations 8, 9 and 10 determine the context over the whole length of input text while the context may not be preserved in the different paragraphs of the input text.
3.2. Contextual-Paragraph-Level Selection
One option is to optimize the length of the input sequence such that the context is preserved in the input text.
Assume the source email text with sequence of words has multiple contextual paragraphs with context vector . The objective of this approach is to split the text vector comprising the words so that split vector is based on the context vector.
Thus, at the p-th step the model converges. A comparison of Equation 10 and Equations 11, 12,13 clearly depicts the convergence of two scenarios and further explained in the Section 3.6. Figure 0(b) demonstrates the proposed implementation of the model for an application to optimize in comparison to implementation in Figure 0(a).
3.3. Results and Analysis
The objective of this research is to develop an NMT model for English Malay emails which were circulated to Universiti Brunei Darussalam teaching staff and students. The international staff and students are part of the University and therefore the communication in many emails is in both the languages.
We also paired each email with a translation from Google Translate. A sample of the dataset is shown in Table 1. The table has four columns, the content of the first and second columns are from the email used in the communication. The third column is the English translation of email content in Malay language using Google Translate. The fourth column is the Malay translation of email content in English Language using Google Translate.
There are a few issues in Google Translate as shown in the table when translating English language ‘Dear All ’to Malay Language, Google still shows ‘Dear All ’. Additionally, one can see in other cases that the English and Malay translation from Google doesn’t correspond to what has been used in the email. The problem is defined in terms of why the dataset of using emails is important. Since the translation of a sentence word-by-word is not that a user wants. In many cases, translation varies from person to person since, in many sentences or words, there are different ways to communicate the same matter.
3.4. Experiment Settings
We split 131 emails into contextual paragraphs. Thus, our dataset for English Malay translation and vice versa comprises 785 contextual paragraph pairs.
We divided the email into contextual paragraphs since training a model with full email which has multiple contextual paragraphs is error-prone. We also observed this fact when training the model with full email. This indicates that Gated Recurrent Unit Recurrent Neural Network machine translation model does not perform well in learning the patterns for long context, which in fact is not there, but the model tries to find and thus make an error.
The dataset dictionary comprises 1808 English words and 1628 Malay words. We set a maximum length of input text as 2000 to cover most sentences or paragraphs in the training process and also to terminate the output.
To train the network, we pass the input sequence through the encoder and track every output and hidden state. Thus, after passing an input sequence with the initial hidden state, we get encoder outputs and the final hidden state. Passing an input sequence of five words with 256 hidden sizes will produce encoder outputs of the tensor size (5, 256) and final hidden state of 256 size tensor vector.
The decoder is then given the first input as SOS
token and final hidden state of the encoder. The decoder can be given next input as the best guess by the decoder or the real target outputs during the training process. The concept of using target outputs as the next input is called teacher forcing that helps to converge the training process faster. We used the teacher forcing algorithm randomly with a probability of 0.5(Lamb et al., 2016). However, during testing or evaluation time, the decoder is given the next input as the best guess only.
Network loss is computed based on decoder output and target tensor. Network weights are optimized using Stochastic Gradient Descent (SGD) optimizer using the initial learning rate of 0.01. We stored loss after every 100 steps to track if the network is learning.
3.5. Model convergence
In this section, we provide results for both translation models, namely Malay English and English Malay when we split the email with contextual paragraphs. We show results in terms of NLL loss, BLEU score (Papineni et al., 2002) and comparison with Google Translation.
The loss explains the training process so that how the training performed during the number of iterations. Figure 1(a) shows graph of NLL loss with the number of iterations for Malay English. We performed 80,000 iterations that reduced the cross-entropy loss from 4.498 to 0.023. Figure 1(b) shows a graph of NLL loss with the number of iterations for English Malay. We performed 40,000 iterations that reduced the loss from 4.14 to 0.106. These graphs show that the model training is computationally faster in English Malay rather than Malay English.
3.6. Contextual Variation
Figure 1(c) - 1(d) shows graphs with the number of iterations for the variation of context with English Malay and Malay English wherein we observed that the context does not converge when email was not split in both Figure 1(c) -1(d) while, when the email split, context starts with a lower value and converge to 1 with an increase in the number of iterations. The context in the decoder depends on the attention weights calculated at each iteration. In a paragraph with multiple contexts, the attention weights are computed with a higher value and remain the same while the model does not converge. In the case of the contextual paragraph with split context, the attention weights start with a lower value and converge to 1, therefore, contributing to target predicted sequence to near the ground-truth value.
3.7. BLEU Score Comparison
Table 2 shows the comparison of BLEU score between (i) Malay Model English, (ii) English Model Malay, (iii) Malay Google English translation, and (iv) English Google Malay.
We present BLEU score for randomly chosen 100 paragraphs from the dataset after the model is trained when the NLL Loss is negligible.
The low BLEU of English to Malay of our model and Google Translation indicates that the Malay Language has complex language features corresponding to English.
The low BLEU of Google Translation in comparison to our model indicates that the application based regional models with contextual split are better.
|Model||Human Model||Model||Google Model|
Below we present a sample output from the model. The sample shows input text, the true value expected from model and google, predicted text from model and predicted text from Google.
Input: dengan itu para pensyarah pegawai kakitangan dan para pelajar dan juga alumni universiti brunei darussalam adalah amat dialu alukan untuk turut serta menjadi pembimbing bagi program ini.
Truth: therefore lecturers officers staff and students as well as university brunei darussalam alumni are welcome to participate in this program .
Pred: therefore lecturers officers staff and students as well as university brunei darussalam alumni are welcome to participate in this program .
Google: therefore the faculty staff staff and students as well as the university of brunei darussalam alumni are welcome to participate in the program .
4. Related Work
Researchers (Sutskever et al., 2014) have used five deep layered Long Short Term Memory training model improving the existing result of WMT-14 dataset for an EN FR translation of fixed conditionality. Cho et al. (Cho et al., 2014) improved hidden unit of LSTM Recurrent Neural Network by dropping a previously hidden unit whenever there is irrelevant information. The reset gate and update gate collectively improve the hidden state. The states which capture short term dependencies were including reset gate while long term dependencies were captured with updated gate. The research also incorporated phrase pair based dependencies to improve the model.
The authors in (Ruzsics and Samardzic, 2017) improved the Long Short Term Memory Recurrent Neural Network model by integrating it with canonical segmentation of words by providing the exact words of the verb. This is an improvement over the character level for language model with morphemes. The length constraint due to segmentation is also included in the model to handle variable-length sequences.
Machine Translation has been promising but on state-of-art datasets research is still in progress by reducing the noise between source and target sentence-level, reducing the overcorrection at word-level as well as at sentence level. The authors (Zhang et al., 2019) proposed to solve overcorrection problem by selecting predicted word as next input rather than ground truth by defining a measure on BLEU score on word-level as well as on sentence-level.
The recent problem under application of Machine Translation is inferential machine comprehension. The inferential network (Yu et al., 2019) is proposed to comprise a micro infer cell where one master unit is for reading the document to locate the ending of reasoning for the question according to context of the question. The reader unit uses attention mechanism from the reasoning operation to retrieve the content. The writer unit write the content to memory cell. The problem can not be optimized using back propagation therefore reinforcement learning is used to terminate the mechanism. Another approach in the direction of improvement of Neural Machine Translation is to augmenting the source sentences (Bulté and Tezcan, 2019) with fuzzy matches from translation memory using similarity match.
(Papineni et al., 2002) presented ‘BLEU’ a metric for automatic evaluation of MT
. Human evaluation of a translation is exhaustive but expensive. The authors defined n-gram precision over a corpus by first comparing the n-grams of the reference sentence with n-grams of the candidate translations in the corpus. BLEU score is given higher value when the number of n-gram matches are higher.
(Tan et al., 2017) used Long Short Term Memory (LSTM
) for POS tagging over the Malay Language dataset. The authors compared the Weighted Finite-State Transducer, Hidden Markov Model (HMM) andLong Short Term Memory for tagging and found the state transducer produced more accurate tagging with morphological information.
5. Conclusion and Future Work
In this paper, the performance of RNN based NMT model namely Gated Recurrent Unit (GRU) with attention decoder is presented on our dataset populated with English-Malay translated emails circulated at the University. The model was unable to learn the context for source and target language within the input text even in the presence of attention mechanism. Thus, a different approach splitting the input text into contextual content is used. General purpose trained model doesn’t perform well for a specific application. Thus, there is need to develop application oriented trained model populated with application specific vocabulary. The model using regional email vocabulary showed 10-20 BLUE score better than google translate. The model was unable to learn when source input contains bilingual text. Thus, there is need to update general translators for multilingual blended input text.
- Neural machine translation by jointly learning to align and translate. arXiv:1409.0473. Cited by: §2.
- Neural fuzzy repair: integrating fuzzy matches into neural machine translation. In 57th Conference of the Association for Computational Linguistics (ACL), pp. 1800–1809. Cited by: §1, §4.
- Learning phrase representations using rnn encoder-decoder for statistical machine translation. arXiv:1406.1078. Cited by: §4.
- Demonstration of a neural machine translation system with online learning for translators. arXiv:1906.09000. Cited by: §1.
- Professor forcing: a new algorithm for training recurrent networks. In Advances In Neural Information Processing Systems, pp. 4601–4609. Cited by: §3.4.
- BLEU: a method for automatic evaluation of machine translation. In Proceedings of the 40th annual meeting on association for computational linguistics, pp. 311–318. Cited by: §3.5, §4.
- Identifying and classifying unknown words in malay texts. Int. Sym. on Natural Language Processing, Pattaya, pp. 493–498. Cited by: §4.
- Neural sequence-to-sequence learning of internal word structure. In Proceedings of the 21st Conference on Computational Natural Language Learning (CoNLL 2017), pp. 184–194. Cited by: §4.
- Sequence to sequence learning with neural networks. In Advances in neural information processing systems, pp. 3104–3112. Cited by: §4.
- Evaluating lstm networks, hmm and wfst in malay part-of-speech tagging. Journal of Telecommunication, Electronic and Computer Engineering (JTEC) 9 (2-9), pp. 79–83. Cited by: §4.
- Attention is all you need. In Advances in neural information processing systems, pp. 5998–6008. Cited by: §1.
- Learning deep transformer models for machine translation. arXiv:1906.01787. Cited by: §1.
- Inferential machine comprehension: answering questions by recursively deducing the evidence chain from text. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pp. 2241–2251. Cited by: §1, §1, §4.
- Bridging the gap between training and inference for neural machine translation. arXiv:1906.02448. Cited by: §1, §4.