Machine Translation (MT) refers to automated translation. It is the process by which computer software is used to translate a text from one natural language (such as English) into another (such as Spanish). Translation itself is a challenging task for humans, and hence, is more challenging for computers. High quality translation requires a thorough understanding of syntax and semantical properties of both the source and target languages.
The importance of studying and developing better MT systems has gained popularity in the recent past due to rapid globalization, where people from multiple backgrounds having variety of language knowledge are working together. Primarily, two paradigms are currently followed for building MT systems. One is based on statistical techniques, while the other employs artificial neural networks.
The statistical model, commonly referred to as Statistical Machine Translation (SMT) Weaver (1955), addresses this challenge by creating statistical models, where input parameters are derived from the analysis of parallel bilingual text corpora Mahata et al. (2017). Some of the notable works on SMT are Al-Onaizan et al. (1999); Lopez (2008); Koehn (2009), where the authors have dived deep into various challenges, working principles and possible improvements. SMT has shown good results for many language pairs and is responsible for the recent surge in the popularity of MT among general public .
On the other hand, despite being relatively new, Neural Machine Translation (NMT) Bahdanau et al. (2014) has already shown promising results Mahata et al. (2016); Wu et al. (2016) and hence has gained substantial attention as well as interest. Continuous recurrent models for translation, which do not depend on alignment or phrasal translation units, was introduced by Kalchbrenner and Blunsom (2013). On the other hand, the problem of rare word occurrence was addressed by Luong et al. (2014) and the effectiveness of global and local approach was explored by Luong et al. (2015). He et al. (2016) demonstrated a log-linear framework by incorporating SMT features combined with NMT which addresses the issues like out of vocabulary and inadequate translation. The properties of these architecture were discussed in detail in Cho et al. (2014). This approach generally produces much more accurate translation than SMT even with the adequate supply of training data. Vaswani et al. (2013); Liu et al. (2014); Doherty et al. (2010).
In the current work, we have tested the performance of SMT and NMT on simple sentences (see Section 2) extracted from English-Hindi (En-Hn) and English-Bengali (En-Bn) parallel corpora provided by TDIL111http://www.tdil.meity.gov.in/. These experiments were done to dive into the scenarios where NMT and SMT outperform each other. Moreover, they would also help us in evaluating the question that whether usage of simple sentences as training data for MT models really evokes any difference in the quality of the MT output or not.
We have constrained our target language domain to Hindi and Bengali as these languages are used primarily in the Indian sub-continent. Number of native speakers of Hindi in India is 41.1% while that of Bengali is 8.11%. Hindi is written in Devanagari 222https://en.wikipedia.org/wiki/Devanagari script and Bengali is written in Eastern Nagari 333https://en.wikipedia.org/wiki/Eastern_Nagari_script script.
In order to test the effectiveness of the case study, SMT and NMT systems were also trained for the whole corpus which consists of sentences with mixed complexity. For both simple sentence corpus and the whole corpus, BLEU Papineni et al. (2002), TER Snover et al. (2006)
and manual evaluation metrics like fluency and adequecy were calculated to validate the observed results.
The paper has been organized as follows. Section 2 describes the extraction of simple sentences from the parallel corpus given by TDIL. Section 3 and Section 4, describes the methodology for the training of the SMT and the NMT models, respectively. Later, Section 5 describes the evaluation with respect to various metrics and finally, Section 6 draws the conclusion.
2 Extraction of Simple Sentence
Since we wanted to analyze and compare both the models viz. SMT and NMT with respect to how they perform on simple sentences, we first needed to extract such instances from our dataset that had data of varying complexity.
A simple sentence in this context is defined as a sentence that contains only one independent clause and has no dependent clauses. Generally, whenever two or more clauses are joined by conjunctions (coordinating and subordinating), it becomes a complex or a compound sentence accordingly. So, to get a hold on handling the conjunctions, we used the Stanford Dependency Parser 444https://stanfordnlp.github.io/CoreNLP/ library to chunk the English sentences into phrases. (viz. NP (Noun Phrase), VP (Verb Phrase), PP (Preposition Phrase), ADJP (Adjective Phrase) and ADVP (Adverb Phrase)).
We noticed that, simple sentences have an unique phrase structure that consists of some combinations of NP, VP and PP. In conjunction with this theory, we applied two approaches (viz. rule based approach and deep learning based approach) to extract simple sentences from the English corpus. The approaches are discussed in Section2.1 and Section 2.2, respectively.
2.1 Rule based Approach
We subjected a total of 3046 simple sentences, extracted from various websites, to chunk using Stanford Dependency Parser Manning et al. (2014), and identified their unique phrase structures. Such structures became the rules by which we further mined simple sentences from the English corpus.
We extracted 205 unique rules, the surface forms of which, along with its Confidence Score, are shown in Table 1. The confidence score of the rules were calculated using
We tested our system on 2876 sentences (1438 simple sentences and 1438 complex/compound sentences) and achieved an accuracy of 89.22%. Table 2 shows the various validation metrics. Using this system, 10,349 simple sentences from the TDIL English corpus were extracted, as shown in Table 4.
|PP NP* PP VP NP*||8.40|
|PP NP* VP PP NP*||9.49|
|ADVP NP* VP* ADVP NP*||9.36|
|NP VP PP NP PP NP||12.15|
|NP ADVP VP* NP*||11.69|
|NP* VP NP*||11.69|
|NP* PP NP VP* NP||11.46|
|NP VP PP NP*||11.23|
|VP* NP* PRP* ADVP*||4.92|
|NP VP* NP* PP* ADJP* ADVP*||9.62|
2.2 Deep Learning based Approach
We preferred Deep Learning approach over traditional Machine Learning (ML) approach as because in the ML approach we could only extract syntactic features, which was already exploited in the rule based approach discussed in Section2.1. On the other hand, a deep learning technique learn categories incrementally through it’s hidden layer architecture. We wanted the deep learning framework to learn the nature of a sentence from the POS tags itself as it automatically clusters similar data into separate spaces.
as optimizer with back-propagation. The network contained two hidden layers of sizes 50 and 50, respectively. The activation function used wastanh
and loss function used wasMean Squared Error. Learning Rate was kept at 0.001 and
number of epochswere fixed at 100. The batch size was kept at 128. The training data consisted phrases of 2876 sentences (1438 simple sentences and 1438 other complex/compound sentences). The trained model was subjected to 10 fold cross validation and it yielded an accuracy figure of 92.11%. Table 3 shows the results with respect to other important validation metrics.
The TDIL English corpus was fed to this model and it yielded 14,976 simple sentences as shown in Table 4.
|# of sentences||49999|
|# of other sentences||RL||39650|
|# of simple sentences||RL||10349|
|# of other sentences||DL||35023|
|# of simple sentences||DL||14976|
The deep learning based approach was preferred as it resulted better accuracy. The Bengali and Hindi counterparts of these sentences were extracted to build a parallel corpus comprising of simple sentences only. The next step was to build MT models using this data, as well as the data from the whole corpus, and compare their respective results.
3 Statistical Machine Translation
We know that Moses Koehn et al. (2007)
is a statistical machine translation system that allows us to automatically train translation models for any language pair, making use of a large collection of translated texts (parallel corpus). Once the model has been trained, an efficient beam search algorithm quickly finds the highest probability translation among the exponential number of choices.
For training the SMT model, we used English as the source language and Bengali and Hindi as the target languages. To prepare the data for training the SMT system, we performed the following steps.
The following steps were employed to preprocess the Source and the Target texts.
Tokenization: Given a character sequence and a defined document unit, tokenization is applied for chopping it up into pieces, called tokens. In our case, these tokens were words, punctuation marks, numbers.
Truecasing: This refers to the process of restoring case information to badly-cased or non-cased text Lita et al. (2003). Truecasing helps in reducing data sparsity.
Cleaning: Long sentences (# of tokens 80) were removed.
3.2 Language Model
A Language Model (LM) was built using the target language, Bengali and Hindi, in our case, to ensure fluent output. KenLM Heafield (2011), which comes bundled with the Moses toolkit, was used for building this model.
3.3 Word Alignment and Phrase Table Generation
For word alignment in the translation model, GIZA++ Och and Ney (2003) was used. Finally, the phrase table was created and probability scores were calculated. Training the Moses statistical MT system resulted in the generation of two models, one is a Phrase Model and the other is a Translation Model. Moses scores the phrase in the phrase table with respect to a given source sentence and produces best scored phrases as output.
4 Neural Machine Translation
Neural machine translation (NMT) is a MT approach that uses neural networks to predict the likelihood of a sequence of words, typically modeling entire sentences in a single integrated model. NMT departs from traditional phrase-based statistical approaches in the sense that it uses separately engineered subcomponents like Language Model generation, Word Alignment and Phrase Table generation. The main functionality of NMT is based on the sequence to sequence (seq2seq) architecture, which is described in Section 4.1.
4.1 Seq2Seq Model
The sequence to sequence model is a relatively new idea for sequence learning using neural networks. It has gained quite some popularity since it achieved state of the art results in machine translation task. Essentially, the model takes a sequence as input
and tries to generate the target sequence as output
where xi and yi are the input and target symbols, respectively. The architecture of seq2seq model comprises of two parts, the encoder and decoder. We experimented with two types of NMT models (word and character level) and both the models use the seq2seq architecture, the difference being in the inputs to its encoder and decoder. They are discussed in the sections 3 and 4 below. The working architecture of seq2seq model at the word level is shown in Fig. 2
. We implemented both the models using the KerasChollet et al. (2015) library.
4.1.1 Word Level NMT
To build our world level NMT model, we used the seq2seq with attention mechanism. This architecture has recently shown to achieve state of the art quality translation across many different language pairs. The details of the seq2seq model along with the training details are given below.
The encoder takes a variable length sequence as input and encodes it into a fixed length vector, which is supposed to summarize it’s meaning and taking into account it’s context as well. A Long Short Term Memory (LSTM) cell was used to achieve this. The directional encoder reads the sequence from one end to the other (left to right in our case),
Here, Ex is the input embedding lookup table (dictionary), enc are the transfer function for the Long Short Term Memory (LSTM) recurrent unit Hochreiter and Schmidhuber (1997). A contiguous sequence of encodings C is constructed and then passed on to the decoder.
The decoder takes as input, the context vector C from the encoder, and computes the hidden state at time t as,
Subsequently, a parametric function outk returns the conditional probability using the next target symbol k.
Z is the normalizing constant,
The entire model can be trained end-to-end by minimizing the log likelihood which is defined as
where N is the number of sentence pairs, and Xn and ytn are the input sentence and the t-th target symbol in the nth pair, respectively.
For training our model, we used the seq2seq with attention architecture by employing LSTM cell. We used two LSTM cells, stacked upon each other, where one acts as the encoder and the other as the decoder. We trained our model on 14976 data (for simple sentence corpus), 49999 sentences (for Bengali and Hindi whole general corpus), batch size at 256, number of epochs at 100 and learning rate at 0.001. The activation function used was softmax, optimizer used was rmsprop and the loss calculation at each step was done using categorical cross-entropy.
Neural processes involving attention Vaswani et al. (2017) has been largely studied in computational neuro-science. This concept is very loosely based on visual attention mechanism in humans. With attention mechanism, the need to encode the full source sentence into a fixed length vector is omitted. Rather we allow the decoder to attend different parts of the source sentence at each time step of the output generation. Essentially, we let the model learn what to attend based on the input sequence and what is predicted so far.
Mathematically, it computes the context vector ct at each time step t as a weighted sum of the source hidden states,
Each attention weight t represents how much relevant the tth source token xt is to the tth target token yt and is computed as :
Z is the normalization constant. score() is a feed forward neural network with a single hidden layer that scores how well the source symbol xx and the target symbol yt match. Ey is the target embedding lookup table and st is the target hidden state at time t. The results and evaluation of the systems are shown in Section 5.
4.1.2 Character Level NMT
It was observed that Character level NMT (CNMT) performs better than Word level NMT (WNMT) due to the following reasons Chung et al. (2016)
It does not suffer from out-of-vocabulary issues
It is able to model different, rare morphological variants of a word
It does not require segmentation.
Generally, CNMT works the best when majority of alphabets, in the source and target language, overlap i.e both the languages share a common or similar script. Still, we tried to find out its performance on the simple sentence and whole corpus, though in our case, Nagari script and Roman script utilizes completely different alphabets. The model has two parts (encoder and decoder) as discussed below.
In order to build the encoder, we used LSTM cells. The input of the cell was one hot tensor of English sentences (embeddings at character level). From the encoder, the internal states of each cell were preserved and the outputs were discarded. The purpose of this is to preserve the information at context level. These states were then passed on to the decoder cell as initial states.
However, for building the decoder, again an LSTM cell was used with initial states as the hidden states from encoder. It was designed to return both sequences and states. The input to the decoder was one hot tensor (embeddings at character level) of Bengali and Hindi sentences while the target data was identical, but with an offset of one time-step ahead. The information for generation is gathered from the initial states passed on by the encoder. Thus, the decoder learns to generate target data [t+1,…] given targets […, t] conditioned on the input sequence. It essentially predicts the output sequence, one character per time step.
For training the model, batch size was set to 64, number of epochs was set to 100, activation function was softmax, optimizer chosen was rmsprop and loss function used was categorical cross-entropy. Learning rate was set to 0.001. The results and evaluation of the systems are shown in Section 5.
5 Evaluation and Analysis
All of our translation systems were evaluated in two ways, automatic and manual, depictions of which are discussed in the section below.
5.1 Automatic Evaluation
Automatic evaluation was done by scoring the translations using BLEU and TER metrics. The results are shown in Table 5 and 6. In the tables, ”Bn” and ”Hn” means Bengali and Hindi, respectively. ”CNMT” and ”WNMT” means character and word level NMT models, respectively. The presence of attention mechanism in the model is denoted using ”A” and the contrary is denoted using ”NA”
|Simple Sent.||Whole Corp.|
|Simple Sent.||Whole Corp.|
5.2 Manual Evaluation
|Model (Bn)||SMT||CNMT||WNMT (NA)||WNMT(A)|
Translation quality was judged by four linguists. Two had Bengali mother tongue (evaluated Bn model), while the other two had Hindi mother tongue (evaluated Hn model). The evaluation criteria were Adequacy and Fluency. Adequacy means how much of the meaning expressed in the target translation. Fluency means to what extent the translation is well-formed grammatically, contains correct spellings and intuitively acceptable and can be sensibly interpreted by a native speaker. The speakers were asked to rate the translation in range of 1-5, where ’1’ is the lowest and ’5’ is the highest. The manual evaluation measures for English-Bengali and English-Hindi language pair are given in Table 7 and Table 8, respectively.
We can clearly see in the results, that a NMT model, when trained using simple sentences, performs better than a SMT model, when trained using the same sentence pairs.
But, at the same time, SMT outperforms NMT, when trained using the whole corpus. This is due to the fact that NMT doesn’t quite work well with less amount of data and highly complex sentences.
Similarly, we also see that character based NMT works better than word based NMT, when dealing with less amount of data. But again, we have to keep in mind that for a character based NMT to work well, we have to train it using a Source-Target language pair, who share a common script.
Further, word based NMT with attention perform relatively better than a character based NMT. We didn’t use attention in the Character NMT, as attention won’t be able to attend individual characters.
6 Conclusion and Future Work
In this work, we have tried to analyze the scenarios where SMT performs better than NMT and vice-versa. Also, we have tried to find out whether MT models give better outputs when trained with simple sentences rather than when trained using sentences of various complexities.
As a future prospect, we would like to take the ”other” (Complex+Compound) sentence pairs and simplify it, so that the whole MT models can be trained using more simple sentences. Also, we would like to increase the number of LSTM encoding and decoding layers as well as include embeddings like ConceptNet555https://github.com/commonsense/conceptnet-numberbatch in our future works.
This work is supported by Media Lab Asia, MeitY, Government of India, under the Visvesvaraya PhD Scheme for Electronics & IT.
- Al-Onaizan et al. (1999) Yaser Al-Onaizan, Jan Curin, Michael Jahr, Kevin Knight, John Lafferty, Dan Melamed, Franz-Josef Och, David Purdy, Noah A Smith, and David Yarowsky. 1999. Statistical machine translation. In Final Report, JHU Summer Workshop, volume 30.
- Bahdanau et al. (2014) Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. 2014. Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473.
- Bottou (2010) Léon Bottou. 2010. Large-scale machine learning with stochastic gradient descent. In Proceedings of COMPSTAT’2010, pages 177–186. Springer.
- Cho et al. (2014) Kyunghyun Cho, Bart Van Merriënboer, Dzmitry Bahdanau, and Yoshua Bengio. 2014. On the properties of neural machine translation: Encoder-decoder approaches. arXiv preprint arXiv:1409.1259.
- Chollet et al. (2015) François Chollet et al. 2015. Keras. https://keras.io.
- Chung et al. (2016) Junyoung Chung, Kyunghyun Cho, and Yoshua Bengio. 2016. A character-level decoder without explicit segmentation for neural machine translation. CoRR, abs/1603.06147.
- Doherty et al. (2010) Stephen Doherty, Sharon O?Brien, and Michael Carl. 2010. Eye tracking as an mt evaluation technique. Machine translation, 24(1):1–13.
- He et al. (2016) Wei He, Zhongjun He, Hua Wu, and Haifeng Wang. 2016. Improved neural machine translation with smt features. In AAAI, pages 151–157.
- Heafield (2011) Kenneth Heafield. 2011. Kenlm: Faster and smaller language model queries. In Proceedings of the Sixth Workshop on Statistical Machine Translation, pages 187–197. Association for Computational Linguistics.
- Hochreiter and Schmidhuber (1997) Sepp Hochreiter and Jürgen Schmidhuber. 1997. Long short-term memory. Neural computation, 9(8):1735–1780.
- Kalchbrenner and Blunsom (2013) Nal Kalchbrenner and Phil Blunsom. 2013. Recurrent continuous translation models. In EMNLP, volume 3, page 413.
- Koehn (2009) Philipp Koehn. 2009. Statistical machine translation. Cambridge University Press.
- Koehn et al. (2007) Philipp Koehn, Hieu Hoang, Alexandra Birch, Chris Callison-Burch, Marcello Federico, Nicola Bertoldi, Brooke Cowan, Wade Shen, Christine Moran, Richard Zens, et al. 2007. Moses: Open source toolkit for statistical machine translation. In Proceedings of the 45th annual meeting of the ACL on interactive poster and demonstration sessions, pages 177–180. Association for Computational Linguistics.
- Lita et al. (2003) Lucian Vlad Lita, Abe Ittycheriah, Salim Roukos, and Nanda Kambhatla. 2003. Truecasing. In Proceedings of the 41st Annual Meeting on Association for Computational Linguistics-Volume 1, pages 152–159. Association for Computational Linguistics.
Liu et al. (2014)
Shujie Liu, Nan Yang, Mu Li, and Ming Zhou. 2014.
A recursive recurrent neural network for statistical machine translation.
- Lopez (2008) Adam Lopez. 2008. Statistical machine translation. ACM Computing Surveys (CSUR), 40(3):8.
- Luong et al. (2015) Minh-Thang Luong, Hieu Pham, and Christopher D Manning. 2015. Effective approaches to attention-based neural machine translation. arXiv preprint arXiv:1508.04025.
- Luong et al. (2014) Minh-Thang Luong, Ilya Sutskever, Quoc V Le, Oriol Vinyals, and Wojciech Zaremba. 2014. Addressing the rare word problem in neural machine translation. arXiv preprint arXiv:1410.8206.
- Mahata et al. (2016) Sainik Mahata, Dipankar Das, and Santanu Pal. 2016. Wmt2016: A hybrid approach to bilingual document alignment. In WMT, pages 724–727.
- Mahata et al. (2017) Sainik Kumar Mahata, Dipankar Das, and Sivaji Bandyopadhyay. 2017. Bucc2017: A hybrid approach for identifying parallel sentences in comparable corpora. ACL 2017, page 56.
- Mahata et al. (2018) Sainik Kumar Mahata, Dipankar Das, and Sivaji Bandyopadhyay. 2018. Mtil2017: Machine translation using recurrent neural network on statistical machine translation. Journal of Intelligent Systems, pages 1–7.
Manning et al. (2014)
Christopher Manning, Mihai Surdeanu, John Bauer, Jenny Finkel, Steven Bethard,
and David McClosky. 2014.
The stanford corenlp natural language processing toolkit.In Proceedings of 52nd annual meeting of the association for computational linguistics: system demonstrations, pages 55–60.
- Och and Ney (2003) Franz Josef Och and Hermann Ney. 2003. A systematic comparison of various statistical alignment models. Computational Linguistics, 29(1):19–51.
- Papineni et al. (2002) Kishore Papineni, Salim Roukos, Todd Ward, and Wei jing Zhu. 2002. Bleu: a method for automatic evaluation of machine translation. pages 311–318.
- Snover et al. (2006) Matthew Snover, Bonnie Dorr, Richard Schwartz, Linnea Micciulla, and John Makhoul. 2006. A study of translation edit rate with targeted human annotation. In In Proceedings of Association for Machine Translation in the Americas, pages 223–231.
- Vaswani et al. (2017) Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. In Advances in Neural Information Processing Systems, pages 6000–6010.
- Vaswani et al. (2013) Ashish Vaswani, Yinggong Zhao, Victoria Fossum, and David Chiang. 2013. Decoding with large-scale neural language models improves translation. In EMNLP, pages 1387–1392.
- Weaver (1955) Warren Weaver. 1955. Translation. Machine translation of languages, 14:15–23.
- Wu et al. (2016) Yonghui Wu, Mike Schuster, Zhifeng Chen, Quoc V Le, Mohammad Norouzi, Wolfgang Macherey, Maxim Krikun, Yuan Cao, Qin Gao, Klaus Macherey, et al. 2016. Google’s neural machine translation system: Bridging the gap between human and machine translation. arXiv preprint arXiv:1609.08144.