JUMT at WMT2019 News Translation Task: A Hybrid approach to Machine Translation for Lithuanian to English

08/01/2019 ∙ by Sainik Kumar Mahata, et al. ∙ 0

In the current work, we present a description of the system submitted to WMT 2019 News Translation Shared task. The system was created to translate news text from Lithuanian to English. To accomplish the given task, our system used a Word Embedding based Neural Machine Translation model to post edit the outputs generated by a Statistical Machine Translation model. The current paper documents the architecture of our model, descriptions of the various modules and the results produced using the same. Our system garnered a BLEU score of 17.6.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Machine Translation (MT) is automated translation of one natural language to another using a computer. Translation, itself, is a very tough task for both humans as well as a computer. It requires a thorough understanding of the syntax and semantics of both the languages under consideration. For producing good translations, a MT system needs good quality and sufficient amount of parallel corpus Mahata et al. (2016, 2017).

In the modern context, MT systems can be categorized into Statistical Machine Translation (SMT) and Neural Machine Translation (NMT). SMT has had its share in making MT very popular among the masses. It includes creating statistical models, whose input parameters are derived from the analysis of bilingual text corpora, created by professional translators Weaver (1955). The state-of-art for SMT is Moses Toolkit111http://www.statmt.org/moses/, created by Koehn et al. (2007), incorporates subcomponents like Language Model generation, Word Alignment and Phrase Table generation. Various works have been done in SMT Lopez (2008); Koehn (2009) and it has shown good results for many language pairs.

On the other hand NMT Bahdanau et al. (2014), though relatively new, has shown considerable improvements in the translation results when compared to SMT Mahata et al. (2018b). This includes better fluency of the output and better handling of the Out-of-Vocabulary problem. Unlike SMT, it doesn’t depend on alignment and phrasal unit translations Kalchbrenner and Blunsom (2013). On the contrary, it uses an Encoder-Decoder approach incorporating Recurrent Neural Cells Cho et al. (2014). As a result, when given sufficient amount of training data, it gives much more accurate results when compared to SMT Doherty et al. (2010); Vaswani et al. (2013); Liu et al. (2014).

For the given task222http://www.statmt.org/wmt19/translation-task.html, we attempted to create a MT system that can translate sentences from Lithuanian to English. Since, using only SMT or NMT models leads to some or the other disadvantages, we tried to use both in a pipeline. This leads to an improvement of the results over the individual usage of either SMT or NMT. The main idea was to train a SMT model for translating Lithuanian language to English. Thereafter, a test set was translated using this model. Then, a word embedding based NMT model was trained to learn the mappings between the SMT output (in English) and the gold standard data (in English).

The organizers provided the required parallel corpora, consisting of 9,62,022 sentence pairs, for training the translation model. Among this, 7,62,022 pairs was used to train the SMT system and 2,00,000 pairs were used to test the SMT system and then train the NMT system. The statistics of the parallel corpus is depicted in 1.

# sentences in Lt corpus 9,62,022
# sentences in En corpus 9,62,022
# words in Lt corpus 1,16,65,937
# words in En corpus 1,56,22,488
# word vocab size for Lt corpus 4,88,593
# word vocab size for En corpus 2,27,131
Table 1: Statistics of the Lithuanian-English parallel corpus provided by the organizers. ”#” depicts No. of. ”Lt” and ”En” depict Lithuanian and English, respectively. ”vocab” means vocabulary of unique tokens.

The remainder of the paper is organized as follows. Section 2 will describe the methodology of creating the SMT and the NMT model and will include the preprocessing steps, a brief summary of the encoder-decoder approach and the architecture of our system. This will be followed by the results and conclusion in Section 3 and 4, respectively.

2 Methodology

2.1 Smt

For designing the model we followed some standard preprocessing steps on 7,62,022 sentence pairs, which are discussed below.

2.1.1 Preprocessing

The following steps were applied to preprocess and clean the data before using it for training our Statistical machine translation model. We used the NLTK toolkit333https://www.nltk.org/ for performing the steps.

  • Tokenization: Given a character sequence and a defined document unit, tokenization is the task of chopping it up into pieces, called tokens. In our case, these tokens were words, punctuation marks, numbers. NLTK supports tokenization of Lithuanian as well as English texts.

  • Truecasing: This refers to the process of restoring case information to badly-cased or non-cased text Lita et al. (2003). Truecasing helps in reducing data sparsity.

  • Cleaning: Long sentences (No. of tokens ) were removed.

2.1.2 Moses

Moses is a statistical machine translation system that allows you to automatically train translation models for any language pair, when trained with a large collection of translated texts (parallel corpus). Once the model has been trained, an efficient search algorithm quickly finds the highest probability translation among the exponential number of choices.

We trained Moses using 7,62,022 sentence pairs provided by WMT2019, with Lithuanian as the source language and English as the target language. For building the Language Model we used KenLM444https://kheafield.com/code/kenlm/ Heafield (2011) with 7-grams from the target corpus. The English monolingual corpus from WMT2019 was used to build the language model

Training the Moses statistical MT system resulted in generation of Phrase Model and Translation Model that helps in translating between source-target language pairs. Moses scores the phrase in the phrase table with respect to a given source sentence and produces best scored phrases as output.

2.2 Nmt

Neural machine translation (NMT) is an approach to machine translation that uses neural networks to predict the likelihood of a sequence of words. The main functionality of NMT is based on the sequence to sequence (seq2seq) architecture, which is described in Section

2.2.1.

2.2.1 Sequence to Sequence Model

Sequence to Sequence learning is a concept in neural networks, that helps it to learn sequences. Essentially, it takes as input a sequence of tokens (words in our case)

and tries to generate the target sequence as output

where xi and yi are the input and target symbols respectively.

Sequence to Sequence architecture consists of two parts, an Encoder and a Decoder.

The encoder takes a variable length sequence as input and encodes it into a fixed length vector, which is supposed to summarize its meaning and taking into account its context as well. A

Long Short Term Memory (LSTM) cell was used to achieve this. The uni-directional encoder reads the words of the Lithuanian texts, as a sequence from one end to the other (left to right in our case),

Here, Ex is the input embedding lookup table (dictionary), enc is the transfer function for the LSTM recurrent unit. The cell state h and context vector C is constructed and is passed on to the decoder.

The decoder takes as input, the context vector C and the cell state h from the encoder, and computes the hidden state at time t as,

Subsequently, a parametric function outk returns the conditional probability using the next target symbol .

is the normalizing constant,

The entire model can be trained end-to-end by minimizing the log likelihood which is defined as

where N is the number of sentence pairs, and Xn and ytn are the input sentence and the t-th target symbol in the n-th pair respectively.

The input to the decoder was one hot tensor (embeddings at word level) of 2,00,000 English sentences while the target data was identical, but with an offset of one time-step ahead.

2.3 Architecture

Figure 1: Architecture

2.3.1 Training

For the training purpose, 7,62,202 , preprocessed, Lituanian-English sentence pairs were fed to Moses Toolkit. This created a SMT translation model with Lithuanian as the source language and English as the target language. Thereafter, we had 2,00,000 Lithuanian-English sentence pairs, from which the Lithuanian sentences were given as input to the SMT model and it gave 2,00,000 translated English sentences as output. Now, this 2,00,000 translated English sentences and the respective gold standard 2,00,000 sentences, from the Lithuanian-English sentence pair, were given as input to a word embedding based NMT model. As a result, this constituted our Hybrid model.

2.3.2 Testing

For the testing purpose, 10k Lithuanian Sentences were fed to the Hybrid model, and the output, when checked using BLEU Papineni et al. (2002), resulted in an accuracy of 21.6. The training and testing architecture is shown in Figure 1

3 Results

WMT2019 provided us with a test set of Lithuanian sentences in .SGM format. This file was parsed and fed to our hybrid system. The output file was again converted to .SGM format and submitted to the organizers. Our system garnered a BLEU Score of 17.6, when it was scored using automated accuracy metrics. Other accuracy scores are mentioned in Table 2.

Metric Score
BLEU 17.6
BLEU-cased 16.6
TER 0.762
BEER 2.0 0.497
CharactTER 0.718
Table 2:

Accuracy scores calculated using various autmoated evaluation metrics.

4 Conclusion

The paper presents the working of the translation system submitted to WMT 2019 News Translation shared task. We have used Word Embedding based NMT on top of SMT, for our proposed system. We have used a single LSTM layer as an encoder as well as a decoder. As a future prospect, we plan to use more LSTM layers in our model. We plan to create another model that incrementally trains both the SMT and NMT systems in a pipeline to improve the translation quality.

Acknowledgement

The reported work is supported by Media Lab Asia, MeitY, Government of India, under the Visvesvaraya PhD Scheme for Electronics & IT.

References