Combining Advanced Methods in Japanese-Vietnamese Neural Machine Translation

05/18/2018 ∙ by Thi-Vinh Ngo, et al. ∙ 0

Neural machine translation (NMT) systems have recently obtained state-of-the art in many machine translation systems between popular language pairs because of the availability of data. For low-resourced language pairs, there are few researches in this field due to the lack of bilingual data. In this paper, we attempt to build the first NMT systems for a low-resourced language pairs:Japanese-Vietnamese. We have also shown significant improvements when combining advanced methods to reduce the adverse impacts of data sparsity and improve the quality of NMT systems. In addition, we proposed a variant of Byte-Pair Encoding algorithm to perform effective word segmentation for Vietnamese texts and alleviate the rare-word problem that persists in NMT systems.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

Neural machine translation (NMT)[1, 2] is widely applied for machine translation (MT) in recent years and focuses on popular language pairs such as EnglishFrench, EnglishGerman, EnglishChinese or EnglishJapanese. NMT has obtained state-of-the-art performance on those language pairs compared to the traditional statistical machine translation (SMT) when given enough data[3, 4]. Furthermore, due to the ability of feature learning, NMT systems can be trained end-to-end with pure parallel texts and minimal linguistic knowledge of the languages involved. Thus it makes training NMT for a new language pair much easier, more scalable and robust. Nevertheless, NMT has not been employed in many low-resourced language pairs since in those scenarios, data scarcity often limits the learning ability of neural methods. In contrast, combinating complicated linguistic-driven features in a typical log-linear framework still keeps SMT the best approach in many translation directions but also hard to apply to new domains or to other language pairs.

In this paper we attempt to build NMT systems for such a low-resourced language pair: JapaneseVietnamese. Our aim is to set the first and reasonable NMT systems that can be reproducible in order to serve as the baseline for further researches in the direction111All the corpora and codes used in the experiments will be published later.. Furthermore, we conduct experiments using some advanced methods to improve the quality of the systems. An important criteria for those methods is that they must be scalable and language-independent as much as possible. The criteria ensures the basic principle of NMT as well as the reproducibility of the systems. On the other hand, the methods are chosen in the direction that they would help alleviate the data sparsity problem of NMT when being applied in this low-resourced setting.

Specifically, to deal with rare-word translation problems, we experiment with translation units in different levels: subword, word and beyond. In morphological-rich languages such as English or German, using subword as the translation units is often suitable since neural methods are able to induce meaning or linguistic function of sub-components constituting a word. Byte-Pair Encoding (BPE)[5] is a simple unsupervised technique to do subword segmentation and it has great effects when applied to NMT training. Japanese and Vietnamese (and some other Asian languages), however, have different word segmentation issues. Hence, it would be difficult to apply BPE directly to the texts in those languages and build NMT systems for subword translation without any modification. In this paper, we experiment different segmentation methods for both languages and also propose a variant of BPE algorithm to learn translation units for Vietnamese in an unsupervised way.

We also attempt to increase the amount of training data by using back-translated texts or mix-source data just from our small available corpus. Those data augmentation approaches have shown their effectiveness on various NMT systems, especially in under-resourced scenarios. While back translation technique is used to generate synthetic data from monolingual corpora, mix-source technique utilizes human-quality corpora in a multilingual setting, leveraging the transfer learning ability across languages. Both are simple but elegantly model the relevant noise needed in training neural architectures in such low-resourced situations.

The main contributions of this paper are:

  • We created the first NMT systems for JapaneseVietnamese and released the dataset as well as the codes to reproduce the experiments.

  • We conducted several ways of segmentation and proposed a variant of BPE algorithm for Vietnamese, which does not need any labeled data or linguistic resource.

  • We applied elegant data augmentation methods in order to reduce the severeness of data sparsity problem in training NMT systems using a small dataset of Japanese-Vietnamese.

Ii Neural Machine Translation

In this section, we will describe the general architecture of NMT as a kind of sequence-to-sequence modeling framework. In this kind of sequence-to-sequence modeling framework, often there is an encoder trying to encode context information from the input sequence and a decoder to generate one item of the output sequence at a time based on the context of both input and output sequences. Besides, an additional component, named attention, exists in between, deciding which parts of the input sequence the decoder should pay attention in order to choose which to output next. In other words, this attention component calculates the context relevant to the decision of the decoder at the considering time. Those components as a whole constitute a large trainable neural architecture called the famous attention-based encoder-decoder framework. This framework becomes popular in many sequence-to-sequence tasks.

In the field of machine translation, using the attention-based encoder-decoder framework is referred to as Neural Machine Translation approach. First presented in [1]

, the encoder and decoder in NMT are recurrent-based, which each hidden unit in those components is a recurrent unit like Long Short-term Memory (LSTM) 

[6]

or Gated Recurrent Unit(GRU)

[7]. Later, the encoder or decoder can also be a convolutional architecture, as in [8]. Recently, [9] introduces transformer architecture, in which both the encoder and decoder are a special variant of attention mechanism, called self attention. In this paper, we will briefly explain the Recurrent NMT as we utilize it in our experiments.

The Recurrent NMT model follows the attention-based architecture proposed by [1]. The bidirectional recurrent encoder reads every words of a source sentence and encodes a representation

of the sentence into a fixed-length vector

concatenated from those of the forward and backward directions:

Here is the one-hot vector of the word and is the word embedding matrix which is shared across the source words. is the recurrent unit computing the current hidden state of the encoder based on the previous hidden state. is then called an annotation vector, which represent the information of the source sentence up to the time from both forward and backward directions.

Those annotation vectors of the source sentences are combined in the attention layers in a way that the resulted vector encodes the source context relevant to the considering target word the decoder should produce. Intuitively, a relevance between the previous target word and the annotation vectors corresponding to the source words can be used to form some attention scenario:

This specific attention mechanism, originally called alignment model in  [1], has been employed as a simple feedforward network with the first layer is a learnable layer via adaptation factors , and . The relevance scores are then normalized into attention weights and the context vector is calculated as the weighted sum of all annotation vectors . Depending on how much attention the target word at time put on the source states , a soft alignment is learned and a source context at time is calculated prior to the prediction of the decoder.

Similar to the encoder, the recurrent decoder recursively generates one target word to form a translated target sentence in the end. At time , it takes the previous hidden state of the decoder , the previous embedded word representation and the time-specific context vector as inputs to calculate the current hidden state :

Again,

is the recurrent activation function of the decoder and

is the shared word embedding matrix of the target sentences.

Given the parallel corpus consisting of training examples

, the objective is to maximize the conditional log-probability of the correct translation given a source sentence with respect to the parameters

of the whole model:

Iii Subword Translation

One of the most severe problems of NMT is dealing with the rare words, which are not in the short lists of the vocabularies, i.e. out-of-vocabulary (OOV) words, or do not appear in the training set at all. On one hand, we would like to have fewer OOV words by increasing the size of the short lists. On the other hand, we need our neural network to learn fast and has a good generalization capability on the unseen words as well.

As explained shortly in the introduction, for many languages, using subword instead of word as a translation unit (TU) has been shown that it is not only effective on reducing vocabulary sizes, thus alleviating the computing burden on the large soft-max layer as well as reducing a substantial number of parameters to be learnt, but also has the ability to generate unseen words. In those languages, a word can be a compound word or comprised by sub-components, each has its own raw meaning or contains morphological information. Segmenting words into sub-components allows NMT to learn to translate them with considerably fewer data. For example, it is definitely less chance to see this popular German word “Wohnungsreinigung" (English equivalence: “house cleaning”) than its sub-components “Wohnung” (i.e. “house” or “flat”) and “reinigung” (i.e. “cleaning”) in a middle-sized German-English parallel corpus. Instead, NMT can observe and translate those sub-components (“Wohnung” and “reinigung”) and combine their translations to generate the unseen words (“house cleaning”). This is achieved by segmenting words into subword units using segmentation techniques in the preprocessing phase prior to translation. There are several segmentation methods; Some are the complicate ones which require linguistic resources or human-crafted rules. Thus, they are not language-independent and expensive to obtain for low-resourced languages.

Byte-Pair Encoding, otherwise, is a simple but robust technique to do subword segmentation. Since it is an unsupervised and fast technique, it has great effects when applied to build NMT systems for morphologically rich languages. BPE is originally proposed in [10] as a data compression technique by iteratively replaces the most frequent pair of bytes in a sequence with a single, unused byte. [5] realized Gage’s algorithm for word segmentation by merging frequent characters inside a word instead of merging frequent pairs of bytes in a whole file (sequence of bytes). Being applied in translation from and to morphological rich languages, it can automatically induce sub-components of a word (i.e. sequence of characters) which bear some meaning or morphological function without knowing much about linguistic characteristics of those languages. Hence, the TU of the NMT used in those languages is at a smaller level of word, i.e. subword.

On the other hand, Japanese and Vietnamese are different to those languages in the way of how we consider a word. In Vietnamese, the TU, normally considered as a word, which often separated to each others by white spaces, is not a word in linguistic term since it does not really have its own meaning. Therefore, applying BPE to segment those TUs into smaller units without any modification is not suitable for Vietnamese. In case of Japanese, there are no space to separate written texts into TUs. Thus, for Vietnamese and Japanese as well as for the languages having similar problem, before applying BPE, it is necessary to have some preprocessing step to tokenize the texts into words.

Iii-a Vietnamese Tokenization

From the linguistic point of view, each sequence of characters between two white spaces in Vietnamese texts cannot be considered as a word since it does not always have a full meaning to stand alone. For example, in the sentence “hôm nay là sinh nhật của tôi” (English equivalence: “Today is my birthday”), “hôm” and “nay” are not two words, they together form a word, which means “today”. Nevertheless, “hôm” and “nay” somehow still bear some meaning: “hôm”-“day”, “nay”-“now”. Similarly, “sinh”-“birth” and “nhật”-“date” also form the word “sinh nhật”-“birthday” but they are not two distinct words. We could also call them subwords.

In many Vietnamese processing tasks such as Part-of-Speech Tagging, Syntax Parsing or Chunking, there requires a step to concatenate those subwords to make a word since in those tasks, word is necessarily considered as the smallest unit to be processed. This step is normally referred to as word segmentation

. There are various word segmentation methods; The best ones are using machine learning approaches to learn from a labeled corpus. It makes the tasks hard and expensive to be applied in other domains. Furthermore, the translation unit in Machine Translation does not need to be a word but can be a subword or sequence of subwords if it have its own meaning.

With this observation in mind, if we consider a subword, i.e. the sequence of characters between two white spaces, as a byte and a sentence as a sequence of bytes, we can apply the BPE algorithm straight-forward: We iteratively find and concatenate the most frequent pair of subwords (,) and replace it by an unseen subwords . We do not merge with if one of them are digits or punctuations or other special symbols. The BPE learning algorithm has an arguments which is the minimum value of frequency. In practice, we set the minimum frequency is 2. Listing 1 presents this variant of BPE, which we call VNBPE.

No. Vietnamese phrases Segmentation using pyvi’s algorithm Segmentation using VN_BPE algorithm
1 sẽ kết thúc sẽ kết_thúc sẽ_kết_thúc
2 sự tập trung sự tập_trung sự_tập_trung
3 một đống một đống một_đống
4 vào lĩnh vực vào lĩnh_vực vào_lĩnh_vực
5 bằng máy bay bằng máy_bay bằng_máy_bay
Table I: Comparing a decent segmentation algorithm and VN_BPE.
 def get_most_freq_pair(text,min_freq):
   dict = {}
   dumpWord = {set_of_separate_symbols}
   for line in text:
     w1 = the_first_word_in_the_line
     for w2 = each_word_in_the_line:
       if w1,w2 not in dumpWord
       and w1 is not a number:
         dict[w1,w2] += 1
         get_next_pair_in_the_line()
         w1 = w2
   return (all pairs has freq > min_freq)
 def update_pairs(pair,text,codes_file):
   original_word = pair[0] + " " + pair[1]
   replaced_word = pair[0] + "_" + pair[1]
   input_file.replace(rew,orw)
   write_replaced_word_to_codes_file()
   return input_file
 ### MAIN PART ###
 min_freq=2
 open(input_file) to read
 open(codes_file) and (output_file) to write
 pairs = get_most_freq_pair(input_file,min_freq)
 arrange_pairs_for_decreasing_of_frequency()
 for each (freq,pair) in pairs:
   update_pairs(pair,input_file,codes_file)
 output_file=input_file
 close_all_files()
The proposed variant of BPE for Vietnamese.

Compared to other word segmentation methods which require training on labeled data, our VNBPE is a simple unsupervised method, alike to its original BPE algorithm. Table I shows the outputs of one decent word segmentation method and our VNBPE.

Iii-B Japanese Tokenization

In a Japanese written text, there could be a mixture of three different types of scripts: Chinese characters (kanji) and the other two syllabic scripts: hiragana and katakana. Each of kanji characters can be loosely considered as a subword that we mentioned in the previous section. In the meanwhile, each of hiragana or katakana characters can be considered as a latin character in English or Vietnamese. In addition, there is no space in the Japanese written texts to separate the characters, either kanji, ,hiragana or katakana. So we cannot learn good subwords from a small corpus by directly applying BPE or the variant VNBPE on Japanese written texts. In order to learn good subwords and with a little knowledge about Japanese, we decided to use KyTea[11] to do Japanese word segmentation and then apply Sennrich’s BPE on those word-segmented texts. Some examples of the Japanese words going through the word segmentation and BPE are shown in Table II.

Before BPE After BPE Vietnamese equi. English equi.
受け入れる 受け 入れる Chấp nhận Accept
崩れ落ちる 崩れ 落ちる Thu gọn Collapse
姉ちゃん 姉 ちゃん Chị gái Older sister
哀れん 哀 れん Đáng tiếc Pity
取りかかる 取り かかる Để bắt đầu To start
Table II: Examples of Japanese words after BPE.

Iv Data Augmentation

In this section, we describe the data augmentation methods we use to increase the amount of training data in order to make our NMT systems suffer less from the low-resourced situation in Japanese Vietnamese translation. Although NMT systems can predict and generate the translation of unseen words on their vocabularies, but they only perform this well if the parallel corpus for training are sufficiently large. For many under-resourced languages, unfortunately, it hardly presents. In reality, although the monolingual data of Vietnamese and Japanese are immensely available due to the popularity of their speakers, the bilingual Japanese-Vietnamese corpora are very limited and often in low quality or in narrowly technical domains. Therefore, data augmentation methods to exploit monolingual data for NMT systems are necessary to obtain more bilingual data, thus upgrading the translating quality.

Iv-a Back Translation

One of the approaches to leverage the monolingual data is to use a machine translation system to translate those data in order to create a synthetic parallel data. Normally, the monolingual data in the target language is translated, thus the name of the method: Back Translation[12].

More specific, to generate the data for an XY NMT system, we use the best YX translation system we have to translate every sentence in the monolingual data of language Y into sentences in the source language X. Then we pair to get the synthetic data. Finally, original bilingual data and synthetic data are mixed to train our NMT from the start.

Back Translation can improve the estimate conditional probability of the target word on the previous context words through adding a bilingual data with approximate translations to the source languages. Furthermore, the synthetic data might contain some translation noise from the Back Translation system, and if this noise is relevant, our NMT can be more robust in learning how to translate noisy inputs. One the other hand, if the quality of the Back Translation system is not adequate, using the synthesis data might bring adverse effects to our NMT.

In this paper, we subsample an amount of Vietnamese monolingual data so that we can create a synthetic corpus having the same size with the Japanese-Vietnamese parallel corpus. In the end, the data we have is double in size compared to the original one.

Iv-B Mix-Source Approach

Another data augmentation method considered useful in this low-resourced setting is the mix-source method[13]. In this method, we can utilize the monolingual data of the target language in a multilingual NMT system by mixing the original source sentences with those target monolingual data. The multilingual framework then uses the share information across source and target languages to improve the decision of the target words to be chosen.

Specifically, there are a small parallel corpus of the language pair X-Y which has sentence pairs () and a big monolingual corpus of the language Y which has sentences (, ). Now from the monolingual corpus we can generate a parallel corpus where we try to model the identical translation Y-Y: (). Then we mix and to get a parallel corpus of the size . Similar to the Back Translation, we subsample Vietnamese sentence pairs from the corpus then the size of the parallel data we have is also doubled.

To let the NMT knows which language a certain source sentence is in and then can model the language information, we follows the conventions from [13]. We append language tags to every word in both source and target sentences of the mixed corpus to indicate the language of the words. This technique shows the effectiveness in low-resourced scenarios[14, 15] and our JapaneseVietnamese is such a scenario.

V Experiments

V-a Data Preparation

We collected Japanese-Vietnamese parallel data from TED talks extracted from WIT3’s corpus[16]. After removing blank and duplicate lines we obtained 106758 pairs of sentences. The validation set used in all experiments is dev2010 and the test set is tst2010.

The data augmentation methods has been applied only for the JapaneseVietnamese direction. For Back Translation, we use Vietnamese monolingual data from VNESEcorpus of DongDu222http://viet.jnlp.org/download-du-lieu-tu-vung-corpus which includes 349578 sentences. We shuffle the lines of VNESEcorpus corpus and take out the first 106758 sentences (the same as the number of sentence pairs in the original parallel corpus). For Mix-Source, instead of using a subsampled monolingual corpus, we use the Vietnamese part of the Japanese-Vietnamese parallel corpus in order to learn the multilingual information in the same domain. Our datasets are listed in Table III.

Dataset Description Num. of sentences
Training TED 106758
Back Translation data Subsampled DongDu 106758
Mix-Source data Vietnamese part of TED 106758
Validation TED dev2010 568
Test TED tst2010 1220
Table III: Statistics of the dataset used in our experiments

V-B Preprocessing

After using KyTea to tokenize the Japanese texts, we learn and apply Sennrich’s BPE from the tokenized texts. For Vietnamese texts, first we use Moses scripts333https://github.com/moses-smt/mosesdecoder/tree/master/scripts to normalize the texts from digits, punctuations and special symbols. We use pyvi444https://github.com/trungtv/pyvi for Vietnamese word segmentation since it is one of the best tools for this task in term of speed, robustness and performance. On the other hand, we useVNBPE as an alternative way of doing word segmentation. Those two approaches are compared in an extrinsic evaluation of the NMT systems employing them.

V-C System Architecture and Training

We implement the translation systems using OpenNMT-py framework555https://github.com/OpenNMT/OpenNMT-py [17]. Our system architecture includes two bi-directional LSTM layers for the encoder and two LSTM layers for the decoder, each layer has the size of 512 hidden units. The size of source and target embedding layers is also 512. We use Adam optimizer[18] and learning rate annealing scheme with the initial learning rate at

. We train each systems for 15 epochs with the batch size of 64. The best model in term of the unigram accuracy on the validation set is usually used to translate the test set with beam size of 16. Other settings are the default settings of

OpenNMT-py, otherwise already noted.

V-D Results

We evaluate the quality of translation of systems based on different approaches mentioned in previous sections. The multi-BLEU from Moses scripts is used. The results have shown in the Table IV.

Baseline. For the baseline systems, the training data includes KyTea-segmented Japanese texts and pyvi-segmented Vietnamese texts. For comparison purpose, we build two baseline systems for each direction: one is use the traditional phrase-based statistical machine translation (SMT), the other one is the NMT system. Although our training set is small but we find that the NMT systems (2) are still more effective than the phrase-based SMT models (1) in both translation directions.

Subword NMT. We applied VNBPE and JPBPE to the baseline’s data and trained NMT systems. On VietnameseJapanese, we observed an improvement of 0.6 BLEU points when we used our VNBPE (3) instead of the pyvi’s word segmentation (2). Furthermore, when we trained our NMT models using both BPE methods (4), we obtained a bigger gain of 1.15 BLEU points. The similar improvements can be found in the JapaneseVietnamese as well: 0.29 BLEU points between (3) and (2) and 0.57 BLEU points between (4) and (3). This draws two conclusions: (i), despite using an unsupervised Vietnamese word segmentation which is fast, robust and does not require linguistic resources, our NMT systems performed better than those systems employing a complicate word segmentation method, (ii) BPE works significantly well for Japanese texts after we tokenized the texts.

Data Augmentation. We use the best system for VietnameseJapanese, which is the NMT systems trained on BPE-processed texts, to generate the synthetic data for JapaneseVietnamese translation. Although we achieved some gain (from 9.04 to 9.39 BLEU points), the effectiveness of Back Translation is not on par with its application on the translation systems of other language pairs. Looking into the VietnameseJapanese translation of DongDu corpus and its BLEU score, we speculate that it is because the VietnameseJapanese system is not good enough to produce reasonable synthetic data. In the meanwhile, combining Back Translation and Mix-Source brings a considerable improvements of 0.6 BLEU points compared to not using them.

VietnameseJapanese System dev2010 tst2010 (1) SMT Baseline - 8.73 (2) NMT Baseline 8.68 9.39 (3) + VNBPE 9.12 9.89 (4) + JPBPE 9.74 11.13 JapaneseVietnamese System dev2010 tst2010 (1) SMT Baseline - 7.73 (2) NMT Baseline 6.85 8.18 (3) + VNBPE 7.36 8.47 (4) + JPBPE 7.77 9.04 (5) + Back Translation 8.25 9.39 (6) + Mix-Source 8.56 9.64

Table IV: Evaluation of JapaneseVietnamese NMT systems.

Vi Related Works

Japanese-Vietnamese MT is firstly mentioned in 2005[19]. The authors focused on the difference from embedding structures between Japanese and Vietnamese, and then proposed rules for MT system and experiment on very small dataset (714 Japanese embedding sentences). This approach is suitable for small system applied in a specific domain or language, but it is not easily extendable to other domains or languages due to the expensiveness of building such rules.

The other previous work for JapaneseVietnamese uses SMT[20]. They also conducted the experiments on parallel corpora collected from TED talks. They used phrase-based and tree-to-string models and have shown that the SMT system trained on FrenchVietnamese obtains better results than the system of JapaneseVietnamese because French and Vietnamese have more similarities in the structures of sentences than between Japanese and Vietnamese. We also built phrase-based systems on the TED data and achieved better BLEU scores when using NMT.

Recently, some works use monolingual data to improve the accuracy of NMT systems. [12] have shown significant improvements by using monolingual data on target-side to generate synthetic data and then add them to original training data. [21] have shown significant improvements by "self-learning" method to generate the target sentences based on monolingual data on the source-side and then combined them with original bilingual data to train.[13] convert monolingual corpus on the target-side into a bitexts by copying target sentences to the source sentence and then combined original bilingual data together on training. Our systems employ those approaches to exploit monolingual data and show the improved performance for JapaneseVietnamese translations.

Vii Conclusion

We has built the first JapaneseVietnamese NMT systems and released the dataset as well as the associated training scripts. We have also shown that the proposed VNBPE algorithm can be used for Vietnamese word segmentation in order to conduct neural machine translation. Furthermore, by adapting Back Translation and Mix-Source, our NMT systems achieved the best improvement on the dataset. In the future, we will exploit more domain and multilingual information to improve the quality of the systems.

Viii Acknowledgments

We would like to thank the center of High-performance computing (HPC), University of Engineering and Technology, VNU, Vietnam for allowing us to use their GPUs to perform the experiments mentioned in the paper. We also thank the anonymous reviewers for their careful reading of our paper and insightful comments.

References