Sequence-to-sequence neural network models for transliteration

10/29/2016 ∙ by Mihaela Rosca, et al. ∙ Google 0

Transliteration is a key component of machine translation systems and software internationalization. This paper demonstrates that neural sequence-to-sequence models obtain state of the art or close to state of the art results on existing datasets. In an effort to make machine transliteration accessible, we open source a new Arabic to English transliteration dataset and our trained models.



There are no comments yet.


page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Transliteration–the conversion of proper nouns from one orthographic system to another–is an important task in multilingual text processing, useful in applications like online mapping and as a component of machine translation systems. Transliteration is determined by collection of historical accidents, conventions, and statistical regularities: many language pairs have adopted different rules for transliteration over time and many transliterations depends on the origin of a word. These properties make it desirable to look for high-quality, automated machine learning solutions to the problem.

A number of model-based methods for machine transliteration have been developed in the past. Such models assume that character sequences in the source orthography correspond to predictable character sequences in the target orthography, possibly depending on context. Some models additionally assume that such correspondences are influenced by phonetic information. Due to the statistical nature of the problem, components of such models frequently involve parameters and statistical modeling such as hidden Markov models, logistic regression, finite state transducers, and/or conditional random fields (CRFs). Typical examples of such models are

[Ammar et al.2012, Ganesh et al.2008]; transliteration in such a system is based on an alignment step, followed by a CRF model that performs local string rewriting.

In many areas, including machine translation, end-to-end deep learning models have become a good alternatives to more traditional statistical approaches. This is our motivation for taking a similar approach to transliteration. Unlike statistical models, such end-to-end systems simply take a character string in the source orthography and are trained directly to produce a character string in the target orthography. The closest approaches to the transliteration methods described in this paper are probably found in

[Rao et al.2015], using a bidirectional LSTM models together with input delays for grapheme to phoneme conversion (which can be viewed as a kind of “transliteration” from English to IPA) and [Yao and Zweig2015], using attentionless sequence-sequence models for the same task.

This paper describes the application of two neural-network based sequence-to-sequence models to transliteration that take principled and general-purpose approaches to alignment and one-to-many or many-to-one correspondences. The first model is based on epsilon insertions and CTC [Graves et al.2006] alignment, the second model is an attentional sequence-to-sequence model commonly used in end-to-end machine translation [Bahdanau et al.2014]. We report and compare both character (CER) and word error rates (WER) on all problems.

2 Models and datasets

2.1 Epsilon Insertion

Epsilon insertion (EI) [Azawi et al.2013] is a simple technique for allowing sequence-to-sequence models to produce strings of different lengths from an input string. Epsilon insertion replaces the original problem of transliteration with a similar problem in which the source string has been modified by the insertion of epsilons (which we will represent as ‘_‘). Transliteration is then performed by an LSTM (possibly bidirectional and deep), and the output is aligned using CTC.

  • [topsep=-0.5pt]

  • source string: きょうと

  • source string with epsilons: __き__ょ__う__と__

  • LSTM output (after training): __ki_yo_u__to_

  • after CTC alignment: __ky_o_____to_

The implementation used for epsilon insertion models is CLSTM111, an open source C++ library. We also open source all our described trained EI.222

2.2 Attentional Sequence-to-Sequence Models

Attentional sequence-to-sequence models [Bahdanau et al.2014]

(Seq2Seq) work by using an encoder RNN to learn representations of the input sequence and a decoder RNN to produce the output sequence from the hidden representations the encoder created. The attention mechanism allows the decoder to focus on different parts of the input for each time step in the output sequence and can be seen as the analog of the alignment mechanism used in traditional statistical translation models. Sequence-to-sequence models do not have the implicit monotonicity assumption that unidirectional CTC models do, hence they are more flexible regarding input-output reordering. This is crucial for machine translation, but less important for transliteration where the sound order gets preserved from the source to the target.

For sequence-to-sequence models we experimented with GRU and LSTM cells and assessed the impact of using a bidirectional encoder. As recommended in [Sutskever et al.2014]

, we feed the input sequence to the encoder in reverse order. For our experiments we used the implementation with an embedding layer provided by TensorFlow

[Abadi et al.2016] .

2.3 Datasets

We asses the proposed models on Arabic to English (AR-EN), English to Japanese (EN-JA) transliteration and grapheme to phoneme conversion, specifically English to IPA (EN-IPA). The datasets we used are described in Table 1. For Arabic to English transliteration, we introduce a new corpus extracted from Wikipedia: firstly, we created a bilingual dataset of full names from titles of Arabic and English articles referring to the same person; secondly, we used it to learn alignments between name parts to create the final dataset. Since no direction specific information was used in data gathering, the data can be used both for English to Arabic and Arabic to English transliteration. Due to the extraction process, the dataset includes names of various origins (eg. Papadopoulos has Greek origin) and some English tokens contain characters specific to other languages such as: ß, ø, ł.

For the transliteration datasets (EN-JA, AR-EN), the English tokens were lowercased and diacritics removed (è becomes e, ü becomes u). The inputs and outputs of our models are unicode codepoints: the model reads one unicode codepoint at a time in the source string and produces unicode codepoints.

It should be observed that datasets usually used for transliteration differ substantially in their statistical properties from datasets used in other machine learning research. In particular, transliteration datasets usually represent transliterations only once, regardless of how common the word is in the source language. A second issue is that transliteration datasets used for training contain a large number of exceptional words, whose transliteration probably cannot be learned at all. Finally, there are frequently multiple acceptable transliterations for a source word, but these are not usually represented in the training data; that is, many transliterations counted as errors during training and evaluation may be acceptable. In order to remain comparable to prior work in the area, we did not attempt to address these issues in the existing datasets or the datasets we created for this paper; we will return to the question of how this influences performance in the Discussion.

Dataset Size
EN-IPA333 123892 7.5 6.8 28 38
EN-JA444 16356 10.8 6.5 29 83
AR-EN555 15898 6 6.8 48 40
Table 1: Datasets on which we assessed model performance.

3 Experimental results

3.1 Training and parameters

For all experiments we used 10% of data for testing, 10% of the remaining data for evaluation and the rest for training. All networks were trained using gradient descent with momentum. We used gradient clipping to avoid exploding RNN gradients

[Pascanu et al.2012]. EI models use a batch size of 1, gradient clipping norm of 9 and 3 epsilons. When training EI models we randomly varied the learning rate (

to 0.1), momentum rate (0.5 to 0.99) and number of hidden units (100 to 1000). For sequence-to-sequence models we varied the following hyperparameters: learning rate (

to 10), momentum rate (0.5 to 0.99), batch size (1 to 50), gradient clipping norm (1 to 10) and number of hidden units (50 to 1000). For both models we trained 1000 networks with different hyperparameter values but a fixed number of layers and chose the one that performed best on the evaluation set and reported performance on the test set. We verified that for each parameter range, optimal performance was reached within the interior of the parameter interval explored.

3.2 Results

Our results are described in Tables 4, 4 and 4. Table 5 compares our results against models trained on the same datasets. For completeness, we report other transliteration results despite not being directly comparable to our work since they used different datasets. Using statistical phonetic based machine translation [Finch and Sumita2008] reports a 31% CER for English to Japanese transliteration. [Deselaers et al.2009]

use deep neural networks for Arabic to English transliteration and report a 22.7% CER, while traditional approaches combined with a single layer perceptron achieved 11.1% on the same task

[Freitag et al.2007].

Bidi Cell CER WER
EI 1 LSTM 18.8 52.8
EI 2 LSTM 18.1 51.1
Seq2Seq 1 GRU 22.8 57.1
Seq2Seq 2 GRU 20.2 50.2
Seq2Seq 3 GRU 22.2 55.4
Seq2Seq 1 LSTM 23.5 56
Seq2Seq 2 LSTM 22.5 55
Seq2Seq 1 GRU 22.6 54.6
Seq2Seq 2 GRU 20.5 51.8
Table 2: English To Japanese results. Test size: 1780. The RNN size reports the number of units of an individual network. For bidirectional networks the number of units should be multiplied by 2, to account for the network which sees the input in reverse order. Only bidirectional encoders were used for sequence-to-sequence models.
Bidi Cell CER WER
CTC 1 LSTM 22.7 79.2
CTC 2 LSTM 22.5 78.5
Seq2Seq 1 GRU 23.5 77.6
Seq2Seq 2 GRU 22.4 77.1
Seq2Seq 1 LSTM 22.9 77.2
Seq2Seq 1 GRU 22.9 78.2
Table 3: Arabic to English results. Test size: 1590.
Bidi Cell CER WER
EI 1 LSTM 9.2 38.7
EI 2 LSTM 8.1 34.2
Seq2Seq 1 GRU 7.8 28.8
Seq2Seq 2 GRU 7.01 26.4
Seq2Seq 3 GRU 7.01 26.6
Seq2Seq 1 LSTM 7.40 27.0
Seq2Seq 2 LSTM 7.05 26.2
Seq2Seq 1 GRU 7.45 27.6
Seq2Seq 2 GRU 7.38 28.0
Table 4: English to IPA results. Test size: 12389.
EN-JA 50.2 67.7 [Benson et al.2009]
EN-IPA 26.2 21.3 [Rao et al.2015]
EN-IPA 26.2 28.6 [Yao and Zweig2015]
EN-IPA 26.2 23.5 [Yao and Zweig2015]
Table 5: Comparing results with prior work on the same datasets. All our results reported here use sequence-to-sequence models. For [Yao and Zweig2015] we report results on two models, which we compare to our approach in the Discussion.

3.3 Error analysis

Table 6 shows a list of errors made by a Arabic to English transliteration model. The most common mistake both explored models make on this task is to confuse vowels in the output. This is expected, given that Arabic has less vowels than English and that often short vowels are not written. Confusing “p” with “b” is another common mistake, accounted by the lack of a corresponding sound for the English “p” in Arabic.

Input Ground truth Model output
ينس¿ jens yens
هاوارد¿ howard haward
فرج¿ faraj farj
سميتشك¿ smyczek smichk
Table 6: Example errors made by an Arabic to English model.

4 Discussion

This paper has demonstrated that end-to-end recurrent neural networks achieve high performance on cross script transliteration on common transliteration tasks: EN-JA, EN-IPA and AR-EN.

We have compared epsilon insertion models and attentional sequence-to-sequence models on three benchmarks, and our results show attentional sequence-to-sequence models generally seem to perform better, but not uniformly. For grapheme to phoneme conversion our attention based sequence-to-sequence models perform better than the attentionless sequence-to-sequence models used in [Yao and Zweig2015]. However, their bidirectional LSTM models which use alignment features outperform our attention based models. The reason can be two fold: the alignment features learned by an alignment specific model help more than the attention implicitly learned by our model, or the simplicity of the LSTM models is advantage against sequence-to-sequence models.

Our results can be extended and potentially improved in a number of ways. A way is by exploring other recurrent network architectures, such as adaptive computation time networks [Graves2016]

or classifier combination via boosting or other methods

[Rao et al.2015]. Another way is by combining the neural network cost with a target language model cost [Chan et al.2015]. Since transliteration involves a combination of orthographic and phonetic features, it might be useful to run a separate pronunciation model on the input string and then provide both the grapheme and the phoneme string as input to the transliteration model, mirroring previous non-neural approaches to transliteration [Jansche and Sproat2009].

Perhaps one of the most important areas of improvements is that of training data. Right now, transliteration research (including the described work) performs training and evaluation on plain correspondences between strings in two orthographic systems. Such an approach disregards word frequencies, and treats predictions involving alternative, valid transcriptions as errors. Improvements to both the training datasets and the mechanisms for handling multiple predictions will likely result in significant improvements in model performance and correlate more with human evaluations. In addition to improving datasets, our work also points out the need for understanding the relative importance of character and word error rates in evaluating transliterations, since they appear to vary independently.

Given that transliteration is often used as part of machine translation systems, and that such systems themselves are increasingly character based end-to-end system, the question arises whether we need separate transliteration models at all. It appears likely that transliteration will remain a distinct submodule of such systems, since internal graphemic and phonetic representations inside transliteration modules are likely quite different from internal semantic representations required for translation. Experimental evidence from humans also supports the notion of separate and distinct processing of proper nouns and other nouns [Adorni et al.2014].

In addition to demonstrating a simple and novel way of constructing efficient transliteration systems, the benchmarks presented in this paper should be a useful baseline for future work. To this end, we open sourced a new Arabic-English transliteration dataset.

5 Acknowledgments

We would like to thank Andy Staudacher, Lara Scheidegger and Vincent Vanhoucke for their support throughout this work.