Transliteration–the conversion of proper nouns from one orthographic system to another–is an important task in multilingual text processing, useful in applications like online mapping and as a component of machine translation systems. Transliteration is determined by collection of historical accidents, conventions, and statistical regularities: many language pairs have adopted different rules for transliteration over time and many transliterations depends on the origin of a word. These properties make it desirable to look for high-quality, automated machine learning solutions to the problem.
A number of model-based methods for machine transliteration have been developed in the past. Such models assume that character sequences in the source orthography correspond to predictable character sequences in the target orthography, possibly depending on context. Some models additionally assume that such correspondences are influenced by phonetic information. Due to the statistical nature of the problem, components of such models frequently involve parameters and statistical modeling such as hidden Markov models, logistic regression, finite state transducers, and/or conditional random fields (CRFs). Typical examples of such models are[Ammar et al.2012, Ganesh et al.2008]; transliteration in such a system is based on an alignment step, followed by a CRF model that performs local string rewriting.
In many areas, including machine translation, end-to-end deep learning models have become a good alternatives to more traditional statistical approaches. This is our motivation for taking a similar approach to transliteration. Unlike statistical models, such end-to-end systems simply take a character string in the source orthography and are trained directly to produce a character string in the target orthography. The closest approaches to the transliteration methods described in this paper are probably found in[Rao et al.2015], using a bidirectional LSTM models together with input delays for grapheme to phoneme conversion (which can be viewed as a kind of “transliteration” from English to IPA) and [Yao and Zweig2015], using attentionless sequence-sequence models for the same task.
This paper describes the application of two neural-network based sequence-to-sequence models to transliteration that take principled and general-purpose approaches to alignment and one-to-many or many-to-one correspondences. The first model is based on epsilon insertions and CTC [Graves et al.2006] alignment, the second model is an attentional sequence-to-sequence model commonly used in end-to-end machine translation [Bahdanau et al.2014]. We report and compare both character (CER) and word error rates (WER) on all problems.
2 Models and datasets
2.1 Epsilon Insertion
Epsilon insertion (EI) [Azawi et al.2013] is a simple technique for allowing sequence-to-sequence models to produce strings of different lengths from an input string. Epsilon insertion replaces the original problem of transliteration with a similar problem in which the source string has been modified by the insertion of epsilons (which we will represent as ‘_‘). Transliteration is then performed by an LSTM (possibly bidirectional and deep), and the output is aligned using CTC.
source string: きょうと
source string with epsilons: __き__ょ__う__と__
LSTM output (after training): __ki_yo_u__to_
after CTC alignment: __ky_o_____to_
2.2 Attentional Sequence-to-Sequence Models
Attentional sequence-to-sequence models [Bahdanau et al.2014]
(Seq2Seq) work by using an encoder RNN to learn representations of the input sequence and a decoder RNN to produce the output sequence from the hidden representations the encoder created. The attention mechanism allows the decoder to focus on different parts of the input for each time step in the output sequence and can be seen as the analog of the alignment mechanism used in traditional statistical translation models. Sequence-to-sequence models do not have the implicit monotonicity assumption that unidirectional CTC models do, hence they are more flexible regarding input-output reordering. This is crucial for machine translation, but less important for transliteration where the sound order gets preserved from the source to the target.
For sequence-to-sequence models we experimented with GRU and LSTM cells and assessed the impact of using a bidirectional encoder. As recommended in [Sutskever et al.2014]
, we feed the input sequence to the encoder in reverse order. For our experiments we used the implementation with an embedding layer provided by TensorFlow[Abadi et al.2016] .
We asses the proposed models on Arabic to English (AR-EN), English to Japanese (EN-JA) transliteration and grapheme to phoneme conversion, specifically English to IPA (EN-IPA). The datasets we used are described in Table 1. For Arabic to English transliteration, we introduce a new corpus extracted from Wikipedia: firstly, we created a bilingual dataset of full names from titles of Arabic and English articles referring to the same person; secondly, we used it to learn alignments between name parts to create the final dataset. Since no direction specific information was used in data gathering, the data can be used both for English to Arabic and Arabic to English transliteration. Due to the extraction process, the dataset includes names of various origins (eg. Papadopoulos has Greek origin) and some English tokens contain characters specific to other languages such as: ß, ø, ł.
For the transliteration datasets (EN-JA, AR-EN), the English tokens were lowercased and diacritics removed (è becomes e, ü becomes u). The inputs and outputs of our models are unicode codepoints: the model reads one unicode codepoint at a time in the source string and produces unicode codepoints.
It should be observed that datasets usually used for transliteration differ substantially in their statistical properties from datasets used in other machine learning research. In particular, transliteration datasets usually represent transliterations only once, regardless of how common the word is in the source language. A second issue is that transliteration datasets used for training contain a large number of exceptional words, whose transliteration probably cannot be learned at all. Finally, there are frequently multiple acceptable transliterations for a source word, but these are not usually represented in the training data; that is, many transliterations counted as errors during training and evaluation may be acceptable. In order to remain comparable to prior work in the area, we did not attempt to address these issues in the existing datasets or the datasets we created for this paper; we will return to the question of how this influences performance in the Discussion.
3 Experimental results
3.1 Training and parameters
For all experiments we used 10% of data for testing, 10% of the remaining data for evaluation and the rest for training. All networks were trained using gradient descent with momentum. We used gradient clipping to avoid exploding RNN gradients[Pascanu et al.2012]. EI models use a batch size of 1, gradient clipping norm of 9 and 3 epsilons. When training EI models we randomly varied the learning rate (
to 0.1), momentum rate (0.5 to 0.99) and number of hidden units (100 to 1000). For sequence-to-sequence models we varied the following hyperparameters: learning rate (to 10), momentum rate (0.5 to 0.99), batch size (1 to 50), gradient clipping norm (1 to 10) and number of hidden units (50 to 1000). For both models we trained 1000 networks with different hyperparameter values but a fixed number of layers and chose the one that performed best on the evaluation set and reported performance on the test set. We verified that for each parameter range, optimal performance was reached within the interior of the parameter interval explored.
Our results are described in Tables 4, 4 and 4. Table 5 compares our results against models trained on the same datasets. For completeness, we report other transliteration results despite not being directly comparable to our work since they used different datasets. Using statistical phonetic based machine translation [Finch and Sumita2008] reports a 31% CER for English to Japanese transliteration. [Deselaers et al.2009]
use deep neural networks for Arabic to English transliteration and report a 22.7% CER, while traditional approaches combined with a single layer perceptron achieved 11.1% on the same task[Freitag et al.2007].
|EN-JA||50.2||67.7||[Benson et al.2009]|
|EN-IPA||26.2||21.3||[Rao et al.2015]|
|EN-IPA||26.2||28.6||[Yao and Zweig2015]|
|EN-IPA||26.2||23.5||[Yao and Zweig2015]|
3.3 Error analysis
Table 6 shows a list of errors made by a Arabic to English transliteration model. The most common mistake both explored models make on this task is to confuse vowels in the output. This is expected, given that Arabic has less vowels than English and that often short vowels are not written. Confusing “p” with “b” is another common mistake, accounted by the lack of a corresponding sound for the English “p” in Arabic.
|Input||Ground truth||Model output|
This paper has demonstrated that end-to-end recurrent neural networks achieve high performance on cross script transliteration on common transliteration tasks: EN-JA, EN-IPA and AR-EN.
We have compared epsilon insertion models and attentional sequence-to-sequence models on three benchmarks, and our results show attentional sequence-to-sequence models generally seem to perform better, but not uniformly. For grapheme to phoneme conversion our attention based sequence-to-sequence models perform better than the attentionless sequence-to-sequence models used in [Yao and Zweig2015]. However, their bidirectional LSTM models which use alignment features outperform our attention based models. The reason can be two fold: the alignment features learned by an alignment specific model help more than the attention implicitly learned by our model, or the simplicity of the LSTM models is advantage against sequence-to-sequence models.
Our results can be extended and potentially improved in a number of ways. A way is by exploring other recurrent network architectures, such as adaptive computation time networks [Graves2016]
or classifier combination via boosting or other methods[Rao et al.2015]. Another way is by combining the neural network cost with a target language model cost [Chan et al.2015]. Since transliteration involves a combination of orthographic and phonetic features, it might be useful to run a separate pronunciation model on the input string and then provide both the grapheme and the phoneme string as input to the transliteration model, mirroring previous non-neural approaches to transliteration [Jansche and Sproat2009].
Perhaps one of the most important areas of improvements is that of training data. Right now, transliteration research (including the described work) performs training and evaluation on plain correspondences between strings in two orthographic systems. Such an approach disregards word frequencies, and treats predictions involving alternative, valid transcriptions as errors. Improvements to both the training datasets and the mechanisms for handling multiple predictions will likely result in significant improvements in model performance and correlate more with human evaluations. In addition to improving datasets, our work also points out the need for understanding the relative importance of character and word error rates in evaluating transliterations, since they appear to vary independently.
Given that transliteration is often used as part of machine translation systems, and that such systems themselves are increasingly character based end-to-end system, the question arises whether we need separate transliteration models at all. It appears likely that transliteration will remain a distinct submodule of such systems, since internal graphemic and phonetic representations inside transliteration modules are likely quite different from internal semantic representations required for translation. Experimental evidence from humans also supports the notion of separate and distinct processing of proper nouns and other nouns [Adorni et al.2014].
In addition to demonstrating a simple and novel way of constructing efficient transliteration systems, the benchmarks presented in this paper should be a useful baseline for future work. To this end, we open sourced a new Arabic-English transliteration dataset.
We would like to thank Andy Staudacher, Lara Scheidegger and Vincent Vanhoucke for their support throughout this work.
- [Abadi et al.2016] Martın Abadi, Ashish Agarwal, Paul Barham, Eugene Brevdo, Zhifeng Chen, Craig Citro, Greg S Corrado, Andy Davis, Jeffrey Dean, Matthieu Devin, et al. 2016. Tensorflow: Large-scale machine learning on heterogeneous distributed systems. arXiv preprint arXiv:1603.04467.
- [Adorni et al.2014] Roberta Adorni, Mirella Manfredi, and Alice Mado Proverbio. 2014. Electro-cortical manifestations of common vs. proper name processing during reading. Brain and language, 135:1–8.
- [Ammar et al.2012] Waleed Ammar, Chris Dyer, and Noah A Smith. 2012. Transliteration by sequence labeling with lattice encodings and reranking. In Proceedings of the 4th Named Entity Workshop, pages 66–70. Association for Computational Linguistics.
- [Azawi et al.2013] Mayce Al Azawi, Muhammad Zeshan Afzal, and Thomas M Breuel. 2013. Normalizing historical orthography for ocr historical documents using lstm. In Proceedings of the 2nd International Workshop on Historical Document Imaging and Processing, pages 80–85. ACM.
- [Bahdanau et al.2014] Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. 2014. Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473.
- [Benson et al.2009] Edward Benson, Stephen Pueblo, Fuming Shih, and Robert Berwick. 2009. English-japanese transliteration.
- [Chan et al.2015] William Chan, Navdeep Jaitly, Quoc V Le, and Oriol Vinyals. 2015. Listen, attend and spell. arXiv preprint arXiv:1508.01211.
- [Deselaers et al.2009] Thomas Deselaers, Saša Hasan, Oliver Bender, and Hermann Ney. 2009. A deep learning approach to machine transliteration. In Proceedings of the Fourth Workshop on Statistical Machine Translation, pages 233–241. Association for Computational Linguistics.
- [Finch and Sumita2008] Andrew Finch and Eiichiro Sumita. 2008. Phrase-based machine transliteration. In Proceedings of the Workshop on Technologies and Corpora for Asia-Pacific Speech Translation (TCAST), pages 13–18.
- [Freitag et al.2007] Dayne Freitag, Shahram Khadivi, et al. 2007. A sequence alignment model based on the averaged perceptron. pages 238–247.
- [Ganesh et al.2008] Surya Ganesh, Sree Harsha, Prasad Pingali, and Vasudeva Verma. 2008. Statistical transliterationalledfor cross language information retrieval using hmm alignment model and crf. In Proceedings of the 2nd Workshop on Cross Lingual Information Access.
- [Graves et al.2006] Alex Graves, Santiago Fernández, Faustino Gomez, and Jürgen Schmidhuber. 2006. Connectionist temporal classification: labelling unsegmented sequence data with recurrent neural networks. In Proceedings of the 23rd international conference on Machine learning, pages 369–376. ACM.
- [Graves2016] Alex Graves. 2016. Adaptive computation time for recurrent neural networks. arXiv preprint arXiv:1603.08983.
[Jansche and Sproat2009]
Martin Jansche and Richard Sproat.
Named entity transcription with pair n-gram models.In Proceedings of the 2009 Named Entities Workshop: Shared Task on Transliteration, pages 32–35. Association for Computational Linguistics.
- [Pascanu et al.2012] Razvan Pascanu, Tomas Mikolov, and Yoshua Bengio. 2012. On the difficulty of training recurrent neural networks. arXiv preprint arXiv:1211.5063.
[Rao et al.2015]
Kanishka Rao, Fuchun Peng, Hasim Sak, and Françoise Beaufays.
Grapheme-to-phoneme conversion using long short-term memory recurrent neural networks.In Acoustics, Speech and Signal Processing (ICASSP), 2015 IEEE International Conference on, pages 4225–4229. IEEE.
- [Sutskever et al.2014] Ilya Sutskever, Oriol Vinyals, and Quoc V Le. 2014. Sequence to sequence learning with neural networks. In Advances in neural information processing systems, pages 3104–3112.
- [Yao and Zweig2015] Kaisheng Yao and Geoffrey Zweig. 2015. Sequence-to-sequence neural net models for grapheme-to-phoneme conversion. arXiv preprint arXiv:1506.00196.