A Comparison of Different Machine Transliteration Models

Machine transliteration is a method for automatically converting words in one language into phonetically equivalent ones in another language. Machine transliteration plays an important role in natural language applications such as information retrieval and machine translation, especially for handling proper nouns and technical terms. Four machine transliteration models -- grapheme-based transliteration model, phoneme-based transliteration model, hybrid transliteration model, and correspondence-based transliteration model -- have been proposed by several researchers. To date, however, there has been little research on a framework in which multiple transliteration models can operate simultaneously. Furthermore, there has been no comparison of the four models within the same framework and using the same data. We addressed these problems by 1) modeling the four models within the same framework, 2) comparing them under the same conditions, and 3) developing a way to improve machine transliteration through this comparison. Our comparison showed that the hybrid and correspondence-based models were the most effective and that the four models can be used in a complementary manner to improve machine transliteration performance.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

12/28/2018

Machine Translation: A Literature Review

Machine translation (MT) plays an important role in benefiting linguists...
04/06/2015

Bengali to Assamese Statistical Machine Translation using Moses (Corpus Based)

Machine dialect interpretation assumes a real part in encouraging man-ma...
07/10/2020

Pragmatic information in translation: a corpus-based study of tense and mood in English and German

Grammatical tense and mood are important linguistic phenomena to conside...
12/11/2018

Machine Translation : From Statistical to modern Deep-learning practices

Machine translation (MT) is an area of study in Natural Language process...
04/13/2021

Multiple regression techniques for modeling dates of first performances of Shakespeare-era plays

The date of the first performance of a play of Shakespeare's time must u...
09/02/2021

Do Prompt-Based Models Really Understand the Meaning of their Prompts?

Recently, a boom of papers have shown extraordinary progress in few-shot...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

With the advent of new technology and the flood of information through the Web, it has become increasingly common to adopt foreign words into one’s language. This usually entails adjusting the adopted word’s original pronunciation to follow the phonological rules of the target language, along with modification of its orthographical form. This phonetic “translation” of foreign words is called transliteration. For example, the English word data is transliterated into Korean as ‘de-i-teo’111In this paper, target language transliterations are represented in their Romanized form with single quotation marks and hyphens between syllables. and into Japanese as ‘de-e-ta’. Transliteration is particularly used to translate proper names and technical terms from languages using Roman alphabets into ones using non-Roman alphabets such as from English to Korean, Japanese, or Chinese. Because transliteration is one of the main causes of the out-of-vocabulary (OOV) problem, transliteration by means of dictionary lookup is impractical [Fujii  TetsuyaFujii  Tetsuya2001, Lin  ChenLin  Chen2002]. One way to solve the OOV problem is to use machine transliteration. Machine transliteration is usually used to support machine translation (MT) [Knight  GraehlKnight  Graehl1997, Al-Onaizan  KnightAl-Onaizan  Knight2002] and cross-language information retrieval (CLIR) [Fujii  TetsuyaFujii  Tetsuya2001, Lin  ChenLin  Chen2002]. For CLIR, machine transliteration bridges the gap between the transliterated localized form and its original form by generating all possible transliterations from the original form (or generating all possible original forms from the transliteration)222The former process is generally called “transliteration”, and the latter is generally called “back-transliteration” [Knight  GraehlKnight  Graehl1997]. For example, machine transliteration can assist query translation in CLIR, where proper names and technical terms frequently appear in source language queries. In the area of MT, machine transliteration helps preventing translation errors when translations of proper names and technical terms are not registered in the translation dictionary. Machine transliteration can therefore improve the performance of MT and CLIR.

Four machine transliteration models have been proposed by several researchers: graph-eme333Graphemes refer to the basic units (or the smallest contrastive units) of a written language: for example, English has 26 graphemes or letters, Korean has 24, and German has 30.-based transliteration model ([Lee  ChoiLee  Choi1998, Jeong, Myaeng, Lee,  ChoiJeong et al.1999, Kim, Lee,  ChoiKim et al.1999, LeeLee1999, Kang  ChoiKang  Choi2000, Kang  KimKang  Kim2000, KangKang2001, Goto, Kato, Uratani,  EharaGoto et al.2003, Li, Zhang,  SuLi et al.2004], phoneme444Phonemes are the simplest significant unit of sound (or the smallest contrastive units of a spoken language); for example, /M/, /AE/, and /TH/ in /M AE TH/, the pronunciation of math. We use the ARPAbet symbols to represent source phonemes. ARPAbet is one of the methods used for coding source phonemes into ASCII characters (http://www.cs.cmu.edu/~laura/pages/arpabet.ps). Here we denote source phonemes and pronunciation with two slashes, as in /AH/, and use pronunciation based on The CMU Pronunciation Dictionary and The American Heritage(r) Dictionary of the English Language.-based transliteration model ([Knight  GraehlKnight  Graehl1997, LeeLee1999, Jung, Hong,  PaekJung et al.2000, Meng, Lo, Chen,  TangMeng et al.2001], hybrid transliteration model ([LeeLee1999, Al-Onaizan  KnightAl-Onaizan  Knight2002, Bilac  TanakaBilac  Tanaka2004], and correspondence-based transliteration model ([Oh  ChoiOh  Choi2002]

. These models are classified in terms of the units to be transliterated. The

is sometimes referred to as the direct method because it directly transforms source language graphemes into target language graphemes without any phonetic knowledge of the source language words. The is sometimes referred to as the pivot method because it uses source language phonemes as a pivot when it produces target language graphemes from source language graphemes. The therefore usually needs two steps: 1) produce source language phonemes from source language graphemes; 2) produce target language graphemes from source phonemes555 These two steps are explicit if the transliteration system produces target language transliterations after producing the pronunciations of the source language words; they are implicit if the system uses phonemes implicitly in the transliteration stage and explicitly in the learning stage, as described elsewhere [Bilac  TanakaBilac  Tanaka2004]. The and make use of both source language graphemes and source language phonemes when producing target language transliterations. Hereafter, we refer to a source language grapheme as a source grapheme, a source language phoneme as a source phoneme, and a target language grapheme as a target grapheme.

The transliterations produced by the four models usually differ because the models use different information. Generally, transliteration is a phonetic process, as in , rather than an orthographic one, as in  [Knight  GraehlKnight  Graehl1997]. However, standard transliterations are not restricted to phoneme-based transliterations. For example, the standard Korean transliterations of data, amylase, and neomycin are, respectively, the phoneme-based transliteration ‘de-i-teo’, the grapheme-based transliteration ‘a-mil-la-a-je’, and ‘ne-o-ma-i-sin’, which is a combination of the grapheme-based transliteration ‘ne-o’ and the phoneme-based transliteration ‘ma-i-sin’. Furthermore, if the unit to be transliterated is restricted to either a source grapheme or a source phoneme, it is hard to produce the correct transliteration in many cases. For example, cannot easily produce the grapheme-based transliteration ‘a-mil-la-a-je’, the standard Korean transliteration of amylase, because tends to produce ‘a-mil-le-i-seu’ based on the sequence of source phonemes /AE M AH L EY S/. Multiple transliteration models should therefore be applied to better cover the various transliteration processes. To date, however, there has been little published research regarding a framework in which multiple transliteration models can operate simultaneously. Furthermore, there has been no reported comparison of the transliteration models within the same framework and using the same data although many English-to-Korean transliteration methods based on have been compared to each other with the same data [Kang  ChoiKang  Choi2000, Kang  KimKang  Kim2000, Oh  ChoiOh  Choi2002].

To address these problems, we 1) modeled a framework in which the four transliteration models can operate simultaneously, 2) compared the transliteration models under the same conditions, and 3) using the results of the comparison, developed a way to improve the performance of machine transliteration.

The rest of this paper is organized as follows. Section 2 describes previous work relevant to our study. Section 3 describes our implementation of the four transliteration models. Section 4 describes our testing and results. Section 5 describes a way to improve machine transliteration based on the results of our comparison. Section 6 describes a transliteration ranking method that can be used to improve transliteration performance. Section 7 concludes the paper with a summary and a look at future work.

2 Related Work

Machine transliteration has received significant research attention in recent years. In most cases, the source language and target language have been English and an Asian language, respectively – for example, English to Japanese [Goto, Kato, Uratani,  EharaGoto et al.2003], English to Chinese [Meng, Lo, Chen,  TangMeng et al.2001, Li, Zhang,  SuLi et al.2004], and English to Korean [Lee  ChoiLee  Choi1998, Kim, Lee,  ChoiKim et al.1999, Jeong, Myaeng, Lee,  ChoiJeong et al.1999, LeeLee1999, Jung, Hong,  PaekJung et al.2000, Kang  ChoiKang  Choi2000, Kang  KimKang  Kim2000, KangKang2001, Oh  ChoiOh  Choi2002]. In this section, we review previous work related to the four transliteration models.

2.1 Grapheme-based Transliteration Model

Conceptually, the is direct orthographical mapping from source graphemes to target graphemes. Several transliteration methods based on this model have been proposed, such as those based on a source-channel model [Lee  ChoiLee  Choi1998, LeeLee1999, Jeong, Myaeng, Lee,  ChoiJeong et al.1999, Kim, Lee,  ChoiKim et al.1999]

, a decision tree 

[Kang  ChoiKang  Choi2000, KangKang2001], a transliteration network [Kang  KimKang  Kim2000, Goto, Kato, Uratani,  EharaGoto et al.2003], and a joint source-channel model [Li, Zhang,  SuLi et al.2004].

The methods based on the source-channel model deal with English-Korean transliteration. They use a chunk of graphemes that can correspond to a source phoneme. First, English words are segmented into a chunk of English graphemes. Next, all possible chunks of Korean graphemes corresponding to the chunk of English graphemes are produced. Finally, the most relevant sequence of Korean graphemes is identified by using the source-channel model. The advantage of this approach is that it considers a chunk of graphemes representing a phonetic property of the source language word. However, errors in the first step (segmenting the English words) propagate to the subsequent steps, making it difficult to produce correct transliterations in those steps. Moreover, there is high time complexity because all possible chunks of graphemes are generated in both languages.

In the method based on a decision tree, decision trees that transform each source grapheme into target graphemes are learned and then directly applied to machine transliteration. The advantage of this approach is that it considers a wide range of contextual information, say, the left three and right three contexts. However, it does not consider any phonetic aspects of transliteration.

Kang and Kim kangih00 and Goto et al. goto03 proposed methods based on a transliteration network for, respectively, English-to-Korean and English-to-Japanese transliteration. Their frameworks for constructing a transliteration network are similar – both are composed of nodes and arcs. A node represents a chunk of source graphemes and its corresponding target graphemes. An arc represents a possible link between nodes and has a weight showing its strength. Like the methods based on the source-channel model, their methods consider the phonetic aspect in the form of chunks of graphemes. Furthermore, they segment a chunk of graphemes and identify the most relevant sequence of target graphemes in one step. This means that errors are not propagated from one step to the next, as in the methods based on the source-channel model.

The method based on the joint source-channel model simultaneously considers the source language and target language contexts (bigram and trigram) for machine transliteration. Its main advantage is the use of bilingual contexts.

2.2 Phoneme-based Transliteration Model

In the , the transliteration key is pronunciation or the source phoneme rather than spelling or the source grapheme. This model is basically source grapheme-to-source phoneme transformation and source phoneme-to-target grapheme transformation.

Knight and Graehl knight97 modeled Japanese-to-English transliteration with weighted finite state transducers (WFSTs) by combining several parameters including romaji-to-phoneme, phoneme-to-English, English word probabilities, and so on. A similar model was developed for Arabic-to-English transliteration 

[Stalls  KnightStalls  Knight1998]. Meng et al. meng01 proposed an English-to-Chinese transliteration method based on English grapheme-to-phoneme conversion, cross-lingual phonological rules, mapping rules between English phonemes and Chinese phonemes, and Chinese syllable-based and character-based language models. Jung et al. jung00 modeled English-to-Korean transliteration with an extended Markov window. The method transforms an English word into English pronunciation by using a pronunciation dictionary. Then it segments the English phonemes into chunks of English phonemes; each chunk corresponds to a Korean grapheme as defined by handcrafted rules. Finally, it automatically transforms each chunk of English phonemes into Korean graphemes by using an extended Markov window.

Lee lee99 modeled English-to-Korean transliteration in two steps. The English grapheme-to-English phoneme transformation is modeled in a manner similar to his method based on the source-channel model described in Section 2.1. The English phonemes are then transformed into Korean graphemes by using English-to-Korean standard conversion rules (EKSCR) [Korea Ministry of Culture & TourismKorea Ministry of Culture & Tourism1995]. These rules are in the form of context-sensitive rewrite rules, “”, meaning that English phoneme is rewritten as Korean grapheme in the context and , where , , and represent English phonemes. For example, “ ‘si’” means “English phoneme is rewritten into Korean grapheme ‘si’ if it occurs at the end of the word () after any phoneme ()”. This approach suffers from both the propagation of errors and the limitations of EKSCR. The first step, grapheme-to-phoneme transformation, usually results in errors, and the errors propagate to the next step. Propagated errors make it difficult for a transliteration system to work correctly. In addition, EKSCR does not contain enough rules to generate correct Korean transliterations since its main focus is mapping from an English phoneme to Korean graphemes without taking into account the contexts of the English grapheme.

2.3 Hybrid and Correspondence-based Transliteration Models

Attempts to use both source graphemes and source phonemes in machine transliteration led to the correspondence-based transliteration model ([Oh  ChoiOh  Choi2002] and the hybrid transliteration model ([LeeLee1999, Al-Onaizan  KnightAl-Onaizan  Knight2002, Bilac  TanakaBilac  Tanaka2004]. The former makes use of the correspondence between a source grapheme and a source phoneme when it produces target language graphemes; the latter simply combines and

through linear interpolation. Note that the

combines the grapheme-based transliteration probability () and the phoneme-based transliteration probability () using linear interpolation.

Oh and Choi oh02 considered the contexts of a source grapheme and its corresponding source phoneme for English-to-Korean transliteration. They used EKSCR as the basic rules in their method. Additional contextual rules are semi-automatically constructed by examining the cases in which EKSCR produced incorrect transliterations because of a lack of contexts. These contextual rules are in the form of context-sensitive rewrite rules, “”, meaning “ is rewritten as target grapheme in the context and ”. Note that , , and represent the correspondence between the English grapheme and phoneme. For example, we can read “ NULL” as “English grapheme corresponding to phoneme is rewritten into null Korean graphemes when it occurs after vowel phonemes, (), before consonant phonemes, ()”. The main advantage of this approach is the application of a sophisticated rule that reflects the context of the source grapheme and source phoneme by considering their correspondence. However, there is lack of portability to other languages because the rules are restricted to Korean.

Several researchers [LeeLee1999, Al-Onaizan  KnightAl-Onaizan  Knight2002, Bilac  TanakaBilac  Tanaka2004] have proposed hybrid model-based transliteration methods. They model and with WFSTs or a source-channel model and combine and through linear interpolation. In their , several parameters are considered, such as the source grapheme-to-source phoneme probability, source phoneme-to-target grapheme probability, and target language word probability. In their , the source grapheme-to-target grapheme probability is mainly considered. The main disadvantage of the hybrid model is that the dependence between the source grapheme and source phoneme is not taken into consideration in the combining process; in contrast, Oh and Choi’s approach [Oh  ChoiOh  Choi2002] considers this dependence by using the correspondence between the source grapheme and phoneme.

3 Modeling Machine Transliteration Models

In this section, we describe our implementation of the four machine transliteration models (, , , and

) using three machine learning algorithms: memory-based learning, decision-tree learning, and the maximum entropy model.

3.1 Framework for Four Machine Transliteration Models

Figure 1 summarizes the differences among the transliteration models and their component functions. The directly transforms source graphemes (S) into target graphemes (T). The and transform source graphemes into source phonemes and then generate target graphemes666According to , we can write and .. While uses only the source phonemes, uses the correspondence between the source grapheme and the source phoneme when it generates target graphemes. We describe their differences with two functions, and . The is represented as the linear interpolation of and by means of (). Here, is the probability that will produce target graphemes, while is the probability that will produce target graphemes. We can thus regard as being composed of component functions of and (, , and ). Here we use the maximum entropy model as the machine learning algorithm for because requires and , and only the maximum entropy model among memory-based learning, decision-tree learning, and the maximum entropy model can produce the probabilities.

Figure 1: Graphical representation of each component function and four transliteration models: S is a set of source graphemes (e.g., letters of the English alphabet), P is a set of source phonemes defined in ARPAbet, and T is a set of target graphemes.

To train each component function, we need to define the features that represent training instances and data. Table 1 shows five feature types, , , , , and . The feature types used depend on the component functions. The modeling of each component function with the feature types is explained in Sections 3.2 and 3.3.

Feature type Description and possible values
Source graphemes in S:
26 letters in English alphabet
Source grapheme types:
Consonant (C) and Vowel (V)
Source phonemes in P
(/AA/, /AE/, and so on)
Source phoneme types: Consonant (C), Vowel (V),
Semi-vowel (SV), and silence (//)
Target graphemes in T
Table 1: Feature types used for transliteration models: indicates both and , while indicates both and .

3.2 Component Functions of Each Transliteration Model

Notation Feature types used Input Output
,
, ,
,
,
Table 2: Definition of each component function: , , , , and respectively represent the source grapheme, the context of ( and ), the source phoneme, the context of ( and ), and the target grapheme.

Table 2 shows the definitions of the four component functions that we used. Each is defined in terms of its input and output: the first and last characters in the notation of each correspond respectively to its input and output. The role of each component function in each transliteration model is to produce the most relevant output from its input. The performance of a transliteration model therefore depends strongly on that of its component functions. In other words, the better the modeling of each component function, the better the performance of the machine transliteration system.

The modeling strongly depends on the feature type. Different feature types are used by the , , and functions, as shown in Table 2. These three component functions thus have different strengths and weaknesses for machine transliteration. The function is good at producing grapheme-based transliterations and poor at producing phoneme-based ones. In contrast, the function is good at producing phoneme-based transliterations and poor at producing grapheme-based ones. For amylase and its standard Korean transliteration, ‘a-mil-la-a-je’, which is a grapheme-based transliteration, tends to produce the correct transliteration; tends to produce wrong ones like ‘ae-meol-le-i-seu’, which is derived from /AE M AH L EY S/, the pronunciation of amylase. In contrast, can produce ‘de-i-teo’, which is the standard Korean transliteration of data and a phoneme-based transliteration, while tends to give a wrong one, like ‘da-ta’.

The function combines the advantages of and by utilizing the correspondence between the source grapheme and source phoneme. This correspondence enables to produce both grapheme-based and phoneme-based transliterations. Furthermore, the correspondence provides important clues for use in resolving transliteration ambiguities777Though contextual information can also be used to reduce ambiguities, we limit our discussion here to the feature type.. For example, the source phoneme /AH/ produces much ambiguity in machine transliteration because it can be mapped to almost every vowel in the source and target languages (the underlined graphemes in the following example corresponds to /AH/: holocaust in English, ‘hol-lo-ko-seu-teu’ in its Korean counterpart, and ‘ho-ro-ko-o-su-to’ in its Japanese counterpart). If we know the correspondence between the source grapheme and source phoneme, we can more easily infer the correct transliteration of /AH/ because the correct target grapheme corresponding to /AH/ usually depends on the source grapheme corresponding to /AH/. Moreover, there are various Korean transliterations of the source grapheme a: ‘a’, ‘ae’, ‘ei’, ‘i’, and ‘o’. In this case, the English phonemes corresponding to the English grapheme can help a component function resolve transliteration ambiguities, as shown in Table 3. In Table 3, the a underlined in the example words shown in the last column is pronounced as the English phoneme in the second column. By looking at English grapheme and its corresponding English phoneme, we can find correct Korean transliterations more easily.

Korean Grapheme English Phoneme Example usage
‘a’ /AA/ adagio, safari, vivace
‘ae’ /AE/ advantage, alabaster, travertine
‘ei’ /EY/ chamber, champagne, chaos
‘i’ /IH/ advantage, average, silage
‘o’ /AO/ allspice, ball, chalk
Table 3: Examples of Korean graphemes derived from English grapheme a and its corresponding English phonemes: the underlines in the example words indicate the English grapheme corresponding to English phonemes in the second column.

Though is more effective than both and in many cases, sometimes works poorly when the standard transliteration is strongly biased to either grapheme-based or phoneme-based transliteration. In such cases, either the source grapheme or source phoneme does not contribute to the correct transliteration, making it difficult for to produce the correct transliteration. Because , , and are the core parts of , , and , respectively, the advantages and disadvantages of the three component functions correspond to those of the transliteration models in which each is used.

Transliteration usually depends on context. For example, the English grapheme a can be transliterated into Korean graphemes on the basis of its context, like ‘ei’ in the context of -ation and ‘a’ in the context of art. When context information is used, determining the context window size is important. A context window that is too narrow can degrade transliteration performance because of a lack of context information. For example, when English grapheme t in -tion is transliterated into Korean, the one right English grapheme is insufficient as context because the three right contexts, -ion, are necessary to get the correct Korean grapheme, ‘s’. A context window that is too wide can also degrade transliteration performance because it reduces the power to resolve transliteration ambiguities. Many previous studies have determined that an appropriate context window size is 3. In this paper, we use a window size of 3, as in previous work [Kang  ChoiKang  Choi2000, Goto, Kato, Uratani,  EharaGoto et al.2003]. The effect of the context window size on transliteration performance will be discussed in Section 4.

L3 L2 L1 C0 R1 R2 R3 Output
$ $ $ b o a r /B/
$ $ $ C V V C
$ $ $
$ $ $ b o a r ‘b’
$ $ $ C V V C
$ $ $
$ $ $ /B/ /AO/ // /R/ ‘b’
$ $ $ C V // C
$ $ $
$ $ $ b o a r ‘b’
$ $ $ /B/ /AO/ // /R/
$ $ $ C V V C
$ $ $ C V // C
$ $ $
Table 4: Framework for each component function: $ represents start of words and means unused contexts for each component function.

Table 4 shows how to identify the most relevant output in each component function using context information. The L3-L1, C0, and R1-R3 represent the left context, current context (i.e., that to be transliterated), and right context, respectively. The function produces the most relevant source phoneme for each source grapheme. If is an English word, SW’s pronunciation can be represented as a sequence of source phonemes produced by ; that is, , where . transforms source graphemes into phonemes in two ways. The first one is to search in a pronunciation dictionary containing English words and their pronunciation [CMUCMU1997]

. The second one is to estimate the pronunciation (or automatic grapheme-to-phoneme conversion) 

[Andersen, Kuhn, Lazarides, Dalsgaard, Haas,  NothAndersen et al.1996, Daelemans  van den BoschDaelemans  van den Bosch1996, Pagel, Lenzo,  BlackPagel et al.1998, Damper, Marchand, Adamson,  GustafsonDamper et al.1999, ChenChen2003]. If an English word is not registered in the pronunciation dictionary, we must estimate its pronunciation. The produced pronunciation is used for in and in . For training the automatic grapheme-to-phoneme conversion in , we use The CMU Pronouncing Dictionary [CMUCMU1997].

The , , and functions produce target graphemes using their input. Like , these three functions use their previous outputs, which are represented by . As shown in Table 4, , , and produce target grapheme ‘b’ for source grapheme b and source phoneme /B/ in board and /B AO R D/. Because the b and /B/ are the first source grapheme of board and the first source phoneme of /B AO R D/, respectively, their left context is $, which represents the start of words. Source graphemes (o, a, and r) and their type (V: vowel, V: vowel, and C: consonant) can be the right context in and . Source phonemes (/AO/, //, and /R/) and their type (V: vowel, //: silence, V: vowel) can be the right context in and . Depending on the feature type used in each component function and described in Table 2, , , and produce a sequence of target graphemes, , for and . For board, , , and can be represented as follows. The // represents silence (null source phonemes), and the ‘’ represents null target graphemes.

  • ‘b’ ‘o’ ‘deu’

3.3 Machine Learning Algorithms for Each Component Function

In this section we describe a way to model component functions using three machine learning algorithms (the maximum entropy model, decision-tree learning, and memory-based learning)888These three algorithms are typically applied to automatic grapheme-to-phoneme conversion [Andersen, Kuhn, Lazarides, Dalsgaard, Haas,  NothAndersen et al.1996, Daelemans  van den BoschDaelemans  van den Bosch1996, Pagel, Lenzo,  BlackPagel et al.1998, Damper, Marchand, Adamson,  GustafsonDamper et al.1999, ChenChen2003].. Because the four component functions share a similar framework, we limit our focus to in this section.

3.3.1 Maximum entropy model

The maximum entropy model (MEM) is a widely used probability model that can incorporate heterogeneous information effectively [Berger, Pietra,  PietraBerger et al.1996]. In the MEM, an event () is usually composed of a target event () and a history event (); say . Event is represented by a bundle of feature functions, , which represent the existence of certain characteristics in event . A feature function is a binary-valued function. It is activated () when it meets its activating condition; otherwise it is deactivated ([Berger, Pietra,  PietraBerger et al.1996]. Let source language word SW be composed of n graphemes. SW, , and can then be represented as , , and , respectively. and represent the pronunciation and target language word corresponding to SW, and and represent the source phoneme and target grapheme corresponding to . Function based on the maximum entropy model can be represented as

(1)

With the assumption that depends on the context information in window size k, we simplify Formula (1) to

(2)

Because , , and can be represented by , , and , respectively, we can rewrite Formula (2) as

(3)

where is the index of the current source grapheme and source phoneme to be transliterated and represents the features of feature type located from position to position .

An important factor in designing a model based on the maximum entropy model is to identify feature functions that effectively support certain decisions of the model. Our basic philosophy of feature function design for each component function is that the context information collocated with the unit of interest is important. We thus designed the feature function with collocated features in each feature type and between different feature types. Features used for are listed below. These features are used as activating conditions or history events of feature functions.

  • Feature type and features used for designing feature functions in ()

    • All possible features in , , and (e.g., , , and )

    • All possible feature combinations between features of the same feature type (e.g., {, , }, {, , }, and {, })

    • All possible feature combinations between features of different feature types (e.g., {, }, {, } , and {, , })

      • between and

      • between and

      • between and

Generally, a conditional maximum entropy model that gives the conditional probability is represented as Formula (4[Berger, Pietra,  PietraBerger et al.1996].

(4)

In , the target event () is target graphemes to be assigned, and the history event () can be represented as a tuple . Therefore, we can rewrite Formula (3) as

(5)
‘b’
‘b’
‘b’ and
‘b’
‘b’
Table 5: Feature functions for derived from Table 4.

Table 5 shows example feature functions for ; Table 4 was used to derive the functions. For example, represents an event where (history event) is “ is b and is /B/” and (target event) is “ is ‘b’”. To model each component function based on the MEM, Zhang’s maximum entropy modeling tool is used [ZhangZhang2004].

3.3.2 Decision-tree learning

Decision-tree learning (DTL) is one of the most widely used and well-known methods for inductive inference [QuinlanQuinlan1986, MitchellMitchell1997]. ID3, which is a greedy algorithm that constructs decision trees in a top-down manner, uses the information gain, which is a measure of how well a given feature (or attribute) separates training examples on the basis of their target class [QuinlanQuinlan1993, Manning  SchutzeManning  Schutze1999]. We use C4.5 [QuinlanQuinlan1993], which is a well-known tool for DTL and an implementation of Quinlan’s ID3 algorithm.

The training data for each component function is represented by features located in L3-L1, C0, and R1-R3, as shown in Table 4. C4.5 tries to construct a decision tree by looking for regularities in the training data [MitchellMitchell1997]. Figure 2 shows part of the decision tree constructed for in English-to-Korean transliteration. A set of the target classes in the decision tree for is a set of the target graphemes. The rectangles indicate the leaf nodes, which represent the target classes, and the circles indicate the decision nodes. To simplify our examples, we use only and . Note that all feature types for each component function, as described in Table 4, are actually used to construct decision trees. Intuitively, the most effective feature from among L3-L1, C0, and R1-R3 for may be located in C0 because the correct outputs of strongly depend on the source grapheme or source phoneme in the C0 position. As we expected, the most effective feature in the decision tree is located in the C0 position, that is, C0(). (Note that the first feature to be tested in decision trees is the most effective feature.) In Figure 2, the decision tree produces the target grapheme (Korean grapheme) ‘o’ for the instance by retrieving the decision nodes from to represented by ‘’.

Figure 2: Decision tree for .

3.3.3 Memory-based learning

Memory-based learning (MBL), also called “instance-based learning” and “case-based learning”, is an example-based learning method. It is based on a -nearest neighborhood algorithm [Aha, Kibler,  AlbertAha et al.1991, AhaAha1997, Cover  HartCover  Hart1967, Devijver  Kittler.Devijver  Kittler.1982]

. MBL represents training data as a vector and, in the training phase, it places all training data as examples in memory and clusters some examples on the basis of the

-nearest neighborhood principle. Training data for MBL is represented in the same form as training data for a decision tree. Note that the target classes for , which MBL outputs, are target graphemes. Feature weighting to deal with features of differing importance is also done in the training phase999TiMBL [Daelemans, Zavrel, Sloot,  BoschDaelemans et al.2004] supports gain ratio weighting, information gain weighting, chi-squared () weighting, and

shared variance weighting

of the features.
. It then produces an output using similarity-based reasoning between test data and the examples in memory. If the test data is and the set of examples in memory is , the similarity between and can be estimated using distance function 101010Modified value difference metric, overlap metric, Jeffrey divergence metric, dot product metric, etc. are used as the distance function [Daelemans, Zavrel, Sloot,  BoschDaelemans et al.2004].. MBL selects an example or the cluster of examples that are most similar to and then assigns the example’s target class to ’s target class. We use an MBL tool called TiMBL (Tilburg memory-based learner) version 5.0 [Daelemans, Zavrel, Sloot,  BoschDaelemans et al.2004].

4 Experiments

We tested the four machine transliteration models on English-to-Korean and English-to-Japanese transliteration. The test set for the former (EKSet) [NamNam1997] consisted of 7,172 English-Korean pairs – the number of training items was about 6,000 and that of the blind test items was about 1,000. EKSet contained no transliteration variations, meaning that there was one transliteration for each English word. The test set for the latter (EJSet) contained English-katakana pairs from EDICT [BreenBreen2003] and consisted of 10,417 pairs – the number of training items was about 9,000 and that of the blind test items was about 1,000. EJSet contained transliteration variations, like micro, ‘ma-i-ku-ro’, and micro, ‘mi-ku-ro’; the average number of Japanese transliterations for an English word was 1.15. EKSet and EJSet covered proper names, technical terms, and general terms. We used The CMU Pronouncing Dictionary [CMUCMU1997] for training pronunciation estimation (or automatic grapheme-to-phoneme conversion) in . The training for automatic grapheme-to-phoneme conversion was done ignoring the lexical stress of vowels in the dictionary [CMUCMU1997]. The evaluation was done in terms of word accuracy (), the evaluation measure used in previous work [Kang  ChoiKang  Choi2000, Kang  KimKang  Kim2000, Goto, Kato, Uratani,  EharaGoto et al.2003, Bilac  TanakaBilac  Tanaka2004]. Here, can be represented as Formula (6). A generated transliteration for an English word was judged to be correct if it exactly matched a transliteration for that word in the test data.

(6)

In the evaluation, we used -fold cross-validation (=7 for EKSet and =10 for EJSet). The test set was divided into subsets. Each was used in turn for testing while the remainder was used for training. The average computed across all trials was used as the evaluation results presented in this section.

We conducted six tests.

  • Hybrid Model Test: Evaluation of hybrid transliteration model by changing value of (the parameter of the hybrid transliteration model)

  • Comparison Test I: Comparison among four machine transliteration models

  • Comparison Test II: Comparison of four machine transliteration models to previously proposed transliteration methods

  • Dictionary Test: Evaluation of transliteration models on words registered and not registered in pronunciation dictionary to determine effect of pronunciation dictionary on each model

  • Context Window-Size Test: Evaluation of transliteration models for various sizes of context window

  • Training Data-Size Test: Evaluation of transliteration models for various sizes of training data sets

4.1 Hybrid Model Test

The objective of this test was to estimate the dependence of the performance of on parameter . We evaluated the performance by changing from 0 to 1 at intervals of 0.1 (i.e., =0, 0.1, 0.2, , 0.9, 1.0). Note that the hybrid model can be represented as “”. Therefore, is when and when . As shown in Table 6, the performance of depended on that of and . For example, the performance of exceeded that of for EKSet. Therefore, tended to perform better when than when for EKSet. The best performance was attained when for EKSet and when for EJSet. Hereinafter, we use for EKSet and for EJSet as the linear interpolation parameter for .

EKSet EJSet
0 58.8% 58.8%
0.1 61.2% 60.9%
0.2 62.0% 62.6%
0.3 63.0% 64.1%
0.4 64.1% 65.4%
0.5 63.4% 65.8%
0.6 61.1% 65.0%
0.7 59.6% 63.4%
0.8 58.2% 62.1%
0.9 57.0% 61.2%
1.0 55.2% 59.2%
Table 6: Results of Hybrid Model Test.

4.2 Comparison Test I

The objectives of the first comparison test were to compare performance among the four transliteration models (, , , and ) and to compare the performance of each model with the combined performance of three of the models (). Table 7 summarizes the performance of each model for English-to-Korean and English-to-Japanese transliteration, where DTL, MBL111111We tested all possible combinations between and a weighting scheme supported by TiMBL [Daelemans, Zavrel, Sloot,  BoschDaelemans et al.2004] and did not detect any significant differences in performance for the various combinations. Therefore, we used the default setting of TiMBL (Overlap metric for and gain ratio weighting for feature weighting). and MEM represent decision-tree learning, memory-based learning, and maximum entropy model.

The unit to be transliterated was restricted to either a source grapheme or a source phoneme in and ; it was dynamically selected on the basis of the contexts in and . This means that and could produce an incorrect result if either a source phoneme or a source grapheme, which, respectively, they do not consider, holds the key to producing the correct transliteration result. For this reason, and performed better than both and .

Transliteration Model EKSet EJSet

DTL MBL MEM DTL MBL MEM
53.1% 54.6% 58.8% 55.6% 58.9% 58.8%
50.8% 50.6% 55.2% 55.8% 56.1% 59.2%
N/A N/A 64.1% N/A N/A 65.8%
59.5% 60.3% 65.5% 64.0% 65.8% 69.1%
72.0% 71.4% 75.2% 73.4% 74.2% 76.6%
Table 7: Results of Comparison Test I.

In the table, means the combined results for the three transliteration models, , , and . We exclude from the combining because it is implemented only with the MEM (the performance of combining the four transliteration models are discussed in Section 5). In evaluating , we judged the transliteration results to be correct if there was at least one correct transliteration among the results produced by the three models. Though showed the best results among the three transliteration models due to its ability to use the correspondence between the source grapheme and source phoneme, the source grapheme or the source phoneme can create noise when the correct transliteration is produced by the other one. In other words, when the correct transliteration is strongly biased to either grapheme-based or phoneme-based transliteration, and may be more suitable for producing the correct transliteration.

Table 8 shows example transliterations produced by each transliteration model. The produced correct transliterations for cyclase and bacteroid, while did the same for geoid and silo. produced correct transliterations for saxhorn and bacteroid, and produced correct transliterations for geoid and bacteroid. As shown by these results, there are transliterations that only one transliteration model can produce correctly. For example, only , , and produced the correct transliterations of cyclase, silo, and saxhorn, respectively. Therefore, these three transliteration models can be used in a complementary manner to improve transliteration performance because at least one can usually produce the correct transliteration. This combination increased the performance by compared to , , and (on average, 30.1% in EKSet and 24.6% in EJSet). In short, , , and are complementary transliteration models that together produce more correct transliterations, so combining different transliteration models can improve transliteration performance. The transliteration results produced by are analyzed in detail in Section 5.

cyclase si-keul-la-a-je ‘sa-i-keul-la-a-je’
bacteroid bak-te-lo-i-deu ‘bak-teo-o-i-deu’
geoid ‘je-o-i-deu’ ji-o-i-deu
silo ‘sil-lo’ sa-il-lo
saxhorn ‘saek-seon’ ‘saek-seu-ho-leun’
cyclase ‘sa-i-keul-la-a-je’ ‘sa-i-keul-la-a-je’
bacteroid bak-te-lo-i-deu bak-te-lo-i-deu
geoid ji-o-i-deu ‘ge-o-i-deu’
silo ‘sil-lo’ ‘sil-lo’
saxhorn ‘saek-seon’ saek-seu-hon
Table 8: Example transliterations produced by each transliteration model ( indicates an incorrect transliteration).

In our subsequent testing, we used the maximum entropy model as the machine learning algorithm for two reasons. First, it produced the best results of the three algorithms we tested121212

A one-tail paired t-test showed that the results with the MEM were always significantly better (except for

in EJSet) than those of DTL and MBL (level of significance = 0.001).. Second, it can support .

4.3 Comparison Test II

In this test, we compared four previously proposed machine transliteration methods [Kang  ChoiKang  Choi2000, Kang  KimKang  Kim2000, Goto, Kato, Uratani,  EharaGoto et al.2003, Bilac  TanakaBilac  Tanaka2004] to the four transliteration models (, , , and ), which were based on the MEM. Table 9 shows the results. We trained and tested the previous methods with the same data sets used for the four transliteration models. Table 10 shows the key features of the methods and models from the viewpoint of information type and usage. Information type indicates the type of information considered: source grapheme, source phoneme, and correspondence between the two. For example, the first three methods use only the source grapheme. Information usage indicates the context used and whether the previous output is used.

Method/Model EKSet EJSet
Previous methods Kang and Choi kangbj00 51.4% 50.3%
Kang and Kim kangih00 55.1% 53.2%
Goto et al. goto03 55.9% 56.2%
Bilac and Tanaka bilac04 58.3% 62.5%
MEM-based models 58.8% 58.8%
55.2% 59.2%
64.1% 65.8%
65.5% 69.1%
Table 9: Results of Comparison Test II.
Method/Model Info. type Info. usage
S P C Context PO
Kang and Choi kangbj00 +
Kang and Kim kangih00 + Unbounded +
Goto et al. goto03 + +
Bilac and Tanaka bilac04 + + Unbounded
+ +
+ +
+ + +
+ + + +
Table 10: Information type and usage for previous methods and four transliteration models, where S, P, C, and PO respectively represent the source grapheme, source phoneme, correspondence between S and P, and previous output.

It is obvious from the table that the more information types a transliteration model considers, the better its performance. Either the source phoneme or the correspondence – which are not considered in the methods of Kang and Choi kangbj00, Kang and Kim kangih00, and Goto et al. goto03 – is the key to the higher performance of the method of Bilac and Tanaka bilac04 and the and .

From the viewpoint of information usage, the models and methods that consider the previous output tended to achieve better performance. For example, the method of Goto et al. goto03 had better results than that of Kang and Choi kangbj00. Because machine transliteration is sensitive to context, a reasonable context size usually enhances transliteration ability. Note that the size of the context window for the previous methods was limited to 3 because a context window wider than 3 degrades performance  [Kang  ChoiKang  Choi2000] or does not significantly improve it [Kang  KimKang  Kim2000]. Experimental results related to context window size are given in Section 4.5.

Overall, and had better performance than the previous methods (on average, 17.04% better for EKSet and 21.78% better for EJSet), (on average, 9.6% better for EKSet and 14.4% better for EJSet), and (on average, 16.7% better for EKSet and 19.0% better for EJSet). In short, a good machine transliteration model should 1) consider either the correspondence between the source grapheme and the source phoneme or both the source grapheme and the source phoneme, 2) have a reasonable context size, and 3) consider previous output. The and satisfy all three conditions.

4.4 Dictionary Test

Table 11 shows the performance of each transliteration model for the dictionary test. In this test, we evaluated four transliteration models according to a way of pronunciation generation (or grapheme-to-phoneme conversion). Registered represents the performance for words registered in the pronunciation dictionary, and Unregistered represents that for unregistered words. On average, the number of Registered words in EKSet was about 600, and that in EJSet was about 700 in -fold cross-validation test data. In other words, Registered words accounted for about 60% of the test data in EKSet and about 70% of the test data in EJSet. The correct pronunciation can always be acquired from the pronunciation dictionary for Registered words, while the pronunciation must be estimated for Unregistered words through automatic grapheme-to-phoneme conversion. However, the automatic grapheme-to-phoneme conversion does not always produce correct pronunciations – the estimated rate of correct pronunciations was about 70% accuracy.

EKSet EJSet
Registered Unregistered Registered Unregistered
60.91% 55.74% 61.18% 50.24%
66.70% 38.45% 64.35% 40.78%
70.34% 53.31% 70.20% 50.02%
73.32% 54.12% 74.04% 51.39%
80.78% 68.41% 81.17% 62.31%
Table 11: Results of Dictionary Test: ALL means .

Analysis of the results showed that the four transliteration models fall into three categories. Since the is free from the need for correct pronunciation, that is, it does not use the source phoneme, its performance is not affected by pronunciation correctness. Therefore, can be regarded as the baseline performance for Registered and Unregistered. Because (), ( + ), and () depend on the source phoneme, their performance tends to be affected by the performance of . Therefore, , , and show notable differences in performance between Registered and Unregistered. However, the performance gap differs with the strength of the dependence. falls into the second category: its performance strongly depends on the correct pronunciation. tends to perform well for Registered and poorly for Unregistered. and weakly depend on the correct pronunciation. Unlike , they make use of both the source grapheme and source phoneme. Therefore, they can perform reasonably well without the correct pronunciation because using the source grapheme weakens the negative effect of incorrect pronunciation in machine transliteration.

Comparing and , we find two interesting things. First, was more sensitive to errors in for Unregistered. Second, showed better results for both Registered and Unregistered. Because and share the same function, , the key factor accounting for the performance gap between them is the component functions, and . From the results shown in Table 11, we can infer that (in ) performed better than (in ) for both Registered and Unregistered. In , the source grapheme corresponding to the source phonemes, which does not consider, made two contributions to the higher performance of . First, the source grapheme in the correspondence made it possible to produce more accurate transliterations. Because considers the correspondence, has a more powerful transliteration ability than , which uses just the source phonemes, when the correspondence is needed to produce correct transliterations. This is the main reason performed better than for Registered. Second, source graphemes in the correspondence compensated for errors produced by in producing target graphemes. This is the main reason performed better than for Unregistered. In the comparison between and , the performances were similar for Unregistered. This indicates that the transliteration power of is similar to that of , even though the pronunciation of the source language word may not be correct. Furthermore, the performance of was significantly higher than that of for Registered. This indicates that the transliteration power of is greater than that of if the correct pronunciation is given.

The behavior of was similar to that of . For Unregistered, in made it possible for to avoid errors caused by . Therefore, it worked better than . For Registered, enabled to perform better than .

The results of this test showed that and perform better than and while complementing and (and thus overcoming their disadvantage) by considering either the correspondence between the source grapheme and the source phoneme or both the source grapheme and the source phoneme.

4.5 Context Window-Size Test

In our testing of the effect of the context window size, we varied the size from 1 to 5. Regardless of the size, and always performed better than both and . When the size was 4 or 5, each model had difficulty identifying regularities in the training data. Thus, there were consistent drops in performance for all models when the size was increased from 3 to 4 or 5. Although the best performance was obtained when the size was 3, as shown in Table 12, the differences in performance were not significant in the range of 2-4. However, there was a significant difference between a size of 1 and a size of 2. This indicates that a lack of contextual information can easily lead to incorrect transliteration. For example, to produce the correct target language grapheme of t in -tion, we need the right three graphemes (or at least the right two) of t, -ion (or -io). The results of this testing indicate that the context size should be more than 1 to avoid degraded performance.

Context Size
1 44.9% 44.9% 51.8% 52.4% 65.8%
2 57.3% 52.8% 61.7% 64.4% 74.4%
3 58.8% 55.2% 64.1% 65.5% 75.8%
4 56.1% 54.6% 61.8% 64.3% 74.4%
5 53.7% 52.6% 60.4% 62.5% 73.9%
Context Size
1 46.4% 52.1% 58.0% 62.0% 70.4%
2 58.2% 59.5% 65.6% 68.7% 76.3%
3 58.8% 59.2% 65.8% 69.1% 77.0%
4 56.4% 58.5% 64.4% 68.2% 76.0%
5 53.9% 56.4% 62.9% 66.3% 75.5%
Table 12: Results of Context Window-Size Test: ALL means .

4.6 Training Data-Size Test

Table 13 shows the results of the Training Data-Size Test using MEM-based machine transliteration models. We evaluated the performance of the four models and while varying the size of the training data from 20% to 100%. Obviously, the more training data used, the higher the system performance. However, the objective of this test was to determine whether the transliteration models perform reasonably well even for a small amount of training data. We found that was the most sensitive of the four models to the amount of training data; it had the largest difference in performance between 20% and 100%. In contrast, ALL showed the smallest performance gap. The results of this test shows that combining different transliteration models is helpful in producing correct transliterations even if there is little training data.

Training Data Size
20% 46.6% 47.3% 53.4% 57.0% 67.5%
40% 52.6% 51.5% 58.7% 62.1% 71.6%
60% 55.2% 53.0% 61.5% 63.3% 73.0%
80% 58.9% 54.0% 62.6% 64.6% 74.7%
100% 58.8% 55.2% 64.1% 65.5% 75.8%
Training Data Size
20% 47.6% 51.2% 56.4% 60.4% 69.6%
40% 52.4% 55.1% 60.7% 64.8% 72.6%
60% 55.2% 57.3% 62.9% 66.6% 74.7%
80% 57.9% 58.8% 65.4% 68.0% 76.7%
100% 58.8% 59.2% 65.8% 69.1% 77.0%
Table 13: Results of Training Data-Size Test: ALL means .

5 Discussion

Figures 3 and 4 show the distribution of the correct transliterations produced by each transliteration model and by the combination of models, all based on the MEM. The , , , and in the figures represent the set of correct transliterations produced by each model through -fold validation. For example, = 4,220 for EKSet and = 6,121 for EJSet mean that produced 4,220 correct transliterations for 7,172 English words in EKSet ( in Figure 3) and 6,121 correct ones for 10,417 English words in EJSet ( in Figure 4). An important factor in modeling a transliteration model is to reflect the dynamic transliteration behaviors, which means that a transliteration process dynamically uses the source grapheme and source phoneme to produce transliterations. Due to these dynamic behaviors, a transliteration can be grapheme-based transliteration, phoneme-based transliteration, or some combination of the two. The forms of transliterations are classified on the basis of the information upon which the transliteration process mainly relies (either a source grapheme or a source phoneme or some combination of the two). Therefore, an effective transliteration system should be able to produce various types of transliterations at the same time. One way to accommodate the different dynamic transliteration behaviors is to combine different transliteration models, each of which can handle a different behavior. Synergy can be achieved by combining models so that one model can produce the correct transliteration when the others cannot. Naturally, if the models tend to produce the same transliteration, less synergy can be realized from combining them. Figures 3 and 4 show the synergy gained from combining transliteration models in terms of the size of the intersection and the union of the transliteration models.

(a) ++
(b) ++
(c) ++
(d) ++
Figure 3: Distributions of correct transliterations produced by models for English-to-Korean transliteration. KTG represents “Korean Transliterations in the Gold standard”. Note that = 5,439, = 3,047, and = 7,172.
(a) ++
(b) ++
(c) ++
(d) ++
Figure 4: Distributions of correct transliterations produced by models for English-to-Japanese transliteration. JTG represents “Japanese Transliterations in the Gold standard”. Note that =8,021, =4,786, and = 10,417.

The figures show that, as the area of intersection between different transliteration models becomes smaller, the size of their union tends to become bigger. The main characteristics obtained from these figures are summarized in Table 14.

EKSet EJSet
4,202 6,118
3,947 6,158
4,583 6,846
4,680 7,189
3,133 4,937
3,731 5,601
4,025 5,731
4,136 6,360
3,675 5,759
3,583 5,841
5,051 7,345
5,188 7,712
4,796 7,239
5,164 7,681
4,988 7,594
4,982 7,169
Table 14: Main characteristics obtained from Figures 3 and 4.

The first thing to note is that is clearly smaller than any other intersection. The main reason for this is that and use no common information ( uses source graphemes while uses source phonemes). However, the others use at least one of source grapheme and source phoneme (source graphemes are information common to , , and while source phonemes are information common to , , and ). Therefore, we can infer that the synergy derived from combining and is greater than that derived from the other combinations. However, the size of the union between the various pairs of transliteration models in Table 14 shows that and are bigger than . The main reason for this might be the higher transliteration power of and compared to that of and and cover more of the KTG and JTG than both and . The second thing to note is that the contribution of each transliteration model to can be estimated from the difference between and the union of the three other transliteration models. For example, we can measure the contribution of from the difference between and . As shown in Figures 3(a) and 4(a)),