This work presents a novel framework for Neural Machine Translation (NMT) using phonetic information ‘computed’ by human interaction throughout the evolution of spoken language. Our overarching goal is to improve machine translation accuracy. We view social interaction as a computational device that generates precomputed knowledge. We systematically study the possible reasons for getting improved machine translation systems and based on this we construct artificial analogs of the phonetic encoding and apply these ideas to a wide range of human language pairs and domains. Our empirical study shows an overwhelming improvement over the state-of-the-art. In what follows we discuss our motivation, give an overview of our methods, and summarize our results.
Machine translation is one of the areas that we have seen remarkable advancement with the revival of the neural networks. Deep neural networks are believed to automatically learn the underlying features. To that end, neural networks typically have an extensive number of parameters to be estimated from the data. The data can have a biased domain or noise such as typos. It is a challenging task to rely purely on neural networks to extract all hidden features in NMT. Besides, we are not aware of a well-known general mathematical explanation about the effectiveness of the shortcomings/bottlenecks of artificial neural networks. Adding another representation as input potentially allows a more straightforward network structure. Therefore, we ask:
“Can we find a new representation of natural language data beyond the text that is less sensitive to styles or errors (without an additional input source)?”
Phonetics is another language representation (besides text). Let us consider how humans communicate a message. Our first recourse, far earlier than any written language, was to encode our thoughts in sound [blevins2004evolutionary]. Similar phenomena happen for individuals in the first language acquisition when children start with talking than reading. Phonology, during human language development, carries semantic meanings, see [tyler1996interaction, beaver2007semantics]. These studies coincide with neural discoveries about the correlation between phonology and semantics in the human brain [wang2016neural, Amenta2017]. The phonological structure is realized through phonetic implementation as in [cohn2014interface].
What is phonetics? A phonetic algorithm (encoding) is an algorithm for indexing words by their pronunciation. Table 1 shows examples of phonetics: Pinyin and Soundex. We view these encodings as many-to-one functions that map multiple words to one (formally speaking it is a projection of the text). We introduce Soundex, NYSIIS, Metaphone for Western languages, and use Pinyin for Chinese. In addition to phonetics, we consider other forms of ‘projections’ for Chinese and in particular a Logogram encoding and Wubi.
Can the coupled representation of phonetics and text benefit MT? We use speech in the form of a phonetic sequence as an independent language representation of text. Note that this can be efficiently computed by the given text input – i.e. throughout our paper we do not use any additional input/information. Furthermore, we do not need any additional labeled dataset or other sources. Phonetics has a smaller vocabulary than written text. Adding a second representation of the foreign sentences after their text form will increase the tolerance to noise and other lexical variations.
How to use phonetics in MT? We give this new form of phonetic sentence representation together with its written text form as an input when training and decoding the neural networks. We use (1) concatenation of phonetic and word sequences; (2) multi-source encoding of phonetics and text form in words.Figure 1 shows the workflow of how we apply our methods. We first apply phonetic encoding, logogram, or random clustering to the foreign sentences. Then we apply Byte-Pair-Encoding and learn a word embedding (marked as empty boxes). Source and target embedding are trained jointly. Finally, we concatenate or combine with multi-source encoder the embedding of the original text and the phonetics to feed into NMT. We treat the NMT as a black-box and thus it is technically easy to adapt to any new NMT or preprocessing tools.
|Pinyin||xiao4||笑(smile), 校(school), 孝(filial), 效(imitate)|
We achieved substantial and consistent increases over the state-of-the-art on all language pairs with which we experimented. We conducted extensive experiments and achieved up to 4 BLEU points on the medium scale IWSLT’17 and the large scale WMT’18 Biomedical task over the state-of-the-art in translation directions of English-German, German-English, English-French, French-English, and Chinese-English. In particular, this led to a higher accuracy even on an arbitrary test set whose distribution is unknown at training. We verified that our approaches are more robust on French-English experiments with about a 5 BLEU point improvement on a foreign test set from an unknown domain. Our approach is general and can potentially benefit any language with phonetic encodings and any NMT system.
Why phonetics help significantly in NMT? We investigate the structure of the phonetic encoding aiming to explain the translation improvements. We performed a systematic empirical analysis to understand the effect of phonetic encodings. We introduce and verify with three quantitative analyses our hypothesis of semantic diversity by phonetics stated as:
“One phonetic representation usually corresponds to characters/words that are semantically diverse.”
From phonetics to artificial encodings based on our hypothesis.
To explain why the phonetics diversity hypothesis in NMT we develop an artificial encoding. This is a new random clustering algorithm that casts words or characters into classes randomly. Note that a random mapping is still a mapping (i.e. one word is mapped to a random class, but the same word seen later in the dataset will be mapped to the same class). Random clustering groups more semantically diverse words/characters than typical clusterings such as K-means. This procedure is similar to how words with different meanings map to one phonetic representation according to our hypothesis. The distribution on the random cluster size follows the number of words in each Metaphone. We uniform sample words randomly for each cluster. Our experimental results show that the random clustering achieves comparable improvements to phonetic encodings. This finding aligns with the empirical justification of our hypothesis on why phonetic encoding improves NMT.
This work contains two areas of study: phonetic encoding and random clustering, where the former inspires the latter per our hypothesis. Here are our main contributions:
Phonetic and logogram encodings. We introduce phonetic and logogram encoding to enrich the text representation of languages. We experimented with English, French, German, and Chinese on various encodings such as Soundex, NYSIIS, Metaphone, Pinyin, and Wubi. These encodings, when used as auxiliary inputs, improve state-of-the-art NMT with up to 4 and 5 BLEU points on an in-domain and an unknown out-of-domain test set, respectively.
The phonetics diversity hypothesis. We empirically analyze our hypothesis. We find and verify that a phonetic representation corresponds to words with diverse semantic meanings. Phonetics is a function that groups semantically different words. Note that this finding could be of independent interest to the field of Linguistics.
Random clustering. Inspired by our hypothesis, we introduce random clustering. The considerable improvements over state-of-the-art NMT systems can be explained by the validity of our hypothesis and how this affects the training of a much more accurate but simpler Neural Networks.
In the remainder of the paper, we first introduce phonetic and logogram encodings. Then, we study why adding them (each derived from the original text) improves NMT and propose our hypothesis. Subsequently, we introduce an artificial method, random clustering to generalize text encoding. Finally, we empirically demonstrate that all these approaches significantly boost the NMT accuracy and robustness.
NMT is an approach to MT using neural networks, which takes as an input a source sentence and generates its translation , where and
are source and target words respectively. NMT models with attention have three components, namely, an encoder, a decoder, and an attention mechanism. The encoder summarizes the meaning of the input sequence by encoding it with a bidirectional recurrent neural network (RNN). We apply the sequence-to-sequence learning architecture by[Gehring17]
, where the encoder and decoder states are calculated using convolutional neural networks (CNNs).
3 Phonetic Encodings
A phonetic algorithm is used to index words by their pronunciation. We apply the phonetic algorithm to each word in a sentence and output a sequence of phonetic encodings.
Soundex is a widely known phonetic algorithm for indexing names by sound and avoids misspelling and alternative spelling problems. It maps homophones to the same representation despite minor differences in spelling [soundex:07]. Continental European family names share the 26 letters (A to Z) in English. Soundex clusters the letter with exceptions. For example, the Soundex key letter code clusters ‘b, f, p, v’ to ‘1’, and ‘c, g, j, k, q, s, x, z’ to ‘2’, and ‘d, t’ to ‘3’.
The New York State Identification and Intelligence System Phonetic Code (NYSIIS) is a phonetic algorithm devised in 1970 [rajkovic2007adaptation]. It features an accuracy increase of 2.7% over Soundex and takes special care to handle phonemes that occur in European and Hispanic surnames by adding rules to Soundex. For example, if the last letters of the name are ‘EE’ then these letters are changed to ‘Yb’.
Metaphone [philips1990hanging] is another algorithm that improves on earlier systems such as Soundex and NYSIIS. The Metaphone algorithm is significantly more complicated than the others because it includes special rules for handling spelling inconsistencies and for looking at combinations of consonants in addition to some vowels.
Hanyu Pinyin (Pinyin) is the official romanization system for Standard Chinese in mainland China. Pinyin, which means ‘spelled sound’, was developed to teach Mandarin. One Pinyin corresponds to multiple Chinese characters. One Chinese word is usually composed of one, two, or three Chinese characters.
Logogram Encoding: Chinese Wubi
The Wubizingxing (Wubi or Wubi Xing) is a Chinese character input method primarily used to efficiently input Chinese text with a keyboard. The Wubi method decomposes a character based on its structure rather than its pronunciation. Every character can be written with at most 4 keystrokes including -, , 丿, hook, and 丶 with various combinations.
4 Random Clustering
Driven by our hypothesis, which we will be elaborate in Section 5, we further introduce an artificial method to encode text that simulates ’natural’ encoding (i.e. phonetics and logogram). We call this random clustering as described in Algorithm 1. We cluster words (or characters) uniformly at random. The cluster size follows the distribution of how many words/characters are associated with each phonetic algorithm, here Metaphone. For example, in Chinese, each Pinyin is a cluster. Each cluster size is the same as the number of characters mapped to a Pinyin, and the number of clusters equals the number of unique Pinyins.
One phonetic representation (for example, Pinyin in Chinese) usually corresponds to characters/words that are semantically diverse.
At first, this hypothesis may seem counter-intuitive. However, it is made to reduce ambiguity in oral communication, because, otherwise, humans would not be able to communicate effectively due to confusion. For example, red (Pinyin: ‘hong’) and green (Pinyin: ‘lv’) in Chinese appear in similar contexts. It seems plausible to think that part of the development of phonetics is that one re-uses the same sound when context can be used to distinguish among multiple interpretations. For example, ‘to’ versus ‘two.’
How do we set up experiments to verify this?
We test our hypothesis using geometric interpretations of semantics, precisely, word embeddings [bengio2014word]. Intuitively, an embedding [mikolov2013distributed, arora2016linear] preserves pairwise semantic distances. For instance, the two words/characters are close if they are semantically similar and far away otherwise. For instance, see the work of zouzias2010low,molitor2017remarks zouzias2010low,molitor2017remarks about volume preserving embeddings, which formalizes the concept of this term. For example, if we have a set of words, and all the words correspond to a Pinyin, then the points themselves may mean nothing, but the distances among the points are our focus. Typically in geometry, three points in space are sufficient to quantify a volume. We embed each word or characters from Chinese-English translation data (in Section 6) into 100 dimensions and then project this embedding into two dimensions using PCA. Algorithm 2 describes how we compute a smooth convex hull of points. The convex hull of a Pinyin is the convex hull of embeddings of all words or characters pronounced with this Pinyin.
Figure 2 shows all embedded Chinese characters in red dots, and black dots are the Chinese character(s) of one random Pinyin in each plot. We can see that characters with the same pronunciation tend to have distributed meaning - that is, well-distributed over the Euclidean plane.
In Figure 3
, we measure the convex hull (the smallest convex set that contains all points - implemented in Matlab) of all characters. We exclude the outliers (blue dots) by removing all points that are encircled along with less thanother points in a ball of radius . The first plot shows the hull enclosing characters of one random group (either cluster or Pinyin). The second plot shows the addition of characters from a second random group to the first group, analogously for the second plot. The convex hull volume (here, 2D volume) increases as we add groups. We can see that Soundex, Pinyin, and random grouping covers the space faster than the K-Means clustering when we increase the number of groups.
These are carried out with three experiments. First, Figure 4 shows the empirical CDF
of the convex hull volume of characters of each Pinyin, random clustering, and K-Means clustering, where the x-axes indicate the volume, and y-axes indicate the frequency. Random clustering and Pinyin grouping have a larger volume than K-Means, respectively. For each group, Pinyin is slightly better distributed (more widespread) than uniformly random clustering, and both of these are better distributed than K-Means. This phenomenon is quite exciting and is probably due to the isoperimetry of the uniform random sampling for these data points.
Second, we define the concentration factor as
where . is the th point in group (either cluster or Pinyin). The smaller the value, the more distributed the points are in each cluster. The concentration factor is 9350 for K-Means, 3.783 for Pinyin, 1.476 for the random clustering in Chinese; 3543K for K-Means, 0.3674 for Soundex, and 0.0191 for random clustering.
Finally, we define the density measure as in Algorithm 3, which intuitively seems to be a more robust test. For each point in the smoothed convex hull of all words/characters, let be the distance between and the -th nearest neighbor of in the space . We then look at either the maximum of over all , or the average. Choosing a larger captures the ‘density’ of the point-set at larger scales, which is a parameter that can be tuned to be more robust against noise. We numerically integrate the convex hull surface by randomly sampling the points, which are a linear combination of the convex hull corner weighted uniformly at random. Table 2 shows the density results, which are consistent with the CDF in Figure 4. Pinyin is the most well-distributed, then random clustering, followed by K-Means.
Adding auxiliary information:
First, we convert all the words or characters in the source sentences (in the training, development, and test set) into phonetic or other encodings. Then, we segment those sequences of encodings with the Byte Pair Encoding compression algorithm [Sennrich16, gage1994new] (BPE). We learn a new embedding on this encoded training data only (for example, Metaphone encodings). Next, the embedded vectors are either concatenated or multi-source encoded with the original sentence embedding (after BPE) or used alone as inputs to the encoder of the CNN neural translator.
Datasets and Vocabularies
We carried out experiments for five translation directions: Chinese to English (ZH-EN), English to French (EN-FR), French to English (FR-EN), English to German (EN-DE), and German to English (DE-EN). We used the IWSLT’17 [iwslt17] and the WMT’18 Biomedical training data. Figure 3 shows vocabulary statistics on source/target tokenized text [chinesetokenizer] before and after applying encodings [jellyfishpackage]. We apply a BPE with 89K and 16K [denkowski2017stronger] operations for FR, 89K for DE, and 18K operations for ZH.
For each encoding scheme, we carried out two experiments, one with only the encoded sentences and another one with the source sentence concatenated with the encoded sentence. For example, W+Soundex means the source sentence in words concatenated with all words converted into Soundex as the input to NMT. As the Soundex algorithm does not support French, we do not have its results for French. We evaluate translation quality with BLEU implemented by multibleu multibleu.
|Baseline: Words [Gehring17] (W)||35.01||36.21||34.37||36.78||27.79||25.12|
|Baseline: Words [Vaswani17] ()||36.81|
Table 4 shows that encoding as an auxiliary input (concatenated with the original sentence) significantly improves the translation quality in all language directions in our experiments. W+Metaphone indicates adding Metaphone to the word-based NMT baseline, which gives the best results for EN-FR and DE-EN, with an improvement of 1.71 and 1.2 in BLEU points, respectively. In our experiments, random clustering consistently improves over baselines on all languages. The non-uniform random clustering method in algorithm 1 achieves a higher BLEU score of 37.95% than a uniform random clustering after tuning on the cluster size. For EN-FR ( BPE operations) data in Table 5, we uniform sample words randomly for each cluster. We get the BLEU score of 37.74%, 37.77%, 37.38%, and 37.63% when setting the number of clusters to be 20%, 40%, 60%, and 80% of the vocabulary size (63615 words), i.e. the average cluster size to be 5, 2.5, 1.6, 1.25, respectively.
However, for most languages, the best codings are phonetic ones. Since linguistic information is typically language-dependent, some phonetic encoding may be more suitable for certain languages than others. NYSIIS handles phonemes that occur in European and Hispanic surnames. Thus, it performs best in French. Metaphone is an advanced algorithm with spelling variations and inconsistencies. Hence, it works best for English and German (both Germanic languages).
Table 7 shows the results of the ZH-EN translation system (BPE operations). We apply Pinyin [pinyinwubipackage], and Pinyin segmented into letters, and Wubi encoding [pinyinwubipackage]. We achieve significant improvement over the baseline by adding auxiliary information: 0.87 BLEU points with Pinyin, 1.68 BLEU points with Pinyin in letters, and 1.11 BLEU points with Wubi, respectively. Randomly clustering on Chinese characters and words both improve the baseline with 1.49 and 1.47 BLEU points, respectively, a more significant improvement than that of the K-Means clustering.
Additionally, we experimented on English-French and French-English with a large dataset: the WMT’18 Biomedical task that contains more than 2 million sentences. We show the translation results in Table 5. We achieved a significant improvement of 2.21 BLEU points for English-French and 4.27 BLEU points for French-English, respectively.
|Baseline: Words [Gehring17]||31.10||28.19|
We tune the dropout parameter for three experiments: Words, W+Pinyin, and W+Pinyin letters on ZH-EN. The drop out value is set by default to 0.2, and the beam-size to 12. Figure 5 shows how translation accuracy changes by varying the dropout value. The highest BLEU score is at a dropout of 0.05 for the baseline, but between 0.2 and 0.3 for our approach. Higher optimial value of dropout means less nodes in the Neural Networks are needed to opt NMT quality. This implies that adding auxiliary inputs will reduce the model complexity.
Training time in minutes/ number of epochs
Table 6 shows the system training time (with BPE operations and for ZH-EN ). The total time (in minutes) is in the first column, and the number of epochs is in the second. The auxiliary information reduces the model complexity as described in Section 6. Therefore, the training becomes more efficient and needs a smaller number of epochs to converge. The total training time of our approaches is comparable to that of baselines, sometimes even less.
|Baseline: Words [Gehring17]||17.00|
|Pinyin in letters||12.51|
|W+Pinyin in letters||18.68|
|random clustering words||17.35|
|random clustering characters||15.84|
|W+random clustering words||18.47|
|W+random clustering characters||18.49|
|Source Data||No. of Operations|
We test the system robustness with noise augmented data on IWSLT’17 EN-FR. In training, we randomly sample 20% of words and substitute each of them with another word, which we randomly select according to its word2vec similarity to the substituted word. We concatenate the clean and noisy data and their phonetic encoded data. For each test sentence, we perform 0, 1, 2, 3, 4, or 5 times of the edit distance operation. We uniform randomly sample the word(s), the type of edit (from deletion, substitution, and insertion), and the substitution word. We trained models on below data: 1. Clean data (C); 2. Noisy data (N); 3. Noisy data concatenated with its Soundex encoding (N+S); 4. Combined clean and noisy data (C+N) as in [cheng2018towards]. Table 8 (BPE 89k) shows that adding Soundex helps most test sets with different number of operations.
Furthermore, we test the system robustness on a test set whose distribution is unknown during training (unlike the common robustness task). We aim to evaluate how a system behaves in a real-life scenario. We verified on English-French systems () in Table 4 trained on IWSLT’17. The tests are the out-of-domain News WMT’15 [Bojar2015findings] and informal language data MTNT test sets [wmt2019robust] released the robustness shared task in WMT’19. As in Table 9, all of our approaches achieved higher accuracy than baselines. W+Metaphone outperforms all other systems and improves over the baseline by about 5 BLEU points.
7 Related Work
[hayes1996phonetically, johnson2015sign] applied phonological rules or constrains to tasks such as word segmentation. Phonetics involves gradient and variable phenomena. Phonology has rules and patterns. In neural networks, we can directly learn from phonetic data and leave the network structure to discover hidden phonetic features opt NMT performance, instead of optimizing towards phonological constraints. Discriminatively learning phonetic features has demonstrated success in various language technology applications. The work of [huang2004improving]
used phonetic information to improve the named entity recognition task. bengio2014word bengio2014word integrates speech information into word embedding and subword unit models, respectively.[du2017Pinyin] converted Chinese characters to subword units using Pinyin to reduce unknown words using a factor model. Our work improves NMT overall rather than only translating unknown Chinese words. We are the first to introduce the use of Soundex, NYSIIS, Metaphone, and Wubi in NMT. We successfully develop random clustering driven by our empirically verified hypothesis.
Leading research has investigated auxiliary information to NLP tasks, such as polysemous word embedding structures by arora2016linear arora2016linear, factored models by martnez2016factored martnez2016factored and sennrich2016linguistic sennrich2016linguistic, and feature compilation by sennrich2016linguistic sennrich2016linguistic. We applied both concatenation and multi-source encoding to combine phonetic inputs.
Closely related, but independent to this work, is the approach of word segmentation or character-based NMT such as [ling2015character, chung2016character], which focuses on the decomposition of the translation unit. Smaller text granularity helps in unseen word challenges, while more extended translation units reduce input lengths. The use of Pinyin and logogram before BPE further chops down Chinese characters. Soundex, NYSIIS, and Metaphone have the effect of grouping letters. Both cases are different from the typical word segmentation task. We take a different angle and view MT input as an information source encoded in various forms. We study the source sentence representations other than text such as phonetic encodings, which works surprisingly well when combined with word segmentation methods.
We introduce phonetic and logogram encodings that hugely improve NMT translation quality (4 BLEU points) and robustness(5 BLEU points). When seeking the reasons for this improvement, we find and verify our hypothesis of diversity by phonetics, which leads to a new encoding algorithm, random clustering, that also significantly improves NMT.