Robust Neural Machine Translation with Joint Textual and Phonetic Embedding

10/15/2018 ∙ by Hairong Liu, et al. ∙ Baidu, Inc. 0

Neural machine translation (NMT) is notoriously sensitive to noises, but noises are almost inevitable in practice. One special kind of noise is the homophone noise, where words are replaced by other words with the same (or similar) pronunciations. Homophone noise arises frequently from many real-world scenarios upstream to translation, such as automatic speech recognition (ASR) or phonetic-based input systems. We propose to improve the robustness of NMT to homophone noise by 1) jointly embedding both textual and phonetic information of source sentences, and 2) augmenting the training dataset with homophone noise. Interestingly, we found that in order to achieve the best translation quality, most (though not all) weights should be put on the phonetic rather than textual information, where the latter is only used as auxiliary information. Experiments show that our method not only significantly improves the robustness of NMT to homophone noise, which is expected but also surprisingly improves the translation quality on clean test sets.



There are no comments yet.


page 6

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

In recent years we witnessed rapid progresses in the field of neural machine translation (NMT). Sutskever et al. (2014) proposed a general end-to-end approach for sequence learning and demonstrated promising results on the task of machine translation. Cho et al. (2014)

proposed the famous Encoder-Decoder architecture: a neural network has two components, an encoder network which encodes an input sequence into a fixed-length vector representation, and a decoder network which generates the output sequence based on the fixed-length vector representation. To overcome the limitation of the fixed-length vector representation, attention mechanism

Bahdanau et al. (2014); Luong et al. (2015) has been proposed and extensively studied, resulting in significant improvements in NMT. Further improvements, such as replacing recurrent units by convolutional units Gehring et al. (2017) or self-attention Vaswani et al. (2017), push the boundary of NMT, especially the transformer network proposed in Vaswani et al. (2017), which achieves the state-of-the-art (SOTA) results on many tasks, including machine translations.

However, these NMT models are very sensitive to the noises in input sentences. The causes of such vulnerability are complicated, and some of them are: 1) neural networks are inherently sensitive to noises, such as adversarial examples Goodfellow et al. (2014); Szegedy et al. (2013), 2) the global effect of attention, where every input word can affect every output word generated by the decoder, and 3) the embedding of input words is very sensitive to noises. To improve the robustness of NMT, Cheng et al. (2018) recently proposed to use adversarial stability training to improve the robustness of both encoder and decoder with perturbed inputs.

Clean Input 目前已发现109人死亡, 另有57人获救
Output of Transformer at present, 109 people have been found dead and 57 have been rescued
Noisy Input 目前已发现109人死亡, 另有57人获救
Output of Transformer the hpv has been found dead so far and 57 have been saved
Output of Our Method so far, 109 people have been found dead and 57 others have been rescued
Table 1: The translation results on Mandarin sentences without and with homophone noises. The word ‘有’ (yǒu, “have”) in clean input is replaced by one of its homophone, ‘又’ (yòu, “again”), to form a noisy input. This seemingly minor change completely fools the Transformer to generate something irrelvant (“hpv”). Our method, by contrast, is very robust to homophone noises thanks to phonetic information.

In this paper, we focus on one special kind of noises, which we call the homophone noise, where a word is replaced by another word with the same (or very similar) pronunciation. Homophone noises are common in many real-world systems. One such example is speech translation, where the ASR system may output correct phoneme sequences, but transcribe some words into their homophones. Another example is phonetic input systems for non-phonetic writing systems such as Pinyin for Chinese or Katakana/Hiragana for Japanese. It is very common for a user to choose a homophone instead of the correct word. Existing NMT systems are very sensitive to homophone noises, and Table 1 illustrates such an example. The transformer model can correctly translate the clean input sentence; however, when one character, ‘有’, has been replaced by one of its homophones, ‘又’, it generates a strange and irrelvant translation. The method proposed in this paper can generate correct results under such kind of noises, since it mainly uses phonetic information.

Since words are discrete signals, to feed them into a neural network, a common practice is to encode them into float vectors through embedding. However, if a word is replaced by a word due to noises, the embedding usually changes dramatically, which will inevitably result in the changes in outputs. For homophone noises, since correct phonetic information exists, we can make use of it to make translation much more robust.

In this paper, we propose to improve the robustness of NMT models to homophone noises by jointly embedding both textual and phonetic information. In our approach, the input includes both sentences and the pronunciations of each word in source language, and the output is still sentences in target language. Both words and their corresponding pronunciations are embedded and then combined to feed into a neural network. This approach has the following advantages:

  • First, it is a simple but general approach, and easy to implement. Our approach can be applied to many NMT models, and only needs to modify the embedding layer, where phonetic information is embedded and combined with word embeddings.

  • Second, it can dramatically improve the robustness of NMT models to homophone noises. This is because the final embedding in our approach is a combination of both word embedding and phonetic embedding. When mainly focusing on phonetic embeddings, even some word embeddings are incorrect, the final embeddings are still robust.

  • Third, it also improves translation performance on clean test sets. This seems to be surprising since no additional information is provided. Although the real reason is unknown, we suspect it is because some kind of regularization effects from phonetic embeddings.

To further improve the robustness of NMT models to homophone noises, we use data augmentation to expand the training datasets, by randomly selecting training instances and adding homophone noises to them. The experimental results clearly show that data augmentation improves the robustness of NMT models to homophone noises.

2 Background

In theory, the goal of NMT is to optimize the translation probability of a target sentence

conditioned on the corresponding source sentence :


where represents model parameters and represents partial translation with first target words.

NMT uses neural networks to model the objective function (1), usually with two core components, namely, an encoder and a decoder. The encoder encodes the source sentence

into a sequence of hidden representations, and the decoder generates the target sequence

based on this sequence of hidden representations. Both encoder and decoder can be modeled by various neural networks, such as RNN, CNN and self-attention.

Both source sequence and target sequence are sequence of words, which are discrete signals; however, neural networks need continuous signals as inputs. A common practice is to first embed words into a real coordinate space, denoted by , where is the dimension of the space. After embedding, each word is represented by a -dimensional real vector, which can be feed into a neural network.

Embedding is very sensitive to noises, which is kind of self-evident: when a word is replace by a word due to noises, the obtained embedding vector usually changes dramatically (from the embedding vector of to the embedding vector of ). It is hard to make the embedding robust for arbitrary noises; however, for homophone noises, due to the existence of correct phonetic information, the embedding can be made extremely robust by putting most weights on phonetic information, and only using textual information as auxiliary information. This is reasonable since our human can communicate with each other with no knowledge of texts.

3 Joint Embedding

For a word in source language, suppose its pronunciation can be expressed by a sequence of pronunciation units, such as phonemes or syllables, denoted by . Note that we use the term “word” loosely here, and in fact may be a word or a subword, or even a character in some languages where each character has pronunciations, such as in Mandarin.

We embed both pronunciation units and words. For a pronunciation unit , its embedding is denoted by , and for a word , its embedding is denoted by . For the pair of a word and its pronunciation sequence , we have embedding vectors, that is, . Since is different for different words, or even for different pronunciations of the same word, we need to combine them into a vector whose length not depending on . There are obvious multiple ways, and in this paper, we do in the following way:

  • First, both words and pronunciation units are embedded into the same space .

  • Second, the pronunciation embeddings, , are merged into a single embedding by averaging, denoted by , that is,

  • Third, the word embedding and the averaged pronunciation embedding are combined:


    where is a parameter which controls the relative contributions between words and their pronunciations. When , only textual embedding is used. When , only phonetical embedding is used. The best balance, as demonstrated by our experiments, is a very large close to but not .

4 Data Augmentation

Data augmentation is widely known to be extremely useful in training models Shotton et al. (2011); Krizhevsky et al. (2012), especially in training data-hungry neural networks. The homophone noises tackled in this paper is easy to simulate, and also can be collected from the output of ASR and pronunciation-based input methods.

In this paper, we augment the training datasets in a very simple way:

  • First, for each word , build a set . contains words which are homophones of and realistically may be mistaken into by real systems from .

  • Second, randomly pick training pairs from training datasets, and revise the source sentences by randomly replacing some words by their homophones.

After augmentation, the robustness of models are significantly improved. Together with the embedding of pronunciation information, the resulting NMT models are very robust to homophone noises.

5 Experiments

5.1 Models

In our experiments, we use Transformer as baseline. This is because Transformer has the SOTA results on many machine translation datasets, as well as well-maintained open source codebases, such as Tensor2Tensor and OpenNMT

. Specifically, we use the PyTorch version (PyTorch 0.4.0) of OpenNMT. All models are trained with

GPUs. Except specifically mentioned exceptions, the parameters are: 6 layers, 8 heads attention, 2048 neurons in feed-forward layer, and 512 neurons in other layers, dropout rate is

, label smoothing rate is , Adam optimizer, learning rate is with NOAM decay.

5.2 Translation Tasks

We evaluated our method on translation tasks of Mandarin-English, and reported the -gram BLEU score Papineni et al. (2002) as calculated by the multi-bleu.perl perl script.

We used an extended NIST corpus which consists of M sentence pairs with about M Mandarin words and M English words, respectively. The NIST set is used as the dev set to select the best model, and the NIST , , , and dataset are used as test sets.

We apply byte-pair encodings (BPE) Sennrich et al. (2016) on both Chinese and English sides to reduce the vocabulary size down to 18K and 10K, respectively. We exclude the sentences which are longer than 256 subwords. For Mandarin, we use pinyin as pronunciation units, and there are pinyins in total. For symbols or entries in Mandarin dictionary without pronunciation or with unknown pronunciations, a special pronunciation unit, , is used.

5.3 Translation Results

Figure 1: BLEU scores on the dev set (NIST test set) for the baseline model (Transformer-base) and our models with different . The -axis is the number of iterations and the -axis in the case-insensitive BLEU scores on multiple references. Although there are some fluctuations in BLEU scores, there is one clear observation: the model combined both textual and phonetic information perform much better than the model without phonetic information (baseline model, blue line) and the model without textual information (our model with , black line). The best BLEU score () is achieved at the -th iteration when

In Figure 1, we compare the performances, measured by BLEU scores to multiple references, of the baseline model and our models with , respectively. We report the results every iterations from iteration to iteration . Note that our model is almost exactly the same as baseline model, with only different source embeddings. In theory, when , our model is identical to baseline model. However, in practice, there is a slight difference: when , the embedding parameters are still there, which will affect the optimization procedure even no gradients flow back to these parameters. When , only phonetic information is used. There are some interesting observations from Figure 1. First, combing textual and phonetic information improves the performance of translation. Compared with baseline, when , the BLEU scores improves points, and when , the BLEU scores improves points. Second, the phonetic information plays a very important role in translation. Even when , that is, the weight of phonetic embedding is and the weight of word embedding is only , the performance is still very good. In fact, our best BLEU score (), is achieved when . However, word embedding is still important. In fact, when we use only phonetic information (when ), the performance become worse, almost the same as baseline (only using textual information). As mentioned in Section 2 , it seems that our human only needs phonetic information to communicate with each other, this is probably because we have better ability to understand context than machines, thus do not need the help of textual information.

(Dev Set)
Table 2: Translation results on NIST Mandarin-English test sets

We use NIST as dev set to select best models for both baseline and our model with different s, and test them on NIST , , , and test sets. The results are reported in Table 2. These results corroborate our findings from Figure 1, and here we emphasize the most important finding: although we can get good translations from either textual or phonetic information of source sentence, combing them yield much better results.

Figure 2: Visualization of a small region in the embedding space. Note that pinyins with similar pronunciations are close in the embedding space.

To understand why phonetic information helps the translation, it is helpful to visualize the embedding of pronunciation units. We projects the whole pinyin embedding space into a -dimensional space using t-SNE technique Maaten and Hinton (2008), and illustrate a small region of it in Figure 2. An intriguing property of the embedding is that pinyins with similar pronunciations are close to each other, such as ZHEN and ZHENG, JI and QI, MU and HU.

Pinyin Word 1 Word 2 Word 3 Word 4
SHI SHI 实施 事实 实事 适时
SHI JI 世纪 实际 时机 事迹
Table 3: Some examples of homophones in Mandarin.

Homophone is very common in Mandarin. In Table 3, five groups of homophones are listed. To test the robustness of NMT models to homophone noises, we created two noisy test sets, namely, NoisySet, and NoisySet, based on NIST Mandarin-English test set. The creation procedure is as follows: for each source sentence in NIST, we scan it from left to right, and if a word has homophones, it will be replaced by one of its homophones by a certain probability ( for NoisySet and for NoisySet).

Figure 3: BLEU scores on dataset without and with homophone noises. On both noisy test sets, as more weight are put on phonetic embedding, that is, as grows, the translation quality improves.

In Figure 3, we compare the performances, measured by BLEU scores, of the baseline model and our models with , respectively, on NIST test set and the two created noisy sets. The models are chosen based on their performance (BLEU scores) on NIST test set. As Figure 3 shows, as grows, which means that more and more weights are put on phonetic information, the performances on both noisy test sets almost steadily improve. When , as expected, homophone noises will not affect the results since the model is trained solely based on phonetic information. However, this is not our best choice since the performance on the clean test set gets much worse. In fact, it seems that the best choice of is a value smaller but close to , which mainly focus on phonetic information but still utilizes some textual information.

Clean Input 古巴是第一个新中国建交的拉美国家
Output of Transformer cuba was the first latin american country to
establish diplomatic relations with new china
Noisy Input 古巴是第一个新中国建交的拉美国家
Output of Transformer cuba was the first latin american country to discovering the establishment of
diplomatic relations between china and new Zealand
Output of Our Method cuba is the first latin american country to
establish diplomatic relations with new china
Clean Input 他认为, 格方俄方的指责是荒谬的
Output of Transformer he believes that georgia’s accusation against russia is absurd
Noisy Input 他认为, 格方俄方的指责是荒谬的
Output of Transformer he believes that the accusations by the russian side villains are absurd
Output of Our Method he maintained that georgia’s accusation against russia is absurd
Table 4: Two examples of homophone noises on source sentences. Textual-only embedding is very sensitive to homophone noises, thus generates weird outputs. However, when jointly embedding both textual and phonetic information in source sentences, the model is very robust.

Table 4 demonstrate the effects of homophone noises on two sentences. The baseline model can translate both sentences correctly; however, when only one word (preposition) is replaced by one of its homophones, the baseline model generates incorrect, redundant and strange translations. This shows the vulnerability of the baseline model. Note that since the replaced words are prepositions, the meaning of the noisy source sentences are still very clear, and it does not affect our human’s understanding at all. For our method, we use the model with , and it generates reasonable translations.

Models Before Augmentation After Augmentation
NIST NoisySet NoisySet NIST NoisySet NoisySet
Table 5: Comparison of models trained with and without data augmentation. Clearly, data augmentation significantly improves the robustness of models to homophone noises.

To further improve the robustness of NMT models, we augment the training dataset by the method described in Section 4. There are about M sentence pairs in the original training dataset, and we add noisy sentence pairs, resulting a training dataset with about M sentence pairs.

In Table 5, we report the performance of baseline model and our model with , with and without data augmentation. Not surprisingly, data augmentation significantly improves the robustness of NMT models to homophone noises. However, the noises in training data seem to hurt the performance of the baseline model (from to ), and its effort on our model seems to be much smaller, probably because our model mainly uses the phonetic information.

Figure 4: Embedding of some pinyins before (a) and after data augmentation (b).

In Figure 4, we illustrate the embeddings of some pinyins before and after data augmentation. An interesting observation here is that if two pinyins are the pronunciations of some common words, such as BAO and PU, TUO and DUO, YE and XIE, their embeddings become much closer after data augmentation, this explains why the robustness of NMT models to homophone noises improve.

6 Conclusion

In this paper, we propose to use both textual and phonetic information in neural machine translation by combining them in the embedding layer of neural networks. Such combination not only makes NMT models much more robust to homophone noises, but also improves their performance on dataset without homophone noises. Our experimental results clearly show that both textual and phonetical information are important in neural machine translation, although the best balance is to rely mostly on phonetic information. Since homophone noises is easy to stimulate, we also augment the training dataset by adding homophone noises, which is proven by our experiments to be very useful in improving the robustness of NMT models to homophone noises.


  • Bahdanau et al. (2014) Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. 2014. Neural machine translation by jointly learning to align and translate. In Proceedings of ICLR.
  • Cheng et al. (2018) Yong Cheng, Zhaopeng Tu, Fandong Meng, Junjie Zhai, and Yang Liu. 2018. Towards robust neural machine translation. In Proceedings of ACL.
  • Cho et al. (2014) Kyunghyun Cho, Bart Van Merriënboer, Caglar Gulcehre, Dzmitry Bahdanau, Fethi Bougares, Holger Schwenk, and Yoshua Bengio. 2014. Learning phrase representations using rnn encoder-decoder for statistical machine translation. In Proceedings of EMNLP.
  • Gehring et al. (2017) Jonas Gehring, Michael Auli, David Grangier, Denis Yarats, and Yann N Dauphin. 2017. Convolutional sequence to sequence learning. In Proceedings of ICML.
  • Goodfellow et al. (2014) Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. 2014. Generative adversarial nets. In Advances in Neural Information Processing Systems, pages 2672–2680.
  • Krizhevsky et al. (2012) Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. 2012. Imagenet classification with deep convolutional neural networks. In Advances in Neural Information Processing Systems, pages 1097–1105.
  • Luong et al. (2015) Minh-Thang Luong, Hieu Pham, and Christopher D Manning. 2015. Effective approaches to attention-based neural machine translation. In Proceedings of EMNLP.
  • Maaten and Hinton (2008) Laurens van der Maaten and Geoffrey Hinton. 2008. Visualizing data using t-sne.

    Journal of Machine Learning Research

    , 9:2579–2605.
  • Papineni et al. (2002) Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu. 2002. Bleu: a method for automatic evaluation of machine translation. In Proceedings of ACL, pages 311–318.
  • Sennrich et al. (2016) Rico Sennrich, Barry Haddow, and Alexandra Birch. 2016. Neural machine translation of rare words with subword units. In Proceedings of ACL, volume 1, pages 1715–1725.
  • Shotton et al. (2011) Jamie Shotton, Andrew Fitzgibbon, Mat Cook, Toby Sharp, Mark Finocchio, Richard Moore, Alex Kipman, and Andrew Blake. 2011. Real-time human pose recognition in parts from single depth images. In

    IEEE Conference on Computer Vision and Pattern Recognition

    , pages 1297–1304.
  • Sutskever et al. (2014) Ilya Sutskever, Oriol Vinyals, and Quoc V Le. 2014. Sequence to sequence learning with neural networks. In Advances in Neural Information Processing Systems, pages 3104–3112.
  • Szegedy et al. (2013) Christian Szegedy, Wojciech Zaremba, Ilya Sutskever, Joan Bruna, Dumitru Erhan, Ian Goodfellow, and Rob Fergus. 2013. Intriguing properties of neural networks. In Proceedings of ICML.
  • Vaswani et al. (2017) Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. In Advances in Neural Information Processing Systems, pages 5998–6008.