Language Graph Distillation for Low-Resource Machine Translation

08/17/2019 ∙ by Tianyu He, et al. ∙ Microsoft USTC 0

Neural machine translation on low-resource language is challenging due to the lack of bilingual sentence pairs. Previous works usually solve the low-resource translation problem with knowledge transfer in a multilingual setting. In this paper, we propose the concept of Language Graph and further design a novel graph distillation algorithm that boosts the accuracy of low-resource translations in the graph with forward and backward knowledge distillation. Preliminary experiments on the TED talks multilingual dataset demonstrate the effectiveness of our proposed method. Specifically, we improve the low-resource translation pair by more than 3.13 points in terms of BLEU score.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Neural machine translation (NMT) has witnessed rapid progress in recent years (Bahdanau et al., 2015; Luong et al., 2015; Sutskever et al., 2014; Vaswani et al., 2017; He et al., 2018; Guo et al., 2018), obtaining good accuracy or even achieving human parity Hassan et al. (2018) for rich-resource translation pair. However, there are more than 7000 languages in the world111https://www.ethnologue.com/browse and most are low-resource or even zero-resource language pairs, which throw challenges to the data-hungry NMT model. How to improve the translation accuracy with limited bilingual sentence pairs remains a question in NMT.

Some previous works have studied on this problem, which include: 1) transfer learning 

(Zoph et al., 2016; Gu et al., 2018b; Neubig and Hu, 2018) that transfers the knowledge from rich-resource to low-resource languages; 2) pivot translation (Cohn and Lapata, 2007; Utiyama and Isahara, 2007; Chen et al., 2017)

that uses a third language to bridge the source to target translation; 3) semi/unsupervised learning 

(He et al., 2016; Artetxe et al., 2017; Lample et al., 2017, 2018) that leverages monolingual data for translation. While these works can improve the accuracy of low-resource translation to some extent, they either just leverage some rich-resource language pairs for knowledge transfer and pivoting, or just leverage the monolingual data of the language itself, without considering the relationship between languages and the monolingual data of each language in a global perspective.

In this paper, we propose the concept of language graph, where each node and edge in the graph represents the language and translation pair respectively. We further propose a graph distillation algorithm based on language graph which boosts the accuracy of low-resource translation with forward and backward knowledge distillation. We formulate the graph distillation algorithm like this: (1) we choose the edges (translation pairs) in the graph which are of high-potential to improve; (2) for each of the potential edge, we find the high-quality paths that connect the source and target language of this edge; (3) we distill the knowledge from the high-quality paths through the forward and backward translation directions to improve the high-potential edges. We conduct preliminary experiments on the TED talks multilingual dataset which contains translations sentence pairs between more than 50 languages. Our graph distillation algorithm can improve the low-resource language pairs by more than points in terms of BLEU score.

Our contributions are listed as follows. (1) We propose the concept of language graph, which can model the neural machine translation in multilingual setting. (2) We design a novel graph distillation algorithm to improve the low-resource machine translation. (3) Preliminary experiments on the TED talks multilingual dataset demonstrate the effectiveness of our method.

2 Related Work

The related works on low-resource machine translation can be classified in three categories. The basic idea of the first category is to transfer the knowledge from rich-resource to low-resource languages 

(Zoph et al., 2016; Gu et al., 2018a, b; Neubig and Hu, 2018; Tan et al., 2019). The second category mainly leverages a third language as the pivot to enable the translation (Leng et al., 2019; Cohn and Lapata, 2007; Wu and Wang, 2007; Utiyama and Isahara, 2007; Firat et al., 2016; Johnson et al., 2017b; Ha et al., 2016), considering there are enough bilingual sentence pairs connected with the pivot language. The last category mainly leverages the monolingual sentences of the low-resource language and formulates it as a semi-supervised or unsupervised problem.  He et al. (2016) proposed dual learning to solve the low-resource translation based on few bilingual but large monolingual sentence pairs.  Song et al. (2019); Artetxe et al. (2017); Lample et al. (2017, 2018) leveraged purely unsupervised learning for machine translation.

There are few works on language graph, let alone using language graph for machine translation.  Ronen et al. (2014) formulated the language network through the connections in book translations, multiple language editions of Wikipedia, and Twitter.  Samoilenko et al. (2016) studied the network of global interconnections between language communities, based on shared co-editing interests of Wikipedia editors. The above works all concentrate on the analysis of language itself with the help of language network based on other data, such as the co-edit activities in Wikipedia, while we leverage language graph for machine translation which are directly derived from the multilingual translation dataset. The works that are mostly related to but far from “graph” in machine translation is the pivot translation, where a third language is leveraged to bridge the translation from source to target language.

3 Language Graph Distillation

In this section, we first give description about the concept of language graph for machine translation, and then formulate the graph distillation algorithm for low-resource machine translation.

3.1 Language Graph

Denote graph where is the set of nodes and is the set of edges. We formulate the language as the node and translation pair as the edge in Graph . We will use node and language, edge and translation pair interchangeably. Denote weight as the translation accuracy of the corresponding translation pair . Therefore is a direct graph where the weight are different in two directions between two languages. Denote as the number of sentence pairs for language , where and represent the bilingual and monolingual data on language . The number of the bilingual data on a language are the total bilingual data of the language pairs related to this language. Similarly, denote as the number of bilingual sentence pairs for language pair .

The multilingual machine translation problem on graph can be formulated as follows: Given a set of languages , translation pairs , bilingual data for and monolingual data for , the multilingual translation is to develop machine translation algorithm, in order to maximize each for or .

3.2 Graph Distillation Algorithm

In this subsection, we first describe some concepts used in our graph distillation algorithm, and then formulate detailed steps of this algorithm.

Multi-Hop Translation Path

For source language and target language , there exist several forward and backward paths that connect to . For example, the one-hop forward path , two-hop forward path , or two-hop backward path , where and are pivot languages.

Multi-Hop Accuracy Table

As there are many forward and backward paths between the source and target languages, we maintain translation accuracy tables for the paths with different length of hops between any languages. Denote the accuracy table as , where represent the number of hops for each path. is a -dimensional matrix, where the first and last dimension represent the source and target language respectively, and each entry in the matrix represents the accuracy for the corresponding -hop path between a source and a target language. For example, is a two-dimensional matrix when , where each row and column represent a source and target language respectively, and each entry in the matrix represents the accuracy for an one-hop path.

Forward/Backward Distillation

For a low-resource translation pair , their direct translation path is usually of low translation quality. However, some of the forward paths related to the two languages have rich-resource sentence pairs and the diversity of these paths can provide additional information, which will help improve the direct low-resource translation pair. On the other hand, the paths in the backward translation direction of are also helpful, as back-translation is useful for neural machine translation Sennrich et al. (2015); He et al. (2016). But different from back-translation which just leverages the reverse direction, we also leverage the multi-hop backward paths, e.g., , .

Specifically, we use sequence-level knowledge distillation (Kim and Rush, 2016) to transfer the knowledge from the forward and backward paths to the low-resource translation pairs. The forward and backward paths that are of comparable or better accuracy than the low-resource translation pair will be used to translate the bilingual and monolingual sentences to generate pseudo translation sentence pairs. These generated pseudo sentence pairs are added into the original bilingual sentences of the low-resource pair to boost the accuracy.

Distillation Path Selection

There are so many paths in the graph, we need be selective to choose which low-resource pair to improve in the current step. For the chosen low-resource pair, we need also choose the related forward and backward paths with good quality for knowledge distillation. We use a greedy strategy to choose the low-resource pair to improve. We define the potential of a language pair as the gap between the accuracy of the direct translation and the multi-hop paths. The more the multi-hop paths are better than the direct translation, the more potential this language pair has. For each chosen translation pair, we then choose the forward and backward paths with the top-K best accuracy respectively for knowledge distillation. For the backward multi-hop paths, we also leverage the one-hop path, which can be considered as standard back-translation.

1:Input: Graph , which includes a set of languages and translation pairs , bilingual data for and monolingual data for . Threshold of accuracy improvement , the maximum hop size .
2:Initialize: Set iteration step = . Train the multilingual model on the available bilingual sentence for . Set the accuracy improvement = .
3:while do
4:     = +
5:     Construct table for .
6:     Select high-potential edges .
7:     for do
8:         Generate pseudo sentence pairs for with forward and backward distillation.
9:     end for
10:     Train multilingual model for , and get the average accuracy improvements .
11:end while
Algorithm 1 Language Graph Distillation

Iteration on the Graph

We conduct the forward and backward distillation iteratively on the graph. For each iteration, we choose the low-resource language pairs with the highest potential currently and the associated forward and backward multi-hop paths, and train the multilingual model for the chosen low-resource pairs with the generated pseudo sentence pairs by the multi-hop paths. After the model is converged, we update the multi-hop accuracy table for iteration . We repeat the iteration until the accuracy table of the one-hop translation pair is converged.

The detailed steps for the graph distillation algorithm are shown in Algorithm 1.

4 Experiments and Results

In this section, we describe the experiment settings and show the preliminary results of our proposed graph distillation algorithm. Note that this work is still in progress.

4.1 Experiment Setup

Dataset

We use the common corpus of TED talks which contains bilingual sentences pairs between more than 50 languages (Ye et al., 2018)222https://github.com/neulab/word-embeddings-for-nmt and also use the monolingual data from TED talks333https://github.com/ajinkyakulkarni14/TED-Multilingual-Parallel-Corpus/tree/master/Monolingual_data. Since we just verify the effectiveness of our framework in this paper, we randomly select languages to construct the language graph in our experiments for simplicity, as illustrated in Table 1.

Fi He Nb Sk Sl
Ar
En
Fr
Ru
Table 1: Languages used in our experiments. represents there are bilingual data between the language in the row and column. There are bilingual data between any two of Ar, En, Fr and Ru.

Model Configurations

We use the Transformer (Vaswani et al., 2017) as the basic NMT model structure. The model hidden size, feed-forward size, number of layer is , and respectively. For the basic multilingual model training, we add a special tag to the encoder input to determine which target language to translate, following the practice in Johnson et al. (2017a).

Training and Inference

For the basic multilingual model training, we upsample the data of each language to make all languages have the same size of data. The mini batch size is set to roughly 4096 tokens. We train the models on 4 NVIDIA V100 GPUs. We follow the default parameters of Adam optimizer (Kingma and Ba, 2014) and learning rate schedule in Vaswani et al. (2017). During inference, we decode with beam search and set beam size to and length penalty for all the languages. We evaluate the translation quality by tokenized case sensitive BLEU (Papineni et al., 2002) with multi-bleu.pl444https://github.com/moses-smt/mosesdecoder/blob/ master/scripts/generic/multi-bleu.perl.

4.2 Results

In this section, we show the preliminary experimental results of the proposed language graph distillation. We select most potential language pairs in the each iteration and perform the first two iteration steps, and show the results in Table 2. The third column is the BLEU score obtained by the basic multilingual model (Initial). The forth, five and sixth columns are the results trained with only one-hop back-translation (+BT), only forward distillation (+Forward) and our language graph distillation (+Graph: both forward and backward distillation) respectively. The +BT baseline is obtained by training the selected language pairs with one-hop back-translation data, while the +Forward baseline is obtained by training the selected language pairs with all the forward distillation data.

Pair Initial +BT +Forward +Graph
ArFi 7.58
0 HeFi 9.04
NbSl 9.67
Av. + + +1.57
ArNb 13.92
1 HeNb 18.78
SkNb 16.00
Av. + + +3.13
Table 2: The language pairs improved in the first two iterations of our method. The results demonstrate that our method achieves better accuracy than the +BT and +Forward baseline on the low-resource translation edges in the graph. indicates iteration step. Av. indicates averaged BLEU scores. Note that, we just demonstrate the preliminary results to verify the effectiveness of our language graph distillation method.

It can be seen that at each iteration step , our method significantly outperforms all baselines in most cases. Note that, for HeNb translation, our method is slightly worse than the model trained with one-hop back-translation (+BT). The baseline model for HeNb translation has already achieved good performance, and thus is hard to be further improved by multi-hop forward/backward distillation. However, since the back-translation is the special case of the backward distillation in our method, we can selectively choose each configuration above (+BT, +Forward) to achieve higher BLEU score for each language pair.

5 Conclusion

In this paper, we introduced the concept of language graph and further proposed the graph distillation algorithm to boost the accuracy of low-resource machine translation. The preliminary results on the multilingual and low-resource translation dataset demonstrate the effectiveness of our method and show potential for further improvements. For future work, we will work on optimizing the iteration scheme in our algorithm and taking full advantages of more edges (translation pairs) in our language graph distillation.

References

  • M. Artetxe, G. Labaka, E. Agirre, and K. Cho (2017) Unsupervised neural machine translation. arXiv preprint arXiv:1710.11041. Cited by: §1, §2.
  • D. Bahdanau, K. Cho, and Y. Bengio (2015) Neural machine translation by jointly learning to align and translate. ICLR 2015. Cited by: §1.
  • Y. Chen, Y. Liu, Y. Cheng, and V. O. Li (2017) A teacher-student framework for zero-resource neural machine translation. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 1925–1935. Cited by: §1.
  • T. Cohn and M. Lapata (2007) Machine translation by triangulation: making effective use of multi-parallel corpora. In Proceedings of the 45th Annual Meeting of the Association of Computational Linguistics, pp. 728–735. Cited by: §1, §2.
  • O. Firat, B. Sankaran, Y. Al-Onaizan, F. T. Yarman-Vural, and K. Cho (2016) Zero-resource translation with multi-lingual neural machine translation. In

    Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, EMNLP 2016, Austin, Texas, USA, November 1-4, 2016

    ,
    pp. 268–277. Cited by: §2.
  • J. Gu, H. Hassan, J. Devlin, and V. O. K. Li (2018a) Universal neural machine translation for extremely low resource languages. In NAACL-HLT 2018, New Orleans, Louisiana, USA, June 1-6, 2018, Volume 1 (Long Papers), pp. 344–354. Cited by: §2.
  • J. Gu, Y. Wang, Y. Chen, V. O. K. Li, and K. Cho (2018b) Meta-learning for low-resource neural machine translation. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, Brussels, Belgium, October 31 - November 4, 2018, pp. 3622–3631. Cited by: §1, §2.
  • J. Guo, X. Tan, D. He, T. Qin, L. Xu, and T. Liu (2018) Non-autoregressive neural machine translation with enhanced decoder input. arXiv preprint arXiv:1812.09664. Cited by: §1.
  • T. Ha, J. Niehues, and A. H. Waibel (2016) Toward multilingual neural machine translation with universal encoder and decoder. CoRR abs/1611.04798. Cited by: §2.
  • H. Hassan, A. Aue, C. Chen, V. Chowdhary, J. Clark, C. Federmann, X. Huang, M. Junczys-Dowmunt, W. Lewis, M. Li, S. Liu, T. Liu, R. Luo, A. Menezes, T. Qin, F. Seide, X. Tan, F. Tian, L. Wu, S. Wu, Y. Xia, D. Zhang, Z. Zhang, and M. Zhou (2018) Achieving human parity on automatic chinese to english news translation. CoRR abs/1803.05567. Cited by: §1.
  • D. He, Y. Xia, T. Qin, L. Wang, N. Yu, T. Liu, and W. Ma (2016) Dual learning for machine translation. In Advances in Neural Information Processing Systems, pp. 820–828. Cited by: §1, §2, §3.2.
  • T. He, X. Tan, Y. Xia, D. He, T. Qin, Z. Chen, and T. Liu (2018) Layer-wise coordination between encoder and decoder for neural machine translation. In Advances in Neural Information Processing Systems, pp. 7944–7954. Cited by: §1.
  • M. Johnson, M. Schuster, Q. V. Le, M. Krikun, Y. Wu, Z. Chen, N. Thorat, F. Viégas, M. Wattenberg, G. Corrado, et al. (2017a) Google’s multilingual neural machine translation system: enabling zero-shot translation. Transactions of the Association for Computational Linguistics 5, pp. 339–351. Cited by: §4.1.
  • M. Johnson, M. Schuster, Q. V. Le, M. Krikun, Y. Wu, Z. Chen, N. Thorat, F. B. Viégas, M. Wattenberg, G. Corrado, M. Hughes, and J. Dean (2017b) Google’s multilingual neural machine translation system: enabling zero-shot translation. TACL 5, pp. 339–351. Cited by: §2.
  • Y. Kim and A. M. Rush (2016) Sequence-level knowledge distillation. In Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, EMNLP 2016, Austin, Texas, USA, November 1-4, 2016, pp. 1317–1327. Cited by: §3.2.
  • D. Kingma and J. Ba (2014) Adam: a method for stochastic optimization. arXiv preprint arXiv:1412.6980. Cited by: §4.1.
  • G. Lample, A. Conneau, L. Denoyer, and M. Ranzato (2017) Unsupervised machine translation using monolingual corpora only. arXiv preprint arXiv:1711.00043. Cited by: §1, §2.
  • G. Lample, M. Ott, A. Conneau, L. Denoyer, and M. Ranzato (2018) Phrase-based & neural unsupervised machine translation. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, Brussels, Belgium, October 31 - November 4, 2018, pp. 5039–5049. Cited by: §1, §2.
  • Y. Leng, X. Tan, T. Qin, X. Li, and T. Liu (2019) Unsupervised pivot translation for distant languages. arXiv preprint arXiv:1906.02461. Cited by: §2.
  • T. Luong, H. Pham, and C. D. Manning (2015) Effective approaches to attention-based neural machine translation. In Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing, EMNLP 2015, Lisbon, Portugal, September 17-21, 2015, pp. 1412–1421. Cited by: §1.
  • G. Neubig and J. Hu (2018) Rapid adaptation of neural machine translation to new languages. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, Brussels, Belgium, October 31 - November 4, 2018, pp. 875–880. Cited by: §1, §2.
  • K. Papineni, S. Roukos, T. Ward, and W. Zhu (2002) Bleu: a method for automatic evaluation of machine translation. In Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics, July 6-12, 2002, Philadelphia, PA, USA., pp. 311–318. Cited by: §4.1.
  • S. Ronen, B. Gonçalves, K. Z. Hu, A. Vespignani, S. Pinker, and C. A. Hidalgo (2014) Links that speak: the global language network and its association with global fame. Proceedings of the National Academy of Sciences 111 (52), pp. E5616–E5622. Cited by: §2.
  • A. Samoilenko, F. Karimi, D. Edler, J. Kunegis, and M. Strohmaier (2016) Linguistic neighbourhoods: explaining cultural borders on wikipedia through multilingual co-editing activity.

    EPJ data science

    5 (1), pp. 9.
    Cited by: §2.
  • R. Sennrich, B. Haddow, and A. Birch (2015) Improving neural machine translation models with monolingual data. arXiv preprint arXiv:1511.06709. Cited by: §3.2.
  • K. Song, X. Tan, T. Qin, J. Lu, and T. Liu (2019) Mass: masked sequence to sequence pre-training for language generation. arXiv preprint arXiv:1905.02450. Cited by: §2.
  • I. Sutskever, O. Vinyals, and Q. V. Le (2014)

    Sequence to sequence learning with neural networks

    .
    In NIPS 2014, December 8-13 2014, Montreal, Quebec, Canada, pp. 3104–3112. Cited by: §1.
  • X. Tan, Y. Ren, D. He, T. Qin, and T. Liu (2019) Multilingual neural machine translation with knowledge distillation. In International Conference on Learning Representations, Cited by: §2.
  • M. Utiyama and H. Isahara (2007) A comparison of pivot methods for phrase-based statistical machine translation. In Human Language Technologies 2007: The Conference of the North American Chapter of the Association for Computational Linguistics; Proceedings of the Main Conference, pp. 484–491. Cited by: §1, §2.
  • A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. Kaiser, and I. Polosukhin (2017) Attention is all you need. In NIPS 2017, 4-9 December 2017, Long Beach, CA, USA, pp. 6000–6010. Cited by: §1, §4.1, §4.1.
  • H. Wu and H. Wang (2007) Pivot language approach for phrase-based statistical machine translation. In ACL 2007, Proceedings of the 45th Annual Meeting of the Association for Computational Linguistics, June 23-30, 2007, Prague, Czech Republic, Cited by: §2.
  • Q. Ye, S. Devendra, F. Matthieu, P. Sarguna, and N. Graham (2018) When and why are pre-trained word embeddings useful for neural machine translation. In HLT-NAACL, Cited by: §4.1.
  • B. Zoph, D. Yuret, J. May, and K. Knight (2016) Transfer learning for low-resource neural machine translation. In Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, EMNLP 2016, Austin, Texas, USA, November 1-4, 2016, pp. 1568–1575. Cited by: §1, §2.