Unsupervised Pidgin Text Generation By Pivoting English Data and Self-Training

03/18/2020 ∙ by Ernie Chang, et al. ∙ Universität Saarland 0

West African Pidgin English is a language that is significantly spoken in West Africa, consisting of at least 75 million speakers. Nevertheless, proper machine translation systems and relevant NLP datasets for pidgin English are virtually absent. In this work, we develop techniques targeted at bridging the gap between Pidgin English and English in the context of natural language generation. area of data-to-text generation. By building upon the previously released monolingual Pidgin English text and parallel English data-to-text corpus, we hope to build a system that can automatically generate Pidgin English descriptions from structured data. We first train a data-to-English text generation system, before employing techniques in unsupervised neural machine translation and self-training to establish the Pidgin-to-English cross-lingual alignment. The human evaluation performed on the generated Pidgin texts shows that, though still far from being practically usable, the pivoting + self-training technique improves both Pidgin text fluency and relevance.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Pidgin English is one of the the most widely spoken languages in West Africa with roughly 75 million speakers estimated in Nigeria; and over 5 million speakers estimated in Ghana

(Ogueji and Ahia, 2019). 111Though variants of Pidgin English are abound, the language is fairly uniform across the continent. In this work, we directed our research to the most commonly spoken variant of West African Pidgin English the Nigerian Pidgin English. While there have been recent efforts in popularizing the monolingual Pidgin English as seen in the BBC Pidgin222 https://www.bbc.com/pidgin , it remains under-resourced in terms of the available parallel corpus for machine translation. Similarly, this low-resource scenario extends to other domains in natural language generation (NLG) such as summarization, data-to-text and so on (Lebret et al., 2016; Su et al., 2018; Shen et al., 2019a, b; Zhao et al., 2019; Hong et al., 2019; de Souza et al., 2018) where Pidgin English generation is largely under-explored. The scarcity is further aggravated when the pipeline language generation system includes other sub-modules that computes semantic textual similarity (Shen et al., 2017; Zhuang and Chang, 2017), which exists solely in English.

Previous works on unsupervised neural machine translation for Pidgin English constructed a monolingual corpus (Ogueji and Ahia, 2019), and achieved a BLEU score of 5.18 from English to Pidgin. However, there is an issue of domain mismatch between down-stream NLG tasks and the trained machine translation system. This creates a caveat where the resulting English-to-Pidgin MT systems (trained on the domain of news and the Bible) cannot be directly used to translate out-domain English texts to Pidgin. An example of the English/pidgin text in the restaurant domain (Novikova et al., 2017) is displayed in Table  1.

English There is a pub Blue Spice located in the centre of the city that provides Chinese food.
Pidgin Wan pub blue spice dey for centre of city wey dey give Chinese food.
Table 1: Sample parallel English-Pidgin text from the restaurant domain.

Nevertheless, we argue that this domain-mismatch problem can be alleviated by using English text in the target-domain as a pivot language (Guo et al., 2019). To this end, we explore this idea on the task of neural data-to-text generation which has been the subject of much recent research. Neural data-to-Pidgin generation is essential in the African continent especially given the fact that many existing data-to-text systems are English-based e.g. Weather reporting systems (Sripada et al., 2002; Belz, 2008). This work aims at bridging the gap between many of these English-based systems and Pidgin by training an in-domain English-to-pidgin MT system in an unsupervised way. By this means, English-based NLG systems can be locally adapted by translating the output English text into pidgin English. We employ the publicly available parallel data-to-text corpus E2E (Novikova et al., 2017) consisting of tabulated data and English descriptions in the restaurant domain. The training of the in-domain MT system is done with a two-step process: (1) We use the target-side English texts as the pivot, and train an unsupervised NMT () directly between in-domain English text and the available monolingual Pidgin corpus. (2) Next, we employ self-training (He et al., 2019) to create augmented parallel pairs to continue updating the system ().

2 Approach

First phase of the approach requires training of an unsupervised NMT system similar to Ogueji and Ahia (2019) (PidginUNMT). Similar to Ogueji and Ahia (2019), we train the cross-lingual model using FastText Bojanowski et al. (2017) on the combined Pidgin-English corpus. Next, we train an unsupervised NMT similar to Lample et al. (2017); Artetxe et al. (2017); Ogueji and Ahia (2019) between them to obtain . Then we further utilize to construct pseudo parallel corpus by predicting target Pidgin text given the English input. We augment this dataset to the existing monolingual corpus. The self-training step involves further updating on the pseudo parallel corpus and non-parallel monolingual corpus to yield .

3 Experiments and Results

We conduct experiments on the E2E corpus (Novikova et al., 2017) which amounts to roughly 42k samples in the training set. The monolingual Pidgin corpus contains 56,695 sentences and 32,925 unique words. The human evaluation was performed on the test set (630 data instances for E2E) by averaging over scores by 2 native Pidgin speakers on both Relevance (0 or 1 to indicate relevant or not) and Fluency (0, 1, or 2 to indicate readability). Table 2 shows that outperforms direct translation (PidginUNMT) and unsupervisedly-trained model on relevance and performing on par with PidginUNMT on fluency. We also display relevant sample outputs in Table 3 at all levels of fluency.

System Relevance Fluency
PidginUNMT 0.038 0.827
0.319 0.788
0.434 0.814
Table 2: PidginUNMT is trained on unparallel, out-domain English and pidgin text. refers to unsupervised NMT trained on in-domain English text and out-domain pidign text. further augments with pseudo parallel pairs obtained from self-training.
Pidgin text Fluency
Every money of money on food and at least 1 of 1 points. 0
and na one na di best food for di world. 1
People dey feel the good food but all of us no dey available. 2
Table 3: Sampled relevant (score of 1) Pidgin outputs from with various Fluency scores.

4 conclusion

In this paper, we have shown that it is possible to improve upon low-resource Pidgin text generation in a demonstrated low-resource scenario. By using non-parallel in-domain English and out-domain Pidgin text along with self-training algorithm, we show that both fluency and relevance can be further improved. This work serves as the starting point for future works on Pidgin NLG in the absence of annotated data. For future works, we will also further utilize phrase-based statistical machine translation to further improve upon current work.

5 Acknowledgements

This research was funded in part by the SFB 248 “Foundations of Perspicuous Software Systems”. We sincerely thank the anonymous reviewers for their insightful comments that helped us to improve this paper.

References

  • M. Artetxe, G. Labaka, E. Agirre, and K. Cho (2017) Unsupervised neural machine translation. arXiv preprint arXiv:1710.11041. Cited by: §2.
  • A. Belz (2008) Automatic generation of weather forecast texts using comprehensive probabilistic generation-space models. Natural Language Engineering 14 (4), pp. 431–455. Cited by: §1.
  • P. Bojanowski, E. Grave, A. Joulin, and T. Mikolov (2017)

    Enriching word vectors with subword information

    .
    Transactions of the Association for Computational Linguistics 5, pp. 135–146. External Links: ISSN 2307-387X Cited by: §2.
  • J. G. de Souza, M. Kozielski, P. Mathur, E. Chang, M. Guerini, M. Negri, M. Turchi, and E. Matusov (2018) Generating e-commerce product titles and predicting their quality. In Proceedings of the 11th International Conference on Natural Language Generation, pp. 233–243. Cited by: §1.
  • Y. Guo, Y. Liao, X. Jiang, Q. Zhang, Y. Zhang, and Q. Liu (2019) Zero-shot paraphrase generation with multilingual language models. arXiv preprint arXiv:1911.03597. Cited by: §1.
  • J. He, J. Gu, J. Shen, and M. Ranzato (2019) Revisiting self-training for neural sequence generation. arXiv preprint arXiv:1909.13788. Cited by: §1.
  • X. Hong, E. Chang, and V. Demberg (2019) Improving language generation from feature-rich tree-structured data with relational graph convolutional encoders. In Proceedings of the 2nd Workshop on Multilingual Surface Realisation (MSR 2019), pp. 75–80. Cited by: §1.
  • G. Lample, A. Conneau, L. Denoyer, and M. Ranzato (2017) Unsupervised machine translation using monolingual corpora only. arXiv preprint arXiv:1711.00043. Cited by: §2.
  • R. Lebret, D. Grangier, and M. Auli (2016) Neural text generation from structured data with application to the biography domain. In

    Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing

    ,
    pp. 1203–1213. Cited by: §1.
  • J. Novikova, O. Dušek, and V. Rieser (2017) The e2e dataset: new challenges for end-to-end generation. In Proceedings of the 18th Annual SIGdial Meeting on Discourse and Dialogue, pp. 201–206. Cited by: §1, §1, §3.
  • K. Ogueji and O. Ahia (2019) PidginUNMT: unsupervised neural machine translation from west african pidgin to english. arXiv preprint arXiv:1912.03444. Cited by: §1, §1, §2.
  • X. Shen, Y. Oualil, C. Greenberg, M. Singh, and D. Klakow (2017) Estimation of gap between current language models and human performance. Cited by: §1.
  • X. Shen, J. Suzuki, K. Inui, H. Su, D. Klakow, and S. Sekine (2019a) Select and attend: towards controllable content selection in text generation. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pp. 579–590. Cited by: §1.
  • X. Shen, Y. Zhao, H. Su, and D. Klakow (2019b)

    Improving latent alignment in text summarization by generalizing the pointer generator

    .
    In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pp. 3753–3764. Cited by: §1.
  • S. Sripada, E. Reiter, J. Hunter, and J. Yu (2002) Sumtime-meteo: parallel corpus of naturally occurring forecast texts and weather data. Computing Science Department, University of Aberdeen, Aberdeen, Scotland, Tech. Rep. AUCS/TR0201. Cited by: §1.
  • H. Su, X. Shen, P. Hu, W. Li, and Y. Chen (2018) Dialogue generation with gan. In

    Thirty-Second AAAI Conference on Artificial Intelligence

    ,
    Cited by: §1.
  • Y. Zhao, X. Shen, W. Bi, and A. Aizawa (2019) Unsupervised rewriter for multi-sentence compression. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pp. 2235–2240. Cited by: §1.
  • W. Zhuang and E. Chang (2017) Neobility at semeval-2017 task 1: an attention-based sentence similarity model. arXiv preprint arXiv:1703.05465. Cited by: §1.