Combining Pretrained High-Resource Embeddings and Subword Representations for Low-Resource Languages

03/09/2020 ∙ by Machel Reid, et al. ∙ The University of Tokyo 0

The contrast between the need for large amounts of data for current Natural Language Processing (NLP) techniques, and the lack thereof, is accentuated in the case of African languages, most of which are considered low-resource. To help circumvent this issue, we explore techniques exploiting the qualities of morphologically rich languages (MRLs), while leveraging pretrained word vectors in well-resourced languages. In our exploration, we show that a meta-embedding approach combining both pretrained and morphologically-informed word embeddings performs best in the downstream task of Xhosa-English translation.



There are no comments yet.


page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction & Related Work

Learning distributed representations of words

(Mikolov et al., 2013; Pennington et al., 2014) has found great success in many NLP tasks. However, the need for large amounts of data in low-resource African languages is a significant drawback of the current methods for learning word embeddings. Shikali et al. (2019) proposed learning Swahili word embeddings by taking morphemes into account, while Grave et al. (2018) developed word embeddings for 157 languages, including multiple low-resource languages.

Learning of cross-lingual word embeddings (Ruder, 2017) and projecting monolingual word embeddings to a single cross-lingual embedding space (Artetxe et al., 2018) have also been proposed to help learn embeddings for low-resource languages.

In recent years, definitions and context words have been used to learn representations of nonce words (a word created for a single occasion to solve an immediate problem of communication), with Lazaridou et al. (2017) proposing learning nonce words by summing the representations of context words, obtaining representations highly correlated with human judgements in terms of similarity. Additionally, Hill et al. (2016) propose using word definitions to learn nonce word representations.

Our contributions are twofold: (1) Inspired by the work above, we explore techniques for combining pretrained high resource vectors and subword representations to produce better word embeddings for low-resource MRLs, and (2) we develop a new dataset, XhOxExamples, made up of 5K Xhosa-English examples, that we collected from the isiXhosa Oxford Living Dictionary (2020).

2 Approach

Our approach assumes the existence of a vocabulary in our low-resource language , the corresponding translations of the words in in a high-resource language, referred to as , and a pretrained embedding matrix for the high resource language . can either be comprised by individual words, in the case the word in the low-resource language can be accurately mapped to a single word in the high-resource language (e.g. indoda man)111All examples are from Xhosa to English; or by a sequence of words in the case that the word in the low-resource language maps to more than one word in the high-resource language (e.g. bethuna [listen, everyone]). Concretely, our objective is to use both and to produce vector representations for each word in .

To leverage the high-resource language, we embed the atomic elements of in and map the resulting vectors to the corresponding word in . In the case of being a sequence, we take a similar approach to Lazaridou et al. (2017) and sum the normalized word vectors for each word in to produce a word representation for the word . We refer to the resulting embedding matrix as . Additionally, we pretrain another embedding matrix on a corpus in our low resource language using subword information to capture similarity correlated with the morphological nature of words (Bojanowski et al., 2016).

We experiment with the following 4 methods to initialize the word embeddings for our downstream task222Note that the vocabulary for our downstream task might differ from .:

  • XhPre - Initialization with . Words not present in are initialized with .

  • XhSub - Initialization with only.

  • VecMap - We learn cross-lingual word embedding mappings by taking two sets of monolingual word embeddings, and , and mapping them to a common space following Artetxe et al. (2018).

  • XhMeta - We compute meta-embeddings for every word by taking the mean of and , following Coates and Bollegala (2018). Words not present in are associated with an UNK token and its corresponding vector.

Bible XhOxExamples
Aspect Xhosa English Xhosa English
Sentence Length (MeanStd) 18.447.74 29.4312.61 5.161.89 8.423.04
Total # of Tokens 573571 915531 25978 42400
# of Sentences 31102 5033
Train/Dev/Test Ratio 70/20/10 70/20/10
Table 1: Statistics regarding the Bible corpus and XhOxExamples
Bible XhOxExamples
Random Initializaton 21.79 21.84 16.08 25.30
VecMap 22.46 22.03 16.38 25.42
XhSub 24.65 22.79 17.37 26.04
XhPre 27.67 22.40 17.06 25.70
XhMeta 29.09 23.33 17.77 26.44
Table 2: Results on the test set of the Xhosa-English Bible Corpus and XhOxExamples

3 Experiments & Results

To test the performance of the 4 different embedding configurations, we both train and evaluate using machine translation (MT) as our downstream task on the English and Xhosa versions of the parallel Bible corpus (Christodouloupoulos and Steedman, 2015)

and make use of two MT evaluation metrics, BEER

(Stanojević and Sima’an, 2014) and sentence-level BLEU (Papineni et al., 2002). We also fine tune and evaluate the above pretrained models on the XhOxExamples dataset to evaluate performance after fine-tuning as shown on the right side of Table 2. Statistics for both datasets can be seen in Table 1.

All models are trained using OpenNMT (Klein et al., 2017), with a 2-layer 128-dimensional BiGRU for encoding, and 2-layer 128-dimensional GRU (Cho et al., 2014) using a beam search algorithm with a beam size of 5 for decoding. Embedding matrix is initialized with the 840B-token version333 of GloVe (Pennington et al., 2014), while is initialized with a FASTTEXT model (Bojanowski et al., 2016), trained on a combination of XhOxExamples and the Xhosa Bible Corpus.

As is evident from the results shown in Table 2, XhMeta outperforms all other models in both benchmarks. We believe this is a consequence of the simple approach of meta-embedding, resulting in vectors being both similar to both the subword representations captured by and the pretrained vectors, capturing a sense of the “meaning” of the word, demonstrating the necessity of both aspects in this context.

4 Conclusion

In this paper, we explore different ways of combining pretrained high-resource word vectors and subword representations to produce more meaningful word embeddings for low-resource MRLs. We show that both types of representations are important in the context of the low-resourced MRL, Xhosa, and hope that this research is able to assist others when doing NLP for African languages.


We are grateful to Jorge Balazs for his fruitful corrections and comments. We also gratefully acknowledge the support of NVIDIA Corporation with the donation of the Titan Xp GPU used for this research.


  • M. Artetxe, G. Labaka, and E. Agirre (2018) A robust self-learning method for fully unsupervised cross-lingual mappings of word embeddings. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 789–798. Cited by: §1, 3rd item.
  • P. Bojanowski, E. Grave, A. Joulin, and T. Mikolov (2016) Enriching word vectors with subword information. CoRR abs/1607.04606. External Links: Link, 1607.04606 Cited by: §2, §3.
  • K. Cho, B. van Merrienboer, C. Gulcehre, D. Bahdanau, F. Bougares, H. Schwenk, and Y. Bengio (2014) Learning phrase representations using rnn encoder–decoder for statistical machine translation. Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP). External Links: Link, Document Cited by: §3.
  • C. Christodouloupoulos and M. Steedman (2015) A massively parallel corpus: the bible in 100 languages. Language resources and evaluation 49 (2), pp. 375–395. Cited by: §3.
  • J. Coates and D. Bollegala (2018) Frustratingly easy meta-embedding – computing meta-embeddings by averaging source word embeddings. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 2 (Short Papers), New Orleans, Louisiana, pp. 194–198. External Links: Link, Document Cited by: 4th item.
  • E. Grave, P. Bojanowski, P. Gupta, A. Joulin, and T. Mikolov (2018) Learning word vectors for 157 languages. In Proceedings of the International Conference on Language Resources and Evaluation (LREC 2018), Cited by: §1.
  • F. Hill, K. Cho, A. Korhonen, and Y. Bengio (2016) Learning to understand phrases by embedding the dictionary. Transactions of the Association for Computational Linguistics 4 (0), pp. 17–30. External Links: ISSN 2307-387X, Link Cited by: §1.
  • isiXhosa Oxford Living Dictionary (2020) isiXhosa Oxford Living Dictionary. External Links: Link Cited by: §1.
  • G. Klein, Y. Kim, Y. Deng, J. Senellart, and A. Rush (2017)

    OpenNMT: open-source toolkit for neural machine translation

    In Proceedings of ACL 2017, System Demonstrations, Vancouver, Canada, pp. 67–72. External Links: Link Cited by: §3.
  • A. Lazaridou, M. Marelli, and M. Baroni (2017) Multimodal word meaning induction from minimal exposure to natural text. Cognitive Science 41 (S4), pp. 677–705. External Links: Document, Link, Cited by: §1, §2.
  • T. Mikolov, I. Sutskever, K. Chen, G. S. Corrado, and J. Dean (2013) Distributed representations of words and phrases and their compositionality. In Advances in neural information processing systems, pp. 3111–3119. Cited by: §1.
  • K. Papineni, S. Roukos, T. Ward, and W. Zhu (2002) BLEU: a method for automatic evaluation of machine translation. In Proceedings of the 40th annual meeting on association for computational linguistics, pp. 311–318. Cited by: §3.
  • J. Pennington, R. Socher, and C. Manning (2014) Glove: global vectors for word representation. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), Doha, Qatar, pp. 1532–1543. External Links: Link, Document Cited by: §1, §3.
  • S. Ruder (2017) A survey of cross-lingual embedding models. CoRR abs/1706.04902. External Links: Link, 1706.04902 Cited by: §1.
  • C. S. Shikali, Z. Sijie, L. Qihe, and R. Mokhosi (2019) Better word representation vectors using syllabic alphabet: a case study of swahili. Applied Sciences 9 (18), pp. 3648. Cited by: §1.
  • M. Stanojević and K. Sima’an (2014) BEER: BEtter evaluation as ranking. In Proceedings of the Ninth Workshop on Statistical Machine Translation, Baltimore, Maryland, USA, pp. 414–419. External Links: Link, Document Cited by: §3.