1 Introduction & Related Work
Learning distributed representations of words(Mikolov et al., 2013; Pennington et al., 2014) has found great success in many NLP tasks. However, the need for large amounts of data in low-resource African languages is a significant drawback of the current methods for learning word embeddings. Shikali et al. (2019) proposed learning Swahili word embeddings by taking morphemes into account, while Grave et al. (2018) developed word embeddings for 157 languages, including multiple low-resource languages.
Learning of cross-lingual word embeddings (Ruder, 2017) and projecting monolingual word embeddings to a single cross-lingual embedding space (Artetxe et al., 2018) have also been proposed to help learn embeddings for low-resource languages.
In recent years, definitions and context words have been used to learn representations of nonce words (a word created for a single occasion to solve an immediate problem of communication), with Lazaridou et al. (2017) proposing learning nonce words by summing the representations of context words, obtaining representations highly correlated with human judgements in terms of similarity. Additionally, Hill et al. (2016) propose using word definitions to learn nonce word representations.
Our contributions are twofold: (1) Inspired by the work above, we explore techniques for combining pretrained high resource vectors and subword representations to produce better word embeddings for low-resource MRLs, and (2) we develop a new dataset, XhOxExamples, made up of 5K Xhosa-English examples, that we collected from the isiXhosa Oxford Living Dictionary (2020).
Our approach assumes the existence of a vocabulary in our low-resource language , the corresponding translations of the words in in a high-resource language, referred to as , and a pretrained embedding matrix for the high resource language . can either be comprised by individual words, in the case the word in the low-resource language can be accurately mapped to a single word in the high-resource language (e.g. indoda man)111All examples are from Xhosa to English; or by a sequence of words in the case that the word in the low-resource language maps to more than one word in the high-resource language (e.g. bethuna [listen, everyone]). Concretely, our objective is to use both and to produce vector representations for each word in .
To leverage the high-resource language, we embed the atomic elements of in and map the resulting vectors to the corresponding word in . In the case of being a sequence, we take a similar approach to Lazaridou et al. (2017) and sum the normalized word vectors for each word in to produce a word representation for the word . We refer to the resulting embedding matrix as . Additionally, we pretrain another embedding matrix on a corpus in our low resource language using subword information to capture similarity correlated with the morphological nature of words (Bojanowski et al., 2016).
We experiment with the following 4 methods to initialize the word embeddings for our downstream task222Note that the vocabulary for our downstream task might differ from .:
XhPre - Initialization with . Words not present in are initialized with .
XhSub - Initialization with only.
VecMap - We learn cross-lingual word embedding mappings by taking two sets of monolingual word embeddings, and , and mapping them to a common space following Artetxe et al. (2018).
XhMeta - We compute meta-embeddings for every word by taking the mean of and , following Coates and Bollegala (2018). Words not present in are associated with an UNK token and its corresponding vector.
|Sentence Length (MeanStd)||18.447.74||29.4312.61||5.161.89||8.423.04|
|Total # of Tokens||573571||915531||25978||42400|
|# of Sentences||31102||5033|
3 Experiments & Results
To test the performance of the 4 different embedding configurations, we both train and evaluate using machine translation (MT) as our downstream task on the English and Xhosa versions of the parallel Bible corpus (Christodouloupoulos and Steedman, 2015)
and make use of two MT evaluation metrics, BEER(Stanojević and Sima’an, 2014) and sentence-level BLEU (Papineni et al., 2002). We also fine tune and evaluate the above pretrained models on the XhOxExamples dataset to evaluate performance after fine-tuning as shown on the right side of Table 2. Statistics for both datasets can be seen in Table 1.
All models are trained using OpenNMT (Klein et al., 2017), with a 2-layer 128-dimensional BiGRU for encoding, and 2-layer 128-dimensional GRU (Cho et al., 2014) using a beam search algorithm with a beam size of 5 for decoding. Embedding matrix is initialized with the 840B-token version333https://nlp.stanford.edu/projects/glove/ of GloVe (Pennington et al., 2014), while is initialized with a FASTTEXT model (Bojanowski et al., 2016), trained on a combination of XhOxExamples and the Xhosa Bible Corpus.
As is evident from the results shown in Table 2, XhMeta outperforms all other models in both benchmarks. We believe this is a consequence of the simple approach of meta-embedding, resulting in vectors being both similar to both the subword representations captured by and the pretrained vectors, capturing a sense of the “meaning” of the word, demonstrating the necessity of both aspects in this context.
In this paper, we explore different ways of combining pretrained high-resource word vectors and subword representations to produce more meaningful word embeddings for low-resource MRLs. We show that both types of representations are important in the context of the low-resourced MRL, Xhosa, and hope that this research is able to assist others when doing NLP for African languages.
We are grateful to Jorge Balazs for his fruitful corrections and comments. We also gratefully acknowledge the support of NVIDIA Corporation with the donation of the Titan Xp GPU used for this research.
- A robust self-learning method for fully unsupervised cross-lingual mappings of word embeddings. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 789–798. Cited by: §1, 3rd item.
- Enriching word vectors with subword information. CoRR abs/1607.04606. External Links: Cited by: §2, §3.
- Learning phrase representations using rnn encoder–decoder for statistical machine translation. Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP). External Links: Cited by: §3.
- A massively parallel corpus: the bible in 100 languages. Language resources and evaluation 49 (2), pp. 375–395. Cited by: §3.
- Frustratingly easy meta-embedding – computing meta-embeddings by averaging source word embeddings. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 2 (Short Papers), New Orleans, Louisiana, pp. 194–198. External Links: Cited by: 4th item.
- Learning word vectors for 157 languages. In Proceedings of the International Conference on Language Resources and Evaluation (LREC 2018), Cited by: §1.
- Learning to understand phrases by embedding the dictionary. Transactions of the Association for Computational Linguistics 4 (0), pp. 17–30. External Links: Cited by: §1.
- isiXhosa Oxford Living Dictionary. External Links: Cited by: §1.
- . In Proceedings of ACL 2017, System Demonstrations, Vancouver, Canada, pp. 67–72. External Links: Cited by: §3.
- Multimodal word meaning induction from minimal exposure to natural text. Cognitive Science 41 (S4), pp. 677–705. External Links: Cited by: §1, §2.
- Distributed representations of words and phrases and their compositionality. In Advances in neural information processing systems, pp. 3111–3119. Cited by: §1.
- BLEU: a method for automatic evaluation of machine translation. In Proceedings of the 40th annual meeting on association for computational linguistics, pp. 311–318. Cited by: §3.
- Glove: global vectors for word representation. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), Doha, Qatar, pp. 1532–1543. External Links: Cited by: §1, §3.
- A survey of cross-lingual embedding models. CoRR abs/1706.04902. External Links: Cited by: §1.
- Better word representation vectors using syllabic alphabet: a case study of swahili. Applied Sciences 9 (18), pp. 3648. Cited by: §1.
- BEER: BEtter evaluation as ranking. In Proceedings of the Ninth Workshop on Statistical Machine Translation, Baltimore, Maryland, USA, pp. 414–419. External Links: Cited by: §3.