KinyaBERT: a Morphology-aware Kinyarwanda Language Model

by   Antoine Nzeyimana, et al.

Pre-trained language models such as BERT have been successful at tackling many natural language processing tasks. However, the unsupervised sub-word tokenization methods commonly used in these models (e.g., byte-pair encoding - BPE) are sub-optimal at handling morphologically rich languages. Even given a morphological analyzer, naive sequencing of morphemes into a standard BERT architecture is inefficient at capturing morphological compositionality and expressing word-relative syntactic regularities. We address these challenges by proposing a simple yet effective two-tier BERT architecture that leverages a morphological analyzer and explicitly represents morphological compositionality. Despite the success of BERT, most of its evaluations have been conducted on high-resource languages, obscuring its applicability on low-resource languages. We evaluate our proposed method on the low-resource morphologically rich Kinyarwanda language, naming the proposed model architecture KinyaBERT. A robust set of experimental results reveal that KinyaBERT outperforms solid baselines by 2 recognition task and by 4.3 benchmark. KinyaBERT fine-tuning has better convergence and achieves more robust results on multiple tasks even in the presence of translation noise.


Morphological Processing of Low-Resource Languages: Where We Are and What's Next

Automatic morphological processing can aid downstream natural language p...

Knowledge-Rich BERT Embeddings for Readability Assessment

Automatic readability assessment (ARA) is the task of evaluating the lev...

What's Wrong with Hebrew NLP? And How to Make it Right

For languages with simple morphology, such as English, automatic annotat...

A Systematic Analysis of Morphological Content in BERT Models for Multiple Languages

This work describes experiments which probe the hidden representations o...

Paradigm Shift in Language Modeling: Revisiting CNN for Modeling Sanskrit Originated Bengali and Hindi Language

Though there has been a large body of recent works in language modeling ...

Probabilistic Modelling of Morphologically Rich Languages

This thesis investigates how the sub-structure of words can be accounted...

New Students on Sesame Street: What Order-Aware Matrix Embeddings Can Learn from BERT

Large-scale pretrained language models (PreLMs) are revolutionizing natu...

1 Introduction

Recent advances in natural language processing (NLP) through deep learning have been largely enabled by vector representations (or embeddings) learned through language model pre-training 

Bengio et al. (2003); Mikolov et al. (2013); Pennington et al. (2014); Bojanowski et al. (2017); Peters et al. (2018); Devlin et al. (2019). Language models such as BERT (Devlin et al., 2019) are pre-trained on large text corpora and then fine-tuned on downstream tasks, resulting in better performance on many NLP tasks. Despite attempts to make multilingual BERT models Conneau et al. (2020), research has shown that models pre-trained on high quality monolingual corpora outperform multilingual models pre-trained on large Internet data Scheible et al. (2020); Virtanen et al. (2019). This has motivated many researchers to pre-train BERT models on individual languages rather than adopting the “language-agnostic” multilingual models. This work is partly motivated by the same findings, but also proposes an adaptation of the BERT architecture to address representational challenges that are specific to morphologically rich languages such as Kinyarwanda.

Word Morphemes Monolingual BPE Multilingual BPE
twagezeyowe arrived there tu . a . ger . ye . yo twag . ezeyo _twa . ge . ze . yo
ndabyizeyeI hope so n . ra . bi . izer . ye ndaby . izeye _ ndab . yiz . eye
umwarimuteacher u . mu . arimu umwarimu _um . wari . mu
Table 1: Comparison between morphemes and BPE-produced sub-word tokens. Stems are underlined.

In order to handle rare words and reduce the vocabulary size, BERT-like models use statistical sub-word tokenization algorithms such as byte pair encoding (BPE) Sennrich et al. (2016). While these techniques have been widely used in language modeling and machine translation, they are not optimal for morphologically rich languages (Klein and Tsarfaty, 2020). In fact, sub-word tokenization methods that are solely based on surface forms, including BPE and character-based models, cannot capture all morphological details. This is due to morphological alternations Muhirwe (2007) and non-concatenative morphology (McCarthy, 1981) that are often exhibited by morphologically rich languages. For example, as shown in Table 1, a BPE model trained on 390 million tokens of Kinyarwanda text cannot extract the true sub-word lexical units (i.e. morphemes) for the given words. This work addresses the above problem by proposing a language model architecture that explicitly represents most of the input words with morphological parses produced by a morphological analyzer. In this architecture BPE is only used to handle words which cannot be directly decomposed by the morphological analyzer such as misspellings, proper names and foreign language words.

Given the output of a morphological analyzer, a second challenge is in how to incorporate the produced morphemes into the model. One naive approach is to feed the produced morphemes to a standard transformer encoder as a single monolithic sequence. This approach is used by Mohseni and Tebbifakhr (2019). One problem with this method is that mixing sub-word information and sentence-level tokens in a single sequence does not encourage the model to learn the actual morphological compositionality and express word-relative syntactic regularities. We address these issues by proposing a simple yet effective two-tier transformer encoder architecture. The first tier encodes morphological information, which is then transferred to the second tier to encode sentence level information. We call this new model architecture KinyaBERT because it uses BERT’s masked language model objective for pre-training and is evaluated on the morphologically rich Kinyarwanda language.

This work also represents progress in low resource NLP. Advances in human language technology are most often evaluated on the main languages spoken by major economic powers such as English, Chinese and European languages. This has exacerbated the language technology divide between the highly resourced languages and the underrepresented languages. It also hinders progress in NLP research because new techniques are mostly evaluated on the mainstream languages and some NLP advances become less informed of the diversity of the linguistic phenomena (Bender, 2019). Specifically, this work provides the following research contributions:

  • A simple yet effective two-tier BERT architecture for representing morphologically rich languages.

  • New evaluation datasets for Kinyarwanda language including a machine-translated subset of the GLUE benchmark Wang et al. (2019) and a news categorization dataset.

  • Experimental results which set a benchmark for future studies on Kinyarwanda language understanding, and on using machine-translated versions of the GLUE benchmark.

  • Code and datasets are made publicly available for reproducibility1.

2 Morphology-aware Language Model

Figure 1: KinyaBERT model architecture: Encoding of the sentence ’John twarahamusanze biradutangaza’ (We were surprised to find John there). The morphological analyzer produces morphemes for each word and assigns a POS tag to it. The two-tier transformer model then generates contextualized embeddings (blue vectors at the top). The red colored embeddings correspond to the POS tags, yellow is for the stem embeddings, green is for the variable length affixes while the purple embeddings correspond to the affix set.

Our modeling objective is to be able to express morphological compositionality in a Transformer-based (Vaswani et al., 2017) language model. For morphologically rich languages such as Kinyarwanda, a set of morphemes (typically a stem and a set of functional affixes) combine to produce a word with a given surface form. This requires an alternative to the ubiquitous BPE tokenization, through which exact sub-word lexical units (i.e. morphemes) are used. For this purpose, we use a morphological analyzer which takes a sentence as input and, for every word, produces a stem, zero or more affixes and assigns a part of speech (POS) tag to each word. This section describes how this morphological information is obtained and then integrated in a two-tier transformer architecture (Figure 1) to learn morphology-aware input representations.

2.1 Morphological Analysis and Part-of-Speech Tagging

Kinyarwanda, the national language of Rwanda, is one of the major Bantu languages (Nurse and Philippson, 2006) spoken in central and eastern Africa. Kinyarwanda has 16 noun classes. Modifiers (demonstratives, possessives, adjectives, numerals) carry a class marking morpheme that agrees with the main noun class. The verbal morphology (Nzeyimana, 2020) also includes subject and object markers that agree with the class of the subject or object. This agreement therefore enables users of the language to approximately disambiguate referred entities based on their classes. We leverage this syntactic agreement property in designing our unsupervised POS tagger.

Our morphological analyzer for Kinyarwanda was built following finite-state two-level morphology principles (Koskenniemi, 1983; Beesley and Karttunen, 2000, 2003). For every inflectable word type, we maintain a morphotactics model using a directed acyclic graph (DAG) that represents the regular sequencing of morphemes. We effectively model all inflectable word types in Kinyarwanda which include verbals, nouns, adjectives, possessive and demonstrative pronouns, numerals and quantifiers. The morphological analyzer also includes many hand-crafted rules for handling morphographemics and other linguistic regularities of the Kinyarwanda language. The morphological analyzer was independently developed and calibrated by native speakers as a closed source solution before the current work on language modeling. Similar to Nzeyimana (2020)

, we use a classifier trained on a stemming dataset to disambiguate between competing outputs of the morphological analyzer. Furthermore, we improve the disambiguation quality by leveraging a POS tagger at the phrase level so that the syntactic context can be taken into consideration.

We devise an unsupervised part of speech tagging algorithm which we explain here. Let be a sequence of tokens (e.g. words) to be tagged with a corresponding sequence of tags . A sample of actual POS tags used for Kinyarwanda is given in Table 12 the Appendix. Using Bayes’ rule, the optimal tag sequence is given by the following equation:


A standard hidden Markov model (HMM) can decompose the result of Equation 

1 using first order Markov assumption and independence assumptions into and . The tag sequence can then be efficiently decoded using the Viterbi algorithm Forney (1973). A better decoding strategy is presented below.

Inspired by Tsuruoka and Tsujii (2005)

, we devise a greedy heuristic for decoding

using the same first order Markov assumptions but with bidirectional decoding.

First, we estimate the local emission probabilities

using a factored model given in the following equation:


In Equation 2, corresponds to the probability/score returned by a morphological disambiguation classifier, representing the uncertainty of the morphology of . corresponds to a local precedence weight between competing POS tags. These precedence weights are manually crafted through qualitative evaluation (See Table 12 in Appendix for examples). quantifies the local neighborhood syntactic agreement between Bantu class markers. When there are two or more agreeing class markers in neighboring words, the tagger should be more confident of the agreeing parts of speech. A basic agreement score can be the number of agreeing class markers within a window of seven words around a given candidate . We manually designed a more elaborate set of agreement rules and their weights for different contexts. Therefore, the actual agreement score is a weighted sum of the matched agreement rules. Each of the unnormalized measures in Equation 2 is mapped to the

range using a sigmoid function

given in Equation 3, where is the score of the measure and is its estimated active range.


After estimating the local emission model, we greedily decode in decreasing order of using a first order bidirectional inference of as given in the following equation:


The first order transition measures , and are estimated using count tables computed over the entire corpus by aggregating local emission marginals obtained through morphological analysis and disambiguation.

2.2 Morphology Encoding

The overall architecture of our model is depicted in Figure 1. This is a two-tier transformer encoder architecture made of a token-level morphology encoder that feeds into a sentence/document-level encoder. The morphology encoder is made of a small transformer encoder that is applied to each analyzed token separately in order to extract its morphological features. The extracted morphological features are then concatenated with the token’s stem embedding to form the input vector fed to the sentence/document encoder. The sentence/document encoder is made of a standard transformer encoder as used in other BERT models. The sentence/document encoder uses untied position encoding with relative bias as proposed in Ke et al. (2020).

The input to the morphology encoder is a set of embedding vectors, three vectors relating to the part of speech, one for the stem and one for each affix when available. The transformer encoder operation is applied to these embedding vectors without any positional information. This is because positional information at the morphology level is inherent since no morpheme repeats and each morpheme always occupies a known (i.e. fixed) slot in the morphotactics model. The extracted morphological features are four encoder output vectors corresponding to the three POS embeddings and one stem embedding. Vectors corresponding to the affixes are left out since they are of variable length and the role of the affixes in this case is to be attended to by the stem and the POS tag so that morphological information can be captured. The four morphological output feature vectors are further concatenated with another stem embedding at the sentence level to form the input vector for the main sentence/document encoder.

The choice of this transformer-based architecture for morphology encoding is motivated by two factors. First, Zaheer et al. (2020) has demonstrated the importance of having “global tokens” such as [CLS] token in BERT models. These are tokens that attend to all other tokens in the modeled sequence. These “global tokens” effectively encapsulate some “meaning” of the encoded sequence. Second, the POS tag and stem represent the high level information content of a word. Therefore, having the POS tag and stem embeddings be transformed into morphological features is a viable option. The POS tag and stem embeddings thus serve as the “global tokens” at the morphology encoder level since they attend to all other morphemes that can be associated with them.

In order to capture subtle morphological information, we make one of the three POS embeddings span an affix set vocabulary that is a subset of the all affixes power set. We form an affix set vocabulary that is made of the most frequent affix combinations in the corpus. In fact, the morphological model of the language enforces constraints on which affixes can go together for any given part of speech, resulting in an affix set vocabulary that is much smaller than the power set of all affixes. Even with limiting the affix set vocabulary to a fixed size, we can still map any affix combination to by dropping zero or very few affixes from the combination. Note that the affix set embedding still has to attend to all morphemes at the morphology encoder level, making it adapt to the whole morphological context. The affix set embedding is depicted by the purple units in Figure 1 and a sample of is given in Table 13 in the Appendix.

2.3 Pre-training Objective

Similar to other BERT models, we use a masked language model objective. Specifically, 15% of all tokens in the training set are considered for prediction, of which 80% are replaced with [MASK] tokens, 10% are replaced with random tokens and 10% are left unchanged. When prediction tokens are replaced with [MASK] or random tokens, the corresponding affixes are randomly omitted 70% of the time or left in place for 30% of the time, while the units corresponding to POS tags and affix sets are also masked. The pre-training objective is then to predict stems and the associated affixes for all tokens considered for prediction using a two-layer feed-forward module on top of the encoder output.

For the affix prediction task, we face a multi-label classification problem where for each prediction token, we predict a variable number of affixes. In our experiments, we tried two methods. For one, we use the Kullback–Leibler (KL) divergence222 function to solve a regression task of predicting the -length affix distribution vector. For this case, we use a target affix probability vector in which each target affix index is assigned probability and probability for non-target affixes. Here is the number of affixes in the word to be predicted and is the total number of all affixes. We call this method “Affix Distribution Regression” (ADR) and model variant KinyaBERT. Alternatively, we use cross entropy loss and just predict the affix set associated with the prediction word; we call this method “Affix Set Classification” (ASC) and the model variant KinyaBERT.

3 Experiments

In order to evaluate the proposed architecture, we pre-train KinyaBERT (101M parameters for KinyaBERT and 105M for KinyaBERT) on a 2.4 GB of Kinyarwanda text along with 3 baseline BERT models. The first baseline is a BERT model pre-trained on the same Kinyarwanda corpus and with the same position encoding Ke et al. (2020), same batch size and pre-training steps, but using the standard BPE tokenization. We call this first baseline model BERT (120M parameters). The second baseline is a similar BERT model pre-trained on the same Kinyarwanda corpus but tokenized by a morphological analyzer. For this model, the input is just a sequence of morphemes, in a similar fashion to Mohseni and Tebbifakhr (2019). We call this second baseline model BERT (127M parameters). For BERT, we found that predicting 30% of the tokens achieves better results than using 15% because of the many affixes generated. The third baseline is XLM-R Conneau et al. (2020) (270M parameters) which is pre-trained on 2.5 TB of multilingual text. We evaluate the above models by comparing their performance on downstream NLP tasks.

Language Kinyarwanda
Publication Period 2011 - 2021
Websites/Sources 370
Documents/Articles 840K
Sentences 16M
Tokens/Words 390M
Text size 2.4 GB
Table 2: Summary of the pre-training corpus.

3.1 Pre-training details

KinyaBERT model was implemented using Pytorch version 1.9. The morphological analyzer and POS tagger were implemented in a shared library using POSIX C. Morphological parsing of the corpus was performed as a pre-processing step, taking 20 hours to segment the 390M-token corpus on an 12-core desktop machine. Pre-training was performed using RTX 3090 and RTX 2080Ti desktop GPUs. Each KinyaBERT model takes on average 22 hours to train for 1000 steps on one RTX 3090 GPU or 29 hours on one RTX 2080Ti GPU. Baseline models (BERT

and BERT

) were pre-trained on cloud tensor processing units (TPU v3-8 devices each with 128 GB memory) using PyTorch/XLA

333 package and a TPU-optimized fairseq toolkit Ott et al. (2019). Pre-training on TPU took 2.3 hours per 1000 steps. The baselines were trained on TPU because there were no major changes needed to the existing RoBERTA (base) architecture implemented in fairseq and the TPU resources were available and efficient. In all cases, pre-training batch size was set to 2560 sequences, with maximum 512 tokens in each sequence. The maximum learning rates was set to which is achieved after 2000 steps and then linearly decays. Our main results and ablation results were obtained from models pre-trained for 32K steps in all cases. Other pre-training details, model architectural dimensions and other hyper-parameters are given in the Appendix.

#Train examples: 3.4K 104.7K 2.5K 67.4K 5.8K 0.6K
Translation score: 2.7/4.0 2.9/4.0 3.0/4.0 2.7/4.0 3.1/4.0 2.9/4.0
Model Validation Set
XLM-R 84.2/78.3 79.0 58.4 78.7 77.7/77.8 55.4
BERT 83.3/76.6 81.9 59.2 80.1 75.6/75.7 55.4
BERT 84.3/77.4 81.6 59.2 81.6 76.8/77.0 54.2
KinyaBERT 87.1/82.1 81.6 61.8 81.8 79.6/79.5 54.5
KinyaBERT 86.6/81.3 82.3 64.3 82.4 80.0/79.9 56.2
Model Test Set
XLM-R 82.6/76.0 78.1 56.4 76.3 69.5/68.9 63.7
BERT 82.8/76.2 81.1 55.6 79.1 68.9/67.8 63.4
BERT 82.7/75.4 80.8 56.7 80.7 68.9/67.8 65.0
KinyaBERT 84.4/78.7 81.2 58.1 80.9 73.2/72.0 65.1
KinyaBERT 84.6/78.4 82.2 58.8 81.4 74.5/73.5 65.0
Table 3: Performance results on the machine translated GLUE benchmark Wang et al. (2019). The translation score is the sample average translation quality score assigned by volunteers. For MRPC, we report accuracy and F1. For STS-B, we report Pearson and Spearman correlations. For all others, we report accuracy. The best results are shown in bold while equal top results are underlined.
Task: NER
#Train examples: 2.1K
Model Validation Set Test Set
XLM-R 80.3 71.8
BERT 83.4 74.8
BERT 83.2 72.8
KinyaBERT 87.1 77.2
KinyaBERT 86.2 76.3
Table 4: Micro average F1 scores on Kinyarwanda NER task Adelani et al. (2021).
Task: NEWS
#Train examples: 18.0K
Model Validation Set Test Set
XLM-R 83.8 84.0
BERT 87.6 88.3
BERT 86.9 86.9
KinyaBERT 88.8 88.0
KinyaBERT 88.4 88.0
Table 5: Accuracy results on Kinyarwanda NEWS categorization task.

3.2 Evaluation tasks

Machine translated GLUE benchmark – The General Language Understanding Evaluation (GLUE) benchmark Wang et al. (2019)

has been widely used to evaluate pre-trained language models. In order to assess KinyaBERT performance on such high level language tasks, we used Google Translate API to translate a subset of the GLUE benchmark (MRPC, QNLI, RTE, SST-2, STS-B and WNLI tasks) into Kinyarwanda. CoLA task was left because it is English-specific. MNLI and QQP tasks were also not translated because they were too expensive to translate with Google’s commercial API. While machine translation adds more noise to the data, evaluating on this dataset is still relevant because all models compared have to cope with the same noise. To understand this translation noise, we also run user evaluation experiments, whereby four volunteers proficient in both English and Kinyarwanda evaluated a random sample of 6000 translated GLUE examples, and assigned a score to each example on a scale from 1 to 4 (See Table 

11 in Appendix). These scores help us characterize the noise in the data and contextualize our results with regards to other GLUE evaluations. Results on these GLUE tasks are shown in Table 3.

Named entity recognition (NER) – We use the Kinyarwanda subset of the MasakhaNER dataset Adelani et al. (2021) for NER task. This is a high quality NER dataset annotated by native speakers for major African languages including Kinyarwanda. The task requires predicting four entity types: Persons (PER), Locations (LOC), Organizations (ORG), and date and time (DATE). Results on this NER task are presented in Table 4.

News Categorization Task (NEWS) – For a document classification experiment, we collected a set of categorized news articles from seven major news websites that regularly publish in Kinyarwanda. The articles were already categorized, so no more manual labeling was needed. This dataset is similar to Niyongabo et al. (2020), but in our case, we limited the number collected articles per category to 3000 in order to have a more balanced label distribution (See Table 10 in the Appendix). The final dataset contains a total of 25.7K articles spanning 12 categories and has been split into training, validation and test sets in the ratios of 70%, 5% and 25% respectively. Results on this NEWS task are presented in Table 5.

For each evaluation task, we use a two-layer feed-forward network on top of the sentence encoder as it is typically done in other BERT models. The fine-tuning hyper-parameters are presented in Table 14 in the Appendix.

3.3 Main results

The main results are presented in Table 3, Table 4, and Table 5

. Each result is the average of 10 independent fine-tuning runs. Each average result is shown with the standard deviation of the 10 runs. Except for XLM-R, all other models are pre-trained on the same corpus (See Table 

2) for 32K steps using the same hyper-parameters.

On the GLUE task, KinyaBERT achieves 4.3% better average score than the strongest baseline. KinyaBERT also leads to more robust results on multiple tasks. It is also shown that having just a morphological analyzer is not enough: BERT still under-performs even though it uses morphological tokenization. Multi-lingual XLM-R achieves least performance in most cases, possibly because it was not pre-trained on Kinyarwanda text and uses inadequate tokenization.

On the NER task, KinyaBERT achieves best performance, about 3.2% better average F1 score than the strongest baseline. One of the architectural differences between KinyaBERT and KinyaBERT is that KinyaBERT uses three POS tag embeddings while KinyaBERT uses two. Assuming that POS tagging facilitates named entity recognition, this empirical result suggests that increasing the amount of POS tag information in the model, possibly through diversification (i.e. multiple POS tag embedding vectors per word), can lead to better NER performance.

The NEWS categorization task resulted in differing performances between validation and test sets. This may be a result that solving such task does not require high level language modeling but rather depends on spotting few keywords. Previous research on a similar task Niyongabo et al. (2020) has shown that simple classifiers based on TF-IDF features suffice to achieve best performance.

The morphological analyzer and POS tagger inherently have some level of noise because they do not always perform with perfect accuracy. While we did not have a simple way of assessing the impact of this noise in this work, we can logically expect that the lower the noise the better the results could be. Improving the morphological analyzer and POS tagger and quantitatively evaluating its accuracy is part of future work. Even though our POS tagger uses heuristic methods and was evaluated mainly through qualitative exploration, we can still see its positive impact on the pre-trained language model. We did not use previous work on Kinyarwanda POS tagging because it is largely different from this work in terms of scale, tag dictionary and dataset size and availability.

Figure 2: Comparison of fine-tuning loss curves between KinyaBERT and baselines on the evaluation tasks. KinyaBERT achieves the best convergence in most cases, indicating better effectiveness of its model architecture and pre-training objective.

We plot the learning curves during fine-tuning process of KinyaBERT and the baselines. The results in Figure 2 indicate that KinyaBERT fine-tuning has better convergence across all tasks. Additional results also show that positional attention Ke et al. (2020) learned by KinyaBERT has more uniform and smoother relative bias while BERT and BERT have more noisy relative positional bias (See Figure 3 in Appendix). This is possibly an indication that KinyaBERT allows learning better word-relative syntactic regularities. However, this aspect needs to be investigated more systematically in future research.

While the main sentence/document encoder of KinyaBERT is equivalent to a standard BERT “BASE” configuration on top of a small morphology encoder, overall, the model actually decreases the number of parameters by more than 12% through embedding layer savings. This is because using morphological representation reduces the vocabulary size. Using smaller embedding vectors at the morphology encoder level also significantly reduces the overall number of parameters. Table 8 in Appendix shows the vocabulary sizes and parameter count of KinyaBERT in comparison to the baselines. While the sizing of the embeddings was done essentially to match BERT “BASE” configuration, future studies can shed more light on how different model sizes affect performance.

3.4 Ablation study

MorphologyPrediction Validation Set
AFSSTEM+ASC 86.6/81.3 82.3 64.3 82.4 80.0/79.9 56.2 86.2 88.4
POSSTEM+ADR 87.1/82.1 81.6 61.8 81.8 79.6/79.5 54.5 87.1 88.8
AVGSTEM+ADR 85.5/80.3 81.4 63.0 82.1 79.6/79.5 55.8 86.6 88.3
STEMSTEM 86.4/81.5 80.4 63.4 77.5 79.7/79.5 50.4 86.6 88.0
MorphologyPrediction Test Set
AFSSTEM+ASC 84.6/78.4 82.2 58.8 81.4 74.5/73.5 65.0 76.3 88.0
POSSTEM+ADR 84.4/78.7 81.2 58.1 80.9 73.2/72.0 65.1 77.2 88.0
AVGSTEM+ADR 84.0/78.2 81.7 59.4 80.7 73.6/72.6 65.0 76.9 88.2
STEMSTEM 84.2/78.6 80.3 59.8 77.5 73.3/72.0 59.6 76.4 88.4
Table 6: Ablation results: each result is an average of 10 independent fine-tuning runs. Metrics, dataset sizes and noise statistics are the same as for the main results in Table 3, Table 4 and Table 5.

We conducted an ablation study to clarify some of the design choices made for KinyaBERT architecture. We make variations along two axes: (i) morphology input and (ii) pre-training task, which gave us four variants that we pre-trained for 32K steps and evaluated on the same downstream tasks.

  • AFSSTEM+ASC: Morphological features are captured by two POS tag and one affix set vectors. We predict both the stem and affix set. This corresponds to KinyaBERT presented in the main results.

  • POSSTEM+ADR: Morphological features are carried by three POS tag vectors and we predict the stem and affix probability vector. This corresponds to KinyaBERT.

  • AVGSTEM+ADR: Morphological features are captured by two POS tag vectors and the pointwise average of affix hidden vectors from the morphology encoder. We predict the stem and affix probability vector.

  • STEMSTEM: We omit the morphology encoder and train a model with only the stem parts without affixes and only predict the stem.

Ablation results presented in Table 6 indicate that using affix sets for both morphology encoding and prediction gives better results for many GLUE tasks. The under-performance of “STEMSTEM” on high resource tasks (QNLI and SST-2) is an indication that morphological information from affixes is important. However, the utility of this information depends on the task as we see mixed results on other tasks.

Due to a large design space for a morphology-aware language model, there are still a number of other design choices that can be explored in future studies. One may vary the amount of POS tag embeddings used, vary the size affix set vocabulary or the dimension of the morphology encoder embeddings. One may also investigate the potential of other architectures for the morphology encoder, such as convolutional networks. Our early attempt of using recurrent neural networks (RNNs) for the morphology encoder was abandoned because it was too slow to train.

4 Related Work

BERT-variant pre-trained language models (PLMs) were initially pre-trained on monolingual high-resource languages. Multilingual PLMs that include both high-resource and low-resource languages have also been introduced Devlin et al. (2019); Conneau et al. (2020); Xue et al. (2021); Chung et al. (2020). However, it has been found that these multilingual models are biased towards high-resource languages and use fewer low quality and uncleaned low-resource data Kreutzer et al. (2022). The included low-resource languages are also very limited because they are mainly sourced from Wikipedia articles, where languages with few articles like Kinyarwanda are often left behind Joshi et al. (2020); Nekoto et al. (2020).

Joshi et al. (2020) classify the state of NLP for Kinyarwanda as “Scraping-By”, meaning it has been mostly excluded from previous NLP research, and require the creation of dedicated resources and models. Kinyarwanda has been studied mostly in descriptive linguistics Kimenyi (1976, 1978a, 1978b, 1988); Jerro (2016). Few recent NLP works on Kinyarwanda include Morphological Analysis Muhirwe (2009); Nzeyimana (2020), Text Classification Niyongabo et al. (2020), Named Entity Recognition Rijhwani et al. (2020); Adelani et al. (2021); Sälevä and Lignos (2021), POS tagging Garrette and Baldridge (2013); Garrette et al. (2013); Duong et al. (2014); Fang and Cohn (2016); Cardenas et al. (2019), and Parsing Sun et al. (2014); Mielens et al. (2015). There is no prior study on pre-trained language modeling for Kinyarwanda.

There are very few works on monolingual PLMs for African languages. To the best of our knowledge there is currently only AfriBERT Ralethe (2020) that has been pre-trained on Afrikaans, a language spoken in South Africa. In this paper, we aim to increase the inclusion of African languages in NLP community by introducing a PLM for Kinyarwanda. Differently to the previous works (see Table 15 in Appendix) which solely pre-trained unmodified BERT models, we propose an improved BERT architecture for morphologically rich languages.

Recently, there has been a research push to improve sub-word tokenization by adopting character-based models (Ma et al., 2020; Clark et al., 2022). While these methods are promising for the “language-agnostic” case, they are still solely based on the surface form of words, and thus have the same limitations as BPE when processing morphologically rich languages. We leave it to future research to empirically explore how these character-based methods compare to morphology-aware models.

5 Conclusion

This work demonstrates the effectiveness of explicitly incorporating morphological information in language model pre-training. The proposed two-tier Transformer architecture allows the model to represent morphological compositionality. Experiments conducted on Kinyarwanda, a low resource morphologically rich language, reveal significant performance improvement on several downstream NLP tasks when using the proposed architecture. These findings should motivate more research into morphology-aware language models.


This work was supported with Cloud TPUs from Google’s TPU Research Cloud (TRC) program and Google Cloud Research Credits with the award GCP19980904. We also thank the anonymous reviewers for their insightful feedback.


  • Adelani et al. (2021) David Ifeoluwa Adelani, Jade Abbott, Graham Neubig, Daniel D’souza, Julia Kreutzer, Constantine Lignos, Chester Palen-Michel, Happy Buzaaba, Shruti Rijhwani, Sebastian Ruder, Stephen Mayhew, Israel Abebe Azime, Shamsuddeen H. Muhammad, Chris Chinenye Emezue, Joyce Nakatumba-Nabende, Perez Ogayo, Aremu Anuoluwapo, Catherine Gitau, Derguene Mbaye, Jesujoba Alabi, Seid Muhie Yimam, Tajuddeen Rabiu Gwadabe, Ignatius Ezeani, Rubungo Andre Niyongabo, Jonathan Mukiibi, Verrah Otiende, Iroro Orife, Davis David, Samba Ngom, Tosin Adewumi, Paul Rayson, Mofetoluwa Adeyemi, Gerald Muriuki, Emmanuel Anebi, Chiamaka Chukwuneke, Nkiruka Odu, Eric Peter Wairagala, Samuel Oyerinde, Clemencia Siro, Tobius Saul Bateesa, Temilola Oloyede, Yvonne Wambui, Victor Akinode, Deborah Nabagereka, Maurice Katusiime, Ayodele Awokoya, Mouhamadane MBOUP, Dibora Gebreyohannes, Henok Tilaye, Kelechi Nwaike, Degaga Wolde, Abdoulaye Faye, Blessing Sibanda, Orevaoghene Ahia, Bonaventure F. P. Dossou, Kelechi Ogueji, Thierno Ibrahima DIOP, Abdoulaye Diallo, Adewale Akinfaderin, Tendai Marengereke, and Salomey Osei. 2021. MasakhaNER: Named entity recognition for African languages. Transactions of the Association for Computational Linguistics, 9:1116–1131.
  • Baly et al. (2020) Fady Baly, Hazem Hajj, et al. 2020. Arabert: Transformer-based model for arabic language understanding. In

    Proceedings of the 4th Workshop on Open-Source Arabic Corpora and Processing Tools, with a Shared Task on Offensive Language Detection

    , pages 9–15.
  • Beesley and Karttunen (2000) Kenneth R Beesley and Lauri Karttunen. 2000. Finite-state non-concatenative morphotactics. In Proceedings of the 38th Annual Meeting on Association for Computational Linguistics, pages 191–198.
  • Beesley and Karttunen (2003) Kenneth R Beesley and Lauri Karttunen. 2003. Finite-state morphology: Xerox tools and techniques. CSLI, Stanford.
  • Bender (2019) Emily M Bender. 2019. The# benderrule: On naming the languages we study and why it matters. The Gradient, 14.
  • Bengio et al. (2003) Yoshua Bengio, Réjean Ducharme, Pascal Vincent, and Christian Janvin. 2003. A neural probabilistic language model.

    The journal of machine learning research

    , 3:1137–1155.
  • Bojanowski et al. (2017) Piotr Bojanowski, Edouard Grave, Armand Joulin, and Tomas Mikolov. 2017. Enriching word vectors with subword information. Transactions of the Association for Computational Linguistics, 5:135–146.
  • Canete et al. (2020) José Canete, Gabriel Chaperon, Rodrigo Fuentes, and Jorge Pérez. 2020. Spanish pre-trained bert model and evaluation data. PML4DC at ICLR, 2020.
  • Cardenas et al. (2019) Ronald Cardenas, Ying Lin, Heng Ji, and Jonathan May. 2019. A grounded unsupervised universal part-of-speech tagger for low-resource languages. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 2428–2439.
  • Chan et al. (2020) Branden Chan, Stefan Schweter, and Timo Möller. 2020. German’s next language model. In Proceedings of the 28th International Conference on Computational Linguistics, pages 6788–6796, Barcelona, Spain (Online). International Committee on Computational Linguistics.
  • Chung et al. (2020) Hyung Won Chung, Thibault Fevry, Henry Tsai, Melvin Johnson, and Sebastian Ruder. 2020. Rethinking embedding coupling in pre-trained language models. In International Conference on Learning Representations.
  • Clark et al. (2022) Jonathan H Clark, Dan Garrette, Iulia Turc, and John Wieting. 2022. Canine: Pre-training an efficient tokenization-free encoder for language representation. Transactions of the Association for Computational Linguistics, 10:73–91.
  • Conneau et al. (2020) Alexis Conneau, Kartikay Khandelwal, Naman Goyal, Vishrav Chaudhary, Guillaume Wenzek, Francisco Guzmán, Edouard Grave, Myle Ott, Luke Zettlemoyer, and Veselin Stoyanov. 2020. Unsupervised cross-lingual representation learning at scale. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 8440–8451, Online. Association for Computational Linguistics.
  • Delobelle et al. (2020) Pieter Delobelle, Thomas Winters, and Bettina Berendt. 2020. RobBERT: a Dutch RoBERTa-based Language Model. In Findings of the Association for Computational Linguistics: EMNLP 2020, pages 3255–3265, Online. Association for Computational Linguistics.
  • Devlin et al. (2019) Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. Bert: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 4171–4186.
  • Duong et al. (2014) Long Duong, Trevor Cohn, Karin Verspoor, Steven Bird, and Paul Cook. 2014. What can we get from 1000 tokens? a case study of multilingual pos tagging for resource-poor languages. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 886–897.
  • Fang and Cohn (2016) Meng Fang and Trevor Cohn. 2016. Learning when to trust distant supervision: An application to low-resource pos tagging using cross-lingual projection. In Proceedings of The 20th SIGNLL Conference on Computational Natural Language Learning, pages 178–186.
  • Forney (1973) G David Forney. 1973. The viterbi algorithm. Proceedings of the IEEE, 61(3):268–278.
  • Garrette and Baldridge (2013) Dan Garrette and Jason Baldridge. 2013. Learning a part-of-speech tagger from two hours of annotation. In Proceedings of the 2013 conference of the North American chapter of the association for computational linguistics: Human language technologies, pages 138–147.
  • Garrette et al. (2013) Dan Garrette, Jason Mielens, and Jason Baldridge. 2013.

    Real-world semi-supervised learning of pos-taggers for low-resource languages.

    In Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 583–592.
  • Jerro (2016) Kyle Jerro. 2016. The locative applicative and the semantics of verb class in kinyarwanda. Diversity in African languages, page 289.
  • Joshi et al. (2020) Pratik Joshi, Sebastin Santy, Amar Budhiraja, Kalika Bali, and Monojit Choudhury. 2020. The state and fate of linguistic diversity and inclusion in the nlp world. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 6282–6293.
  • Ke et al. (2020) Guolin Ke, Di He, and Tie-Yan Liu. 2020. Rethinking positional encoding in language pre-training. In International Conference on Learning Representations.
  • Kimenyi (1976) Alexandre Kimenyi. 1976. Subjectivization rules in kinyarwanda. In Annual Meeting of the Berkeley Linguistics Society, volume 2, pages 258–268.
  • Kimenyi (1978a) Alexandre Kimenyi. 1978a. Aspects of naming in kinyarwanda. Anthropological linguistics, 20(6):258–271.
  • Kimenyi (1978b) Alexandre Kimenyi. 1978b. A relational grammar of kinyarwanda. University of California, Publications in Linguistics Berkeley, Cal, 91:1–248.
  • Kimenyi (1988) Alexandre Kimenyi. 1988. Passiveness in kinyarwanda. In Passive and Voice, page 355. John Benjamins.
  • Klein and Tsarfaty (2020) Stav Klein and Reut Tsarfaty. 2020. Getting the ##life out of living: How adequate are word-pieces for modelling complex morphology? In Proceedings of the 17th SIGMORPHON Workshop on Computational Research in Phonetics, Phonology, and Morphology, pages 204–209, Online. Association for Computational Linguistics.
  • Koskenniemi (1983) Kimmo Koskenniemi. 1983. Two-level model for morphological analysis. In IJCAI, volume 83, pages 683–685.
  • Koto et al. (2020) Fajri Koto, Afshin Rahimi, Jey Han Lau, and Timothy Baldwin. 2020. Indolem and indobert: A benchmark dataset and pre-trained language model for indonesian nlp. In Proceedings of the 28th International Conference on Computational Linguistics, pages 757–770.
  • Koutsikakis et al. (2020) John Koutsikakis, Ilias Chalkidis, Prodromos Malakasiotis, and Ion Androutsopoulos. 2020. Greek-bert: The greeks visiting sesame street. In

    11th Hellenic Conference on Artificial Intelligence

    , pages 110–117.
  • Kreutzer et al. (2022) Julia Kreutzer, Isaac Caswell, Lisa Wang, Ahsan Wahab, Daan van Esch, Nasanbayar Ulzii-Orshikh, Allahsera Tapo, Nishant Subramani, Artem Sokolov, Claytone Sikasote, et al. 2022. Quality at a glance: An audit of web-crawled multilingual datasets. Transactions of the Association for Computational Linguistics, 10:50–72.
  • Kuratov and Arkhipov (2019) Y Kuratov and M Arkhipov. 2019. Adaptation of deep bidirectional multilingual transformers for russian language. In Komp’juternaja Lingvistika i Intellektual’nye Tehnologii, pages 333–339.
  • Le et al. (2020) Hang Le, Loïc Vial, Jibril Frej, Vincent Segonne, Maximin Coavoux, Benjamin Lecouteux, Alexandre Allauzen, Benoit Crabbe, Laurent Besacier, and Didier Schwab. 2020. Flaubert: Unsupervised language model pre-training for french. In LREC.
  • Ma et al. (2020) Wentao Ma, Yiming Cui, Chenglei Si, Ting Liu, Shijin Wang, and Guoping Hu. 2020. CharBERT: Character-aware pre-trained language model. In Proceedings of the 28th International Conference on Computational Linguistics, pages 39–50, Barcelona, Spain (Online). International Committee on Computational Linguistics.
  • Martin et al. (2020) Louis Martin, Benjamin Muller, Pedro Javier Ortiz Suárez, Yoann Dupont, Laurent Romary, Éric de la Clergerie, Djamé Seddah, and Benoît Sagot. 2020. CamemBERT: a tasty French language model. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 7203–7219, Online. Association for Computational Linguistics.
  • Masala et al. (2020) Mihai Masala, Stefan Ruseti, and Mihai Dascalu. 2020. Robert–a romanian bert model. In Proceedings of the 28th International Conference on Computational Linguistics, pages 6626–6637.
  • McCarthy (1981) John J McCarthy. 1981. A prosodic theory of nonconcatenative morphology. Linguistic inquiry, 12(3):373–418.
  • Mielens et al. (2015) Jason Mielens, Liang Sun, and Jason Baldridge. 2015.

    Parse imputation for dependency annotations.

    In Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pages 1385–1394.
  • Mikolov et al. (2013) Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S Corrado, and Jeff Dean. 2013. Distributed representations of words and phrases and their compositionality. Advances in neural information processing systems, 26.
  • Mohseni and Tebbifakhr (2019) Mahdi Mohseni and Amirhossein Tebbifakhr. 2019. MorphoBERT: a Persian NER system with BERT and morphological analysis. In Proceedings of The First International Workshop on NLP Solutions for Under Resourced Languages (NSURL 2019) co-located with ICNLSP 2019 - Short Papers, pages 23–30, Trento, Italy. Association for Computational Linguistics.
  • Muhirwe (2007) Jackson Muhirwe. 2007. Computational analysis of kinyarwanda morphology: The morphological alternations. International Journal of computing and ICT Research, 1(1):85–92.
  • Muhirwe (2009) Jackson Muhirwe. 2009. Morphological analysis of tone marked kinyarwanda text. In International Workshop on Finite-State Methods and Natural Language Processing, pages 48–55. Springer.
  • Nekoto et al. (2020) Wilhelmina Nekoto, Vukosi Marivate, Tshinondiwa Matsila, Timi Fasubaa, Taiwo Fagbohungbe, Solomon Oluwole Akinola, Shamsuddeen Muhammad, Salomon Kabongo Kabenamualu, Salomey Osei, Freshia Sackey, Rubungo Andre Niyongabo, Ricky Macharm, Perez Ogayo, Orevaoghene Ahia, Musie Meressa Berhe, Mofetoluwa Adeyemi, Masabata Mokgesi-Selinga, Lawrence Okegbemi, Laura Martinus, Kolawole Tajudeen, Kevin Degila, Kelechi Ogueji, Kathleen Siminyu, Julia Kreutzer, Jason Webster, Jamiil Toure Ali, Jade Abbott, Iroro Orife, Ignatius Ezeani, Idris Abdulkadir Dangana, Herman Kamper, Hady Elsahar, Goodness Duru, Ghollah Kioko, Murhabazi Espoir, Elan van Biljon, Daniel Whitenack, Christopher Onyefuluchi, Chris Chinenye Emezue, Bonaventure F. P. Dossou, Blessing Sibanda, Blessing Bassey, Ayodele Olabiyi, Arshath Ramkilowan, Alp Öktem, Adewale Akinfaderin, and Abdallah Bashir. 2020. Participatory research for low-resourced machine translation: A case study in African languages. In Findings of the Association for Computational Linguistics: EMNLP 2020, pages 2144–2160, Online. Association for Computational Linguistics.
  • Nguyen and Tuan Nguyen (2020) Dat Quoc Nguyen and Anh Tuan Nguyen. 2020. PhoBERT: Pre-trained language models for Vietnamese. In Findings of the Association for Computational Linguistics: EMNLP 2020, pages 1037–1042, Online. Association for Computational Linguistics.
  • Niyongabo et al. (2020) Rubungo Andre Niyongabo, Qu Hong, Julia Kreutzer, and Li Huang. 2020. Kinnews and kirnews: Benchmarking cross-lingual text classification for kinyarwanda and kirundi. In Proceedings of the 28th International Conference on Computational Linguistics, pages 5507–5521.
  • Nurse and Philippson (2006) Derek Nurse and Gérard Philippson. 2006. The bantu languages. Routledge.
  • Nzeyimana (2020) Antoine Nzeyimana. 2020. Morphological disambiguation from stemming data. In Proceedings of the 28th International Conference on Computational Linguistics, pages 4649–4660, Barcelona, Spain (Online). International Committee on Computational Linguistics.
  • Ott et al. (2019) Myle Ott, Sergey Edunov, Alexei Baevski, Angela Fan, Sam Gross, Nathan Ng, David Grangier, and Michael Auli. 2019. fairseq: A fast, extensible toolkit for sequence modeling. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics (Demonstrations), pages 48–53, Minneapolis, Minnesota. Association for Computational Linguistics.
  • Pennington et al. (2014) Jeffrey Pennington, Richard Socher, and Christopher D Manning. 2014. Glove: Global vectors for word representation. In Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP), pages 1532–1543.
  • Peters et al. (2018) Matthew E. Peters, Mark Neumann, Mohit Iyyer, Matt Gardner, Christopher Clark, Kenton Lee, and Luke Zettlemoyer. 2018. Deep contextualized word representations. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers), pages 2227–2237, New Orleans, Louisiana. Association for Computational Linguistics.
  • Ralethe (2020) Sello Ralethe. 2020. Adaptation of deep bidirectional transformers for afrikaans language. In Proceedings of The 12th Language Resources and Evaluation Conference, pages 2475–2478.
  • Rijhwani et al. (2020) Shruti Rijhwani, Shuyan Zhou, Graham Neubig, and Jaime G Carbonell. 2020. Soft gazetteers for low-resource named entity recognition. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 8118–8123.
  • Rybak et al. (2020) Piotr Rybak, Robert Mroczkowski, Janusz Tracz, and Ireneusz Gawlik. 2020. Klej: Comprehensive benchmark for polish language understanding. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 1191–1201.
  • Sälevä and Lignos (2021) Jonne Sälevä and Constantine Lignos. 2021. Mining wikidata for name resources for african languages. arXiv preprint arXiv:2104.00558.
  • Scheible et al. (2020) Raphael Scheible, Fabian Thomczyk, Patric Tippmann, Victor Jaravine, and Martin Boeker. 2020. Gottbert: a pure german language model. arXiv preprint arXiv:2012.02110.
  • Sennrich et al. (2016) Rico Sennrich, Barry Haddow, and Alexandra Birch. 2016. Neural machine translation of rare words with subword units. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 1715–1725, Berlin, Germany. Association for Computational Linguistics.
  • Souza et al. (2020) Fábio Souza, Rodrigo Nogueira, and Roberto Lotufo. 2020. Bertimbau: Pretrained bert models for brazilian portuguese. In Brazilian Conference on Intelligent Systems, pages 403–417. Springer.
  • Sun et al. (2014) Liang Sun, Jason Mielens, and Jason Baldridge. 2014. Parsing low-resource languages using gibbs sampling for pcfgs with latent annotations. In Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP), pages 290–300.
  • Tsuruoka and Tsujii (2005) Yoshimasa Tsuruoka and Jun’ichi Tsujii. 2005. Bidirectional inference with the easiest-first strategy for tagging sequence data. In Proceedings of Human Language Technology Conference and Conference on Empirical Methods in Natural Language Processing, pages 467–474.
  • Vaswani et al. (2017) Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Lukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. In NIPS.
  • Virtanen et al. (2019) Antti Virtanen, Jenna Kanerva, Rami Ilo, Jouni Luoma, Juhani Luotolahti, Tapio Salakoski, Filip Ginter, and Sampo Pyysalo. 2019. Multilingual is not enough: Bert for finnish. arXiv preprint arXiv:1912.07076.
  • Wang et al. (2019) Alex Wang, Amanpreet Singh, Julian Michael, Felix Hill, Omer Levy, and Samuel R Bowman. 2019. Glue: A multi-task benchmark and analysis platform for natural language understanding. In 7th International Conference on Learning Representations, ICLR 2019.
  • Xue et al. (2021) Linting Xue, Noah Constant, Adam Roberts, Mihir Kale, Rami Al-Rfou, Aditya Siddhant, Aditya Barua, and Colin Raffel. 2021. mT5: A massively multilingual pre-trained text-to-text transformer. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 483–498, Online. Association for Computational Linguistics.
  • Zaheer et al. (2020) Manzil Zaheer, Guru Guruganesh, Kumar Avinava Dubey, Joshua Ainslie, Chris Alberti, Santiago Ontanon, Philip Pham, Anirudh Ravula, Qifan Wang, Li Yang, et al. 2020. Big bird: Transformers for longer sequences. Advances in Neural Information Processing Systems, 33:17283–17297.

Appendix A Data Tables, Hyper-parameters & Additional results

Module Values
Morphology Encoder:
Number of Layers 4
Attention heads 4
Hidden Size 128
Attention head size 32
FFN inner hidden size 512
Morphological embedding size 128
Sentence/Document Encoder:
Number of Layers 12
Attention heads 12
Hidden Size 768
Attention head size 64
FFN inner hidden size 3072
Stem embedding size 256
Table 7: KinyaBERT Architectural dimensions.
Model (#Params) Vocab. Size
XLM-R (270M):
Sentence-Piece tokens 250K
BERT (120M):
BPE Tokens 43K
BERT (127M):
Morphemes & BPE Tokens 51K
KinyaBERT (101M):
Stems & BPE Tokens 34K
Affixes 0.3K
POS Tags 0.2K
KinyaBERT (105M):
Stems & BPE Tokens 34K
Affix sets 34K
Affixes 0.3K
POS Tags 0.2K
Table 8: Vocabulary sizes for embedding layers.
Hyper-parameter Values
Dropout 0.1
Attention Dropout 0.1
Warmup Steps 2K
Max Steps 200K
Weight Decay 0.01
Learning Rate Decay Linear
Peak Learning Rate 4e-4
Batch Size 2560
Optimizer LAMB
Adam 1e-6
Adam 0.90
Adam 0.98
Gradient Clipping 0
Table 9: Pre-training hyper-parameters
Category #Articles
entertainment 3000
sports 3000
security 3000
economy 3000
health 3000
politics 3000
religion 2020
development 1813
technology 1105
culture 994
relationships 940
people 852
Total 25724
Table 10: NEWS categorization dataset label distribution.
Score Translation quality
1 Invalid or meaningless translation
2 Invalid but not totally wrong
3 Almost valid, but not totally correct
4 Valid and correct translation
Table 11: Machine-translated GLUE benchmark scoring prompt levels.
POS Tag weight Description Example
V#000 1.8 Infinitive Verb kuvuga ‘to say’
V#001 1 Gerund or verbal noun uwavuze ‘the one who said’
V#002 1.5 Imperative verb vuga ‘say’
V#004 1.5 Continuous present verb aracyavuga ‘she is still saying’
V#005 1.5 Past tense verb yaravuze ‘she said’
V#006 1.5 Future tense verb azavuga ‘she will say’
V#010 1.5 Verb without tense mark avuga ‘saying’
N#011 1 Noun without augmment (wa)muntu ‘person’
N#012 2 Noun with augment umuntu ‘a person’
DE#013 2 Demonstrative ng- nguyu ‘this is her’
DE#020 3 Personal demonstrative wowe ‘you’
DE#021 2 Demonstrative with augment uwo ‘this (person)’
PO#025 2 Possessive +augment +owner uwawe ‘yours’
QA#026 0.5 Qualificative adjective +augment +bu ubuto ‘littleness’
QA#027 1 Qualificative adjective +augment -bu umuto ‘the little one’
QA#028 2.5 Qualificative adjective -augment muto ‘little’
QA#029 3 Qualificative adjective -augment +reduplication mutomuto ‘(kind of) little’
NU#030 2.5 Numeral babiri ‘two (people)’
OT#033 2.5 Quoting -ti bati: ‘they said:’
NP#035 2 Proper names Yohana ‘John’
DI#036 3 Digits 84
AD#037 2.5 Adverb bucece ‘silently’
VC#038 2.5 Conjunctive adverbs hanyuma ‘and then’
CO#039 2.5 Commanding expressions cyono ‘please’
CA#040 2.5 Calling expressions yewe ‘you’
QU#044 3 Questioning adverb he he ‘where’
SP#054 2.5 Spatial hakurya ‘over there’
TE#055 2.5 Temporal kare ‘early’
RL#056 3 Relatives masenge ‘my aunt’
PR#057 3 Prepositions ku ‘on’
OR#064 2.5 Orientations amajyaruguru ‘north’
AJ#065 2.5 Adjectives rusange ‘common’
NN#066 2.5 Nominal loanwords kopi ‘copy’
HR#067 3 Hours (saa) mbiri ‘eight o’clock’
DT#068 2.5 Date taliki ‘date’
EN#069 3 Common English terms live, like, share
IJ#070 2.5 Interjections dorere ‘see!’
CJ#071 3 Conjunctions ko ‘that’
CP#078 3 Copula ni ‘it is’
RE#079 3 Responses yego ‘yes’
UN#083 3 Measuring units metero ‘meter’
MO#084 4 Months Mutarama ‘January’
PT#085 3 Punctuations .
Table 12: Examples of POS tags used in KinyaBERT along with precedence weights in Equation 2.
Affix Set Example Surface form
V:2:ku-V:18:a ku-gend-a kugenda ‘to walk’
N:0:u-N:1:mu u-mu-ntu umuntu ‘a person’
PO:1:i i-a-cu yacu ‘our’
N:0:i-N:1:n i-n-kiko inkiko ‘courts’
PO:1:u u-a-bo wabo ‘their’
V:2:a-V:4:a-V:18:ye a-a-bon-ye yabonye ‘she saw’
DE:1:u-DE:2:u u-u-o uwo ‘that’
V:2:u-V:4:a-V:17:w-V:18:ye u-a-vug-w-ye wavuzwe ‘who was talked about’
QA:1:ki-QA:3:ki-QA:4:re ki-re-ki-re kirekire ‘tall’
Table 13: Examples of affix sets used by KinyaBERT; there are 34K sets in total.
Peak Learning Rate 1e-5 1e-5 2e-5 1e-5 2e-5 1e-5 5e-5 1e-5
Batch Size 16 32 16 32 16 16 32 32
Learning Rate Decay Linear Linear Linear Linear Linear Linear Linear Linear
Weight Decay 0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.1

Max Epochs

15 15 15 15 15 15 30 15
Warmup Steps proportion 6% 6% 6% 6% 6% 6% 6% 6%
Optimizer AdamW AdamW AdamW AdamW AdamW AdamW AdamW AdamW
Table 14: Downstream task fine-tuning hyper-parameters.
Paper Language Pre-training Positional Input
Tasks Embedding Representation
Mohseni and Tebbifakhr (2019) Persian MLM+NSP Absolute Morphemes
Kuratov and Arkhipov (2019) Russian MLM+NSP Absolute BPE
Masala et al. (2020) Romanian MLM+NSP Absolute BPE
Baly et al. (2020) Arabic WWM+NSP Absolute BPE
Koto et al. (2020) Indonesian MLM+NSP Absolute BPE
Chan et al. (2020) German WWM Absolute BPE
Delobelle et al. (2020) Dutch MLM Absolute BPE
Nguyen and Tuan Nguyen (2020) Vietnamese MLM Absolute BPE
Canete et al. (2020) Spanish WWM Absolute BPE
Rybak et al. (2020) Polish MLM Absolute BPE
Martin et al. (2020) French MLM Absolute BPE
Le et al. (2020) French MLM Absolute BPE
Koutsikakis et al. (2020) Greek MLM+NSP Absolute BPE
Souza et al. (2020) Portuguese MLM Absolute BPE
Ralethe (2020) Afrikaans MLM+NSP Absolute BPE
This work Kinyarwanda MLM: STEM+AFFIXES TUPE-R Morphemes+BPE
Table 15: Comparison between KinyaBERT and other monolingual BERT-variant PLMs. We only compare with previous works that have been published in either journals or conferences as of August 2021. We excluded some extremely high-resource languages such as English and Chinese. MLM: Masked language model; NSP: Next Sentence Prediction; WWM: Whole Word Masked.
BERT; Average non-adjacent diagonal STDEV = 0.81 for
BERT; Average non-adjacent diagonal STDEV = 0.80 for
KinyaBERT; Average non-adjacent diagonal STDEV = 0.75 for
KinyaBERT; Average non-adjacent diagonal STDEV = 0.75 for
Figure 3: Visualization of the positional attention bias (normalized) of the 12 attention heads. Each attention bias Ke et al. (2020) indicates the positional correlations between the and words/tokens in a sentence.