A creole language arises if a pidgin,111A pidgin is a grammatically simplified language that develops between two or more groups that do not have a language in common. Both pidgins and creoles are sometimes referred to as contact languages. developed by adults for use as a second language, becomes the native and primary language of their children. Although a large portion of creole languages have their roots in Western European colonialism and slavery, creole languages still serve as important lingua franca in multi-ethnic and multilingual communities, and creoles are often an important part of the local identity. Moreover, there are more than a hundred million speakers of creole languages world wide (fig. 1), with similar needs for technological assistance, and yet creoles are still largely absent from NLP research joshi-etal-2020-state. Haitian Creole, for example, has 9.6 million speakers as of today; Nigerian Pidgin English has 100 million speakers, and Singaporean Colloquial English (Singlish) has 3.5 million speakers. This paper sets out to collect existing resources for these three languages and provides language models for them. In doing so, we wish to take the nature of creole languages into account, not necessarily assuming that our best approaches to modeling non-creole language are also best for the creole languages.
The nature of creole languages has been a matter of much debate in linguistics during the last decade Sessarego2020NotAG: Some see creole languages as natural stages in language change cycles (aboh15emergence), while others see them as a distinct typological class with unique characteristics, including, for example, a very simple morphology (mcwhorter98). Another feature of creoles is that they exhibit significant variation across groups of speakers (Patrick1999). winford99variation goes as far as to call creoles a continua that cannot be captured under a single grammar.
Consider the following pair of sentences from DBLP:journals/corr/BajpaiPHC17:
John sibei hum sup one.
John very buaya sia.
Here, according to the authors, both sentences are valid utterances in Singlish, and they both mean John is so lecherous, but the first would more likely come from a speaker of Chinese, and the second from a Malay speaker. From this,222Creole languages clearly differ though in the dynamics that affect their drift. For example, yakpo_2021 discuss two seemingly similar creole languages, Krio (Sierra Leone) and Pichi (Equatorial Guinea). Both creoles have English as their lexifier, but while Krio is spoken alongside English, Pichi is spoken alongside Spanish. The two creoles, as a consequence, exhibit a clear difference. Krio has converged increasingly toward English, while Pichi has neither converged toward English nor Spanish. we derive the conjecture that creole language models can benefit from learned mixtures of source languages. Training on mixtures of source languages has been applied to language modeling of code-switched language pratapa-etal-2018-language, and it is clear from examples such as the one in fig. 2 that creole languages, at the sentence level, share commonalities with code-switched language, with vocabularies drawn from multiple source languages. To exploit synergies with learned mixtures of source languages, and to obtain robust performance across related, but unseen distributions, we explore ways of training creole language models with distributionally robust objectives oren-etal-2019-distributionally
. Our results below, however, show that, somewhat surprisingly, this conjecture is probably not true, at least not in a straight-forward way.
We combine existing datasets and present pretrained language models for the following creole languages: Nigerian Pidgin English, Singaporean Colloquial English (Singlish), and Haitian Creole. We perform intrinsic evaluation (word prediction), as well as extrinsic evaluation (part-of-speech tagging and named entity recognition). Comparing language models trained with empirical risk minimization to languages models trained with robust objectives, we observe that training with multiple related languages does not improve creole modeling; and also, somewhat surprisingly, that models with empirical risk minimization are superior to models robust across domains. We hence investigatewhy this is: in particular, whether it is due to over-parameterization, insufficient regularization (sagawa19dro), or relative distributional stability (NIPS2006_b1b0432c). We observe no significant difference for language models with fewer parameters or higher degree of regularization. On the other hand, we find that the underlying reason might be the relative stability of the creoles, which show no significant drift.
2 Related Work
NLP research on creoles
Despite the unique features of creoles that make them an interesting application for multilingual and cross-lingual NLP, as well as the open-ended debate about the linguistic nature of creoles Sessarego2020NotAG, little attention has been devoted to creoles in NLP. (We present the works related to the specific creoles of focus in this paper in section 3.) One relevant work by Murawaki2016StatisticalMO explored the typological status of creoles and also introduced a method for statistical modeling of creole genesis. To start, the authors reported that binary SVM classification of creole and non-creole languages failed to distinguish the two classes, even though their underlying distributions are quite different. After this, they introduce a statistical model of creoles, formulated as a mixture of its influential languages and an inferred "restructurer", which is set of possible linguistic feature distributions that are observed across languages included in their experiments. Overall, this work showcases how statistical modeling methods can be useful for investigating the language evolution of creoles, however there is also no discussion of how their findings could help others extend current NLP methods for creoles.
NLP research on pidgins and code-switching
Creoles are pidgins that have consolidated over time to become a first language for new generations of speakers. The NLP literature on pidgins is even more sparse than the literature on creoles, because many pidgins that did not undergo creolization have gone extinct, such as Maritime Polynesian Pidgin 2014PidginsandCreoles. Code-switching literature, however, is also relevant, as both pidgins and creoles also draw from other languages. Importantly, pidgins differ from code-switching or mixed language in that code-switching typically only occurs between two bilingual or highly proficient speakers of two languages. Pidgins, on the other hand, are derived from multiple languages, and spoken by those who do not fluently speak every language involved. The NLP literature on code-switching is surprisingly rich, however. We refer readers to cetinoglu-etal-2016-challenges and dogruoz-etal-2021-survey for an overview.
Computational research on language evolution
Research on creoles is more common in the field of language evolution than in NLP. In particular, work on creoles in this field typically focuses on their computational modeling, their emergence nakamura09, and their evolution ModelingtheEvolutionofCreoles; Furman2020EvolvingAA. Other creole modeling efforts in this space may be more tailored towards specific linguistic insights Parkvall2008TheSO. While these studies demonstrate that work on creoles is being done in a computational space, it is difficult to apply conclusions from them to NLP, because distinct empirical assumptions are made in these two research areas.
Distributionally robust optimization
Effectively learning to model and predict underrepresented subdistributions has always been a challenge in machine learning, e.g., when predicting rare classes,(scheireretal; fei-liu-2016-breaking) or classes of examples from rare domains (zheng2020outofdomain) or minority groups pmlr-v80-hashimoto18a. Often, underrepresented data is ignored or learned poorly by the models NEURIPS2020_1e14bfe2, compared to their over-represented counterparts. Distributionally Robust Optimization (DRO) pmlr-v80-hashimoto18a; sagawa19dro aims to minimize the loss on all sub-populations, rather than minimizing their average BenTal2013RobustSO. DRO has been particularly useful in the domain of algorithmic fairness (pmlr-v80-hashimoto18a), but has also been found to boost performance on underrepresented domains in language modeling (oren-etal-2019-distributionally) and is generally applicable in situations with drift koh2021wilds.
3 Creoles and Corpora
While creole languages are spoken by hundreds of millions, and are often a lingua franca within a larger community, only a handful of resources exist for creoles presently. Some challenges to collecting data resources for creole languages can be a creole’s non-standardized orthography, e.g. Haitian Creole hewavitharana-etal-2011-cmu, or the specific contexts in which creoles are used – it may not always be used in official capacities for news, education, and official documents, even if the creoles are widely used in most other aspects of life shah-sanghavi17. This of course complicates data collection. In this work, we focus on the following creoles, as they each have diverse linguistic makeup and have some existing datasets:
Nigerian Pidgin English
West Africa is one of the world’s most linguistically diverse places, with Nigeria alone having over 400 languages UFOMATA1999315. Recent work to advance African NLP has led to the creation of several datasets in Nigerian Pidgin English (agic-vulic-2019-jw300; Ogueji2019PidginUNMTUN; NdubuisiObi2019WetinDW; caron-etal-2019-surface; oyewusi2021naijaner; adelani2021masakhaner; Oyewusi2020SemanticEO), which makes it particularly well-resourced in comparison to other creole languages. Nigerian Pidgin English, also referred to as simply Nigerian Pidgin, can further be understood as a member in the larger family of West African Pidgins, as many West African countries have their own unique variation of this creole, but all share influences from many of the same languages, such as Igbo, Hausa, and Yoruba.
The first sizeable Nigerian Pidgin dataset comes from agic-vulic-2019-jw300, who collected parallel text from several magazines written by a religious society, which have parallel translations in many languages. This dataset has been utilized in the first attempts to develop baselines for machine translation of Nigerian Pidgin English Ogueji2019PidginUNMTUN; Ahia2020TowardsSA. Furthermore, Ogueji2019PidginUNMTUN also introduced the first corpus of Nigerian Pidgin English to further facilitate machine translation from Nigerian Pidgin into English. NdubuisiObi2019WetinDW also introduced a code-switching corpus of news articles and online comments in both Nigerian Standard English and Nigerian Pidgin. In this work, they discuss some challenges of working with Nigerian Pidgin, such as non-standardized spelling. They also find that different topics prompt code-switching to Nigerian Pidgin over Nigerian Standard English. More task-specific Nigerian Pidgin datasets have been introduced for Universal Dependency Parsing caron-etal-2019-surface, named entity recognition oyewusi2021naijaner; adelani2021masakhanerOyewusi2020SemanticEO, and speech recognition Bigi2017DevelopingRF; Ajisafe2020TowardsET.
Singaporean Colloquial English, also known as Singlish, has English as a source language, but also draws parts of its grammar and vocabulary from languages such as Mandarin, Cantonese, Hakka, Hokkien, Malay, and Tamil. Presently, few publicly available datasets exist in Singlish, as this creole is primarily utilized for informal conversation between people and not for official purposes. The largest relevant corpus is The National University of Singapore SMS Corpus from SinglishSMS, which consists of over 67,000 text messages written by Singaporeans. Qualitatively, we observed that this dataset is much closer to Standard English, albeit with noise from outdated SMS language, than the example provided in fig. 2, but, within this data, we still observe many hallmark features of Singlish such as discourse markers and vocabulary from relevant languages. tan-etal-2020-mind have also released a webcrawler that collects posts from an popular Singaporean forum about hardware, where discussion is often in Singlish. They use the resulting Singlish corpus as part of their work to investigate the role of inflection for NLP with non-standard forms of English. Beyond plain text corpora, wang-etal-2017-universal introduced the first Singlish Universal Dependency dataset, which was further expanded upon in WangSinglishUD. chau-etal-2020-parsing used this dataset as a low-resource language test case for their method of pretraining mBERT devlin-etal-2019-bert. Finally, a few studies have been done on private datasets for sentiment analysis DBLP:journals/corr/BajpaiPHC17; SinglishSenticNet, and polarity detection SenglishSenticPatterns.
|en, fr, es, pt, yo, zh, ta||WMT-News 2020||news|
|ms||Malay 30k News||news|
|Nigerian Pidgin||PidginUNMT Corpus||news|
|Singlish||Singapore SMS Corpus||sms|
|Haitian Creole||Disaster Response Corpus||sms|
Haitian Creole exhibits a combination of French with many West African languages (e.g. Igbo, Yoruba, Fon, etc.). Haitian Creole seized the attention of the machine translation community in the aftermath of the 2010 earthquake crisis in Haiti, during which Munro10crowdsourcedtranslation; journals-ir-Munro13 developed the Haitian Disaster Response Corpus. This is a parallel Haitian–English dataset of SMS messages related to the crisis, to enable rapid development of machine translation systems to assist the crisis response. This dataset was included in the 2011 Workshop for Machine Translation WMT:2011, in conjunction with data from the medical domain, newswire, and a Haitian glossary.333http://www.speech.cs.cmu.edu/haitian/text/. Several studies used this data to extend methods in statistical machine translation Hu2011TheVO; hu-EtAl:2011:WMT; rcostajussa-banchs:2011:WMT as well as spell checking and data cleaning stymne:2011:WMT.
4 Datasets for Creole Language Models
We experiment with training language models for creoles with a mixture of creole data, and additional data from languages influential to each creole.
We begin with the creole datasets noted in table 1, and combine them with data of other higher-resource languages that have been influential to the creole. We combine a fixed number of these examples into a Mixed-Language dataset, as described in table 2. The Mixed-Language dataset for each creole includes information about the original language of each sentence, so that we can form language-specific groups for DRO (see section 5.1 for more details on DRO grouping). The total number of train and development examples were determined by the number of sentences in the base (creole) dataset for a 95-5 train-development split. Singlish had equal representation of each language, with 53,006 examples per language, including Singlish. Haitian Creole also had equally represented languages, with 8,192 examples for Haitian and each additional language. For the Nigerian Pidgin Mixed-Language dataset, English, Portuguese, and Nigerian Pidgin were composed equally with 67,615 examples each, and Yoruba with only 27,260 examples due to the small size of the original data. Thus, we included 95% of the Yoruba WMT-News 2020 dataset.
Language identification within creoles
As we will see in section 6, training the language models on the Mixed-Language dataset with DRO fails to produce positive results. Following from this, we also create a Creole-Only dataset, composed of only the creole examples. In order to sort the creole examples into distinct groups for DRO, we label each creole example by the collection of the selected languages present in the sentences, as determined by a language identification algorithm.444https://fasttext.cc/blog/2017/10/02/blog-post.html. Consider the following examples from their respective Creole-Only datasets:
"treat him makah lah"
en: 88.19%, ms: 4.34%, ta: 0.04%, and zh: 0.01%
"Pikin wey like to play wit wetin no dey common and sabi one particular subject reach ground"
en: 87.46%, pt: 0.23%, and yo: 0.03%
"Infomation sou kestion te tranble a ak lekol"
fr: 3.50%, es: 0.08%, and yo: 0.01%
While the language identification algorithm is not perfect, the confidence scores for the languages still reflect the high-level trends for the creole examples, namely, that English and Malay ("makan") are indeed present in the Singlish sample, and also that English and Portuguese ("pikin", "sabi") are present in the Nigerian Pidgin example. However, for the Haitian Creole example, we see that none of our chosen languages have very high scores from the language identification algorithm, which begs the question: were there other languages with higher confidence from the language identification algorithm?
To ensure that the languages we chose are well-represented in the creole examples, we looked at the distribution of the identified languages across examples in our Creole-Only datasets in fig. 3. From this, we observe that choosing to identify languages specifically related to the creole (i.e. the same languages we included in the Mixed-Language
datasets) is more reliable than trusting the language identifier pick the top five languages with the highest confidence – there appears to be a bias for falsely predicting European languages, even on creole data unrelated to these languages, as well as some strange outliers, such as Tagalog being the third most commonly predicted language for Haitian Creole sentences. Also, we see that Haitian Creole itself was a commonly identified language, which could explain the low confidence scores for French and Spanish in the example above. Finally, among our specifically chosen languages for the creoles, we see that, although the source language (e.g. English or French) is most dominant, the other languages are still well distributed, with the exception of Yoruba. We surmise that the densely distributed, low-confidence scores for Yoruba can probably be attributed to the fact that Yoruba is a lower-resourced language.
In this section, we detail our experimental setups. We make our code and models publicly available.555https://github.com/hclent/creole-dro
Using the datasets described above, we conduct several experiments to assess how different training strategies affect the modeling of creoles. We conduct all the experiments on both English BERT and multilingual mBERT models devlin-etal-2019-bert. As our baseline, we consider pretrained BERT and mBERT models, and evaluate them on our development splits for the creoles. We then assess the effectiveness of two popular training strategies: Empirical Risk Minimization (ERM) and Distributionally Robust Optimization (DRO). In this case, ERM consists of masked language modeling over all the data points in each dataset, in a similar fashion as done during pretraining.
For DRO, we utilize the WILDS library (DBLP:journals/corr/abs-2012-07421), which uses metadata associated with the input data to form the groups for DRO. In our case, we investigate three grouping strategies: grouping with language information as metadata (DRO-Language), as well as with two additional control experiments. In the first control experiment, we assign all training examples to the same group (DRO-One), such that that DRO is optimizing over only one large group. In the second control experiment, we randomly assign examples to one of four groups (DRO-Random). The motivation of for these control experiments is to ensure that improvements for DRO are actually grounded in the language information, and not an artifact of the WILDS grouping algorithm.
In DRO-Language, information about the examples’ language makeup is used to determine the groups. In Mixed-Language, we rely on our knowledge of where the examples were sampled from, but in Creole-Only, we subdivide the creole examples depending on their etymology. Specifically, grouping is done as follows in our two data setups outlined in section 4:
Mixed-Language: Here, grouping is done over the languages in the training data. For example, in the case of Nigerian Pidgin, if a sentence originally comes from the Yoruba corpus, it is assigned to the Yoruba group, and similarly for Nigerian Pidgin and the other languages listed in table 2 for each creole.
Creole-Only: Here, as we only have the creole samples, grouping is done over the confidence scores from the collection of the influential languages (see Section section 4). An example is assigned to one of groups, representing the combinations of detected languages in a sentence. is the number of languages listed in table 2 (Langs) for each creole, and presence of a language is derived from its confidence score by the language identifier: if there is a confidence of 0.1% or higher that the language is represented in the sentence, then it is considered as present.
|Nigerian Pidgin||Singlish||Haitian Creole|
We perform two types of evaluation: intrinsic – based on the MLM training objective – and extrinsic – on traditional downstream NLP tasks.
We evaluate our language models intrinsically with the following metrics:
Precision at (P@): Precision of the language model in predicting a random masked token per sentence. This allows us to assess the general performance following the training objective. In the following, we report P@1. Results at are in the App.
Dictionary-based precision at (P@): Due to their nature, most of the words in a creole sentence are from the corresponding source language (see fig. 3). Hence, for a more principled measurement of precision, we collect online dictionaries of our creoles.666Nigerian Pidgin: http://naijalingo.com/.
Haitian Creole: https://kreyol.com/dictionary.html. We perform the same MLM task as above, but this time only mask words belonging to the creole dictionaries. By doing so, we can obtain a more accurate measure of what the LMs have learned. We again report results at here, and refer the reader to the App. for .
Mean pseudo-log-likelihood score (PLL): Following recent studies pmlr-v101-shin19a; wang2019bert; salazar-etal-2020-masked, we measure the pseudo-log-likelihood scores from MLMs given by summing the conditional log probabilities of each token in a sentence . These are obtained in BERT by replacing with the special [MASK] token. Here, we report the mean score given by:
where denotes the evaluation corpus, and denotes a model’s parameters.
We also perform an extrinsic evaluation of our models on downstream tasks, for the datasets that are available. Specifically, we train and evaluate models for Nigerian Pidgin NER and POS tagging with Universal Dependencies (nivre20universal, UPOS), as well as Singlish UPOS. We fine-tune our pretrained language models on the training sets of these two tasks and evaluate them on the corresponding test sets.
We write our code in PyTorchpytorchpaper. In particular, for language model training, we rely on the HuggingFace Transformers library transformers_paper, and the WILDS library DBLP:journals/corr/abs-2012-07421 for DRO. Models are fine-tuned for 100,000 steps with batch size of 16. For downstream tasks, we use MaChAmp (van-der-goot-etal-2021-massive) and train our models for epochs. The best checkpoints were selected based on performance on the dev sets. Unless otherwise specified, we use the default hyperparameters. Our experiments are run on one NVIDIA TitanX GPU in a shared cluster.
6 Results and Analyses
The main finding of the intrinsic evaluation is that ERM outperforms DRO for all grouping strategies across all metrics. We also observe that P@k is a more difficult task than the standard precision at , with randomly masked tokens (see appendix A for full results with both BERT and mBERT). Moreover we find that the DRO models often have a much higher perplexity than ERM. Finally, the results show that, between the Mixed-Language and Creole-Only experiments, the latter performed better, demonstrating that training on additional data was not useful for learning language models for creoles. While we only report results for BERT here, we observe the same patters with mBERT (see appendix A).
Here, we observe the same trend as in the intrinsic evaluation: ERM performs better than DRO (see table 4). Although for Nigerian Pidgin DRO-Language performs better than ERM on both NER and UPOS, the gap between the scores is too small to draw concrete conclusioins from.
|BERT||NER [F]||UPOS [Acc]||UPOS [Acc]|
There are several factors that could have influenced the DRO models to perform worse than ERM. We explore their effects below.
Over-parameterization is known to be problematic for DRO sagawa19dro. In order to investigate the role of over-parameterization in our experiments, we ran additional Mixed-Language experiments on Nigerian Pidgin English, with different sized BERT models, namely BERT, BERT jiao-etal-2020-tinybert, and BERT. The results in table 5 demonstrate that over-parameterization was not a leading cause for DRO failure, otherwise we would expect for smaller BERT versions to have relative better performance compared to the corresponding ERM runs. Instead, we see that standard BERT works fine for this task, and over-parameterization is not the cause of poor performance of DRO in our experiments.
sagawa19dro also discuss how lack of regularization lead to problems for DRO, and how increased regularization is necessary for worst-group generalization. To investigate this potential weakness in our experiments, we run additional experiments using BERT on Mixed-Language data for Nigerian Pidgin English, trying different weight decay values in each table 6. If our DRO models were suffering from insufficient regularization, we would expect that increasing the regularization factor of weight decay would boost performance. However, we find no meaningful effect of this hyperparameter, which leads us to believe that insufficient regularization is not a driving factor in the underperformance of DRO compared to ERM.
Drift and creole stability
Creole languages arise from pidgins, which are initially developed for use as second language. Recent years have seen renewed interest in the classic question of the relationship between pidgin and creole formation and second language acquisition Plag2009CreolesAI. To investigate the matter of creole stability, we follow (NIPS2006_b1b0432c) and calculate the proxy
-distance (PAD) between different domains of creole data (seetable 7). Specifically, we train an SVM on the BERT encodings.777Our code is adapted from https://github.com/rpryzant/proxy-a-distance. Our -distance results suggest that creole languages do not exhibit more drift than English when the data are comparable. This potentially explains why distributionally robust language models do not outperform regular language models trained with empirical risk minimization objectives.
|English||Disaster Response Corpus||Newswire||1.75|
|Haitian Creole||Disaster Response Corpus||Newswire||1.47|
In this paper, we bring creole languages to the attention of the NLP community. We collect data and train baseline language models for three creoles, and evaluate these models across the downstream tasks of part-of-speech tagging and named entity recognition. Based on previous work suggesting the instability of creole languages winford99variation; Patrick1999, we explore the impact of using more robust learning objectives for masked language modeling of creoles, but our results show that vanilla empirical risk minimiziation is superior. We show that this is not the result of over-parameterization or lack of regularization, but instead suggest this is a result of the relative stability of creole languages. We note that it still remains possible that significant improvements could be achieved by modeling dynamics specific to creole languages, i.e., the processes that govern their development, including social factors holm_2000 and second language acquisition dynamics Plag2009CreolesAI.
We would like to thank the anonymous reviewers and Samson Tan for their helpful feedback. We would also like to thank Robert Monarch and Chris Callison-Burch for their help with the Haitian data. This project has received funding from the European Union’s Horizon 2020 research and innovation programme under the Marie Skłodowska-Curie grant agreement No 801199 (for Heather Lent and Emanuele Bugliarello), the Swedish Research Council Grant 2020-00437 (for Miryam de Lhoneux), and the Google Research Award (for Heather Lent and Anders Søgaard).
Appendix A Appendix