The technique of fine-tuning a self-supervised language model has become ubiquitous in Natural Language Processing, because models trained in this way have advanced evaluation scores on many tasks Radford et al. (2018); Peters et al. (2018); Devlin et al. (2019). Arguably the most successful architecture has been BERT Devlin et al. (2019) which uses stacks of transformers to predict the identity of a masked token and to predict whether two sequences are contiguous. It has spawned many variants Liu et al. (2019); Lan et al. (2019) and the nascent field of BERTology (Jawahar et al., 2019; Chi et al., 2020; Rogers et al., 2020). Initially, BERT models were released for English and Chinese as well as a multilingual model (mBERT), trained on Wikipedia text from 104 languages. Since then, a number of monolingual BERT models have been released Nozza et al. (2020). In this paper, we describe gaBERT, a monolingual BERT model for the Irish language.
Although Irish is the first official language of the Republic of Ireland, a minority, 1.5% of the population or 73,803 people CSO (2016) use it in their everyday lives outside of the education system. As the less dominant language in a bilingual community, the availability of Irish language technology is important since it makes it easier for Irish speakers, and those who wish to be Irish speakers, to use the language in practice. Building on recent progress in Irish NLP Uí Dhonnchadha and Van Genabith (2006); Lynn et al. (2012, 2015); Walsh et al. (2019) we release gaBERT, an Irish BERT model, trained on approximately 7.9 million sentences. There are many downstream applications where we envision gaBERT will be useful, including grammar checking, computer-aided language learning, predictive text, search and translation.
While there is evidence to suggest that dedicated monolingual models can be superior to a single multilingual model for within-language downstream tasks (de Vries et al., 2019; Virtanen et al., 2019; Farahani et al., 2020), other studies Wu and Dredze (2020); Rust et al. (2020); Chau et al. (2020) suggest that mBERT is a good choice for low-resourced languages. We compare gaBERT to mBERT and to the monolingual Irish WikiBERT, trained on Irish language Wikipedia, on the task of universal dependency parsing. We find that parsing accuracy improves when using gaBERT – by over 4 LAS points over mBERT and WikiBERT. We then use the gaBERT training data for continued pretraining of mBERT, resulting in a recovery of 2.6 LAS points over the off-the-shelf version. We also compile a small test set which compares the models on their ability to predict a masked token. We abstract away from vocabulary differences between the models by masking pronouns, which are in all the models’ vocabularies. This evaluation clearly reveals the benefit of using our Irish data.
When training gaBERT, we examine the effect of vocabulary size, corpus filtering type and tokenisation model, finding that document-level filtering, with a SentencePiece tokeniser (Kudo and Richardson, 2018) and a vocabulary size of 30k is the best configuration. We release the code used to run our experiments through GitHub111https://github.com/jbrry/Irish-BERT and our models through the Hugging Face (Wolf et al., 2020) model repository.222https://huggingface.co/DCU-NLP
We use the following to train gaBERT:
The New Corpus for Ireland (Kilgarriff et al., 2006), which contains a wide range of texts in Irish, including fiction, news reports, informative texts, and official documents.
The unshuffled Irish portion of the OSCAR Corpus (Ortiz Suárez et al., 2019) which is a collection of documents from CommonCrawl and sorted by language ID.
The Irish portion of ParaCrawl v7 (ga-en bitext pair) (Bañón et al., 2020), which is a collection of parallel corpora crawled from multi-lingual websites, with the sentences aligned and then filtered.
Text from Irish Wikipedia, an online encyclopedia with articles covering a wide variety of topics. We use the slightly older articles from https://dumps.wikimedia.org/gawiki/20210520/ (27.7 MB).
The sentence counts in each corpus are listed in Table 1 after tokenisation and segmentation but before filtering described below. Some of these corpora overlap with others, e. g. the CoNLL17 corpus contains CommonCrawl data that is also used in the OSCAR corpus, and Irish Wikipedia data that is also used in the Wikipedia corpus. See Appendix A for more information on the content of these corpora, including license information.
|Corpus||Num. Sents||Size (MB)|
3 Corpus Pre-processing
Described in this section are the relevant pre-processing steps required by each of the various corpora included in the training data.
The CoNLL17 corpus is already tokenised, as the files are converted from CoNLL-U format to plain text where each row in the text column of a CoNLL-U sentence is treated as a token.
IMT, OSCAR and ParaCrawl
The text files from the IMT, OSCAR and ParaCrawl contain raw sentences requiring tokenisation. We describe the tokenisation process for these corpora in 3.1.
For the Wikipedia articles, the Irish Wikipedia dump is downloaded and the WikiExtractor tool333https://github.com/attardi/wikiextractor is then used to extract the Wikipedia articles to plain text. Article headers are included in the extracted text files. Once the articles have been converted to plain text, they are tokenised using the tokeniser described in 3.1.
As many of the NCI segments marked up with s
tags contain multiple sentences, we treat each segment boundary as a sentence boundary and further split segments into sentences recursively, finding the best split point according to a set of heuristics described in AppendixB.2, splitting the segment in half and applying the same procedure to each half until no suitable split point is found.
3.1 Tokenisation and Segmentation
Raw texts from the IMT, OSCAR, ParaCrawl and Wikipedia corpora are tokenised and segmented with UDPipe (Straka and Straková, 2017) trained on a combination of the Irish-IDT and English-EWT corpora from version 2.7 of the Universal Dependencies (UD) treebanks Zeman et al. (2020). We include the English-EWT treebank in the training data to expose the tokeniser to more incidences of punctuation symbols which are prevalent in our pre-training data. This also comes with the benefit of supporting the tokenisation of code-mixed data. We upsample the Irish-IDT treebank by ten times to offset the larger English-EWT treebank size. This tokeniser is applied to all corpora apart from the NCI, which is already tokenised by kilgarriff-etal-2006-efficient, and the CoNLL17 corpus as this corpus is already tokenised in CoNLL-U format.
4 Experimental Setup
After initial corpus pre-processing, all corpora are merged and we use the WikiBERT pipeline (Pyysalo et al., 2020) to create pretraining data for BERT. We experiment with four corpus filter settings and seven subword unit vocabularies. We perform a greedy search for the optimal parameters, first choosing the filter settings, then the size of the BERT vocabulary and finally considering a WordPiece vocabulary instead of a SentencePiece vocabulary and the union of the two vocabularies.
4.1 Corpus Filtering
The WikiBERT pipeline contains a number of filters for document-level filtering. As we are working with data sources where there may not be clear document boundaries, or where there are no line breaks over a large number of sentences, document-level filtering may be inadequate for such texts. Consequently, we also experiment with using the OpusFilter library (Aulamo et al., 2020), which filters out individual sentences, thereby giving us the flexibility of filtering noisy sentences while not discarding full documents.
We train different BERT models with varying levels of filtering:
No-filter: all collected texts are included in the pre-training data.
Document-filter: the default document-level filtering used in the WikiBERT pipeline.
OpusFilter-basic: we use OpusFilter with basic filtering described in Appendix B.3.
OpusFilter-basic-char-lang we use OpusFilter with basic filtering as well as character-script and language ID filters described in Appendix B.3.
For each of these filtering configurations, we train a BERT model for 500,000 pre-training steps using all of the sentences which remain after filtering.
4.2 Vocabulary Creation
To create the vocabulary for our models, we experiment with using the SentencePiece tokeniser and the WordPiece tokeniser. We take the highest-performing model from the filtering experiments in Section 4.1 and use that filtering setting for our vocabulary experiments. For the SentencePiece tokeniser, we use vocabulary sizes of 15K, 20K, 30K, 40K and 50K and train a BERT model on the data produced by those vocabularies.
We then train a WordPiece tokeniser, keeping the vocabulary size that works best for the SentencePiece tokeniser, and train a respective BERT model. Furthermore, we train a BERT model using the union of the two vocabularies, i. e. our best-performing SentencePiece vocabulary and the WordPiece vocabulary.
4.3 BERT Pretraining Parameters
We use the original BERT implementation of devlin-etal-2019-bert. For the development experiments, we train our BERT model for 500,000 steps with a sequence length of 128. We use whole word masking and the default hyperparameters and model architecture of BERTBASE Devlin et al. (2019) except a lower batch size of 32 in order to train on NVIDIA RTX 6000 GPUs with 24 GB RAM. This corresponds to a bidirectional transformer (Vaswani et al., 2017) with 12 layers, 12 attention heads, a hidden layer size of 768, and GELU activations (Hendrycks and Gimpel, 2016).
4.4 Final Models
For the final gaBERT model, we train for 900k steps with sequence length 128 and a further 100k steps with sequence length 512. We train on a TPU-v2-8 with 128GB of memory on Google Compute Engine444TPU access was kindly provided to us through the Google Research TPU Research Cloud. and increase the batch size to 128.
Furthermore, we train ELECTRA Clark et al. (2020) on the same data on a TPU-v3-8. ELECTRA replaces the MLM pre-training objective of BERT with a binary classification task discriminating between authentic tokens and alternative tokens generated by a smaller model for higher training efficiency. We use the default settings of the “Base” configuration of their implementation.555https://github.com/google-research/electra As with BERT, we train for 1M steps and evaluate every 100k steps. However, we train on more data per step as the batch size is increased to 256 and a sequence length of 512 is used throughout.
5 Evaluation Measures
This section describes the evaluation measures we use to assess the BERT models in our experiments.
5.1 Dependency Parsing
The evaluation measure we use to make development decisions is dependency parsing LAS. To obtain this measure, we fine-tune a given BERT model in the task of dependency parsing and measure LAS on the development set of the Irish-IDT treebank Lynn and Foster (2016); McGuinness et al. (2020) in version 2.8 of UD. As parser performance also depends on the initialisation of the additional layers added by the parser, we report the median of five runs with different random initialisation, and in figures, we show a box plot.666The box plots do not show the variation to be expected when pre-training BERT with different random initialisation multiple times as obtaining multiple BERT models for each setting would be too computationally expensive.
The Irish-IDT consists of 4,005 training sentences, 451 sentences in the development set and 454 test sentences. For the dependency parser, we use a multitask model which uses a graph-based parser with biaffine attention (Dozat and Manning, 2016)
as well as additional classifiers for predicting POS tags and morphological features. We use the AllenNLP(Gardner et al., 2018) library to develop our multitask model. The trained BERT model is used as an encoder where the BERT representations are passed to the appropriate multitask components.
5.2 Cloze Test
To compile a test set for a cloze task, 100 strings of Irish text (4–77 words in length) containing the pronouns ‘é’, ‘í’ or ‘iad’ are selected from Irish corpora and online publications. One of these pronouns is masked in each string, providing a context cue for the cloze test. The pronouns ‘é’ (‘him/it’), ‘í’ (‘her/it’) and ‘iad’(‘them’) usually appear as single words, bound by whitespace or punctuation. Less commonly, they appear with an initial mutation e.g. ‘hí’ or suffix e.g. ‘iadsan’. Where such constructions occur in the cloze test, the language models must predict a subword unit in order to form a plausible string. Each language model is evaluated on the token it predicts for each masked token.777Due to limitations of Hugging Face’s “fill-mask” pipeline module, we can only have a single mask per input sequence and only a single subword unit can be predicted for each mask. All the masked tokens exist in the vocabularies of the candidate BERT models and are therefore possible predictions.
Following Rönnqvist et al. (2019), the models are evaluated on their ability to generate the original masked token, and a manual evaluation of the models is performed wherein predictions are classified in the following exclusive categories:
Match: The predicted token fits the context grammatically and semantically. This may occur when the model predicts the original token or another token which also fits the context cue. In order for a predicted pronoun to render the given cue grammatically correct, it must agree with regard to the number and gender of the noun phrase it refers to.
Mismatch The predicted token is a valid Irish word but is unsuitable given the context.
Copy The predicted token is an implausible repetition of another token in the context.
Gibberish The predicted token is not a valid Irish word.
6.1 Development Results
This section reports the results of our parameter search for corpus filtering and BERT vocabulary, as well as the effect of restricting the training data to publicly available data.
The overall number of sentences which remain after applying each filter are shown in Table 2. If no filtering is applied, we have 9,192,060 sentences.888This number differs slightly to the number in Table 1 due to using an earlier Wikipedia dump (20/04/2021) for the filtering experiment. The Document-filter removes about 1.2 million sentences, while OpusFilter-basic only removes about 200,000 sentences. OpusFilter-basic-char-lang removes the most number of sentences.
The results of training a dependency parser with the gaBERT model produced by each setting are shown in Figure 1. Document-Filter, which keeps all Paracrawl articles and discards more Wikipedia articles than the other filters, has the highest LAS score. As the BERT model requires contiguous text for its next-sentence-prediction task, filtering out full documents may be more appropriate than filtering individual sentences. The two OpusFilter configurations perform marginally worse than the Document-Filter. In the case of OpusFilter-basic-char-lang, perhaps the lower number of sentences overall translates to lower LAS scores. Finally, No-Filter performs in the same range as the two OpusFilter configurations but has the lowest median score, suggesting that some level of filtering is beneficial.
The results of the five runs testing different vocabulary sizes are shown in Figure 2. A vocabulary size of 30K performs best for the SentencePiece tokeniser, which outperforms the WordPiece tokeniser with the same vocabulary size. The union of the two vocabularies results in 32,314 entries, and does not perform as well as the two vocabularies on their own.
Final Training Runs
Figure 3 shows the development LAS of training gaBERT and gaELECTRA in their final configuration (§4.4). The best gaBERT checkpoint is reached at step 1 million, which may indicate that there are still gains to be made from training for more steps. The highest median LAS for gaELECTRA is reached at step 400k. It is worth noting that although the two models are compared at the same number of steps, gaBERT is trained with a sequence length of 128 for the first 900k steps and 512 for the remaining 100k, whereas gaELECTRA uses a sequence length of 512 throughout. We also use a larger batch size of 256 for gaELECTRA than the 128 used for gaBERT, thus the two models have been exposed to differing numbers of training tokens at each step.
6.2 Model Comparison
We compare our two final models (gaBERT and gaELECTRA) with off-the-shelf mBERT and WikiBERT-ga, as well as an mBERT model obtained with continued pre-training on our corpora.
Table 3 shows the results for dependency parsing.999A competitive parser, UDPipe, trained on the Irish-IDT Treebank 2.8 without external embeddings achieves 72.59 LAS on the test set. Using mBERT off-the-shelf results in a test set LAS of 79.8. The WikiBERT-ga model performs at a similar level. By training mBERT for more steps on our corpora, LAS can be improved by 2.6 points. Our gaBERT model has the highest LAS of 83.9 and our gaELECTRA model follows closely behind with a score of 83.3. It is interesting to note that the gaBERT model performs better than gaELECTRA, despite seeing less training data per step. The last two rows compare gaBERT, on v2.5 of the treebank, with the system of chau-etal-2020-parsing who augment the mBERT vocabulary with the 99 most frequent Irish tokens and fine-tune on Irish Wikipedia.
Table 4 shows the accuracy of each model with regard to predicting the original masked token. mBERT-cp is the most accurate, with gaBERT and gaELECTRA close behind.
|Model||Original Token Prediction|
The prediction of subword units proved to be a challenge for most of the models. mBERT was the most likely to predict subword units, while gaBERT was the least likely. With regard to the pronouns selected for this task, ‘é’ tends to occur approximately 3 times more frequently in Irish text than both ‘í’ and ‘iad’, so language models can be expected to predict ‘é’ more often. mBERT was most heavily biased towards the token ‘é’, rarely predicting ‘í’ or ‘iad’, while gaBERT was the most likely of the models to predict ‘í’ and ‘iad’.
Table 5 shows the manual evaluation of the tokens generated by each model. Given that many of the context cues in the test could have several plausible answers and that there is variation with regard to the quality and quantity of context clues provided, the manual classification of predictions as either match, mismatch, copy or gibberish provides further clarity on the language models’ capabilities. These results echo those of the original masked token prediction evaluation in so far as they rank the models in the same order.
Further detail, examples and analysis of the cloze test can be found in Appendix C.
We release gaBERT, a BERT model trained on over 7.9 million Irish sentences. We plan to further explore the space of possible gaBERT models by experimenting with other fine-tuning tasks and examining the contribution of each corpus.
8 Ethical Considerations
No dataset is released with this paper, however most of the corpora are publicly available as described in our Data Availability Statement (Appendix A). Furthermore, where an anonymised version of a dataset was available it was used. We release the gaBERT and gaELECTRA language models based on the BERTBASE (Devlin et al., 2019) and ELECTRA-Base (Clark et al., 2020)autoencoder architectures respectively. We note that an autoregressive architecture may be susceptible to training data extraction, and that larger language models may be more susceptible (Carlini et al., 2021). However, gaBERT and gaELECTRA are autoencoder architectures and smaller language models which may help mitigate this potential vulnerability.
We now address specific Ethics Questions101010https://2021.emnlp.org/call-for-papers/ethics-faq.
Does the paper describe the characteristics of the dataset in enough detail for a reader to understand which speaker populations the technology could be expected to work for? Yes, §2 "Data" describes each of the six datasets used to train gaBERT and gaELECTRA. Furthermore, the use of ‘ga’, the ISO 639-1:2002111111https://www.iso.org/iso-639-language-codes.html Language Code for Gaeilge (Irish), in the language model names gaBERT and gaELECTRA emphasises that Irish speakers, those who wish to be Irish speakers, and those with an interest in languages (particularly low resource languages) are populations that we hope these language models will benefit.
Do the claims in the paper match the experimental results, in terms of how far the results can be expected to generalize? Yes, we specify the downstream tasks on which we have trained gaBERT and gaELECTRA and presented the results. We note that many other downstream tasks can be performed which we hope will benefit from these models.
Does the paper describe the steps taken to evaluate the quality of the dataset? Yes, §2 "Data", §3 "Corpus Pre-processing", the filtering experiments, and Appendix B all speak to the quality of the datasets used. In addition, the following caveat (or similar) will be specified with the released code on GitHub, and the released model card on Hugging Face: We note that some data used to pretrain gaBERT and gaELECTRA was scraped from the web which potentially contains ethically problematic text (bias, hate, adult content, etc.). Consequently, downstream tasks/applications using gaBERT or gaELECTRA should be thoroughly tested with respect to ethical considerations.
Does the paper describe how the technology would be deployed in actual use cases? There are many downstream tasks which can use gaBERT and gaELECTRA; including Machine Translation, educational applications, predictive text, search (“information retrieval”), and games. The authors hope gaBERT and gaELECTRA will contribute to the ongoing effort to preserve the Irish language as a living language in the technological age. Supporting a low resource language like Irish in a bilingual community will make it easier for Irish speakers, and those who wish to be Irish speakers, to use the language in practice.
Does the task carried out by the computer match how it would be deployed? The fine-tuning tasks used to evaluate gaBERT and gaELECTRA are used as a proxy for other NLP downstream tasks. We release the gaBERT and gaELECTRA models and code publicly, and look forward to seeing what the development community creates.
Does the paper address possible harms when the technology is being used as intended and functioning correctly? The model could produce unsuitable text outputs when used to generate text and this could be particularly problematic if deployed in particular settings such as a school. To combat this possibility, we will include a caveat as stated in Ethical Consideration 3 above.
Does the paper address possible harms when the technology is being used as intended but giving incorrect results? As stated in the previous answer, a caveat will be included.
Does the paper address possible harms following from potential misuse of the technology? The possible harms due to potential misuse are no different to those from any small language model released publicly; we anticipate no new expected misuse potential with the release of gaBERT and gaELECTRA.
If the system learns from user input once deployed, does the paper describe checks and limitations to the learning? gaBERT and gaELECTRA are static models and do not learn from user input when deployed.
Are any of the possible harms you’ve identified likely to fall disproportionately on populations that already experience marginalization or are otherwise vulnerable? Nothing above and beyond those already there. It could be argued the Irish language is marginalized and vulnerable so we are opening it up to both benefits and potential harms. But, as with every language model released, we release it in the hope it will be used for good; while acknowledging the potential for misuse. It might be reasonable to characterize Irish speakers, and those who wish to be, as one example of “populations that already experience marginalization or are otherwise vulnerable” and that gaBERT and gaELECTRA are intended to support them and the Irish language. Go n-éirí an bóthar libh.
This research is supported by Science Foundation Ireland (SFI) through the ADAPT Centre for Digital Content Technology, which is funded under the SFI Research Centres Programme (Grant 13/RC/2106) and is co-funded under the European Regional Development Fund. This research is also supported through the SFI Frontiers for the Future programme (19/FFP/6942) and SFI Grant 18/CRT/6183 as well as by the Irish Government Department of Culture, Heritage and the Gaeltacht under the GaelTech Project. For the purpose of Open Access, the authors have applied a CC BY public copyright licence to any Author Accepted Manuscript version arising from this submission.
- OpusFilter: a configurable parallel corpus filtering toolbox. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics: System Demonstrations, Online, pp. 150–156. External Links: Cited by: §4.1.
- ParaCrawl: web-scale acquisition of parallel corpora. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, Online, pp. 4555–4567. External Links: Cited by: §2.
- Extracting training data from large language models. External Links: Cited by: §8.
- Parsing with multilingual BERT, a small corpus, and a small treebank. In Findings of the Association for Computational Linguistics: EMNLP 2020, Online, pp. 1324–1334. External Links: Cited by: §1.
- Finding universal grammatical relations in multilingual BERT. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, Online, pp. 5564–5577. External Links: Cited by: §1.
- ELECTRA: pre-training text encoders as discriminators rather than generators. In Proceedings of The Eighth International Conference on Learning Representations (ICLR), External Links: Cited by: §4.4, §8.
- Census of Population 2016 – Profile 10 Education, Skills and the Irish Language. Dublin, Ireland (en). Note: Publisher: Central Statistics Office External Links: Cited by: §1.
- BERTje: a Dutch BERT model. Note: arXiv 1912.09582v1 External Links: Cited by: §1.
- BERT: pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), Minneapolis, Minnesota, pp. 4171–4186. External Links: Cited by: §1, §4.3, §8.
A human evaluation of English-Irish statistical and neural machine translation. In Proceedings of the 22nd Annual Conference of the European Association for Machine Translation, Lisboa, Portugal, pp. 431–440. External Links: Cited by: §2.
- SMT versus NMT: preliminary comparisons for Irish. In Proceedings of the AMTA 2018 Workshop on Technologies for MT of Low Resource Languages (LoResMT 2018), Boston, MA, pp. 12–20. External Links: Cited by: §2.
- Deep biaffine attention for neural dependency parsing. CoRR abs/1611.01734. External Links: Cited by: §5.1.
- ParsBERT: transformer-based model for Persian language understanding. Note: arXiv 2005.12515v1 External Links: Cited by: §1.
AllenNLP: a deep semantic natural language processing platform.
Proceedings of Workshop for NLP Open Source Software (NLP-OSS), Melbourne, Australia, pp. 1–6. External Links: Cited by: §5.1.
- CoNLL 2017 shared task - automatically annotated raw texts and word embeddings. Note: LINDAT/CLARIAH-CZ digital library at the Institute of Formal and Applied Linguistics (ÚFAL), Faculty of Mathematics and Physics, Charles University External Links: Cited by: §A.1, §2.
- Gaussian error linear units (GELUs). Note: arXiv 1606.08415v1 External Links: Cited by: §4.3.
- What does BERT learn about the structure of language?. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, Florence, Italy, pp. 3651–3657. External Links: Cited by: §1.
- Efficient corpus development for lexicography: building the New Corpus for Ireland. Language Resources and Evaluation 40, pp. 127–152. External Links: Cited by: §2.
- SentencePiece: a simple and language independent subword tokenizer and detokenizer for neural text processing. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, Brussels, Belgium, pp. 66–71. External Links: Cited by: §1.
ALBERT: A lite BERT for self-supervised learning of language representations. Note: arXiv 1909.11942v6 External Links: Cited by: §1.
- RoBERTa: A robustly optimized BERT pretraining approach. Note: arXiv 1907.11692v1 External Links: Cited by: §1.
- Irish treebanking and parsing: a preliminary evaluation. In Proceedings of the Eight International Conference on Language Resources and Evaluation (LREC-2012), Istanbul, Turkey, pp. 1939–1946 (English). External Links: Cited by: §1.
- Universal Dependencies for Irish. In Proceedings of the Second Celtic Language Technology Workshop (CLTW 2016), Paris, France, pp. 79–92 (en). External Links: Cited by: §5.1.
- Minority Language Twitter: Part-of-Speech Tagging and Analysis of Irish Tweets. In Proceedings of the Workshop on Noisy User-generated Text, Beijing, China, pp. 1–8. External Links: Cited by: §1.
- Annotating MWEs in the Irish UD treebank. In Proceedings of the Fourth Workshop on Universal Dependencies (UDW 2020), Barcelona, Spain (Online), pp. 126–139. External Links: Cited by: §5.1.
- What the [MASK]? making sense of language-specific BERT models. Note: arXiv 2003.02912v1 External Links: Cited by: §1.
- Asynchronous pipeline for processing huge corpora on medium to low resource infrastructures. In 7th Workshop on the Challenges in the Management of Large Corpora (CMLC-7), P. Bański, A. Barbaresi, H. Biber, E. Breiteneder, S. Clematide, M. Kupietz, H. Lüngen, and C. Iliadi (Eds.), Cardiff, United Kingdom, pp. 9 – 16. External Links: Cited by: §2.
- Deep contextualized word representations. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers), New Orleans, Louisiana, USA, pp. 2227–2237. External Links: Cited by: §1.
WikiBERT models: deep transfer learning for many languages. Note: arXiv 2006.01538v1 External Links: Cited by: §4.
- Improving language understanding by generative pre-training. Note: OpenAI External Links: Cited by: §1.
- A primer in BERTology: what we know about how BERT works. Transactions of the Association for Computational Linguistics 8, pp. 842–866. External Links: Cited by: §1.
Is multilingual BERT fluent in language generation?.
Proceedings of the First NLPL Workshop on Deep Learning for Natural Language Processing, Turku, Finland, pp. 29–36. External Links: Cited by: §5.2.
- How good is your tokenizer? on the monolingual performance of multilingual language models. External Links: Cited by: §1.
- Tokenizing, POS tagging, lemmatizing and parsing UD 2.0 with UDPipe. In Proceedings of the CoNLL 2017 Shared Task: Multilingual Parsing from Raw Text to Universal Dependencies, Vancouver, Canada, pp. 88–99. External Links: Cited by: §3.1.
- A part-of-speech tagger for Irish using finite-state morphology and constraint grammar disambiguation. In Proceedings of the Fifth International Conference on Language Resources and Evaluation (LREC’06), Genoa, Italy. External Links: Cited by: §1.
- Attention is all you need. In Proceedings of the 31st International Conference on Neural Information Processing Systems (NIPS’17), NIPS’17, Red Hook, NY, USA, pp. 6000–6010. External Links: Cited by: §4.3.
- Multilingual is not enough: BERT for Finnish. Note: arXiv 1912.07076v1 External Links: Cited by: §1.
Ilfhocail: a lexicon of Irish MWEs. In Proceedings of the Joint Workshop on Multiword Expressions and WordNet (MWE-WN 2019), Florence, Italy, pp. 162–168. External Links: Cited by: §1.
- Transformers: state-of-the-art natural language processing. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, Online, pp. 38–45. External Links: Cited by: §1.
- Are all languages created equal in multilingual BERT?. In Proceedings of the 5th Workshop on Representation Learning for NLP, Online, pp. 120–130. External Links: Cited by: §1.
- Universal dependencies 2.7. Note: LINDAT/CLARIAH-CZ digital library at the Institute of Formal and Applied Linguistics (ÚFAL), Faculty of Mathematics and Physics, Charles University External Links: Cited by: §3.1.
- CoNLL 2017 shared task: multilingual parsing from raw text to Universal Dependencies. In Proceedings of the CoNLL 2017 Shared Task: Multilingual Parsing from Raw Text to Universal Dependencies, Vancouver, Canada, pp. 1–19. External Links: Cited by: §2.
Appendix A Data Licenses
This Appendix provides specific details of the licence for each of the datasets used in the experiments.
The Irish annotated CoNLL17 corpus can be found here: https://lindat.mff.cuni.cz/repository/xmlui/handle/11234/1-1989 Ginter et al. (2017).
The Irish Machine Translation datasets contains text from the following sources:
Text crawled from the Citizen’s Information website, contains Irish Public Sector Data licensed under a Creative Commons Attribution 4.0 International (CC BY 4.0) licence: https://www.citizensinformation.ie/ga/.
Text crawled from Comhairle na Gaelscolaíochta website: https://www.comhairle.org/gaeilge/.
Text crawled from the FÁS website (http://www.fas.ie/), accessed in 2017. The website has since been dissolved.
Text crawled from the Galway County Council website: http://www.galway.ie/ga/.
Text crawled from https://www.gov.ie/ga/, the central portal for government services and information.
Text crawled from articles on the Irish Times website.
Text crawled from the Kerry County Council website: https://ciarrai.ie/.
Text crawled from the Oideas Gael website: http://www.oideas-gael.com/ga/.
Text crawled from articles generated by Teagasc, available under PSI licence.
Text generated by Conradh na Gaeilge, shared with us for research purposes.
The Irish text from a parallel English–Irish corpus of legal texts from the Department of Justice. This dataset is available for reuse on the ELRC-SHARE repository under a PSI license: https://elrc-share.eu
Text reports and notices generated by Dublin City Council, shared with us for research purposes.
Text uploaded to ELRC-share via the National Relay Station, shared with us for research purposes.
Text reports and reference files generated by the Language Commissioner, available on ELRC-share under PSI license: https://elrc-share.eu/.
Text generated by the magazine Nós, shared with us for research purposes.
Irish texts available for download on OPUS, under various licenses: https://opus.nlpl.eu/
Text generated from in-house translation provided by the then titled Department of Culture, Heritage and Gaeltacht (DCHG), provided for research purposes. The anonymised dataset is available on ELRC-share, under a CC-BY 4.0 license: https://elrc-share.eu/.
Text reports created by Údarás na Gaeilge, uploaded to ELRC-share available under PSI license: https://elrc-share.eu/.
Text generated by the University Times, shared with us for research purposes.
The corpus is compiled and owned by Foras na Gaeilge, and is provided to us for research purposes.
The unshuffled version of the Irish part of the OSCAR corpus was provided to us by the authors for research purposes.
Text from ParaCrawl v7, available here: https://www.paracrawl.eu/v7. The texts themselves are not owned by ParaCrawl, the actual packaging of these parallel data are under the Creative Commons CC0 licence ("no rights reserved").
The texts used are available under a CC BY-SA 3.0 licence and/or a GNU Free Documentation License.
Appendix B Corpus Pre-processing
This Appendix provides specific details on Corpus Pre-processing, and Opus Filter filters used.
Foras na Gaeilge provided us with a .vert file121212MD5 7be5c0e9bc473fb83af13541b1cd8d20 containing 33,088,532 tokens in 3,485 documents. We extract the raw text from the first tab-separated column and carry out the following conversions (number of events):
Replace " with a neutral double quote (4408).
Replace the standard xml/html entities quot, lt, gt and amp tokenised into three tokens, e.g. ", with the appropriate characters (128).
Replace the numeric html entities 38, 60, 147, 148, 205, 218, 225, 233, 237, 243 and 250, again spanning three tokens, e. g. &, with the appropriate Unicode characters (3679).
Repeat from the start until the text does not change.
We do not modify the seven occurrences of \x\x13 as it is not clear from their contexts how they should be replaced. After pre-processing and treating all whitespace as token separators, e.g. in the NCI token “go leor”, we obtain 33,472,496 tokens from the NCI.
b.2 Sentence Boundary Detection
Many of the NCI segments marked up with s tags contain multiple sentences. We treat each segment boundary as a sentence boundary and further split segments into sentences recursively, finding the best split point according to the following heuristics, splitting the segment into two halves and applying the same procedure to each half until no suitable split point is found.
Reject if the left half contains no letters and is short. This covers cases where the left half is only a decimal number.
Reject if the right half has no letters and is short or is an ellipsis.
Reject if the right half’s first letter, skipping enumerations, is lowercase.
Reject if the left half only contains a Roman number (in addition to the full-stop).
Reject if inside round, square, curly or angle brackets and the brackets not far away from the candidate split point.
If sentence-ending punctuation is followed by two quote tokens we also consider splitting between the quotes and prefer this split point if not rejected by above rules.
If sentence-ending punctuation is followed by a closing bracket we also consider splitting between the quotes and prefer this split point if not rejected by above rules.
If a question mark is followed by more question marks we also consider splitting after the end of the sequence of question marks and prefer this split point if not rejected by above rules.
If a exclamation mark is followed by more exclamation marks we also consider splitting after the end of the sequence of exclamations marks and prefer this split point if not rejected by above rules.
If a full-stop is the first full-stop in the overall segment, the preceding token is “1”, there are more tokens before this “1” and the token directly before “1” is not a comma or semi-colon we assume that this is an enumeration following a heading and prefer splitting before the “1”.
We do not insert new sentence boundaries at a full-stop after “DR”, “Prof” and “nDr”, and, if followed by a decimal number, after “No”, “Vol” and “Iml”.
Splitting after a full-stop following decimal numbers in all other cases is dispreferred, giving the largest penalty to small numbers as these are most likely to be part of enumerations. An exception is “Airteagal” followed by a token ending with a full-stop, a number, a full-stop, another number and another full-stop. Here, we implemented a preference for splitting after the first separated full-stop, assuming the last number is part of an enumeration.
Prefer a split point balancing the lengths of the halves in characters.
b.3 OpusFilter Filters
For OpusFilter-basic, we include the following filters:
LengthFilter: Filter sentences containing more than 512 words.
LongWordFilter: Filter sentences containing words longer than 40 characters.
HTMLTagFilter: Filter sentences containing HTML tags.
PunctuationFilter: Filter sentences which are over 60% punctuation.
DigitsFilter: Filter sentences which are over 60% numeric symbols.
For OpusFilter-basic-char-lang, we use the same filters as in OpusFilter-basic but include the following character script and language ID filters:
CharacterScoreFilter: Filter sentences which are below a ratio of Latin characters, where .
LanguageIDFilter: Filter sentences where the language ID tools have a lower confidence score than , where .
Appendix C Cloze Test Examples
c.1 Prediction Classification
|Context Cue||Masked Word||Model||Prediction||Classification|
Céard [MASK] na préamhacha raidiciúla sin?
(‘What [MASK] those radical roots?’)
Agus seo [MASK] an fhadhb mhór leis an bhfógra seo.
(‘And this [MASK] the big problem with this advert.’)
Cheannaigh Seán leabhar agus léigh sé [MASK].
(‘Seán bought a book and he read [MASK].’)
Ní h[MASK] sin aidhm an chláir.
(‘[MASK] is not the aim of the programme.’)
Table 6 provides one example per classification category of masked token predictions generated by the language models during our cloze test evaluation.
In the match example in Table 6 , the original meaning (‘What are those radical roots?’) differs to the meaning of the resulting string (‘What about those radical roots?’) in which the masked token is replaced by the predicted by mBERT-cp. However, the latter construction is grammatically and semantically acceptable.
In the mismatch example in Table 6, the predicted token is a valid Irish word, however the resulting generated text is nonsensical.
Though technically grammatical, the predicted token in the copy example in Table 6 results in a string with an unnatural repetition of a noun phrase where a pronoun would be highly preferable (’Seán bought a book and he read a book.’).
In the gibberish example in Table 6, the predicted token does not form a valid Irish word and the resulting from the token prediciton is ungrammatical and meaningless.
c.2 Effect of Length of Context on Accuracy of Prediction
In order to observe the effect that the amount of context provided has on the accuracy of the model, Table 7 shows the proportion of matches achieved by each language model when the results are segmented by the length of the context cues.
All the models tested are least accurate when tested on the group of short context cues. All except mBERT achieved the highest accuracy on the group of long sentences.
c.3 Easy and Difficult Context Cues
A context cue may be considered easy or difficult based on:
Whether the tokens occur frequently in the training data
The number of context clues
The distance of the context clues from the masked token
Two Irish language context cues which vary in terms of difficulty are exemplified below.
Bean, agus í cromtha thar thralaí bia agus [MASK] ag ithe a sáithe.
‘A woman, bent over a food trolley while eating her fill.’
We can consider the above sentence to be easy for the task of token prediction due to the following context clues:
‘Bean’ is a frequent feminine singular noun.
‘í’ is a repetition of the feminine singular pronoun to be predicted.
The lack of lenition on ‘sáithe’ further indicates that the noun it refers to may not be masculine.
These clues indicate that the missing pronoun will be feminine and singular.
Seo béile aoibhinn fuirist nach dtógann ach timpeall leathuair a chloig chun [MASK] a ullmhú.
‘This is an easy, delicious meal that only takes about half an hour to prepare.’
None of the language models tested predicted a plausible token for the above sentence. This example is more challenging as the only context clue is the feminine singular noun ‘béile’ which is 11 tokens in distance from the masked token.
Appendix D More Future Work Ideas
The following list of future work ideas has been collected from our GitHub repository.
Effect of switching to v8 of Paracrawl
Effect of random initialisation (using the same model type and settings as in the development phase; purpose is to find out whether the observed differences between settings are meaningful; differences that can be explained with model instability and that contradict expectations can then be disregarded)
For each corpus, effect of removing it
For each pair of corpora, effect of removing them
Start with the NCI and add corpora in order of cleanliness (or data value estimate from above corpus ablation)
More on filter thresholds and/or corpus-specific filter settings
Effect of filtering (near) duplicates
Effect of increasing the weight of clean corpora
Restrict vocabulary building to clean corpora
Partition the corpus, create a vocabulary for each partition and join the vocabularies
Effect of ## glue on prefixes
Effect of adding our in-house Irish Twitter corpora, in particular on tasks involving social media content
Add a copy of the data with accents removed and/or other normalisations
Add synthetic Irish, e. g. MT output
Add Hiberno-English corpora and English side of parallel corpora to training data
The role of sentence-splitting, e. g. does BERT need to see properly formed sentences or are snippets of text sufficient?