Recent advances in neural network architectures have enabled significant new developments within natural language processing. With the appearance of the first BERT model—Bidirectional Encoder Representations from Transformers—Google’s researchers could point towards state-of-the-art performance across a variety of NLP tasks in English. This was followed by the development of M-BERT, a multilingual model capable of generalizing across languages [11, 15]
, which in turn prompted various monolingual models beyond English—i.e. CamemBERT and FinBERT—that have been shown to outperform M-BERT for deep transfer learning of these languages[10, 13, 12]. While demonstrating impressive performance on a number of tasks, these latter models have needed to overcome the challenges of a lack of data and testbeds for languages other than English.
Here we present the new Swedish model we have developed at the National Library of Sweden (KB): KB-BERT. As with the likes of FinBERT, this has meant finding solutions to the lack of data for a lower-resourced language. We elaborate upon how our work at the recently-established KBLab for data-driven research has allowed us to address this problem — at least in part (see discussion below). Technical issues of access to data converge with wider questions about the representativeness of language models: we have deliberately used parts of the library’s collections with significant elements of colloquial language, including the newspaper archive and government reports—both containing large amounts of direct reported speech from a wide spectrum of social actors—to create a BERT trained on a broad range of Swedish. We have also included historical materials from the newspaper archive dating back to 1945 to capture something approximate to the living language of the national community [2, 1]. There has thus been a significant democratizing of data in the training of KB-BERT.
This paper makes several contributions. Firstly, we explain how we used KB’s collections to create the training data necessary for the making of our Swedish BERT. Secondly, we describe and show the results from our testing process, where we compare the performance of KB-BERT for various NLP tasks in Swedish with that of existing models. Thirdly, we evaluate this performance and highlight the particular steps for future research that we think this raises.
2 Corpus disposition
The main challenge that we faced in developing KB-BERT—apart from the amount of computational resources needed—was in compiling and cleaning a sufficiently large and diverse corpus of Swedish text, cf. . Fortunately, we have been able to draw upon KB’s vast textual resources, which result from both large-scale digitization projects and the library’s role as the national archive for electronic and traditional legal deposits.
To limit the corpus solely to modern Swedish, we have selected only resources created from the 1940’s until late 2019. More precisely, we have included material from the following categories:
Digitized newspapers—text extracted from OCR’d newspapers from KB’s archives. This newspaper corpus contains a wide variety of text such as quotes, speech, articles, short stories and adverts.
Official Reports of the Swedish Government—the collection of government publications (Statens offentliga utredningar or SOU) contain around 5,000 volumes, ranging from 1940 to today. These texts are highly vetted before publication and represent a large amount of correctly written Swedish. The sentences tend to be longer than in other texts.
Legal e-deposits—resources deposited at KB under the e-deposit law (2012: 492). These include reports from Swedish authorities, e-books from publishers and e-magazines.
Social media—comments collected from internet forums. This is by far the most challenging corpus in terms of sentence structure and generally creative use of the Swedish language. Punctuation is, for example, apparently optional in such texts.
Swedish Wikipedia—all Wikipedia articles in Swedish.
|allnews.txt||2 997M||226M||16 783MB|
|Total||3 497M||260M||18 341MB|
As can be seen in the table above, there is an obvious skew towards newspaper text. On the other hand, theallnews.txt corpus is the most diverse of the corpora we have included. It is also our understanding that the amount of a certain type of text is not proportional to the model’s subsequent understanding of that type of text. In other words, incorporating smaller corpora of different types of text, rather than simply adding more text of an existing type, is still beneficial not only to the model’s representativity but also its capacity and robustness as a whole.
The one notable omission is the comparatively low amount of fiction included in the total corpus. This results from the fact that there is only a small number of e-books in the combined corpus. The newspaper corpus has a certain level of fictional texts, but the extent is unknown and it is mostly short stories rather than full books. The legal e-deposit corpus does, however, include a few hundred works of fiction.
3 Text wrangling
The text in its raw form needed extensive cleaning and filtering before it could be used to create pre-training data. Text preparation was carried out using a combination of available and custom-made tools to deal with discrepancies between and within corpora, as we describe below.
3.1 OCR quality fix
One major hurdle has been the presence of a few very specific OCR errors, which most likely stem from an overly aggressive use of word lists, or a faulty word list, in combination with the OCR engine confusing the letter m with the letter r followed by n—and vice versa. This is especially problematic for the word om, meaning about in Swedish, being OCRed as orn (which is not a Swedish word) and then replaced by örn (meaning eagle in Swedish). Since it is a very common word showing up in ca 82% of all newspaper pages, where more than half of all occurrences were incorrectly OCRed on this point, the resulting model would have ended up with a faulty understanding of Swedish.
This also became problematic with common endings in Swedish such as “arna”/”erna”, which would be OCRed as “ama” / “ema”. For example, “tanterna” [ “the old ladies” ] becomes “tantema” (not a Swedish word). To correct this we developed a fix to identify candidates where m had been replaced with rn and vice versa, by checking both versions against The Language Bank in Gothenburg’s (Språkbanken) morphological thesaurus: Saldo . If the original word did not exist in Saldo, but the transformed one did, the word was added as a candidate in a mapping file.
In the end, over 30,000 such candidates were identified—the majority appearing only a few times each— and corrected. In the case of om – örn the word was simply replaced, essentially deleting mentions of actual eagles from the Swedish newspaper corpus. In total, 13 million occurrences were corrected.
3.2 Sentence- and paragraph splitting
The pre-training data creation process ultimately demands “paragraphs”. This can mean quite different things, depending on the type of text at hand. In newspaper text it is usually a block of text spanning part of a page, while in monographs it is not uncommon for a paragraph to cover the whole page. In newspapers it is also not uncommon for single words or sentences to show up as paragraphs in the OCR. These are not valuable since they have little or no context. To reduce noise, we therefore set a somewhat arbitrary lower limit for a paragraph at ten sentences. This meant that we discarded quite a large amount of paragraphs that were deemed too short. Sentences at the beginning and end of paragraphs that did not seem to be whole sentences were also discarded.
However, the major obstacle to creating paragraphs that we faced was in splitting text into sentences. In the end, we used a custom script together with spaCy to try and recreate sentences from a stream of text. This consisted of a pre-processing step to deal with hyphenation—which is very common since Swedish has fairly long composite words and columns of text can be narrow—followed by Spacy’s sentence splitter, and lastly a post-process to deal with special cases not covered by spaCy. It is worth noting that SpaCy does not yet have a statistical model for Swedish, which is partly why so much custom scripting was required.
A special case is that of the emojis that are occasionally used as punctuation in texts on social media. For example, the phrase “Hej ☺Vad heter du?” [“Hi ☺What’s your name?”] is two sentences since the U+1F600 GRINNING FACE works as end-of-sentence punctuation in this context. On the other hand, “Menvafan ☹☹☹varför gjorde du så!?” [“What the hell ☹☹☹why did you do that?”] is one sentence even though it contains three emojis. We felt it important to retain emojis both in the tokenization and in their use as sentence delimiters since they are prevalent in some types of discourse, and if the model was to have any chance of being used to analyze and/or understand conversations in social media, this was necessary.
Once the text was cleaned and split into paragraphs of sentences, we used the SentencePiece  library to create a tokenizer file and then manually added emojis and skin color modifiers that had been filtered out for some reason (Kudo and Richardson, 2018). Following the lead of both the German  and Finnish BERT  projects, we chose a dictionary size of approximately 50,000 tokens, with the rationale that Swedish contains many compound words.
We used the code and instructions provided in Google’s BERT repo to create pre-training data and subsequently pre-train the model . We trained for one million steps with a max sequence length of 128 and a batch size of 512. Then we trained for 100,000 steps with a msl of 512 and batch size of 128. We continued training for approximately another million steps with msl of 128 and batch size of 512 to see if this would yield better results downstream.
Model pretraining was made partly in-house at the KBLab and partly, for unproblematic material without active copyright, with generous support of Cloud TPUs from Google’s TensorFlow Research Cloud (TFRC).
5 Results and performance comparison
To evaluate the performance of the model and compare it with others we used NER- and POS-tagging as a downstream task. These are both token classification task that require some language understanding to perform well.
5.1 NER tagging
For this task, we used the Stockholm-Umeå Corpus 3.0 suc3 dataset hosted by The Language Bank in Gothenburg (Språkbanken) converted to be used by a modified version of the run_ner.py program included in the Huggingface framework. The file was split into a training, test and evaluation set, with 70% used for training, and 20% and 10% for test and evaluation respectively.
The results show that our model, KB-BERT, outperforms existing BERTs either trained for multilingual understanding by Google or specifically for Swedish by Arbetsförmedlingen. Results for HFST-SweNER  are also included in the table below, but a difference in evaluation method means that the numbers are not necessarily comparable.
The model from Arbetsförmedlingen is trained using only a fraction of the data used for KB-BERT, which shows in its limited performance.
5.2 Part-of-spech tagging
Using the SUC dataset we evaluated Part-of-speech tagging the same way as NER. The results show a much smaller improvement, less than 1%, over M-BERT compared to NER tagging. This is probably due to F1 being very high to begin with since a lot of words have a static mapping to a corresponding POS tag.
5.3 Downstream tasks evaluation during pre-training
To get a sense of when to stop pre-training, evaluation of the NER task was continually made at certain checkpoints as a measure of model performance. Results show a fairly slow rise after an initial jump somewhere around ten thousand steps with only marginal improvements after a few hundred thousand steps. This is in line with the results from Deepset’s German BERT(Chan et al., 2019).
5.4 Results for harder NER tasks
While our KB-BERT scores objectively high and outperforms existing models, some specific tests designed to stress language understanding in downstream tasks work only intermittently. For example, the latter part of the sentence “Pelle och Kalle startar företaget Pelle och Kalle” [“Pelle and Kalle start the company Pelle and Kalle”] gets correctly tagged as ORG in some runs but not in others. This does not seem to stabilize with additional pre-training above 1-2 million steps. However, there does seem to be a difference for more complex NER tasks in the range between 350k and 1M pre-training steps. Perhaps we are pushing the boundaries of the model, or perhaps more fine-tuning data is needed. Further research is required to clarify this.
In this paper, we have described how we used KB’s considerable textual resources to train a new Swedish BERT model. Looking at the performance of KB-BERT, we have demonstrated that it outperforms both Google’s multilingual model, M-BERT, and existing Swedish models, such as Arbetsförmedlingen’s, across a range of NLP tasks.
What conclusions can be drawn from this? Firstly, and most obviously, it highlights the importance of data volume: the more data, the better the performance. That a BERT model trained specifically for Swedish outperforms a multilingual model is hardly surprising; this has been shown with comparable results in other languages such as German and Finnish. However, the clear improvement in language understanding of KB-BERT in relation to other Swedish models is interesting, since it demonstrates the significance of the large amount of data required if a model is to perform well.
Secondly, this study suggests the clear value of including a broad range of language types in creating a truly robust model. Perhaps most surprising here was that the use of apparently messy data, such as the formally unorthodox and colloquial language used on social media that incorporates emojis, seems to result in a BERT-model with a better capacity for understanding many different types of language (not just specifically that used on social media). In general terms, we believe that this supports our contention that a democratically-inclined model trained on data derived from many social domains leads to improved performance on NLP tasks. We intend to develop our analysis of this issue further in future research at KBLab.
Thirdly, our work in this project has highlighted the continued problem of a lack of testbeds for the Swedish language. In contrast to well-resourced languages such as English, this lack presents particular problems for the future development and testing of new Swedish language models. To help address this issue, we are going to be participating in a new project aiming to develop new and improved testbeds in Swedish over the coming year, in collaboration with researchers at the Swedish Research Institute (RISE), the Swedish Language Bank at Gothenburg University, and AI Innovation of Sweden (“SuperLim: en svensk testmängd för språkmodeller”, funded by Vinnova, 2020-2021). Our hope is that this will lead to significantly improved possibilities for the future evaluation of Swedish models.
National Library of Sweden / KBLab
Resources from Stockholm University, Umeå University and Swedish Language Bank at Gothenburg University were used when fine-tuning KB-BERT for NER.
Research supported with Cloud TPUs from Google’s TensorFlow Research Cloud (TFRC)
Models are hosted on S3 by Huggingface
8 Bibliographical References
-  (2017) Living Language: An Introduction to Linguistic Anthropology.. Second Edition edition, John Wiley & Sons, Chichester. Cited by: §1.
-  (2016) Imagined Communities: Reflections on the Origin and Spread of Nationalism. Revised Edition edition, Verso, London. Cited by: §1.
-  (2013-12) SALDO: a touch of yin to WordNet’s yang. Language Resources and Evaluation 47 (4), pp. 1191–1211 (English). Note: Place: Dordrecht Publisher: Springer WOS:000328213000011 External Links: Cited by: §3.1.
-  (2019) Open Sourcing German BERT: Insights into pre-training BERT from scratch.. External Links: Cited by: §3.4.
-  (2019-05) BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. arXiv:1810.04805 [cs]. Note: arXiv: 1810.04805 External Links: Cited by: §1, §4.
spaCy 2: natural language understanding with Bloom embeddings, convolutional neural networks and incremental parsing. Note: To appear Cited by: §3.2.
-  (2014) HFST-SweNER - A New NER Resource for Swedish. European Language Resources Assoc-Elra, Paris (English). Note: Pages: 2537-2543 Publication Title: Lrec 2014 - Ninth International Conference on Language Resources and Evaluation WOS:000355611004025 External Links: Cited by: §5.1.
-  (2018) SentencePiece: A simple and language independent subword tokenizer and detokenizer for neural text processing. CoRR abs/1808.06226. External Links: Cited by: §3.4.
-  (2020-06) Automatic Creation of Text Corpora for Low-Resource Languages from the Internet: The Case of Swiss German. arXiv:1912.00159 [cs]. Note: arXiv: 1912.00159 External Links: Cited by: §2.
-  (2020-05) CamemBERT: a Tasty French Language Model. arXiv:1911.03894 [cs]. Note: arXiv: 1911.03894Comment: ACL 2020 long paper. Web site: https://camembert-model.fr External Links: Cited by: §1.
-  (2019-06) How multilingual is Multilingual BERT?. arXiv:1906.01502 [cs]. Note: arXiv: 1906.01502 External Links: Cited by: §1.
-  (2019-10) Is Multilingual BERT Fluent in Language Generation?. arXiv:1910.03806 [cs]. Note: arXiv: 1910.03806 External Links: Cited by: §1.
-  (2019-12) Multilingual is not enough: BERT for Finnish. arXiv:1912.07076 [cs]. Note: arXiv: 1912.07076 External Links: Cited by: §1, §3.4.
-  (2019) HuggingFace’s transformers: state-of-the-art natural language processing. ArXiv abs/1910.03771. Cited by: §5.1.
-  (2019-10) Beto, Bentz, Becas: The Surprising Cross-Lingual Effectiveness of BERT. arXiv:1904.09077 [cs]. Note: arXiv: 1904.09077Comment: EMNLP 2019 Camera Ready External Links: Cited by: §1.
9 Language Resource References