Operationalizing a National Digital Library: The Case for a Norwegian Transformer Model

04/19/2021
by   Per E Kummervold, et al.
0

In this work, we show the process of building a large-scale training set from digital and digitized collections at a national library. The resulting Bidirectional Encoder Representations from Transformers (BERT)-based language model for Norwegian outperforms multilingual BERT (mBERT) models in several token and sequence classification tasks for both Norwegian Bokmål and Norwegian Nynorsk. Our model also improves the mBERT performance for other languages present in the corpus such as English, Swedish, and Danish. For languages not included in the corpus, the weights degrade moderately while keeping strong multilingual properties. Therefore, we show that building high-quality models within a memory institution using somewhat noisy optical character recognition (OCR) content is feasible, and we hope to pave the way for other memory institutions to follow.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
07/03/2020

Playing with Words at the National Library of Sweden – Making a Swedish BERT

This paper introduces the Swedish BERT ("KB-BERT") developed by the KBLa...
research
09/18/2020

The birth of Romanian BERT

Large-scale pretrained language models have become ubiquitous in Natural...
research
01/26/2023

Towards a semantic approach in GLAM Labs: the case of the Data Foundry at the National Library of Scotland

GLAM organisations have been exploring the benefits of publishing their ...
research
02/22/2022

A New Generation of Perspective API: Efficient Multilingual Character-level Transformers

On the world wide web, toxic content detectors are a crucial line of def...
research
05/14/2020

A pre-training technique to localize medical BERT and enhance BioBERT

Bidirectional Encoder Representations from Transformers (BERT) models fo...
research
09/17/2020

Multi^2OIE: Multilingual Open Information Extraction based on Multi-Head Attention with BERT

In this paper, we propose Multi^2OIE, which performs open information ex...

Please sign up or login with your details

Forgot password? Click here to reset