Neural Language Models for Nineteenth-Century English

05/24/2021
by   Kasra Hosseini, et al.
44

We present four types of neural language models trained on a large historical dataset of books in English, published between 1760-1900 and comprised of  5.1 billion tokens. The language model architectures include static (word2vec and fastText) and contextualized models (BERT and Flair). For each architecture, we trained a model instance using the whole dataset. Additionally, we trained separate instances on text published before 1850 for the two static models, and four instances considering different time slices for BERT. Our models have already been used in various downstream tasks where they consistently improved performance. In this paper, we describe how the models have been created and outline their reuse potential.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
01/17/2020

RobBERT: a Dutch RoBERTa-based Language Model

Pre-trained language models have been dominating the field of natural la...
research
05/23/2022

ScholarBERT: Bigger is Not Always Better

Transformer-based masked language models trained on general corpora, suc...
research
04/18/2022

L3Cube-HingCorpus and HingBERT: A Code Mixed Hindi-English Dataset and BERT Language Models

Code-switching occurs when more than one language is mixed in a given se...
research
05/23/2023

Assessing Linguistic Generalisation in Language Models: A Dataset for Brazilian Portuguese

Much recent effort has been devoted to creating large-scale language mod...
research
02/02/2022

L3Cube-MahaCorpus and MahaBERT: Marathi Monolingual Corpus, Marathi BERT Language Models, and Resources

We present L3Cube-MahaCorpus a Marathi monolingual data set scraped from...
research
05/09/2023

Investigating the effect of sub-word segmentation on the performance of transformer language models

We would like to explore how morphemes can affect the performance of a l...
research
12/14/2022

MANTa: Efficient Gradient-Based Tokenization for Robust End-to-End Language Modeling

Static subword tokenization algorithms have been an essential component ...

Please sign up or login with your details

Forgot password? Click here to reset