The Effects of In-domain Corpus Size on pre-training BERT

12/15/2022
by   Chris Sanchez, et al.
0

Many prior language modeling efforts have shown that pre-training on an in-domain corpus can significantly improve performance on downstream domain-specific NLP tasks. However, the difficulties associated with collecting enough in-domain data might discourage researchers from approaching this pre-training task. In this paper, we conducted a series of experiments by pre-training Bidirectional Encoder Representations from Transformers (BERT) with different sizes of biomedical corpora. The results demonstrate that pre-training on a relatively small amount of in-domain data (4GB) with limited training steps, can lead to better performance on downstream domain-specific NLP tasks compared with fine-tuning models pre-trained on general corpora.

READ FULL TEXT
research
12/31/2020

An Experimental Evaluation of Transformer-based Language Models in the Biomedical Domain

With the growing amount of text in health data, there have been rapid ad...
research
05/27/2023

An Investigation into the Effects of Pre-training Data Distributions for Pathology Report Classification

Pre-trained transformer models have demonstrated success across many nat...
research
09/15/2021

Enhancing Clinical Information Extraction with Transferred Contextual Embeddings

The Bidirectional Encoder Representations from Transformers (BERT) model...
research
09/12/2019

UER: An Open-Source Toolkit for Pre-training Models

Existing works, including ELMO and BERT, have revealed the importance of...
research
08/12/2022

Pre-training Tasks for User Intent Detection and Embedding Retrieval in E-commerce Search

BERT-style models pre-trained on the general corpus (e.g., Wikipedia) an...
research
06/09/2023

FPDM: Domain-Specific Fast Pre-training Technique using Document-Level Metadata

Pre-training Transformers has shown promising results on open-domain and...
research
10/09/2020

Style Attuned Pre-training and Parameter Efficient Fine-tuning for Spoken Language Understanding

Neural models have yielded state-of-the-art results in deciphering spoke...

Please sign up or login with your details

Forgot password? Click here to reset