An Investigation into the Effects of Pre-training Data Distributions for Pathology Report Classification

05/27/2023
by   Aliyah R. Hsu, et al.
0

Pre-trained transformer models have demonstrated success across many natural language processing (NLP) tasks. In applying these models to the clinical domain, a prevailing assumption is that pre-training language models from scratch on large-scale biomedical data results in substantial improvements. We test this assumption with 4 pathology classification tasks on a corpus of 2907 prostate cancer pathology reports. We evaluate 5 transformer pre-trained models that are the same size but differ in pre-training corpora. Specifically, we analyze 3 categories of models: 1)General-domain: BERT and Turing Natural Language Representation (TNLR) models, which use general corpora for pre-training, 2)Mixed-domain: BioBERT which is obtained from BERT by including PubMed abstracts in pre-training and Clinical BioBERT which additionally includes MIMIC-III clinical notes and 3)Domain-specific: PubMedBERT which is pre-trained from scratch on PubMed abstracts. We find the mixed-domain and domain-specific models exhibit faster feature disambiguation during fine-tuning. However, the domain-specific model, PubMedBERT, can overfit to minority classes when presented with class imbalance, a common scenario in pathology report data. At the same time, the mixed-domain models are more resistant to overfitting. Our findings indicate that the use of general natural language and domain-specific corpora in pre-training serve complementary purposes for pathology report classification. The first enables resistance to overfitting when fine-tuning on an imbalanced dataset while the second allows for more accurate modelling of the fine-tuning domain. An expert evaluation is also conducted to reveal common outlier modes of each model. Our results could inform better fine-tuning practices in the clinical domain, to possibly leverage the benefits of mixed-domain models for imbalanced downstream datasets.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
12/15/2022

The Effects of In-domain Corpus Size on pre-training BERT

Many prior language modeling efforts have shown that pre-training on an ...
research
05/05/2023

Harnessing the Power of BERT in the Turkish Clinical Domain: Pretraining Approaches for Limited Data Scenarios

In recent years, major advancements in natural language processing (NLP)...
research
10/23/2022

On Cross-Domain Pre-Trained Language Models for Clinical Text Mining: How Do They Perform on Data-Constrained Fine-Tuning?

Pre-trained language models (PLMs) have been deployed in many natural la...
research
12/27/2020

MeDAL: Medical Abbreviation Disambiguation Dataset for Natural Language Understanding Pretraining

One of the biggest challenges that prohibit the use of many current NLP ...
research
10/16/2020

Detecting ESG topics using domain-specific language models and data augmentation approaches

Despite recent advances in deep learning-based language modelling, many ...
research
03/06/2020

Sensitive Data Detection and Classification in Spanish Clinical Text: Experiments with BERT

Massive digital data processing provides a wide range of opportunities a...

Please sign up or login with your details

Forgot password? Click here to reset