Bioformer: an efficient transformer language model for biomedical text mining

02/03/2023
by   Li Fang, et al.
0

Pretrained language models such as Bidirectional Encoder Representations from Transformers (BERT) have achieved state-of-the-art performance in natural language processing (NLP) tasks. Recently, BERT has been adapted to the biomedical domain. Despite the effectiveness, these models have hundreds of millions of parameters and are computationally expensive when applied to large-scale NLP applications. We hypothesized that the number of parameters of the original BERT can be dramatically reduced with minor impact on performance. In this study, we present Bioformer, a compact BERT model for biomedical text mining. We pretrained two Bioformer models (named Bioformer8L and Bioformer16L) which reduced the model size by 60 biomedical vocabulary and was pre-trained from scratch on PubMed abstracts and PubMed Central full-text articles. We thoroughly evaluated the performance of Bioformer as well as existing biomedical BERT models including BioBERT and PubMedBERT on 15 benchmark datasets of four different biomedical NLP tasks: named entity recognition, relation extraction, question answering and document classification. The results show that with 60 is only 0.1 accurate than PubMedBERT. Both Bioformer16L and Bioformer8L outperformed BioBERTBase-v1.1. In addition, Bioformer16L and Bioformer8L are 2-3 fold as fast as PubMedBERT/BioBERTBase-v1.1. Bioformer has been successfully deployed to PubTator Central providing gene annotations over 35 million PubMed abstracts and 5 million PubMed Central full-text articles. We make Bioformer publicly available via https://github.com/WGLab/bioformer, including pre-trained models, datasets, and instructions for downstream use.

READ FULL TEXT
research
12/19/2019

BERTje: A Dutch BERT Model

The transformer-based pre-trained language model BERT has helped to impr...
research
03/30/2023

Evaluation of GPT and BERT-based models on identifying protein-protein interactions in biomedical text

Detecting protein-protein interactions (PPIs) is crucial for understandi...
research
05/06/2020

An Empirical Study of Multi-Task Learning on BERT for Biomedical Text Mining

Multi-task learning (MTL) has achieved remarkable success in natural lan...
research
09/07/2022

On the Effectiveness of Compact Biomedical Transformers

Language models pre-trained on biomedical corpora, such as BioBERT, have...
research
04/27/2022

UBERT: A Novel Language Model for Synonymy Prediction at Scale in the UMLS Metathesaurus

The UMLS Metathesaurus integrates more than 200 biomedical source vocabu...
research
02/09/2023

Lightweight Transformers for Clinical Natural Language Processing

Specialised pre-trained language models are becoming more frequent in NL...
research
02/13/2020

CBAG: Conditional Biomedical Abstract Generation

Biomedical research papers use significantly different language and jarg...

Please sign up or login with your details

Forgot password? Click here to reset