BioMegatron: Larger Biomedical Domain Language Model

10/12/2020
by   Hoo-chang Shin, et al.
11

There has been an influx of biomedical domain-specific language models, showing language models pre-trained on biomedical text perform better on biomedical domain benchmarks than those trained on general domain text corpora such as Wikipedia and Books. Yet, most works do not study the factors affecting each domain language application deeply. Additionally, the study of model size on domain-specific models has been mostly missing. We empirically study and evaluate several factors that can affect performance on domain language applications, such as the sub-word vocabulary set, model size, pre-training corpus, and domain transfer. We show consistent improvements on benchmarks with our larger BioMegatron model trained on a larger domain corpus, contributing to our understanding of domain language model applications. We demonstrate noticeable improvements over the previous state-of-the-art (SOTA) on standard biomedical NLP benchmarks of named entity recognition, relation extraction, and question answering. Model checkpoints and code are available at [https://ngc.nvidia.com] and [https://github.com/NVIDIA/NeMo].

READ FULL TEXT

page 1

page 2

page 3

page 4

research
04/08/2022

RuBioRoBERTa: a pre-trained biomedical language model for Russian language biomedical text mining

This paper presents several BERT-based models for Russian language biome...
research
10/19/2022

BioGPT: Generative Pre-trained Transformer for Biomedical Text Generation and Mining

Pre-trained language models have attracted increasing attention in the b...
research
09/18/2019

Pre-trained Language Model for Biomedical Question Answering

The recent success of question answering systems is largely attributed t...
research
04/22/2022

KALA: Knowledge-Augmented Language Model Adaptation

Pre-trained language models (PLMs) have achieved remarkable success on v...
research
09/16/2021

Spanish Biomedical Crawled Corpus: A Large, Diverse Dataset for Spanish Biomedical Language Models

We introduce CoWeSe (the Corpus Web Salud Español), the largest Spanish ...
research
06/30/2023

Biomedical Language Models are Robust to Sub-optimal Tokenization

As opposed to general English, many concepts in biomedical terminology h...
research
04/27/2023

PMC-LLaMA: Further Finetuning LLaMA on Medical Papers

Large Language Models (LLMs) have showcased remarkable capabilities in n...

Please sign up or login with your details

Forgot password? Click here to reset