Building Chinese Biomedical Language Models via Multi-Level Text Discrimination

10/14/2021
by   Quan Wang, et al.
0

Pre-trained language models (PLMs), such as BERT and GPT, have revolutionized the field of NLP, not only in the general domain but also in the biomedical domain. Most prior efforts in building biomedical PLMs have resorted simply to domain adaptation and focused mainly on English. In this work we introduce eHealth, a biomedical PLM in Chinese built with a new pre-training framework. This new framework trains eHealth as a discriminator through both token-level and sequence-level discrimination. The former is to detect input tokens corrupted by a generator and select their original signals from plausible candidates, while the latter is to further distinguish corruptions of a same original sequence from those of the others. As such, eHealth can learn language semantics at both the token and sequence levels. Extensive experiments on 11 Chinese biomedical language understanding tasks of various forms verify the effectiveness and superiority of our approach. The pre-trained model is available to the public at <https://github.com/PaddlePaddle/Research/tree/master/KG/eHealth> and the code will also be released later.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
08/25/2020

Conceptualized Representation Learning for Chinese Biomedical Text Mining

Biomedical text mining is becoming increasingly important as the number ...
research
06/13/2019

Transfer Learning in Biomedical Natural Language Processing: An Evaluation of BERT and ELMo on Ten Benchmarking Datasets

Inspired by the success of the General Language Understanding Evaluation...
research
10/19/2022

BioGPT: Generative Pre-trained Transformer for Biomedical Text Generation and Mining

Pre-trained language models have attracted increasing attention in the b...
research
04/03/2023

DrBERT: A Robust Pre-trained Model in French for Biomedical and Clinical domains

In recent years, pre-trained language models (PLMs) achieve the best per...
research
04/15/2021

Lattice-BERT: Leveraging Multi-Granularity Representations in Chinese Pre-trained Language Models

Chinese pre-trained language models usually process text as a sequence o...
research
09/07/2022

On the Effectiveness of Compact Biomedical Transformers

Language models pre-trained on biomedical corpora, such as BioBERT, have...
research
09/16/2022

ConFiguRe: Exploring Discourse-level Chinese Figures of Speech

Figures of speech, such as metaphor and irony, are ubiquitous in literat...

Please sign up or login with your details

Forgot password? Click here to reset