Building Chinese Biomedical Language Models via Multi-Level Text Discrimination

by   Quan Wang, et al.

Pre-trained language models (PLMs), such as BERT and GPT, have revolutionized the field of NLP, not only in the general domain but also in the biomedical domain. Most prior efforts in building biomedical PLMs have resorted simply to domain adaptation and focused mainly on English. In this work we introduce eHealth, a biomedical PLM in Chinese built with a new pre-training framework. This new framework trains eHealth as a discriminator through both token-level and sequence-level discrimination. The former is to detect input tokens corrupted by a generator and select their original signals from plausible candidates, while the latter is to further distinguish corruptions of a same original sequence from those of the others. As such, eHealth can learn language semantics at both the token and sequence levels. Extensive experiments on 11 Chinese biomedical language understanding tasks of various forms verify the effectiveness and superiority of our approach. The pre-trained model is available to the public at <> and the code will also be released later.



There are no comments yet.


page 1

page 2

page 3

page 4


Conceptualized Representation Learning for Chinese Biomedical Text Mining

Biomedical text mining is becoming increasingly important as the number ...

Transfer Learning in Biomedical Natural Language Processing: An Evaluation of BERT and ELMo on Ten Benchmarking Datasets

Inspired by the success of the General Language Understanding Evaluation...

BioMegatron: Larger Biomedical Domain Language Model

There has been an influx of biomedical domain-specific language models, ...

Probing Biomedical Embeddings from Language Models

Contextualized word embeddings derived from pre-trained language models ...

CBLUE: A Chinese Biomedical Language Understanding Evaluation Benchmark

Artificial Intelligence (AI), along with the recent progress in biomedic...

DeBERTaV3: Improving DeBERTa using ELECTRA-Style Pre-Training with Gradient-Disentangled Embedding Sharing

This paper presents a new pre-trained language model, DeBERTaV3, which i...

AMBERT: A Pre-trained Language Model with Multi-Grained Tokenization

Pre-trained language models such as BERT have exhibited remarkable perfo...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.