Conceptualized Representation Learning for Chinese Biomedical Text Mining

08/25/2020
by   Ningyu Zhang, et al.
0

Biomedical text mining is becoming increasingly important as the number of biomedical documents and web data rapidly grows. Recently, word representation models such as BERT has gained popularity among researchers. However, it is difficult to estimate their performance on datasets containing biomedical texts as the word distributions of general and biomedical corpora are quite different. Moreover, the medical domain has long-tail concepts and terminologies that are difficult to be learned via language models. For the Chinese biomedical text, it is more difficult due to its complex structure and the variety of phrase combinations. In this paper, we investigate how the recently introduced pre-trained language model BERT can be adapted for Chinese biomedical corpora and propose a novel conceptualized representation learning approach. We also release a new Chinese Biomedical Language Understanding Evaluation benchmark (ChineseBLUE). We examine the effectiveness of Chinese pre-trained models: BERT, BERT-wwm, RoBERTa, and our approach. Experimental results on the benchmark show that our approach could bring significant gain. We release the pre-trained model on GitHub: https://github.com/alibaba-research/ChineseBLUE.

READ FULL TEXT
research
01/25/2019

BioBERT: a pre-trained biomedical language representation model for biomedical text mining

Biomedical text mining is becoming increasingly important as the number ...
research
10/14/2021

Building Chinese Biomedical Language Models via Multi-Level Text Discrimination

Pre-trained language models (PLMs), such as BERT and GPT, have revolutio...
research
01/25/2019

BioBERT: pre-trained biomedical language representation model for biomedical text mining

Biomedical text mining has become more important than ever as the number...
research
01/30/2023

PaCaNet: A Study on CycleGAN with Transfer Learning for Diversifying Fused Chinese Painting and Calligraphy

AI-Generated Content (AIGC) has recently gained a surge in popularity, p...
research
11/13/2020

RethinkCWS: Is Chinese Word Segmentation a Solved Task?

The performance of the Chinese Word Segmentation (CWS) systems has gradu...
research
06/09/2022

SsciBERT: A Pre-trained Language Model for Social Science Texts

The academic literature of social sciences is the literature that record...
research
10/11/2022

HUE: Pretrained Model and Dataset for Understanding Hanja Documents of Ancient Korea

Historical records in Korea before the 20th century were primarily writt...

Please sign up or login with your details

Forgot password? Click here to reset