A pre-training technique to localize medical BERT and enhance BioBERT

05/14/2020 ∙ by Shoya Wada, et al. ∙ 0

Bidirectional Encoder Representations from Transformers (BERT) models for biomedical specialties such as BioBERT and clinicalBERT have significantly improved in biomedical text-mining tasks and enabled us to extract valuable information from biomedical literature. However, we benefitted only in English because of the significant scarcity of high-quality medical documents, such as PubMed, in each language. Therefore, we propose a method that realizes a high-performance BERT model by using a small corpus. We introduce the method to train a BERT model on a small medical corpus both in English and Japanese, respectively, and then we evaluate each of them in terms of the biomedical language understanding evaluation (BLUE) benchmark and the medical-document-classification task in Japanese, respectively. After confirming their satisfactory performances, we apply our method to develop a model that outperforms the pre-existing models. Bidirectional Encoder Representations from Transformers for Biomedical Text Mining by Osaka University (ouBioBERT) achieves the best scores on 7 of the 10 datasets in terms of the BLUE benchmark. The total score is 1.0 points above that of BioBERT.



There are no comments yet.


page 1

Code Repositories


Implementation of the BLUE benchmark with Transformers. The details of our pre-training procedure can be found in https://arxiv.org/abs/2005.07202.

view repo
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.