Improved Biomedical Word Embeddings in the Transformer Era

12/22/2020
by   Jiho Noh, et al.
0

Biomedical word embeddings are usually pre-trained on free text corpora with neural methods that capture local and global distributional properties. They are leveraged in downstream tasks using various neural architectures that are designed to optimize task-specific objectives that might further tune such embeddings. Since 2018, however, there is a marked shift from these static embeddings to contextual embeddings motivated by language models (e.g., ELMo, transformers such as BERT, and ULMFiT). These dynamic embeddings have the added benefit of being able to distinguish homonyms and acronyms given their context. However, static embeddings are still relevant in low resource settings (e.g., smart devices, IoT elements) and to study lexical semantics from a computational linguistics perspective. In this paper, we jointly learn word and concept embeddings by first using the skip-gram method and further fine-tuning them with correlational information manifesting in co-occurring Medical Subject Heading (MeSH) concepts in biomedical citations. This fine-tuning is accomplished with the BERT transformer architecture in the two-sentence input mode with a classification objective that captures MeSH pair co-occurrence. In essence, we repurpose a transformer architecture (typically used to generate dynamic embeddings) to improve static embeddings using concept correlations. We conduct evaluations of these tuned static embeddings using multiple datasets for word relatedness developed by previous efforts. Without selectively culling concepts and terms (as was pursued by previous efforts), we believe we offer the most exhaustive evaluation of static embeddings to date with clear performance improvements across the board. We provide our code and embeddings for public use for downstream applications and research endeavors: https://github.com/bionlproc/BERT-CRel-Embeddings

READ FULL TEXT

page 1

page 2

page 3

page 4

research
11/07/2019

Using Dynamic Embeddings to Improve Static Embeddings

How to build high-quality word embeddings is a fundamental research ques...
research
04/03/2019

Probing Biomedical Embeddings from Language Models

Contextualized word embeddings derived from pre-trained language models ...
research
08/09/2019

BERT-based Ranking for Biomedical Entity Normalization

Developing high-performance entity normalization algorithms that can all...
research
05/16/2023

Distilling Semantic Concept Embeddings from Contrastively Fine-Tuned Language Models

Learning vectors that capture the meaning of concepts remains a fundamen...
research
10/22/2018

BioSentVec: creating sentence embeddings for biomedical texts

Sentence embeddings have become an essential part of today's natural lan...
research
09/17/2022

Unsupervised Lexical Substitution with Decontextualised Embeddings

We propose a new unsupervised method for lexical substitution using pre-...
research
09/21/2021

InvBERT: Text Reconstruction from Contextualized Embeddings used for Derived Text Formats of Literary Works

Digital Humanities and Computational Literary Studies apply text mining ...

Please sign up or login with your details

Forgot password? Click here to reset