DeepAI AI Chat
Log In Sign Up

Probing Biomedical Embeddings from Language Models

by   Qiao Jin, et al.
University of Pittsburgh
Carnegie Mellon University

Contextualized word embeddings derived from pre-trained language models (LMs) show significant improvements on downstream NLP tasks. Pre-training on domain-specific corpora, such as biomedical articles, further improves their performance. In this paper, we conduct probing experiments to determine what additional information is carried intrinsically by the in-domain trained contextualized embeddings. For this we use the pre-trained LMs as fixed feature extractors and restrict the downstream task models to not have additional sequence modeling layers. We compare BERT, ELMo, BioBERT and BioELMo, a biomedical version of ELMo trained on 10M PubMed abstracts. Surprisingly, while fine-tuned BioBERT is better than BioELMo in biomedical NER and NLI tasks, as a fixed feature extractor BioELMo outperforms BioBERT in our probing tasks. We use visualization and nearest neighbor analysis to show that better encoding of entity-type and relational information leads to this superiority.


page 1

page 2

page 3

page 4


An Experimental Evaluation of Transformer-based Language Models in the Biomedical Domain

With the growing amount of text in health data, there have been rapid ad...

Task-adaptive Pre-training of Language Models with Word Embedding Regularization

Pre-trained language models (PTLMs) acquire domain-independent linguisti...

Improved Biomedical Word Embeddings in the Transformer Era

Biomedical word embeddings are usually pre-trained on free text corpora ...

Unsupervised Matching of Data and Text

Entity resolution is a widely studied problem with several proposals to ...

Recognising Biomedical Names: Challenges and Solutions

The growth rate in the amount of biomedical documents is staggering. Unl...

Stress Test Evaluation of Biomedical Word Embeddings

The success of pretrained word embeddings has motivated their use in the...

Incorporating Context into Subword Vocabularies

Most current popular subword tokenizers are trained based on word freque...

Code Repositories


BioELMo is a biomedical version of embeddings from language model (ELMo), pre-trained on PubMed abstracts.

view repo