DeepAI AI Chat
Log In Sign Up

Probing Biomedical Embeddings from Language Models

04/03/2019
by   Qiao Jin, et al.
Google
University of Pittsburgh
Carnegie Mellon University
0

Contextualized word embeddings derived from pre-trained language models (LMs) show significant improvements on downstream NLP tasks. Pre-training on domain-specific corpora, such as biomedical articles, further improves their performance. In this paper, we conduct probing experiments to determine what additional information is carried intrinsically by the in-domain trained contextualized embeddings. For this we use the pre-trained LMs as fixed feature extractors and restrict the downstream task models to not have additional sequence modeling layers. We compare BERT, ELMo, BioBERT and BioELMo, a biomedical version of ELMo trained on 10M PubMed abstracts. Surprisingly, while fine-tuned BioBERT is better than BioELMo in biomedical NER and NLI tasks, as a fixed feature extractor BioELMo outperforms BioBERT in our probing tasks. We use visualization and nearest neighbor analysis to show that better encoding of entity-type and relational information leads to this superiority.

READ FULL TEXT

page 1

page 2

page 3

page 4

12/31/2020

An Experimental Evaluation of Transformer-based Language Models in the Biomedical Domain

With the growing amount of text in health data, there have been rapid ad...
09/17/2021

Task-adaptive Pre-training of Language Models with Word Embedding Regularization

Pre-trained language models (PTLMs) acquire domain-independent linguisti...
12/22/2020

Improved Biomedical Word Embeddings in the Transformer Era

Biomedical word embeddings are usually pre-trained on free text corpora ...
12/16/2021

Unsupervised Matching of Data and Text

Entity resolution is a widely studied problem with several proposals to ...
06/23/2021

Recognising Biomedical Names: Challenges and Solutions

The growth rate in the amount of biomedical documents is staggering. Unl...
07/24/2021

Stress Test Evaluation of Biomedical Word Embeddings

The success of pretrained word embeddings has motivated their use in the...
10/13/2022

Incorporating Context into Subword Vocabularies

Most current popular subword tokenizers are trained based on word freque...

Code Repositories

bioelmo

BioELMo is a biomedical version of embeddings from language model (ELMo), pre-trained on PubMed abstracts.


view repo