AWE-CM Vectors: Augmenting Word Embeddings with a Clinical Metathesaurus

12/05/2017
by   Willie Boag, et al.
0

In recent years, word embeddings have been surprisingly effective at capturing intuitive characteristics of the words they represent. These vectors achieve the best results when training corpora are extremely large, sometimes billions of words. Clinical natural language processing datasets, however, tend to be much smaller. Even the largest publicly-available dataset of medical notes is three orders of magnitude smaller than the dataset of the oft-used "Google News" word vectors. In order to make up for limited training data sizes, we encode expert domain knowledge into our embeddings. Building on a previous extension of word2vec, we show that generalizing the notion of a word's "context" to include arbitrary features creates an avenue for encoding domain knowledge into word embeddings. We show that the word vectors produced by this method outperform their text-only counterparts across the board in correlation with clinical experts.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
10/20/2020

UmlsBERT: Clinical Domain Knowledge Augmentation of Contextual Embeddings Using the Unified Medical Language System Metathesaurus

Contextual word embedding models, such as BioBERT and Bio_ClinicalBERT, ...
research
04/11/2018

Exploiting Task-Oriented Resources to Learn Word Embeddings for Clinical Abbreviation Expansion

In the medical domain, identifying and expanding abbreviations in clinic...
research
10/28/2016

Word Embeddings for the Construction Domain

We introduce word vectors for the construction domain. Our vectors were ...
research
04/10/2017

Exploring Word Embeddings for Unsupervised Textual User-Generated Content Normalization

Text normalization techniques based on rules, lexicons or supervised tra...
research
01/05/2021

Integration of Domain Knowledge using Medical Knowledge Graph Deep Learning for Cancer Phenotyping

A key component of deep learning (DL) for natural language processing (N...
research
02/20/2021

Knowledge-Base Enriched Word Embeddings for Biomedical Domain

Word embeddings have been shown adept at capturing the semantic and synt...
research
09/03/2015

Encoding Prior Knowledge with Eigenword Embeddings

Canonical correlation analysis (CCA) is a method for reducing the dimens...

Please sign up or login with your details

Forgot password? Click here to reset