Evaluating Sparse Interpretable Word Embeddings for Biomedical Domain

Word embeddings have found their way into a wide range of natural language processing tasks including those in the biomedical domain. While these vector representations successfully capture semantic and syntactic word relations, hidden patterns and trends in the data, they fail to offer interpretability. Interpretability is a key means to justification which is an integral part when it comes to biomedical applications. We present an inclusive study on interpretability of word embeddings in the medical domain, focusing on the role of sparse methods. Qualitative and quantitative measurements and metrics for interpretability of word vector representations are provided. For the quantitative evaluation, we introduce an extensive categorized dataset that can be used to quantify interpretability based on category theory. Intrinsic and extrinsic evaluation of the studied methods are also presented. As for the latter, we propose datasets which can be utilized for effective extrinsic evaluation of word vectors in the biomedical domain. Based on our experiments, it is seen that sparse word vectors show far more interpretability while preserving the performance of their original vectors in downstream tasks.



There are no comments yet.


page 1

page 2

page 3

page 4


A Comparison of Word Embeddings for the Biomedical Natural Language Processing

Neural word embeddings have been widely used in biomedical Natural Langu...

Learning and Evaluating Sparse Interpretable Sentence Embeddings

Previous research on word embeddings has shown that sparse representatio...

SPINE: SParse Interpretable Neural Embeddings

Prediction without justification has limited utility. Much of the succes...

Insights into Analogy Completion from the Biomedical Domain

Analogy completion has been a popular task in recent years for evaluatin...

Causally Denoise Word Embeddings Using Half-Sibling Regression

Distributional representations of words, also known as word vectors, hav...

Correlation-based Intrinsic Evaluation of Word Vector Representations

We introduce QVEC-CCA--an intrinsic evaluation metric for word vector re...

Improving Interpretability of Word Embeddings by Generating Definition and Usage

Word Embeddings, which encode semantic and syntactic features, have achi...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Word vector representation algorithms are used to embed words in a high dimensional space where words maintain semantic and syntactic relationships with each other. The main idea behind most of these algorithms is to maximize the dot product of words’ vector representations appearing in the same context, and therefore, maximizing the similarity between them. These methods have been used in many down-stream Natural Language Processing (NLP) tasks and have proved to make a noticeable difference in the performance.

Although they have improved the performance in various tasks, word embedding methods fail to train word vectors with interpretable dimensions. The word relations seem relative rather than absolute and therefore can not be interpreted in order to identify what properties are being captured by these algorithms. Moreover, interpretable word vector representations are believed to capture features, similar to those typically considered in NLP. The importance of this escalates when it comes to biomedical NLP (bioNLP) applications where decisions made by machines are to be reliable, and thus interpretable. However, current methods fail to fulfill this as they work like a black-box.

One approach towards interpretability in word embedding is via sparse methods. There have been many cases (in unsupervised methods) demonstrating that interpretability is in commensurate with sparsity murphy2012learning fyshe2014interpretable dahiya2016discovering. Besides, when it comes to computation, sparse vectors are faster and easier to work with. Sparse interpretable word vectors have proved to even outperform the original vectors in some downstream tasks guo-etal-2014-revisiting

. Indeed, classifiers benefit from higher usability of sparse representations as features and the separability of dimensions in word vectors can be profitable.

While sparse interpretable word embedding methods have been studied in the context of NLP, they are not well understood in bioNLP. This motivates us to further explore the area of interpretability in biomedical domain, focusing mainly on sparse interpretable word vector representations. Specifically, we first train dense word vectors on medical text, via Skip-gram mikolov2013distributed and GloVe pennington2014glove methods . The obtained dense representations are then used to train sparse interpretable word embeddings through two methods, namely, Sparse Overcomplete Word Vector Representation (SPOWV) faruqui2015sparse, and Sparse Interpretable Neural Embedding (SPINE) spine. This way, we arrive at four different sets of sparse representations which are compared to each other, and to their original dense vectors, in terms of their performance in downstream tasks and interpretability.

Our contribution to this area is three-fold: 1) Introducing a structured pipeline for training sparse interpretable word vectors in biomedical field, and to perform appropriate assessments to evaluate them in different aspects. 2) Nominating classification tasks and datasets which can be reliable metrics as downstream tasks in the biomedical domain. 3) Proposing a novel approach for the evaluation of interpretability by using a comprehensive categorized dataset which can serve as a reference in biomedical context.111 We have made our implementations and dataset publicly available at https://github.com/Institute-for-Artificial-Intelligence/Sparse-Interpretable-Word-Embeddings-For-Medical-Domain

The paper is organized as follows. In Section 2 we review the related works to clarify where our work stands in the relevant field. In Section 3, the required preliminaries on the used dense and sparse word embeddings are briefly pointed out. Then, in Section 4, we explain our methods including the pre-processing steps, experiment setup, and the evaluation procedures. The results are then provided and discussed in Section 5. Finally, Section 6 concludes the paper.

2 Related Works

The problem of word vector interpretability has been an active area in recent years. murphy2012learning, for the first time, applied matrix factorization to construct a non-negative sparse embedding. Later on, faruqui2015sparse applied an optimization problem within a dictionary learning approach to create sparse interpretable word embedding. In 2017, park2017rotated

took advantage of a rotation matrix to improve the interpretability of the word embeddings. In another recent study, k-sparse autoencoder is utilized to create a sparse interpretable word embedding


As for the biomedical domain, most of the studies on word vector representation merely look into the concept of similarity and the performance of word embeddings on downstream tasks sajadi2015domain pakhomov2016corpus chiu2018bio muneeb2015evaluating and the interpretability of the word embeddings is not well studied.

There is however one recent work addressing interpretability in the biomedical context jha2018interpretable which applies a supervised non-sparse method based on park2017rotated, to improve the interpretability of the word embedding. To the best of our knowledge, there has been no prior work on unsupervised sparse interpretable word embedding in the biomedical context.

3 Preliminaries: Dense and Sparse Word Embeddings

In this section, first we point out the required preliminaries on the dense word embeddings as the baseline original representations used in this paper. Then, we briefly cover the fundamentals of the two aforementioned sparse methods, i.e., SPOWV and SPINE.

3.1 Dense Word Embedding Methods

Word vector representation methods have long been an integral part of NLP. These methods attempt to represent words as vectors which embody word relations from a large unlabeled corpus. They can be divided into two main categories: a) Algorithms, such as latent semantic analysis (LSA) LSA, that use global co-occurrence information and matrix factorization, b) Algorithms like Word2Vec’s Skip-gram mikolov2013distributed that use local window-based information.

There are however, some methods such as the Global Vectors for Word Representation (GloVe) pennington2014glove

that try to take advantage of both classes by utilizing global and window-based information together. Since Skip-gram and GloVe have shown to outperform other approaches in downstream tasks such as sentiment analysis


, named entity recognition

NER, and NP parsing parsing, in this paper, we will only use these two methods to train dense word embeddings on medical text.

Although both of these algorithms do well in capturing semantic and syntactic relations among words, they fail to train interpretable set of word vectors with meaningful individual dimensions.

3.2 SPOWV: Sparse Overcomplete Word Vector Representation

Faruqui et al. proposed a novel approach in faruqui2015sparse

in order to embed distributed representation of words to a space with higher dimension, using sparse transformations. The goal was to achieve a level of interpretability that would be clear to human judgment as well. To achieve this goal, they applied sparse dictionary learning to transform the vector

into where is the size of the vocabulary, is the dimension of the original space, and is the dimension of sparse space such that . The objective function designed to optimize SPOWV is given by


where is a dictionary consisting of vector bases, is the desired sparse embedding whose row is denoted by , and and are coefficients for the regularization penalty of and , respectively.

3.3 SPINE: Sparse Interpretable Neural Embedding

The concept of sparse autoencoders were first introduced in Andrew Ng’s lecture slides ng2011sparse. SPINE spine utilizes -sparse autoencoders makhzani2013k

in order to train sparse interpretable word vector representations. Its corresponding loss function consists of three terms to induce sparsity and interpretability as follows


where denotes the input set of representations, and , , are parameters that should be tuned. The terms on the right hand side of the above loss function are explained briefly as follows.

3.3.1 Reconstruction Loss (RL)

This term, denoted by in Equation (2), minimizes the error from reconstructing the input from the latent space which is the main loss component in non-sparse basic autoencoders as well. Considering to be the reconstructed vector from the input vector , the reconstruction loss is written as follows


3.3.2 Average Sparsity Loss (ASL)

This term, denoted by in Equation (2), penalizes deviations from the desired sparsity as follows


where is the desired sparsity ratio for unit across , and denotes the observed average activation value for unit across .

The idea behind the formulation of this penalty is that in order for the latent vector to be -sparse and have active elements, the average of the sum of elements should be around .

3.3.3 Partial Sparsity Loss (PSL)

This term, denoted by in Equation (2), penalizes the elements whose amounts are not close to 0 or 1 as follows


where denotes the activation value for the hidden unit , for the input representation .

The role of PSL is to enforce elements of the latent space vectors to be binary rather than being around the average desired sparsity ratio.

4 Material and Methods

An overview of our methodology is shown as a block diagram in Figure 1. In this section, we explain the blocks involved in this pipeline including the pre-processing steps, experiment setup, and the evaluation procedures.

4.1 Corpora and Pre-processing

To form a training text for developing distributed word vector representations, we used the PubMed Central (PMC) Open Access Subset222https://www.ncbi.nlm.nih.gov/pmc/tools/openftlist/. We believe that word vectors can benefit from domain specific training data in medical context. PMC’s open access repository contains over one million free full-text digital archives of biomedical and life sciences journal literature at the U.S. National Institutes of Health’s National Library of Medicine. In order to improve the quality of the trained word embeddings, we excluded non-text partitions of the papers such as tables and references from the training text of word embedding models. To achieve this, we took advantage of the structure of XML format of the dataset and a text extractor Achakulvisut2020 to extract the full body of each document to remove the undesired parts of all the papers. Then, in the pre-processing step, we first removed all the punctuation marks from the training text, replaced all the numbers with “0”, and all characters were lowercased. Finally, we utilized GENIA Sentence Splitter genias, which is optimized for sentence segmentation of text in biomedical domain and shuffled the sentences in the end.

4.2 Experiment Setup

There are various factors that directly affect the dense word embedding methods. For Instance, word embeddings trained with larger window size are capable of capturing more complicated semantic relationships. In this regard, as mentioned earlier, we experiment with the two most commonly used dense word embedding methods, Skip-gram and GloVe. The training data contained over 6B tokens and the dimension size of each word vector was considered 300 with the window size of 15. Once the dense word vectors were trained on the whole training data, we selected 20k most frequent words as the input to the sparse methods. The sparse interpretable word vectors were trained with dimension size 1000.

Figure 1: Methodology overview

4.3 Hyperparameter Tuning

With the various coefficients present in the loss functions of SPOWV, i.e., () in Equation (1), and SPINE i.e., () in Equation (2

), their performance in downstream tasks can not fairly represent how interpretable they are. This demands for a hyperparameter tuning. To do so, we design a task inspired by


that could demonstrate interpretability in a set of word vectors. It is reasonable to expect that in an interpretable word embedding space, top words of dimensions would form a semantically coherent group. Since doing so for each dimension is resource-demanding and time-consuming, we pick words randomly from biomedical terminologies that represent specific semantic groups. For each word, we identify top words of the dimensions in which the word was active. To have a quantitative measurement, we sum up the cosine similarity between every pair of the top words in the original space. The more the total sum for the resulting vectors with a set of parameters, the more interpretable it is on the selected words representing medical semantic groups.

4.4 Evaluation of Interpretability

In this section, we explain the qualitative and quantitative methods used in this paper to assess the interpretability of the word embeddings under study.

4.4.1 Qualitative Evaluation

If a word vector space is interpretable, words active in a dimension should form a semantically or syntactically coherent group. In order to examine interpretability qualitatively, we pick words in the medical domain that could represent a specific medical category. For each word we seek for the dominating dimension, i.e., the dimension which has the highest value, and will find the top five words with the highest values in that dimension. The more these words are closely related to each other and the considered word, the more interpretable the embedding space and that dimension is.

4.4.2 Quantitative Evaluation

The most common way of quantitative evaluation of interpretability has been the word intrusion task. Word intrusion task is a multiple choice question on words extracted from the word embedding dimensions. Specifically, in order to examine interpretability with word intrusion task, 4 out of 5 choices are selected from the top 10 percent of a dimension and the other choice is a word selected from the bottom half of the dimension, but in the top 20 percent of another dimension. Interpretability of a set of word vectors is measured by how well humans can detect the intruding words in the described task.

Although the word intrusion task is very effective in evaluating interpretability, since we are investigating interpretability in biomedical domain, gathering people with sufficient knowledge to perform the word intrusion task reliably proved to be challenging and time-consuming. Therefore, we borrowed the evaluation method firstly introduced in semantic_structure. As mentioned there, a categorized dataset which is big enough can approximate the word intrusion task. In this vein, we used the UMLS Metathesaurus UMLS that is a large dictionary for biomedical terminologies. It includes definitions, relations, semantic categorization, etc. We took advantage of the semantic grouping available in Metathesaurus, for this purpose. There are originally 127 categories of medical concepts in the dataset. Nearly half of the 20000 words in our word vectors vocabulary are present and grouped in the dataset. We discarded semantic groups with lower than 5 and higher than 250 words since the grouping could have been too specific or too general, respectively. Eventually, we ended up with 93 semantic groups containing 62 words on average. We believe the resulting dataset can be a reliable measure for interpretability in biomedical domain.

Based on this approach, a score is given to each category-dimension pair as follows.


where and are respectively, the positive and negative interpretability scores of the dimension in category, is the set of words present in the category and is the size of this set. The parameters and are the top and bottom words of the dimension, respectively, with being a natural number controlling the strictness of the score.

The final interpretability score of the (dimension–category) pair is given by


Now, let the interpretability score of the dimension, denoted by , be the maximum score of any categories in that dimension, i.e.,


Then, the final interpretability score of a word embedding space, denoted by , can be defined as the average interpretability score across all dimensions, i.e.,


4.5 Intrinsic Evaluation

Word embedding algorithms such as word2vec and GloVe capture semantic and syntactic word relations to different extents. One approach to evaluating the performance of a word embedding method, is intrinsic evaluation which is typically performed on sets of word vectors in order to measure their overall success in maintaining the semantic and syntactic relation among the words, regardless of a specific downstream task. Intrinsic evaluation tasks are mostly faster and easier to carry out compared to extrinsic evaluation tasks which are discussed in the next section.

In order to measure word relations quantitatively, many benchmarks have been proposed for both medical and non-medical domains. These benchmarks consist of word pairs and their corresponding human-rated similarity and relatedness scores as references. While the scores are usually scaled to 10, there are a few exceptions to this convention. The word sets that are used for intrinsic evaluation in this paper are as follows:

  • SimLex-999 hill2015simlex: This set includes 999 word pairs comprising noun, verb, and adjective pairs scored by raters recruited from Amazon Mechanical Turk. SimLex-999 provides scores based on only similarity among the pairs and not relatedness or other associations.

  • Bio-SimLex and Bio-SimVerb chiu2018bio: These sets contain 988 noun pairs and 1000 verb pairs in medical domain, respectively. They are labeled by annotators with a biology background. Same as SimLex, the ratings are based on similarity alone.

  • UMNSRS UMNSRS: In contrast to the word sets mentioned above, UMNSRS has two separate datasets for similarity and relatedness associations consisting of 566 and 588 word pairs in biomedical area, respectively. These words cover a diverse fields including drugs, medicines, disorders, etc.

To evaluate the performance of the word embeddings with respect to each reference set, we compute the cosine similarity between the vectors of each pair in the reference set. Once the similarity scores are obtained, their Spearman rank correlations with the scores given in the reference set are calculated. Higher values of correlation indicate better capturing of the word relations in the embedding.

4.6 Extrinsic Evaluation

Extrinsic evaluation is another approach to assessing the performance of word embeddings. The idea is to evaluate the efficiency of the word vectors when used as input to a downstream (classification) task. The tasks considered in this paper for extrinsic evaluation are as follows:

  • Polarity Classification (PC) and Factuality Classification (FC): We acquired the sentiment analysis dataset constructed in factuality which is extracted from the patients’ conversations on the medical forums of the MedHelp website333www.medhelp.org. This dataset consists of a 3792 labeled sentences from patients suffering from food poisoning, crohn’s disease, and breast cancer. Each sentence is labeled in two different aspects. First, the factuality of the sentence that indicates what the comment is based on which is either, “Opinion”, “Fact”, or “Experience”. The other label is based on the polarity of sentences, which is either “Positive”, “Negative” or “Neutral”.

  • Question Classification (QC): This task and its accompanying dataset is inspired by the TREC dataset questionclassification for question classification as a facilitating task for question answering. Specfically, we collected a set of questions asked by patients from the Medhelp medical forums. The dataset has roughly 2000 samples including 9 various conditions related to the digestive system such as Irritable bowel syndrome (IBS) and Gastroesophageal Reflux Disease (GERD). Each class contains more than 200 questions and the task is to identify the corresponding condition from the question’s text.

For both tasks, we use Support Vector Machine (SVM), Random Forest, Gradient Boosting, Passive Aggressive classifiers, Gaussian Naive Bayes, and the best performance in terms of

accuracy is reported. For further validation of the results, we apply ten-fold cross-validation and report the average accuracy for each task.

5 Results and Discussion

In this section, we provide the results for the experiments explained in the previous section. We begin with the results for the downstream tasks related to the intrinsic and extrinsic evaluation of the word embeddings.

5.1 Downstream Tasks

Table 1 summarizes the intrinsic evaluation of the original embeddings (GloVe and Skip-gram) and their relevant sparse versions through SPINE and SPOWV. One important reason for looking into such comparison is to make sure that the expected interpretability of the sparse methods is achieved at either no or tolerable loss in preserving the similarity and relatedness captured by their original versions. As it can be seen, sparse word vectors obtained by SPOWV have even outperformed the original vectors on several occasions. The sparse vectors obtained by SPINE however, show a slight loss of performance which seems nonetheless tolerable considering the high interpretability it provides, as shown in the next subsection.

As for the extrinsic evaluation, Table 2 summarizes the performance of the six embeddings under study, in terms of accuracy, for the aforementioned classification tasks, explained in Section 4.6. As it can be observed, the classifiers benefit slightly from the sparsity of the vectors.

Vectors SimLex Bio-SimLex Bio-SimVerb UMNSRS-Sim UMNSRS-Rel
GloVe 0.29 0.66 0.53 0.56 0.58
SPINE GloVe 0.28 0.66 0.47 0.53 0.52
SPOWV GloVe 0.30 0.66 0.54 0.56 0.57
Skip-gram 0.30 0.67 0.50 0.56 0.55
SPINE SG 0.25 0.60 0.45 0.44 0.43
SPOWV SG 0.31 0.65 0.48 0.60 0.58
Table 1: Word similarity and relatedness results
GloVe 60.9 68.3 82.5 70.5
SPINE GloVe 61.3 68.0 82.8 70.7
SPOWV GloVe 62.2 69.2 82.9 71.4
Skip-gram 61.2 68.4 82.9 70.8
SPINE SG 60.4 67.8 83.1 70.4
SPOWV SG 62.2 69.5 83.6 71.7
Table 2: Extrinsic evaluation results

5.2 Evaluation of Interpretability

In this section, we provide the results on evaluating the interpretability of the considered embeddings. We begin with qualitative results.

5.2.1 Qualitative Evaluation

In terms of interpretability, we begin with an example which visualizes the obtained sparse vectors of words to provide an intuition of interpretability and sparsity from the distribution of values upon dimensions for each word vector.

Consider the following six words: 1) Melanoma, 2) Colorectal cancer, 3) Ewing’s sarcoma444A type of cancer developed in bone or soft tissue., 4) Acetaminophen, 5) Aspirin, 6) Clopidogrel. The first three words belong to the class of cancer types, and the next three words are pharmaceutical drugs.

Figure 2 shows the dimensions of the vector representations of these six words sorted by the average value across the three cancer type vectors. As it can be seen, for both SPINE and SPOWV, the distribution of values over dimensions have the same pattern in the three types of cancer compared to the words that are medicines.555Note that SPOWV vectors contain negative values, while SPINE vectors are non-negative. A similar result is observed if the vectors are sorted by the average of the drug type vectors.

As we discussed earlier, for an interpretable embedding, we expected dimensions to represent specific semantic groups in the sense that the words active in an interpretable dimension form a semantically coherent group. To examine this, we ran the experiment explained in Section 4.4.1 , the results of which are shown in Table 3. The medical terms used to carry out the experiment are Asthma, Alzheimer, Insulin and NRAS666A class of genes that can become cancerous if mutated..

The results summarized in Table 3 suggest that the top words in the dominating dimension of each selected word are related to each other and the selected word, in the sparse representations, especially in SPINE vectors. As it can be seen, both SPOWV and SPINE vectors are shown to be much more interpretable than the original ones. For example, the top words in the dominating dimension of the vector representing Insulin in the SPINE version of GloVe, are rosiglitazone, hyperinslinemia, IGF, hyperglycemia, and pioglitazone, which are all closely related to insulin and diabetes. This is while the original vectors fail to display such signs of interpretability. The same is true for the other words in this experiment.

(a) (b)
Figure 2: Visualization of the dimensions sorted by the average value of the first 3 words (the cancer vectors), where the positive, negative and zero values are shown by the red, blue, and white lines, respectively. (a) SPOWV GloVe; (b) SPINE GloVe
Concept GloVe SPINE GloVe SPOWV GloVe
polycystic, immunotherapy
mansoni, schistosome
copd, comorbid, fibromyalgia,
rhinitis, debilitating
sputum, perennial, herb,
exocrine, peach
untrained, mental,
illnesses, srh, youths
europathology, neuroinflammation,
prion, amyloid, syn
parkinson, dementia,
falls, slips, levodopa
crb, lithium, amiodarone,
gnrh, disagreement
rosiglitazone, hyperinsulinemia,
igf, hyperglycemia, pioglitazone
influencing, lobes.
macrovascular. abiotic.
oncogenes, upa cdx,
ceacam, trophoblast
tumours, kras, hnscc,
cancers, mutated
nras, pod, daf,
truncation, ontogenetic
Concept Skip-gram SPINE Skip-gram SPOWV Skip-gram
vhl, cdh, jak,
chromosomal. nonsense
bal, cf, airways,
balf, inhalation
interferon, virological,
isg, antiviral, virologic
deacetylase, bioconductor,
microbe, html, idd
hippocampalad, brains
neuropathology, neurodegeneration
montreal, james, hads
edinburgh, gba
dog, alarm, fox,
chimera, xenopus
insulin, glp, ins,
somatostatin, islets
glu, glycated, inh,
ins, freeze
vhl, cdh, jak,
chromosomal, nonsense
pms, mutations, lynch,
pdgfra, germline
braf, thrombolysis,
codons, pik, idh
Table 3: Qualitative evaluation of interpretability

5.2.2 Quantitative Evaluation

We finally measure the interpretability of each method by applying the proposed quantitative scoring approach explained in Section 4.4.2. The results are provided in Table 4 which shows the average interpretability score across all dimensions for each method. The results agree with those obtained by qualitative evaluation. As it can be seen, although SPOWV vectors were seen to perform better in downstream tasks, vectors trained by SPINE have better interpretability in medical domain.

GloVe Original SPINE SPOWV
Interpretability Score 8.6 18.1 9.9
Skip-gram Original SPINE SPOWV
Interpretability Score 9.6 16.7 12.3
Table 4: Quantitative evaluation of interpretability

6 Conclusion and Future Work

In the context of NLP for medical domain, we compared the interpretability of the state-of-the-art word embedding methods with the two most nominated sparse interpretable word vector embeddings. We proposed a novel approach for quantifying the interpretability of the word vectors based on category theory. The approach was carried out by a dataset of semantic coherent groupings of medical terminologies that we proposed in this regard. The sparse word vectors trained showed much more interpretability, without downgrading the performance of their original vectors in downstream tasks.

For future work, several directions remain open. Since SPINE and SPOWV are sparse, they are computationally efficient. Therefore, areas such as energy and timing analysis can be further investigated. Moreover, as the results suggest, although SPINE models improved the interpretability most desirably, they suffered from a marginal loss in preserving similarity and word associations, compared to SPOWV. Adding an extension to the cost function of the SPINE to further maintain the similarity can be another direction for future studies.