Unsupervised Annotation of Phenotypic Abnormalities via Semantic Latent Representations on Electronic Health Records

11/10/2019 ∙ by Jingqing Zhang, et al. ∙ Imperial College London 0

The extraction of phenotype information which is naturally contained in electronic health records (EHRs) has been found to be useful in various clinical informatics applications such as disease diagnosis. However, due to imprecise descriptions, lack of gold standards and the demand for efficiency, annotating phenotypic abnormalities on millions of EHR narratives is still challenging. In this work, we propose a novel unsupervised deep learning framework to annotate the phenotypic abnormalities from EHRs via semantic latent representations. The proposed framework takes the advantage of Human Phenotype Ontology (HPO), which is a knowledge base of phenotypic abnormalities, to standardize the annotation results. Experiments have been conducted on 52,722 EHRs from MIMIC-III dataset. Quantitative and qualitative analysis have shown the proposed framework achieves state-of-the-art annotation performance and computational efficiency compared with other methods.



There are no comments yet.


page 1

Code Repositories


Source code of "Unsupervised Annotation of Phenotypic Abnormalities via Semantic Latent Representations on Electronic Health Records". BIBM 2019.

view repo
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

Electronic health records (EHRs) are the digital version of patients’ paper charts, which are real-time and patient-centered. With the increasing adoption of EHRs in hospitals [henry2016adoption], the explosive information archived in EHRs has been exploited and found to be useful in clinical informatics applications [xiao2018opportunities], such as disease classification [shi2017towards] and medical image segmentation [mo2018deep].

In this paper, we focus on annotating phenotype information, from EHR textual datasets for better disease understanding. In the medical text, the word “phenotype” refers to deviations from normal morphology, physiology, or behavior [robinson2012deep]. The EHRs serve as a rich source of phenotype information as they naturally describe phenotypic abnormalities of patients in narratives. The annotation of phenotypic abnormalities from EHRs can improve the understanding of disease diagnosis, disease pathogenesis and genomic diagnostics [deisseroth2018clinphen, son2018deep], which is a large step towards precision medicine [shickel2018deep].

Several standardized knowledge bases have been proposed to help clinicians understand phenotype information in EHRs systematically and consistently [hoehndorf2015role]. Human Phenotype Ontology (HPO) [kohler2016human], which is a standardized and the most widely used knowledge base of phenotypic abnormalities, provides over 13,000 terms. As annotating such amount of phenotypic abnormalities from millions of EHRs manually is extremely expensive and impractical, automatic annotation techniques based on natural language processing (NLP) are demanded.

We first analyzed the appearance of phenotype information in EHRs. With the keyword search approach (i.e. exactly matching the name and synonyms of HPO terms) on the EHRs from MIMIC-III [johnson2016mimic], we found that on average each EHR contained 40.42 HPO terms against 11.74 ICD111ICD: https://www.who.int/classifications/icd/en/ codes, and the number of HPO terms related to a single disease also varied significantly. For example, regarding the disease subarachnoid hemorrhage, the number of HPO terms found in related EHRs ranged from 4 to 40. As shown in Table I, the phenotype expressions in the EHRs of patient A and B were clearly different, though they were both diagnosed as subarachnoid hemorrhage. These suggested the patients who were diagnosed as the same disease could be further classified into different sub-groups for personalized treatment. However, the keyword search method cannot maximally extract HPO terms from free text, so more sophisticated automatic phenotype annotation methods is needed.

Disease Phenotype
11.74 ICD per EHR 40.42 HPO per EHR
Disease name Phenotype quoted from EHRs
Subarachnoid hemorrhage Patient A: “mild confusion”, “aneurysm” and “vertebral basilar junction”
Patient B: “neurologically stab”, “mild headache” and “pain is well controlled”
TABLE I: Analysis of phenotype information in EHRs from MIMIC-III.

There are many automatic annotation methods being developed. Information retrieval based approaches such as OBO Annotator [taboada2014automated], NCBO Annotator [jonquet2009ncbo], Bio-LarK [groza2015human] and MetaMap [aronson2010overview] rely on indexing and retrieval techniques which require manually defined rules and can be computationally inefficient, while deep learning based models with supervision are effective but a gold standard for training is hard to acquire [gehrmann2018comparing] . However, the problem of how to automatically annotate phenotypic abnormalities from EHRs accurately and efficiently is still far from being solved. First, the phenotypic abnormalities may not be explicitly mentioned in EHRs and the imprecise descriptions in EHR narratives such as abbreviations and synonyms can also make the annotation process difficult. In addition, the reliability of methods is critical in the medical area but it can be difficult to verify on a large-scale dataset due to the cost of collecting phenotype annotations from experts.

In this work, we propose a novel unsupervised deep learning framework to annotate phenotypic abnormalities from EHRs. Without using any labelled data, the framework is designed to integrate human curated phenotype knowledge in HPO (Figure 1 (a)). It is assumed that the semantics of EHRs is a composition of the semantics of phenotypic abnormalities. Based on this assumption, an auto-encoder model and a classifier are constructed to learn and constrain the semantic latent representations of EHRs. The goal is to learn which phenotypic abnormalities are semantically more important in EHRs. The overall structure of the framework is shown in Figure 1 (b). The main contributions of this work:

  • We propose a novel unsupervised deep learning framework to exploit supportive phenotype knowledge in HPO and annotate general phenotypic abnormalities from EHRs semantically.

  • We demonstrate that our proposed method achieves state-of-the-art annotation performance and computational efficiency compared with other methods.

In the remainder of this paper, we first summarize related works in section II. The problem is formalized in section III. We explain the methodology and the deep learning framework in section IV. The experiments are introduced in section V and the paper is concluded in section VI.

Ii Related Works

Ii-a Biomedical Concepts Annotation

There are many well-established knowledge bases in the medical area such as International Classification of Disease (ICD), Human Phenotype Ontology (HPO) [kohler2016human], Online Mendelian Inheritance in Man (OMIM) 222OMIM: https://www.omim.org. Previous works mostly use indexing and retrieval techniques. For example, the OBO Annotator [taboada2014automated] uses linguistic patterns to retrieve relevant data and then annotates textual snippets based on the indexes of concepts from knowledge bases. Other annotators such as NCBO Annotator [jonquet2009ncbo], Bio-LarK [groza2015human] and MetaMap [aronson2010overview] follow a similar annotation pipeline. However, those methods suffer from the problem of computational inefficiency. Meanwhile, the evaluation was conducted on limited medical documents and none of the methods was evaluated on EHRs. The recent work [gehrmann2018comparing] shows the effectiveness of supervised deep learning models (CNN) on annotating 10 phenotypes on 1,610 EHRs. In contrast, we propose an unsupervised deep learning framework which is more effective and efficient than the previous works, and our experiments were conducted on 52,722 EHRs.

Ii-B Semantic Representations in NLP

Learning semantic representations in the latent space of textual data is one of the most fundamental techniques of deep learning in natural language processing. From word embedding [mikolov2013efficient] to sentence encoding [dong2017i2t2i], the generated latent vectors should adequately represent semantics of text. Without labelled data, learning the representation of semantics in text and supportive knowledge in knowledge bases can be essential to measure semantic similarity and difference [zhang2019integrating]. Our work adopts the idea of using prior distributions to constrain the latent space in generative models [shen2017style] but we aim to annotate the phenotypic abnormalities from EHRs.

Iii Problem Formulation

There are two types of data sources. First, let be a collection of EHRs and each EHR consists of textual notes written by clinicians. Second, let be a standardized general category of human phenotypic abnormalities provided by HPO. The HPO also provides additional subclasses which are notated as . A simple illustration of HPO is given in Figure 1(a). Each general phenotypic abnormality and subclass comes with a name and a short description. As each is textual data, to comply with this data format, both and refer to the textual descriptions of phenotypic abnormalities.

The EHR can include either multiple phenotypic abnormalities or a single or none. Therefore, learning the annotation of phenotypic abnormalities from EHRs is essentially learning the conditional probability

, i.e. a binary classification for each to decide whether is mentioned in . As a whole, it is a multi-label classification on .

Iv Methodology

Iv-a Semantic Latent Representations

To represent the semantics of an EHR and a phenotypic abnormality by latent vector space, the following two assumptions are made:

  1. The general phenotypic abnormality can be represented by and generated from a latent vector which is sampled from some prior distribution . Each corresponds with a prior distribution and the prior distributions should be ‘distinct’ enough with each other to highlight their difference.

  2. The EHR can be represented by and generated from a latent vector . It is also assumed that the semantics of is a composition of the semantics of , so , where is a weight that can be interpreted the importance of in the composition of , and is a sample from the prior distribution defined above.

Based on the two assumptions made above, it can be noticed that there are two fundamental constraints which should be considered in modelling.

  1. The latent vector should adequately represent the semantics of . Likewise, should adequately represent the semantics of the corresponding (see section IV-B).

  2. The and are both samples from the same prior and the priors of different should be ‘distinct’ enough from each other (see section IV-C).

Iv-B An Auto-encoder Model

Aforementioned, since the annotation process is essentially learning the conditional probability and the latent space is constructed:


The Equation 1 suggests an auto-encoder model which is also effective to learn the latent representations [shen2017style]. Therefore, a general reconstruction process for all available textual data

is considered. Both the encoding step and generating step are approximated by deep neural networks. The encoding step

and the generating step , where is the textual space and

is the latent space. The estimation of parameters of

and is the optimization of the following.


where and are the parameters of and respectively. An illustration of and is shown in Figure 1(b).

There are three reconstruction loss functions being considered while

and are estimated. The first loss function considers the general reconstruction loss of EHRs i.e. .


Besides, the reconstruction loss of the general phenotypic abnormalities i.e. can be defined similarly, and for each phenotypic abnormality , the corresponding should be maximized to 1 and others () should be minimized to 0, i.e. .


The two loss functions are theoretically sufficient to learn the latent representation of EHRs and phenotypic abnormalities . However, in practice, the short description of may not be informative enough to define all cases of the general phenotypic abnormality, and the usage of the additional subclasses can help the model better understand the general phenotypic abnormalities. Therefore, the reconstruction of the additional subclasses i.e. is also necessary and the third loss can be defined as:


where stands for the additional subclasses of the individual . As shown in Figure 1 (a) (the red circle), there can be a subclass of multiple . In other words, .

Input: EHRs (training set), general phenotypic abnormalities , and additional subclasses .
1 Initializing , , ;
2 repeat
3        Sample a mini-batch of textual examples ;
4        Get and by ;
5        Reconstruct by ;
6        Calculate , , respectively ;
7        Classify by ;
8        Calculate by Equation 7 ;
9        Update , , by gradient descent on:
11until convergence;
Output: The encoder .
Algorithm 1 The training algorithm.

Iv-C Constrained and Distinct Priors

There are two requirements regarding the priors as mentioned in section IV-A. (1) The latent vectors and (both are the outputs of the encoder ) should be both sampled from the same prior . (2) The priors of different should be ‘distinct’ enough from each other because the semantics of different are believed to be different.

To comply with the first requirement, one way is to apply the idea of the variational auto-encoder which uses a KL-divergence to constrain the latent vectors. Regarding the second requirement, if the latent vectors sampled from different priors can be classified to different classes, then the priors are thought to be ‘distinct’ enough.

Therefore, considering both the requirements, a classifier is proposed (Figure 1(b)). The classifier is designed to conduct single-label classification with candidate classes . The intuition is the individual latent vector from the encoder of should be classified as the corresponding class via . Besides, and should be classified as two different classes and () respectively. Thus, the loss function to constrain and differentiate priors can be defined as follows.


Iv-D Annotation Strategy

Since the represents the importance of in the composition of and , in practice, we use to approximate . A threshold is applied to each general phenotypic abnormalities to decide if the is mentioned in . If , then is annotated with . Otherwise, is not annotated with . The thresholds are hyper-parameters and the value of for each is decided based on the distribution of in the training set (see section V-B).

Disease name Description in EHR Target HPO Keyword NCBO OBO MetaMap Ours
Subarachnoid hemorrhage On arrival to [**Hospital Name**] a CT was obtained which showed subarachnoid blood. HP:0001871 (Abnormality of blood and blood-forming tissues)
Mitral valve disorder He admits to mild DOE, slightly decreased exercise tolerance and occasional palpitations. HP:0003011 (Abnormality of the musculature)
Mitral valve disorder Patient presents s/p L orbit exenteration ([**Masked**]) for a history of basal cell carcinoma in her L orbit. HP:0000478 (Abnormality of the eye)
TABLE II: Qualitative analysis to show the effectiveness of our method in discovering implicit phenotypes from EHRs.
Method Available (A) Open source (O) #Records Time to annotate 52,722 EHRs
OBO A, Not O 515 1.0 hour
NCBO A, Not O / 36.7 hours
MetaMap A, O / 22 days
Bio-LarK Not A, Not O 228 /
CNN [gehrmann2018comparing] Not A, Not O 1,610 /
Ours A, O 52,722 40.2 min
TABLE III: A comparison of different methods. The #Records refers to the number of textual records used in the original works. The time was measured by the duration of annotating 52,722 EHRs in inference stage with a single thread Intel i7-6850K 3.60GHz and a single NVIDIA Titan X.
Method Precision Recall F1
Random 0.5541 0.5401 0.5108
Keyword 0.6732 0.4982 0.5194
OBO 0.6817 0.5917 0.5775
NCBO 0.6782 0.5724 0.5659
MetaMap 0.7425 0.5231 0.5576
Ours 0.7113 0.6805 0.6383
TABLE IV: The performance of annotation results compared with the silver standard. All the numbers are averaged across EHRs in the testing set.

V Experiments

V-a Datasets

We conducted the experiments based on two datasets. (1) We collected 52,722 discharge summaries as the EHRs from MIMIC-III [johnson2016mimic]. Each EHR also came with disease diagnosis marked by International Classification of Diseases (ICD-9) codes. The EHRs were randomly split into a training set (70%) and a held-out set for testing (30%). (2) We downloaded phenotype terms from Human Phenotype Ontology (HPO) 333Downloaded in April 2019. [kohler2016human]. In HPO, each phenotypic abnormality term has a name, synonyms and a definition. Besides, the HPO also provides the class-subclass relations between phenotypic abnormalities, as shown in Figure 1 (a). There are 24 general phenotypic abnormalities in HPO, i.e., , and there are 13,795 additional subclasses, i.e.. . The vocabulary size was limited to 30,000 most frequent words in both datasets and all numbers were excluded.

V-B Implementation Details 444Source code: https://github.com/JingqingZ/Semantic-HPO.

The encoder and generator were implemented based on Transformer [vaswani2017attention]. The encoder used a word embedding and a position embedding which were followed by 6 stacked Transformer encoders. The hidden size, intermediate size and number of attention heads were set as 768, 3072 and 12 respectively. As , the

s were then calculated by a dense layer with 24 units and a sigmoid activation function. The latent vectors

were calculated by 24 dense layers, each of which had 1536 units. The structure of was identical to . The classifier

was a CNN with three convolution layers. The convolution layers had filter sizes 8, 4, 2 and number of filters 4, 8, 16 respectively. The subsequent dense layer had 24 units with softmax. All the neural networks were implemented by using PyTorch


In Algorithm 1, the loss functions used the cross entropy and the coefficients are set as to balance values. The Adam optimizer [kingma2014adam] was used. In the annotation afterwards, the value of each threshold was set within the range of 70th- and 95th-percentile of in the training set. The training and inference (annotation) processes were run on a single NVIDIA Titan X GPU.

Since some original EHRs from MIMIC-III are lengthy, in practice, EHRs were split into fragments each of which had 32 words. After the encoder was trained, the annotation strategy was performed on each fragment and the aggregated annotations of all fragments were used as the final annotations of the EHR. As the ICD codes were reported in the EHRs at different levels, we used 3-digit level ICD codes when evaluating the annotation results for consistency.

V-C Evaluation

We considered the most influential biomedical annotation tools as baselines for performance comparison and the selection was due to their availability and the experimental settings. For a fair comparison, all the annotation results by the baselines were mapped to sets of general phenotypic abnormalities from .

  • Random choice: Each EHR was annotated by the general phenotypic abnormalities at random.

  • Keyword search: We searched the name and synonyms of each specific phenotypic abnormality in all EHRs and used the searching results as the annotations.

  • OBO Annotator [taboada2014automated] 666OBO: http://www.usc.es/keam/PhenotypeAnnotation/: Java implementation.

  • NCBO Annotator [jonquet2009ncbo] 777NCBO: http://data.bioontology.org/documentation: the annotator web APIs.

  • MetaMap [aronson2010overview]: 2016v2.

Since it is impractical to collect a gold standard on thousands of EHRs, we created a silver standard, i.e., a mapping from ICD codes to the general phenotypic abnormalities from HPO. As there is no manual curated direct mapping between ICD codes and HPO terms, we used Online Mendelian Inheritance in Man (OMIM), which is a catalog of human genes and genetic disorders, as an intermediate hop to link ICD codes and HPO terms. We collected the mapping from ICD codes to OMIM phenotype entries [goh2007human, park2009impact] and the mapping from OMIM entries to HPO terms 888https://hpo.jax.org/app/download/annotation. Based on these two manual curated mappings, we constructed a mapping from ICD codes to HPO terms, i.e. the general phenotypic abnormalities . The silver standard of the annotations of from EHRs was constructed by using this mapping.

The constructed silver standard provides a rich information source on diseases and their associated phenotypic characteristics. With the silver standard, we can partially evaluate the reliability of the annotation results by different methods. We used the micro-precision, micro-recall and micro-F1 which are averaged across EHRs for quantitative analysis. Some typical cases where EHRs have implicitly described some phenotypic abnormalities were shown for qualitative analysis.

V-D Results and Discussion

Table III compares different methods to show the scalability and efficiency of our method. Our work has conducted experiments on 52,722 EHRs, which are significantly more than previous works. In addition, our method is also more computationally efficient than the baselines. The inference (annotation) stage of our method takes 40.2 minutes to annotate 52,722 EHRs, which is 33% faster than the OBO Annotator and 98% faster than the NCBO Annotator and MetaMap.

Table IV compares the accuracy of different annotation methods on the silver standard. The proposed method achieves the precision of 0.7113, recall of 0.6805 and F1 of 0.6383. The F1 is significantly higher than those of all other baselines. Considering the association between phenotype and diseases in the silver standard, we believe that our method is more effective and the annotation results of our method can provide a better indication for disease diagnosis than the baselines.

Along with the evaluation using the silver standard, we have conducted qualitative analysis to provide more insights of our annotation work. The EHRs for qualitative analysis are selected from the patients with single disease. We find that within the same disease group, the phenotypic abnormalities vary across different EHRs. Our method can identify the HPO terms that are missed by other methods. Table II shows three typical case studies from qualitative analysis, where one EHR is from the disease subarachnoid hemorrhage and two EHRs are from the disease mitral valve disorder. In the first case, the EHR contains the keyword “subarachnoid blood” that clearly indicates the the presence of the general phenotype category “HP:000187 (Abnormality of blood and blood-forming tissues)”, while only our annotation method has found this HPO term. The EHR in the second case describes “slightly decreased exercise tolerance” which indicates movement impairment, and both our method and MetaMap have successfully found the related general phenotype category “HP:0003011 (Abnormality of the musculature)”. In the third case, although the EHR is originally diagnosed as mitral valve disorder, its description shows that this EHR can be wrongly diagnosed as it is more likely to have eye diseases. Our method has annotated the general phenotype category “HP:0000478 (Abnormality of the eye)”, which is consistent with our manual investigation. From the listed cases, we show that our annotation method outperforms others in the aspect of finding phenotypic abnormalities from implicit information.

Vi Conclusion and Future Work

In this work, we propose a novel unsupervised deep learning framework to annotate phenotypic abnormalities from EHRs. The proposed framework is able to learn semantic latent representations of textual data and use different prior distributions to constrain the latent space. The experiments have shown the effectiveness, efficiency and scalability of our method and we believe our method can provide a better indication for disease diagnosis than the baselines. In the future, we plan to extend the proposed framework to annotate all the 13,000 specific phenotypic abnormalities in HPO. Besides, due to the generality of the proposed framework, we believe it can be applied to annotating general concepts on plain text in general domains if a well-established knowledge base is available.


Jingqing Zhang would like to thank the support from the LexisNexis® Risk Solutions HPCC Systems® academic program and Pangaea Data.