Contextualised concept embedding for efficiently adapting natural language processing models for phenotype identification

03/10/2019 ∙ by Honghan Wu, et al. ∙ 0

Many efforts have been put to use automated approaches, such as natural language processing (NLP), to mine or extract data from free-text medical records to picture comprehensive patient profiles for delivering better health-care. Reusing NLP models in new settings, however, remains cumbersome - requiring validation and/or retraining on new data iteratively to achieve convergent results. In this paper, we formally define and analyse the NLP model adaptation problem, particularly in phenotype identification tasks, and identify two types of common unnecessary or wasted efforts: duplicate waste and imbalance waste. A distributed representation approach is proposed to represent familiar language patterns for an NLP model by learning phenotype embeddings from its training data. Computations on these language patterns are then introduced to help avoid or reduce unnecessary efforts by combining both geometric and semantic similarities. To evaluate the approach, we cross validate NLP models developed for six physical morbidity studies (23 phenotypes; 17 million documents) on anonymised medical records of South London Maudsley NHS Trust, United Kingdom. Two metrics are introduced to quantify the reductions for both duplicate and imbalance wastes. We conducted various experiments on reusing NLP models in four phenotype identification tasks. Our approach can choose a best model for a given new task, which can identify up to 76 model retraining, meanwhile, having very good performances (93-97 It can also provide guidance for validating and retraining the model for novel language patterns in new tasks, which can help save around 80 required in blind model-adaptation approaches.



There are no comments yet.


page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Compared to structured electronic health records (EHRs), free-text components constitute much deeper and larger volume of health data. For example, in a recent geriatric syndrome study [1], unstructured EHR data comprised a very significant part of identified cases: 67.9% falls, 86.6% vision impairments, and 99.8% lack of social supports. Similarly, in one of our co-morbidity studies using the CRIS database - anonymised EHRs of the South London and Maudsley NHS Foundation Trust (SLaM) [2], 1,899 cases of co-morbid depression and Type 2 Diabetes were identified from unstructured EHRs, while only 19 cases could be found using structured diagnosis tables. Value of unstructured records for selecting cohorts has been widely reported [3, 4]. Clearly, extracting clinical variables or identifying phenotypes from unstructured EHR data is essential to provide evidences for answering many clinical questions or verifying research hypotheses [5, 6, 7].

However, building an NLP application is costly, especially in the clinical domain. In addition to technical hurdles (e.g., data heterogeneity, data quality, expensive labelling efforts), clinical applications face immense challenges of using information from clinical notes at scale, and the reproducibility/transparency of the approaches used to extract information from the notes. Furthermore, due to the sensitive nature of clinical data, research access to EHR data, especially free-text records, require to go through rigid ethical approval and information governance processes, which are usually found to be surprisingly complex and lengthy. To lower such barriers of NLP building in clinical settings, various approaches have been proposed including general and user-friendly tools [8, 9, 10], portable solutions such as reusable algorithms for particular phenotypes [11], or web services or cloud based solutions [12, 13, 14].

These techniques for reducing NLP tool building effort can help speed up creating new solutions from scratch efficiently. When more and more NLP models are developed for similar tasks, a more efficient way to speed up the process is to adapt pre-trained NLP models in new settings, i.e. answering new questions or working on new data by reusing existing NLP solutions. However, extra efforts are normally necessary and it is very often burdensome to reuse pre-trained NLP models. This is mainly because NLP models essentially abstract language patterns (i.e. language characteristics representable in a computer understandable way) and subsequently use them for prediction or classification tasks. These patterns are prone to change when the document set (corpus) or the text mining task (what to look up) changes. Unfortunately, when it comes to a new setting, it is uncertain which patterns have not changed (or similar enough to the previous ones) and what the new ones are. Therefore, in practice, random samples are drawn to validate the performance of an existing NLP in a new setting and subsequently plan the adaptation of the model based on validation results, i.e. whether model retraining or improvement is needed, and how to do so if needed.

Such ‘blind’ adaptation is costly in the clinical domain because of barriers to data access and expensive data labelling - clinical expertise is required. The ‘blindness’ to the similarities and differences of language pattern landscapes between the source (where the model was trained) and target (the new task) settings causes (at least) two types of unnecessary/wasted efforts, which otherwise would be avoidable. First, for those data in the target setting, which shares the same patterns appearing in the source setting, any validation or retraining efforts on it will be unnecessary because the model has been trained/validated on these language patterns. We call this type of wasted effort the “duplicate waste”. The second type of ‘waste’ will exist if the distribution of new language patterns in the target setting is unbalanced, i.e., data instances unevenly belong to different language patterns. The model adaptation involves validating the model on these new data and further adjusting it when performance is not good enough. Without the knowledge of which data instances belong to which language patterns, data instances will be randomly sampled for validation and adaptation. In most cases, a minimal number of instances of every pattern need to be processed so that convergent results can be obtained. This will usually be achieved via iterative validation and adaptation process, which will inevitably cause popular language patterns to be over represented (therefore result in the model being over validated/retrained on such data). Such unnecessary efforts on popular language patterns are underlied by the pattern imbalance in the target setting, which unfortunately is the norm in almost all real world EHR datasets. We call this “imbalance waste”.

The ability to make language patterns ‘visible’ and comparable will provide answers to whether a NLP model can be adapted to a new task or, more actionably, provide guidance to effectively and efficiently solve new problems by ‘smartly’ adapting existing NLP models.

In this paper, we introduce a phenotype embedding model (using vectors to represent a phenotype) to: (1) quantify how NLP models trained for different tasks might work in a new setting; (2) subsequently propose an approach to select the most suitable model; (3) provide guidance for efficiently adapting it for solving the new task.

2 Problem Definition

Examples Types of phenotype mentions
49 year old man with hepatitis c Contextualised phenotype mentions positive mention
with no evidence of cancer recurrence negated mention
is concerning for local lung cancer recurrence hypothetical mention
PAST MEDICAL HISTORY: 1) Atrial Fibrillation, 2)… history mention
Mother was A positive, hepatitis C carrier, and … other experiencer mention
She visited the HIV clinic last week. not a phenotype mention
The pt. asked information about stroke. not a phenotype mention
Table 1:

The task of recognising contextualised phenotype mentions is to identify mentions of phenotypes from free-text records and also classify the context of each mention into 5 categories (listed in the 3rd column above). The last two rows give examples of non-phenotype mentions - the two sentences are not describing incidents of a condition.

Specifically, we aim at the problem of NLP model transferability in tasks of extracting phenotypes (e.g. diseases and associated traits) from free-text medical records. Specifically, the task is to identify mentions of phenotypes and the contexts in which they were mentioned [10]. Table 1 explains and gives examples of contextualised phenotype mentions. The research question to be investigated is formally defined as follows.

Definition 2.1.

Given an NLP model (denoted as ) previously trained for some phenotype identification task(s), and a new task (denoted as , where either phenotypes to be identified are new or the dataset is new, or both are new), is used in to identify a set of phenotype mentions - denoted as . The research question (as illustrated in Figure 1) is how to partition to meet the following criteria:

  1. a maximum ‘p-known’ subset where ’s performances can be properly predicted using prior knowledge of ;

  2. ‘p-unknown’ subsets: , where they meet the following criteria:

    1. ;

    2. ;

    3. can be represented by a small number of instances so that ’s overall performance on can be predicted by its result on ;

    4. .

Figure 1: Assess the transferability of a pre-trained model in solving a new task: discriminate between differently inaccurate mentions identified by the model in the new setting.

The identification of ‘p-known’ subset (criterion 1) will help eliminate “duplicate waste” by avoiding unnecessary validation and adaptation on those phenotype mentions. On the other hand, separating the rest of the annotations into ‘p-unknown’ subsets allows processing mentions based on their performance-relevant characteristics separately, which in turn helps avoid “imbalance waste”. The above criterion 2.a ensures completeness of covering all performance-unknown mentions, 2.b ensures no overlaps between mention subsets so that no duplicated efforts will be put on the same mentions. Criterion 2.c requires the partitioning of the mentions is performance-relevant, meaning model performances on a small number of samples can be generalised to the whole subset that they are drawn from. Lastly, a small (criterion 2.d) enables efficient adaptation of a model.

2.1 Dataset & adaptable phenotype identification models

Recently, we developed SemEHR [10] - a semantic search toolkit aiming to use interactive information retrieval functionalities to replace NLP building so that clinical researchers can use a browser based interface to access text mining results from a generic/baseline NLP model and (optionally) keep getting better results by iteratively feeding back to the system. A SLaM instance of this system has been trained for supporting 6 comorbidity studies (62,719 patients; 17,479,669 clinical notes in total), where different combinations of physical conditions and mental disorders are extracted and analysed. Supplementary material S1 and S2 give details about the user interface and model performances. These studies effectively generated 23 phenotype identification models and relevant labelled data, which we use to study the model transferability.

3 Method

Our approach is based on the following assumption about a language pattern representation model.

Assumption 3.1.

There exists a pattern representation model, , for identifying language patterns of phenotype mentions with the following characteristics.

  1. Each phenotype mention can be characterised by one and only one language pattern;

  2. Patterns are largely shared by different mentions;

  3. There is a deterministic association between NLP models’ performances with such language patterns.

Theorem 3.1.

Given - a pattern model meeting Assumption 3.1, - an NLP model, - a new task, let be the pattern set identifies from dataset(s) that was trained or validated on; let be the pattern set identifies from - the set of all mentions identified by in . Then, the problem defined in Definition 2.1 can be solved by a solution, where is a ‘p-known’ subset and is ‘p-unknown’ subsets.


Theorem 3.1 can be proved as follows.

  1. Instances of in are those mentions whose patterns which are same as has seen previously. Given the deterministic associations between patterns and ’s performances (item 3 in Assumption 3.1), their performances can be predicted with high confidence using prior knowledge of ’s performances on these patterns. Therefore, is a .

  2. meets the definition of ‘p-unknown’ subsets as proved by the following.

    1. Apparently, meets the criterion 2.a in Definition 2.1 because all mentions with new patterns are represented by it;

    2. Based on item 1 in Assumption 3.1, each is disjoint with others because a mention can only be assigned to one pattern. Therefore, criterion 2.b is met.

    3. Item 3 in Assumption 3.1 embraces criterion 2.c in that it induces performances of on all mentions of a pattern can be deduced by observing how it performs on sampled mentions of .

    4. Item 2 Assumption 3.1 assures the fulfilment of criterion 2.d - a small number of patterns.

The rest of this section gives details of a realisation of using distributed representation models.

3.1 Distributed representation for contextualised phenotype mentions

In computational linguistics, statistic language models are perhaps the most common approach to quantify word sequences, where a distribution is used to represent the probability of a sequence of words -

. Among such models, the bag-of-words (BOW) model [15] is one of the earliest and simplest form. In particular, it was widely used, and proven to be efficient and (often surprisingly) accurate in many information retrieval systems [16].

However, the simplicity of BOW models also underlies its many limitations - incapable to disambiguate semantics in word orders (e.g., [worry, about, her, mother’s, cancer] vs. [mother’s, worry, about, her, caner]); orthogonality between vector dimensions (i.e., words) entails the ignorance of semantic similarities between words (e.g., it can not tell that tumour is more similar to cancer than hepatitis

). To overcome these limitations, more complex models were introduced such as, n-gram models 

[17], latent semantic analysis [18] and latent dirichlet allocation [19]. Probably, the most popular alternative is the distributed representation model [20]

, which uses a vector space to model words so that word similarities can be represented as distances between their vectors. Such vectors are usually learnt via unsupervised learning task of predicting a word using surrounding words via a neural network based approach. This concept has since been extensively followed up, extended and shown to significantly improve NLP tasks 

[21, 22, 23, 24, 25, 26].

In original distributed representation models, the semantics of one word is encoded in one single vector, which makes it impossible to disambiguate different semantics or contexts that one word might be used in a corpus. Recently, various (bidirectional) Long Short Term Memory (LSTM) models were proposed to learn contextualised word vectors 

[27, 28, 29]. However, such approaches require large corpus [27] to generate a reasonable model that is able to capture the word contexts. Furthermore, such linguistic contexts are not the phenotype contexts (see Table 1) that we are after in this paper.

Figure 2: The framework to learn contextualised phenotype embedding from labelled data that an NLP model was trained or validated on.

Inspired by the good properties of distributed representations for words, we propose a phenotype encoding approach with the aim to model the language patterns of contextualised phenotype mentions. In the process of word embedding learning, randomly initialised word vectors are eventually updated to capture word semantics during the process of predicting next words (or words at central positions of word sequences). Similar to semantics of single words, phenotype mentions in free-text medical records have their own “semantics”. However, phenotype semantics are implicitly conveyed via ‘observable’ words but associated with a larger and more complex context than word semantics. For example, in sentences of she worried about contracting HIV and she is HIV positive, the entity HIV in both sentences denotes the same concept of Human Immunodeficiency Viruses, while the first one is a hypothetical phenotype mention and the second one is a positive mention. If such implicit phenotype semantics could be explicitly marked in the text, an embedding model can probably learn to capture them using an approach that is similar to word embedding learning framework.

Figure 2 illustrates our framework for extending the continuous bag-of-words [24] architecture of learning word embeddings to capture the semantics of contextualised phenotype mentions. Explicit mark-ups of phenotype mentions are added to the architecture as place-holders for phenotype semantics. Such mark-up (e.g., C0038454_POS) is composed of two parts: phenotype identification (e.g., C0038454) and contextual description (e.g., POS). The first part identifies a phenotype using a standardised vocabulary. In our implementation Unified Medical Language System (UMLS) [30] was chosen for its broad concept coverage and the provision of comprehensive synonyms for concepts. The first benefit of using a standardised phenotype definition is that it helps grouping together mentions of the same phenotype using different names. For example, using UMLS concept identification of C0038454 for STROKE helps combining together mentions using Stroke, Cerebrovascular Accident, Brain Attack and other 43 synonyms. The second benefit comes from the concept relations represented in structured vocabularies, which helps the transferability computation that we will elaborate in a later section. The second part of phenotype mention mark-up is to identify the contexts of a mention. Six types of contexts (see Table 1) are supported: POS for positive mention, NEG for negated mention, HYP for hypothetical mention, HIS for history mention, OTH for other experiencer mention and NOT for not a phenotype mention.

The phenotype mention mark-ups can be populated using labelled data that NLP models were trained or validated on. In our implementation, the mark-ups were generated from the labelled (iterative feedbacks provided via SemEHR user interface) subset of SLaM EHRs where 6 NLP models were trained and validated on. With such mark-ups, the extended architecture in Figure 2 is used to learn embeddings for words and phenotype mark-ups at the same time.

3.2 Using phenotype embedding and their semantics for assessing model transferability

The embeddings learnt from the above-mentioned framework are composed of vector representations for words and, particularly, contextualised phenotypes mentioned in clinical notes. They are the building blocks to underlie our language pattern representation model - as introduced at the beginning of this section, which is to compute and for assessing and guiding NLP model adaptation in new tasks.

Figure 3: Architecture of phenotype embedding based approach for transferring pre-trained natural language processing models in identifying new phenotypes or on new corpora.

Figure 3 illustrates the architecture of our approach. The double-circle shape denotes the embeddings learnt from the ’s labelled data. The documents from a new task (on the left of the figure) are annotated with phenotype mentions using a pre-trained model - . The process is composed of the following steps.

  1. Vectorise phenotype mentions in a new task Each mention in the new task will be represented as a vector of real numbers using the learnt embedding model to combine its surrounding words as context semantics. Formally, let be a mention identified by in the new task, can be represented by a function defined as follows.


    where is the embedding model to convert a word token into a vector, is the word in a document, is the offset of the first word of in the document and is the number of words in and is a function to combine a set vectors into a result vector (we use average in our implementation). With such representations, all mentions are effectively put in a vector space (depicted as a two dimensional space on the right of the figure for illustration purpose).

  2. Identify language patterns from mention vectors In the vector space, clusters are naturally formed based on geometric distances between mention vectors. Different clustering algorithms and parameters were tried. DBScan [31] was chosen on Euclidean distance in our implementation for vector clustering. Essentially, each cluster is a set of mentions considered to share the same (or similar-enough) underlying language pattern, meaning language patterns in the new task are technically the vector clusters. We chose the cluster centroid (arithmetic mean) to represent a cluster (a.k.a its underlying language pattern).

  3. Choose a reference phenotype embedding based on phenotype semantics To represent , we choose a reference vector for each phenotype in the new task. Such a vector is picked up or generated from the learnt phenotype embeddings to represent all language patterns (relevant to the phenotype in question) the model has seen previously. Apparently, when the phenotype to be identified in the new task is novel to - not in the set of phenotypes it was developed for, the reference phenotype needs to be carefully selected so that it can help produce a sensible separation between p-known and p-unknown clusters. We use the semantic similarity (distance between two concepts in the UMLS tree structure) to choose most similar phenotype from ’s phenotype list. Formally, the reference is chosen as follows. Let be the UMLS concept for a phenotype to be identified in the new task and be the set of phenotype concepts that was trained for, the reference phenotype choosing function is:


    where is a distance function to calculate the steps between two nodes in the UMLS concept tree. Once the reference phenotype has chosen, the reference vector can be selected or generated (e.g., use the average) from the phenotype’s contextual embeddings.

  4. Classify language patterns to guide model adaptation When the reference phenotype is properly selected, its vector will be used to classify clusters based on the distances between their centroids (representative vectors of clusters) and the reference vector. When a distance threshold is chosen, effectively, such distance based classification partitions the vector space into two sub-spaces using the reference vector as the centre: the sub-space whose distance to the centre is less than the threshold is called p-known sub-space and the leftover is p-unknown sub-space. The union of clusters whose centroids are within p-known sub-space is p-known meaning ’s performances on them can be predicted without further validation (removes duplicate waste). Other clusters are p-unknown clusters. can be validated and/or further trained on each p-unknown cluster separately instead of blindly on all clusters mixed. This will remove imbalance waste.

4 Results

4.1 Associations between embedding based Language patterns and model performances

As stated in the beginning of section 3, our approach is based on 3 assumptions about language patterns (see Assumption 3.1). Therefore, the first evaluation is to quantify to what extent the language patterns identified by our embedding based approach meet the assumptions. Apparently, the first assumption - a phenotype mention can be assigned to and only to one language pattern - is met in our approach, which can be proved by (a) Equation 1 is an One-to-One function; (b) DBScan algorithm (the vector clustering function chosen in our implementation) is also an One-to-One function. Assumption 2 can be quantified by percentage of mentions that can be assigned to a cluster. Increasing the EPS parameter (the maximum distance between two data items for them to be considered as in the same neighbourhood) in DBScan, a higher percentage can be achieved straightaway. However, the degree to which mentions are clustered together needs to be balanced against its consequence of reduced ability to identify performance-related language patterns, which is the third assumption - associations between language patterns the model performances. To quantify such an association, we propose a metric called Separate Power, as defined in Equation 3 below. The aim is to measure to what extent a clustering can put a certain type of data items together - in a limited number of clusters. Let be a set of binary data items - , given a clustering result , its separate power for typed data items is defined as follows.


In our scenario, we would like to see a clustering being able to separate easy cases (where good performances are achieved) from difficult cases (where performances are bad) for a model .

(a) Diabetes (C0011849):
(b) Hypertensive disease (C0020538):
(c) Abscess (C0000833):
(d) Blindness (C0456909):
Figure 4: Clustered Percentage vs Separate Power on difficult cases. X axis is the EPS parameter of DBScan clustering algorithm - the longest distance between any two items within a cluster; Y axis is the percentage. Two types of changing information (as functions of EPS) are plotted on each sub-figure: clustered percentage (solid line) and Separate Power (SP) on difficult cases (mentions with bad performances). The latter has two series: (1)SP by chance (dash dotted line) - when clustering by randomly selecting mentions; (2)SP by clustering using phenotype embedding (dashed line).

To quantify the clustering percentage, the ability to separate mentions based on model performances and the interplay between the two, we conducted experiments on selected phenotypes by continuously increasing the clustering parameters - EPS from a low level. Figure 4 shows the results. In this experiment, we label mentions into two types: correct and wrong using SemEHR labelled data on CRIS corpus. Specifically, for the mention types in Table 1, wrong mentions are those not-a-phenotype-mention ones and the rest types are labelled as correct. We chose wrong as the in equation 3 meaning that we evaluate the separate power on wrong mentions. Two most validated phenotypes, Diabetes and Hypertensive disease, were selected to represent phenotypes with hundreds of mentions. Two other phenotypes were chosen to represent cases where the NLP model had different levels of performances. Abscess had around 13% incorrectly identified mentions, while Blindness had 47%. The figure shows a clear trend in all cases that as EPS increases the clustered percentage increases but with decreasing separate power. This indicates there is a trade-off between the coverage of identified language patterns and how good they are. Regarding separate powers, the performances on two selected common phenotypes (Figure 3(a) and 3(b)) are generally worse than the other phenotypes - starting with less powers and decreasing faster when EPS increases. The main reason is that the difficult cases (mentions with bad performances) in the two popular phenotypes are relatively rare (Diabetes: 8.5%; Hypertensive disease: 5.5%). In such situations, difficult cases are harder to separate because their patterns are underrepresented. However, in general, compared to random clustering, the embedding based clustering approach brings in much better separate power in all cases. This confirms a high level association between identified clusters and model performances on them. Especially, when the proportion of difficult cases reaches near 50% (Figure 3(d)), the approach can keep values almost constantly near 1.0 when EPS increases. This means it can almost always group difficult cases in their own clusters.

4.2 Model adaptation guidance evaluation

(a) New task: Diabetes (C0011849);
Reuse model: Type 2 Diabetes(C0011860);
#Mentions/#not-a-mention: 268/23;
Saved Imbalance Waste: 40 or 83%
(b) New task:Stroke(C0038454);
Reuse model: Heart Attack(C0027051);
#Mentions/#not-a-mention: 238/13;
Saved Imbalance Waste: 39 or 82%
(c) New task: Heart Attack(C0027051);
Reuse model: Infarct(C0021308);
#Mentions/#not-a-mention: 54/11;
Saved Imbalance Waste: 11 or 78%
(d) New task:Multiple Sclerosis(C0026769);
Reuse model: Myasthenia Gravis(C0026896);
#Mentions/#not-a-mention: 104/4;
Saved Imbalance Waste: 14 or 85%
Figure 5: Identifying new phenotypes by reusing NLP models pre-trained for semantically-close phenotypes: the four pairs of phenotype identification models are chosen from SemEHR models trained on SLaM CRIS data; DBScan EPS value: 3.8; Imbalance Waste is calculated on meaning at least 3 samples are needed for training from each language pattern. X axis is the similarity threshold ranged from 0.0 to 0.8; Y axes, from top to bottom, are proportion of saved duplicate waste over total number of mentions; macro accuracy; micro accuracy.

Technically, the guidance to model adaptation is composed of two parts: avoid duplicate waste (skip validation/training efforts on cases the model is already familiar with); and avoid imbalance waste (group new language patterns together so that validation/continuous training on each group separately can be more efficient than doing it over the whole corpus). To quantify the guidance effectiveness, the following metrics are introduced.

  • Duplicate Waste The number of mentions whose patterns fall into what the model is familiar with. A percentage can be calculated by the , which means the proportion of mentions needs no validation or retraining by reusing .

  • Imbalance Waste To achieve convergence performances, an NLP model needs to be trained on a minimal number (denoted as ) of samples from each language pattern. Calling the language pattern set in a new task as , the following equation counts the minimal number of samples needed for getting convergent results.


    When the language patterns are identifiable, the Imbalance Waste that can be avoided is quantified as .

  • Accuracies Our approach uses a distance threshold between centroids of clusters and reference vector to determine whether the language patterns of identified clusters are known to the pre-trained NLP model. For those whose distances are within the threshold, we assume the model performs well as they are similar enough. To evaluate whether the approach can really identify familiar patterns, we quantify the accuracy of those within-threshold clusters and also those within-threshold single mentions that are not clustered. Both macro-accuracy (average of all cluster accuracies) and micro-accuracy (overall accuracy) are used - detailed explanations at [32].

Figure 5 shows the results of our NLP model adaptation guidance on 4 phenotype identification tasks. To simplify the validation, we make it a binary classification task: whether an identified mention is a phenotype mention or not (all contextual phenotype mentions in Table 1 are deemed as a mention and not-a-phenotype-mention is labelled as not-a-mention). For each new phenotype identification task, the NLP model (pre-)trained for the semantically most similar (defined in Equation 2) phenotype was chosen as the reuse model. Models and labelled data for the four pairs of phenotypes were selected from six physical comobidity studies on SLaM CRIS data. Figure 5 shows that identified mentions have a high proportion of avoidable duplicate wastes in all 4 cases: Diabetes and Heart Attack start with 50% duplicate waste; Stroke and Multiple Sclerosis have more than 70%. Such avoidable duplicate waste decreases when the threshold increases. The threshold is on similarity instead of distance meaning the bigger the threshold value the smaller the space diameter. In other words, new patterns need to be more similar to the reuse model’s embeddings to be counted as familiar patterns. Therefore, it is understandable that duplicate waste decreases in such scenarios. In terms of accuracies, one would expect they increase as only more similar patterns are left when threshold increases. However, interestingly, in all cases, both macro and mico-accuracies slightly decrease a bit before increasing to reach near 1.0. This is a phenomenon worth future investigation. In general, the changes of accuracies are quite small (mostly smaller than .03 and the biggest is .08) and they are quite good (more than .92 even in worst cases) as well. Given these observation, the threshold is normally set at .01, which gives us best saved duplicate waste with ignorable accuracy sacrifices. Specifically, in all cases, more than half of the identified mentions (50%+ for subfigure 4(a) and 4(b); 70%+ for 4(c) and 4(d)) do not need any validation/training with more than .95 accuracy. In terms of effective adaptation on new patterns, the percentages of avoidable imbalance wastes in all cases are around 80% confirming that a much more efficient retraining on data can be achieved through language pattern based guidance.

Model reuse cases duplicate waste macro-accuracy micro-accuracy
Diabetes by Type 2 Diabetes 0.502 0.966 0.933
Diabetes by Hyperchole 0.477 0.965 0.930
Stroke by Heart attack 0.711 0.948 0.955
Stroke by Fatigue 0.220 0.884 0.938
Heart attack by Infarct 0.569 0.989 0.966
Heart attack by Bruise 0.529 0.821 0.889
Multiple Sclerosis by Myasthenia Gravis 0.761 0.944 0.971
Multiple Sclerosis by Diabetes 0.522 0.993 0.979
Table 2: Comparisons on performances of reusing models with different semantic similarity levels. More similar ones are marked with. Similarity threshold: .01; DBScan EPS: .38.

Effectiveness of phenotype semantics in model reuse When considering NLP model reuse for a new task, if there is no existing model that has been developed for the same phenotype identification task, our approach will choose a model trained for a phenotype that is most semantically similar to it (based on Equation 2). To evaluate the effectiveness of such semantic relationships in reusing NLP models, we conducted experiments on the previous four phenotypes by using phenotype models with different levels of semantic similarities. Table 2 shows the results. In all cases, reusing models trained for more similar phenotypes can identify more duplicate waste using the same parameter settings. The first three cases in the table can also achieve better accuracies, while Multiple Sclerosis had slight better accuracy by reusing Diabetes model than the more semantically-similar Myasthenia Gravis. However, the latter identified 46% more duplicate waste.

5 Discussion and Conclusion

Free-text medical records contain a huge amount of important social and clinical features describing patients’ medical profiles, which are undoubtedly richer and more comprehensive than those in structured formats [1, 6]. Unsurprisingly, automated extraction methods (as surveyed recently by Ford and et. al. [33]) have been intensively investigated in this area, many of which are made freely available and/or open source [34, 35, 36, 10]. Therefore, it is very sensible to consider adapting existing tools wherever possible. Reusing existing text mining models or tools is expected not only to save resources and efforts in tool development itself, but also to avoid the complex process for gaining approval and setting up safe environments for data labelling and model training. However, there is no quantifiable guarantee about whether or to what extent an NLP model can work well in a new setting. This means validation/retraining efforts are necessary, which makes most of the above efficiency expectations, particularly the second part, largely unrealistic. To tackle this issue, we proposed an approach that can automatically (i) identify easy cases in a new task for the reused model, on which it can achieve good performances with a high confidence; (ii) classify the rest cases so that the validation or retraining on them can be conducted much more efficiently, compared to adapting the model on the them as a whole. Specifically, in four phenotype identification tasks, we have shown that around 50-79% of the whole mentions are identifiable easy cases, for which our approach can choose the best reusable model achieving more than 93% accuracy. Furthermore, for those cases that need validation or retraining, our approach can provide guidance that can save 78-85% efforts.

In this study, the experiments were focused on binary classifications of candidate phenotype mentions - whether a candidate mention is a phenotype mention or not (see Table 1 for examples). Investigating the approach on multi-class classification, a.k.a. contextualised mentions of phenotypes, will be a more challenging, yet more rewarding future research topic. In addition, we did not evaluate the recall of adapted NLP models in new tasks. Although the models we chose can generally achieve very good recall for identifying physical conditions (96-98%) on CRIS [10], it is no doubt that investigating the transferability on recalls is an important aspect of NLP model adaptation. In general, making language patterns visible and comparable (in a form computers understand) is the key to support ‘smart’ NLP model adaptation. The phenotype embedding based approach is just one way to model such patterns. Investigating novel pattern representation models is an exciting research direction to enable automated NLP model adaptation and composition for efficiently mining clinical notes in new settings with minimised efforts.


  • [1] Hadi Kharrazi, Laura J Anzaldi, Leilani Hernandez, Ashwini Davison, Cynthia M Boyd, Bruce Leff, Joe Kimura, and Jonathan P Weiner. The value of unstructured electronic health record data in geriatric syndrome case identification. J. Am. Geriatr. Soc., 66(8):1499–1507, August 2018.
  • [2] Gayan Perera, Matthew Broadbent, Felicity Callard, Chin-Kuo Chang, Johnny Downs, Rina Dutta, Andrea Fernandes, Richard D Hayes, Max Henderson, Richard Jackson, Amelia Jewell, Giouliana Kadra, Ryan Little, Megan Pritchard, Hitesh Shetty, Alex Tulloch, and Robert Stewart. Cohort profile of the south london and maudsley NHS foundation trust biomedical research centre (SLaM BRC) case register: current status and recent enhancement of an electronic mental health record-derived data resource. BMJ Open, 6(3):e008721, March 2016.
  • [3] Francisco S Roque, Peter B Jensen, Henriette Schmock, Marlene Dalgaard, Massimo Andreatta, Thomas Hansen, Karen Søeby, Søren Bredkjær, Anders Juul, Thomas Werge, et al. Using electronic patient records to discover disease correlations and stratify patient cohorts. PLoS computational biology, 7(8):e1002141, 2011.
  • [4] Yajuan Wang, Kenney Ng, Roy J Byrd, Jianying Hu, Shahram Ebadollahi, Zahra Daar, Steven R Steinhubl, Walter F Stewart, et al. Early detection of heart failure with varying prediction windows by structured and unstructured data in electronic health records. In 2015 37th Annual International Conference of the IEEE Engineering in Medicine and Biology Society (EMBC), pages 2530–2533. IEEE, 2015.
  • [5] Swapna Abhyankar, Dina Demner-Fushman, Fiona M Callaghan, and Clement J McDonald. Combining structured and unstructured data to identify a cohort of icu patients who received dialysis. Journal of the American Medical Informatics Association, 21(5):801–807, 2014.
  • [6] Andrea V Margulis, Joan Fortuny, James A Kaye, Brian Calingaert, Maria Reynolds, Estel Plana, Lisa J McQuay, Willem Jan Atsma, Billy Franks, Stefan de Vogel, et al. Value of free-text comments for validating cancer cases using primary-care data in the united kingdom. Epidemiology, 29(5):e41–e42, 2018.
  • [7] James Bell, Cise Kilic, Reena Prabakaran, Yuan Yuan Wang, Robin Wilson, Matthew Broadbent, Anil Kumar, and Vivienne Curtis. Use of electronic health records in identifying drug and alcohol misuse among psychiatric in-patients. The Psychiatrist, 37(1):15–20, 2013.
  • [8] Richard G Jackson MSc, Michael Ball, Rashmi Patel, Richard D Hayes, Richard J B Dobson, and Robert Stewart. TextHunter–A user friendly tool for extracting generic concepts from free text in clinical research. AMIA Annu. Symp. Proc., 2014:729–738, November 2014.
  • [9] Guergana K Savova, James J Masanz, Philip V Ogren, Jiaping Zheng, Sunghwan Sohn, Karin C Kipper-Schuler, and Christopher G Chute. Mayo clinical text analysis and knowledge extraction system (cTAKES): architecture, component evaluation and applications. J. Am. Med. Inform. Assoc., 17(5):507–513, September 2010.
  • [10] Honghan Wu, Giulia Toti, Katherine I Morley, Zina M Ibrahim, Amos Folarin, Richard Jackson, Ismail Kartoglu, Asha Agrawal, Clive Stringer, Darren Gale, Genevieve Gorrell, Angus Roberts, Matthew Broadbent, Robert Stewart, and Richard J B Dobson. SemEHR: A general-purpose semantic search system to surface semantic data from clinical notes for tailored care, trial recruitment, and clinical research. J. Am. Med. Inform. Assoc., 25(5):530–537, May 2018.
  • [11] Robert J Carroll, Will K Thompson, Anne E Eyler, Arthur M Mandelin, Tianxi Cai, Raquel M Zink, Jennifer A Pacheco, Chad S Boomershine, Thomas A Lasko, Hua Xu, Elizabeth W Karlson, Raul G Perez, Vivian S Gainer, Shawn N Murphy, Eric M Ruderman, Richard M Pope, Robert M Plenge, Abel Ngo Kho, Katherine P Liao, and Joshua C Denny. Portability of an algorithm to identify rheumatoid arthritis in electronic health records. J. Am. Med. Inform. Assoc., 19(e1):e162–9, June 2012.
  • [12] J Christoph, L Griebel, I Leb, I Engel, F Köpcke, D Toddenroth, H-U Prokosch, J Laufer, K Marquardt, and M Sedlmayr. Secure secondary use of clinical data with cloud-based NLP services. towards a highly scalable research infrastructure. Methods Inf. Med., 54(3):276–282, 2015.
  • [13] V Tablan, I Roberts, H Cunningham, and K Bontcheva. a platform for large-scale, open-source text processing on the cloud. Philosophical Transactions of the Royal Society A: Mathematical, Physical and Engineering Sciences, 371(1983):20120071–20120071, 2012.
  • [14] Kyle Chard, Michael Russell, Yves A Lussier, Eneida A Mendonça, and Jonathan C Silverstein. A cloud-based approach to medical NLP. AMIA Annu. Symp. Proc., 2011:207–216, October 2011.
  • [15] Zellig S Harris. Distributional structure. Word, 10(2-3):146–162, 1954.
  • [16] G Salton, A Wong, and C S Yang. A vector space model for automatic indexing. Commun. ACM, 18(11):613–620, 1975.
  • [17] Peter F Brown, Peter V Desouza, Robert L Mercer, Vincent J Della Pietra, and Jenifer C Lai. Class-based n-gram models of natural language. Computational linguistics, 18(4):467–479, 1992.
  • [18] Scott Deerwester, Susan T Dumais, George W Furnas, Thomas K Landauer, and Richard Harshman. Indexing by latent semantic analysis. Journal of the American society for information science, 41(6):391–407, 1990.
  • [19] David M Blei, Andrew Y Ng, and Michael I Jordan. Latent dirichlet allocation. Journal of machine Learning research, 3(Jan):993–1022, 2003.
  • [20] D E Rumelhart and J L Mcclelland. Distributed representations. Parallel distributed processing: Explorations in the microstructure of cognition, 1, 1986.
  • [21] Yoshua Bengio, Réjean Ducharme, Pascal Vincent, and Christian Jauvin. A neural probabilistic language model. Journal of machine learning research, 3(Feb):1137–1155, 2003.
  • [22] Ronan Collobert and Jason Weston. A unified architecture for natural language processing: Deep neural networks with multitask learning. In Proceedings of the 25th international conference on Machine learning, pages 160–167. ACM, 2008.
  • [23] Xavier Glorot, Antoine Bordes, and Yoshua Bengio.

    Domain adaptation for large-scale sentiment classification: A deep learning approach.

    In Proceedings of the 28th international conference on machine learning (ICML-11), pages 513–520, 2011.
  • [24] Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S Corrado, and Jeff Dean. Distributed representations of words and phrases and their compositionality. In Advances in neural information processing systems, pages 3111–3119, 2013.
  • [25] Stephan Gouws, Yoshua Bengio, and Greg Corrado. Bilbowa: Fast bilingual distributed representations without word alignments. In International Conference on Machine Learning, pages 748–756, 2015.
  • [26] Felix Hill, Kyunghyun Cho, and Anna Korhonen. Learning distributed representations of sentences from unlabelled data. arXiv preprint arXiv:1602.03483, 2016.
  • [27] Matthew E Peters, Mark Neumann, Mohit Iyyer, Matt Gardner, Christopher Clark, Kenton Lee, and Luke Zettlemoyer. Deep contextualized word representations. arXiv preprint arXiv:1802.05365, 2018.
  • [28] Bryan McCann, James Bradbury, Caiming Xiong, and Richard Socher. Learned in translation: Contextualized word vectors. In Advances in Neural Information Processing Systems, pages 6294–6305, 2017.
  • [29] Matthew E Peters, Waleed Ammar, Chandra Bhagavatula, and Russell Power. Semi-supervised sequence tagging with bidirectional language models. arXiv preprint arXiv:1705.00108, 2017.
  • [30] Olivier Bodenreider. The unified medical language system (umls): integrating biomedical terminology. Nucleic acids research, 32(suppl_1):D267–D270, 2004.
  • [31] Erich Schubert, Jörg Sander, Martin Ester, Hans Peter Kriegel, and Xiaowei Xu. Dbscan revisited, revisited: why and how you should (still) use dbscan. ACM Transactions on Database Systems (TODS), 42(3):19, 2017.
  • [32] Vincent Van Asch. Macro-and micro-averaged evaluation measures [[basic draft]]. Belgium: CLiPS, 2013.
  • [33] Elizabeth Ford, John A Carroll, Helen E Smith, Donia Scott, and Jackie A Cassell. Extracting information from the text of electronic medical records to improve case detection: a systematic review. Journal of the American Medical Informatics Association, 23(5):1007–1015, 2016.
  • [34] Yonghui Wu, Joshua C Denny, S Trent Rosenbloom, Randolph A Miller, Dario A Giuse, Lulu Wang, Carmelo Blanquicett, Ergin Soysal, Jun Xu, and Hua Xu. A long journey to short abbreviations: developing an open-source framework for clinical abbreviation recognition and disambiguation (card). Journal of the American Medical Informatics Association, 24(e1):e79–e86, 2016.
  • [35] Guergana K Savova, Philip V Ogren, Patrick H Duffy, James D Buntrock, and Christopher G Chute. Mayo clinic nlp system for patient smoking status identification. Journal of the American Medical Informatics Association, 15(1):25–28, 2008.
  • [36] Daniel Albright, Arrick Lanfranchi, Anwen Fredriksen, William F Styler IV, Colin Warner, Jena D Hwang, Jinho D Choi, Dmitriy Dligach, Rodney D Nielsen, James Martin, et al. Towards comprehensive syntactic and semantic annotations of the clinical narrative. Journal of the American Medical Informatics Association, 20(5):922–930, 2013.

Supplementary Material

(a) SemEHR provides a semantic search interface to access generic/baseline natural language processing results.
(b) Feedback buttons are provided along with NLP annotations in the search results. Feedbacks provided by a researcher will be used to populate a dedicated and better model for her research study. Meanings of the buttons: posM - positive mention; hisM - history mention; hypoM - hypothetical mentions; negM - negated mentions; otherM - other experiencer mention; wrongM - not a phenotype mention.
Figure S1: The user interface for interactively adapting a generic NLP model for a research study
(a) Statistics of identified physical conditions: condition mention - mentions that are related to the condition; positive mention - mention of a condition that the patient suffered from; num concepts - number of sub-concepts (e.g. TIA, Brain haemorrhage) that constitute a physical condition (e.g. stroke).
(b) Mention accuracy (the accuracy of NLP tool identified condition mentions) of 23 physical conditions.
(c) The numbers of feedbacks needed to iteratively train a good model for a physical condition (shows the top 10 most validated conditions).
Figure S2: The user interface for interactively adapting a generic NLP model for a research study