The human body is a very complex system, and often patients admitted to hospitals with one initial prognosis or diagnosis have multiple related or unrelated chronic diseases, referred to as multi-morbidity. Modern medical practice emphasises the need to understand the patient as a whole, as multi-morbidity increases the patient’s overall burden of disease, and worsens prognosis Flegel (2018); Ryan et al. (2018); Hausmann et al. (2019); Aubert et al. (2019); Mori et al. (2019). Multi-morbidity makes the diagnosis of each disease more complicated, and physicians may be less accurate in their diagnoses Hausmann et al. (2019). The effects of different conditions may interact with each other, and complicate the management of each disease Flegel (2018). This, in turn, leads to poorer outcomes, such as increased preventable hospital re-admissions, overall hospital re-admissions, and increased total medical and long term care costs Mori et al. (2019); Aubert et al. (2019). For example, a patient newly diagnosed with HER2 (human epidermal growth factor receptor 2) positive breast cancer may also have underlying, possibly undiagnosed, heart failure. This can be crucial, as some treatments for breast cancer can cause cardiac damage. Accurately identifying the symptoms of heart failure allows the physician to best balance the risks and benefits of such treatments.
Machine learning techniques have proven to aid medical advancements and enhance overall patient care. This research uses multi-label medical text classification techniques to improve prediction of the medical codes of patients with multi-morbidity. In single-label classification only one target variable is predicted per instance, i.e., each instance is assigned a class label out of 2 (binary) or more (multi-class) candidates. Whereas, in multi-label classification, the goal is to predict multiple output variables for each input instance. In the above example, the patient is an instance with potential labels such as cancer, hypertension, heart failure, cholesterol and many more related and unrelated health complications. This research focuses on medical codes due to the availability of labels in the dataset. Medical codes such as international classification of diseases (ICD) are used as a way of classifying diseases, symptoms, signs and causes of diseases. Almost every health condition can be assigned a unique code.
The focus of this research is to make use of free-form medical text. Free-form medical text such as discharge summaries, consultation notes and nurses notes are generally longitudinal and are rich sources of information about a patient’s well-being and medical history. However, electronic health records (EHR) in free-form medical text present added complexity due to the nature of the content. EHRs in the free-form text contain an abundance of personal health identifiers which have to be carefully de-identified to avoid any ethical or legal issues Yogarajan et al. (2020b). Also, EHRs contain a large number of abbreviations and acronyms, which can be easily misinterpreted. For example, “Mg” is used to refer to magnesium, “MG” refers to Myestina gravis and “mg” refers to milligram.
This research restricts itself to techniques that enable maximising the feature extraction of the medical text of the embedding layer. Embedding layer is a mapping of discrete variables to continuous vectors, where the dimensional space of the categorical variables is reduced. The embedding layer is considered a significant component for text representationGoldberg (2017). Embeddings allow words to transform from isolated distinct symbols to mathematical representations, where the distance between vectors and distance between words can be equated, and behaviour between words can be generalised. We focus only on multi-label machine learning techniques commonly used in health-related information extraction tasks to better enhance the accuracy of predicting medical codes on patients with multi-morbidity.
This paper extends the work on binary classification of medical codes presented in Yogarajan et al. (2020) Yogarajan et al. (2020a). More specifically, in this paper:
we acknowledge the multi-morbidity nature of patients, and we make use of the multi-label variations of medical text classification to enhance prediction of concurrent medical codes.
we present new embeddings on the health-related text and compare several variations to embeddings models when dealing with an imbalanced multi-label medical text classification problem.
we analyse pre-processing of free-form medical text, given the nature of the medical text, and show that there are very minimum improvements to F-measure when medical text is pre-processed to that of the text ‘as is’.
we present a study exploring variations to tagging words including the traditional part-of-speech (POS).
we provide a comparison of popular machine learning classifiers used in medical text classification.
we present a detailed study and discussion of results extended by varying the formations of embeddings, size of the embeddings and number of labels considered (18, 50 and 155) for the prediction of medical codes.
we show that variations in embeddings, especially the dimensional size, influences the F-measure of the infrequent labels.
The rest of the paper is structured as follows. Section 2 presents related work. This is followed by a brief overview of medical codes in Section 3. Details of the machine learning techniques and experimental methodology are provided in part 4. Section 4 also presents an overview of the data used for experiments. This is followed by results, where a detailed subsection of results are given for 18 label case, followed by 50 and 155 labels. The paper is concluded with discussions and suggestions for future work.
2 Related Work
Developments in machine learning, especially deep learning, have influenced the advancements in many fields, including health applications. The rapid growth in computational power and the availability of EHR are the main reasons for such changes. Rule-based systems have been the most favoured option by health professionals, with systems such as cTAKES and MetaMap considered the leading information extraction toolsSavova et al. (2010); Garla et al. (2011); Liu et al. (2013); Reátegui and Ratté (2018); Yang and Gonçalves (2017). However, recently there is a shift towards favouring machine learning, more specifically deep learning-based models.
Table 1 presents examples of recent developments in predicting medical codes. All systems are based on variations of deep learning models. The number of ICD-9 codes, i.e. the number of labels, used varies across systems with the best reported F1 measures around the 0.4 to 0.6 range. The number of labels and the frequency of the chosen labels influence the F1 score, with top 50 ICD-9 codes generally leading to higher F-measure. MIMIC III (Medical Information Mart for Intensive Care III) is the biggest publicly accessible de-identified dataset and is the most popular free-form medical text used in many applications, including predicting medical codes Purushotham et al. (2017); Johnson et al. (2017); Goldberger et al. (2000); Data (2016) (also evident in Table 1).
|Zeng et al. (2019) Zeng et al. (2019)||
Deep transfer learning
|MIMIC III||micro avg||most frequent 200 labels|
|Multi-scale CNN||F1 = 0.420|
|Du et al. (2019) Du et al. (2019)||ML-Net, ELMo based,||MIMIC III||Best*|
|F1 = 0.428||70 labels|
|Baumel et al. (2018) Baumel et al. (2018)||Hierarchical Attention||MIMIC III||micro avg|
|Bi-GRU||F1 = 0.405||6527 labels|
|F1 = 0.559||1047 labels|
|Mullenbach et al. (2018) Mullenbach et al. (2018)||CNN based, Word2Vec||MIMIC III||micro avg|
|DR-CAML||F1 = 0.633||most frequent 50 labels|
|CAML||F1 = 0.539||8922 labels|
|CAML||F1 = 0.088||8922 labels|
|Li et al. (2018) Li et al. (2018)||DeepLabeler||MIMIC III||micro avg||6984 labels|
|CNN, Doc2Vec||F1 = 0.408|
|Rios and Kavuluru (2018) Rios and Kavuluru (2018)||CNN, few-shot learning||MIMIC III||micro avg||6932 labels|
|Skip-gram embeddings||F1 = 0.468|
Examples of the most recent systems for predicting ICD-9 codes are presented. Here CNN refers to convolutional neural network, LSTM to Long short-term memory, Bi-GRU to bidirectional Gated Recurrent Unit and DR-CAML to Description Regularized - Convolutional Attention for Multi-label classification. * Du et al. (2019)Du et al. (2019)
do not specify best micro average or macro average F score.
Embeddings are the popular method used to represent text in a neural network, and all systems presented in Table 1 use embeddings from algorithms such as Word2Vec, Doc2Vec and ELMo to represent free-form medical text. Yogarajan et al. (2020) Yogarajan et al. (2020a) used fastText to obtain embeddings and presented comparisons with published embeddings, both for general text and health-related text trained models. Embeddings trained on health-related text perform better than those trained on general text, and higher dimensions perform better when top-level ICD-9 groups are considered as an individual binary problem Yogarajan et al. (2020a). Huggard et al. (2019) Huggard et al. (2019)
also show that embeddings obtained from fastText result in significantly higher F-measure on the biomedical name entity recognition when compared to other embeddings such as that of ELMo.
We restrict this research to enhancing embeddings in a multi-label prediction setting. Our findings in this research will aid the development of better performing neural networks. All systems presented in Table 1, and other deep learning-based models, focus predominantly on the complexity of the deep learning algorithm and very little on the representation of the text and pre-processing of the text. Although we acknowledge the need for such developments, in this paper, we constrain ourselves to text representations as this is vital to improving predictive performance for health records. This is so to avoid using the same baseline recipe for the embeddings layer where the size of the embedding is the same; generally, 100 dimensions, and pre-processing steps are also the same Zeng et al. (2019); Mullenbach et al. (2018).
3 Medical Codes
ICD codes are widely used to describe diagnoses of patients, and are used to classify diseases, symptoms, and also causes of diseases Jensen et al. (2012). Many countries use ICD codes for billing purposes, as does the USA where insurance must cover the cost of patient care. ICD codes also provide insights on multi-morbidity of patients. We focus on predicting ICD-9 codes in this paper due to the availability of labels in the data. Generally, hospitals manually assign the correct codes to patient records based on doctors’ clinical diagnosis notes. This requires expert knowledge and is time-consuming. Hence, the use of advancements in machine learning to predict ICD codes from free-form medical text has become an important research avenue.
There are roughly 13,000 ICD-9 codes and their definitions follow a hierarchical structure. Figure 1 presents the tree structure of ICD-9. At the top level, ICD-9 codes can be grouped into 18 main categories, which then divide into 167 sub-groups and finishes with roughly 13,000 individual codes.
4 Experimental Methodology
This section presents an overview of the data used in experiments and for training embeddings. We provide an overview of the machine learning and natural language processing techniques. The details of the embeddings used in this research are also presented.
This research makes use of the medical text data of more than 50,000 patients presented in the publicly available medical database Medical Information Mart for Intensive Care (MIMIC-III) Johnson et al. (2016); Goldberger et al. (2000); Data (2016). MIMIC III contains de-identified medical free-form text among other forms of medical data of patients admitted in critical care units at the Beth Israel Deaconess Medical Center between 2001 and 2012. MIMIC III contains 15 categories of notes in the free-form text, including discharge summaries, nursing notes, nutrition notes and social work notes. More than 90% of the unique hospital admissions contain at least one discharge summary, with many including more than one. We make use of discharge summaries of individual hospital admissions in this research.
There are 6,984 distinct diagnosis ICD-9 codes and 2,032 distinct procedure ICD-9 codes reported in MIMIC III, among more than 50,000 patient admission records found in this database. Patient records in MIMIC III typically have more than one code assigned. However, the frequency of ICD-9 codes is extremely unevenly spread, with a large proportion of the ICD-9 codes occurring infrequently. Table 2 provides an overview of the frequency of ICD-9 codes and ICD-9 groupings. We focus on the top-level and sub-level groups in this study, along with the first 50 highest frequently occurring individual ICD-9 codes. Table 2 also presents frequency ranks of ICD-9 codes and sub-groups to showcase the unbalanced nature of the data. This biased nature is primarily because MIMIC III data were obtained from patients admitted in critical care.
|ICD-9 Top Level Grouping: 18 Groups|
|circ (390-459)||78.40||diges (520-579)||38.80||musc (710-739)||17.99|
|e+v (E- & V-)||69.09||bld (280-289)||33.56||pren (760-779)||17.07|
|endo (240-279)||66.51||symp (780-799)||31.36||neop (140-239)||16.37|
|resp (460-519)||46.63||ment (290-319)||29.66||skin (680-709)||12.02|
|inj (800-999)||41.42||nerv (320-389)||29.10||cong (740-759)||5.41|
|gen (580-629)||40.29||inf (001-139)||26.96||preg (630-679)||0.31|
|Top 50 ICD-9 codes|
|Frequency||ICD-9 code||%||Frequency||ICD-9 code||%|
|ICD-9 Sub-Level Grouping: 155 Groups|
|1||endo4 (270-279)||52.13||50||blood3 (288-289)||4.28|
|5||symp1 (780-789)||30.86||75||inj4 (820-829)||1.81|
|10||v6 (V40-V49)||25.76||100||cong7 (753-753)||0.56|
|25||diges6 (560-569)||11.47||155||v12 (V86-V86)||0.02|
4.2 Training Embeddings
Representing words as embeddings is a common mechanism used in language processing Goldberg (2017). Embeddings obtained from algorithms such as Word2Vec, Glove and fastText are used in text classification tasks, including medical applications. This research makes use of fastText Bojanowski et al. (2016); Joulin et al. (2016b, a) where words are represented as a bag of character -grams, and word embeddings are obtained by summing these representations. This feature gives fastText the ability to produce vectors for words that are misspelt or concatenated. The nature of free-form medical text does benefit from this feature of fastText embeddings, and there are examples of medical applications where embeddings from fastText are shown to outperform other similar algorithms Huggard et al. (2019).
Medical codes for patients admitted to the hospital are labelled at individual admission or patient level documents rather than single words in a health record. Hence, we predict medical codes for entire documents, in this case, discharge summaries of unique hospital admission. For this research, document embeddings are obtained by computing the vector sum of the embeddings for each word in the document. This vector sum is then normalised to have length one, ensuring documents with different lengths have representations of similar magnitudes.
For comparison, our embeddings are trained to the exact same specifications as the fastText embeddings W300 presented in Grave et. al. (2018) Grave et al. (2018). Table 3 presents details of the embedding used in this research with details of dimensional size, source data, model size and training time222Processing was run on a 4 core Intel i7-6700K CPU @ 4.00GHz with 64GB of RAM.
. The word embeddings are trained using CBOW (T300, T600) and Skip-gram (T300SG, T600SG), with character n-grams of length 5, a window of size 5 and ten negative samples per positive sample. The learning rate used for training these models is 0.05.
We make use of the data provided by TREC 2017 competitions Roberts et al. (2017) to train our embeddings. TREC 2017 provides an extensive 24G of health-related data. TREC data contains 26.8 million published abstracts of medical literature listed on PubMed Central, 241,006 clinical trials documents, and 70,025 abstracts from recent proceedings focused on cancer therapy from the American Association for Cancer Research and the American Society of Clinical Oncology Roberts et al. (2017).
|Models||Dimensions||Source Data||Train Time||Model Size|
|W300 Grave et al. (2018)||300||Wiki||-||7G|
|T300 Yogarajan et al. (2020a)||300||TREC||7 hours||13G|
|T600 Yogarajan et al. (2020a)||600||TREC||13 hours||23G|
4.3 Multi-label Classifiers
Generally, a given medical record is annotated with multiple tags for different diagnoses, procedures or treatments. That is, from a machine learning perspective, health text coding is a multi-label classification problem, where one text may belong to more than one label. For example, many ICD codes exist for matters that relate to hypertension or diabetes, and such illnesses often co-occur in individual patients, but they also occur independently. Thus, it would be useful to be able to classify a particular health text to one or the other, or both, or neither of these categories. Moreover, with approximately 13,000 ICD-9 categories for diagnoses and treatments that can combine almost arbitrarily for individual patients, the problem of multi-label classification for health records can be extraordinarily large.
In this section, we provide an overview of multi-label classifiers used for the experiments in this paper. For more details on these methods, see Read et al. (2016).
4.3.1 Binary relevance (BR)
The first and simplest multi-label classification algorithm used here is called binary relevance (BR) Godbole and Sarawagi (2004); Tsoumakas and Katakis (2007). A separate binary classification model is created for each label, such that any text with that label is a positive instance, negative otherwise (i.e. one versus all). To predict the labels for a new text, each classifier decides if the text is in or out the class it has been trained to recognise, and the overall output for the new text is the set of all positive labels. Note that binary relevance ignores any potential relationships between labels.
4.3.2 Classifier Chains (CC)
BR models make their predictions independently. However, as seen with the earlier example of the strong correlation between diabetes and hypertension, a model could possibly benefit from the result of another when making its own decision. Accordingly, BR models can be ‘chained’ together into a sequence such that the predictions made by earlier classifiers are made available as additional features for the next classifier. Such a configuration is unsurprisingly called a classifier chain (CC) Read et al. (2009, 2011).
4.3.3 Ensemble of classifier chains (ECC)
The order of predictions in a classifier chain affects what advice later models have available from preceding ones when it’s their turn to make a judgment. This is a problem for multi-label classification in general, but particularly so for health records where dependencies between ICD codes are myriad, complex, and sometimes quite strong. One way to mitigate the problem of choosing a poor ordering is to create a collection of classifier chains that are each ordered randomly, then make final predictions by polling the results of all chains. Such a collection is called an ensemble of classifier chains (ECC) Read et al. (2009).
4.3.4 Multi-label k-nearest neighbor classifier (MLkNN)
MLkNN Zhang and Zhou (2005)
is a multi-label variant of the standard k-Nearest Neighbor (kNN) algorithm, that predicts the set of the most common labels among the-nearest neighbours. To guard against any anomalies inside a neighbourhood, a Bayesian calibration step refines the raw predictions. An important characteristic of this approach is its excellent scalability with respect to the number of labels: the set of nearest neighbors needs to be calculated only once for a given query text.
4.3.5 Neural Networks
As emphasised earlier, the focus of this research is only at embeddings layers, and we use the most commonly used multi-label classifiers for prediction. However, the outcome of this research can be incorporated into a neural network, where embeddings layers are generally used to represent text Goldberg (2017). Furthermore, in 2019 very recent NLP techniques like BERT (Bidirectional Encoder Representations from Transformers) and BioBERT, showed significant improvements on some other biomedical tasks Lee et al. (2019); Abacha et al. (2019). These are all worthy avenues for future research.
4.3.6 MEKA and base classifiers
All of the classification results presented in this research were carried out using MEKA Read et al. (2016)
: an open-source Java system specifically designed to support multi-label classification experiments. MEKA includes almost all widely-used algorithms and evaluation metrics. The default algorithm for each class (i.e. the base classifier) within MEKA is logistic regression, and this is used for the majority of our experiments; however, stochastic gradient descent (SGD) is used for tests with ECC, with ensembles of 50, 100 and 500 randomly ordered classifier chains.
4.4 Statistical assessment of differences
We perform non-parametric tests to verify if there are statistically significant differences between algorithms, as described in Demšar (2006); Garcia and Herrera (2009). First we use Davenport’s corrected Friedman test with
to check if we can safely reject the null hypothesis that all algorithms perform the same. If there are differences, we proceed with the post-hoc Nemenyi test to determine the critical difference (CD) that serves to identify algorithms with different performance. We include the critical difference plots in our results.
4.5 FastText Parameters
As mentioned in Section 4.2 models used in this research are trained to the exact same specifications as general text trained published models. We also present a comparison of variations of specifications for training embeddings for a multi-label medical text classification problem. The combination of variations to parameter choices are presented in Table 4. The learning rate used to train all of the variations presented is 0.05. For simplicity all dimensions were set to 50. The two word representation models are Skip-gram and CBOW. It is important to note that these combinations result in 18 different embeddings models, and only a selected sub-set of the experimental results are presented in this paper.
|Option||Dimensions||Window||neg||Character -gram||Loss Function||Epoch|
The Continuous Bag-of-Words Model (CBOW) Mikolov et al. (2013)
is similar to a feed-forward neural network language model with non-linear hidden layer removed and the projection layer being shared for all words. CBOW predicts the current word from the surrounding words. The Skip-gram architectureMikolov et al. (2013) is similar to that of CBOW, but Skip-gram uses the current word to predict the words before and after within a given range. For example, for a sentence “Male patient is admitted to the hospital”, CBOW predicts the word “admitted” using the source context words (“Male”, “patient”, “is”, “to”, “the”, “hospital”), whereas Skip-gram predicts context words like “patient” or “hospital” for the source word “admitted”.
4.6 Pre-processing Text Data
We present a comparison of F-measures between pre-processed discharge summary and text ‘as is’ as presented by MIMIC III data. MIMIC III data has been de-identified and pre-processed before being released for research access. Also, most models developed using MIMIC pre-process the text and truncate the maximum number of words Mullenbach et al. (2018). Text pre-processing includes removal of tokens without alphabetic characters, down-casing all tokens, removal of punctuation and truncating the number of tokens in a given discharge summary.
On the other hand, experiments presented in this research use MIMIC III discharge summaries ‘as is’ with minimum pre-processing. This allows us to maximise the use of features in medical free-form text as embeddings are case sensitive. It also avoids the meanings of abbreviations and acronyms used in a medical context being altered. An example of text ‘as is’ followed by an example of pre-processed text is presented below:
Medicine HISTORY OF PRESENT ILLNESS: This is an 81-year-old female with a history of emphysema, presents with 3 days of shortness of breath thought by her primary care. Medications on Admission: Omeprazole 20 mg daily, Furosemide 10mg daily. Tablet Sustained Release 24 hr PO once a day.
medicine history of present illness this is an 81yearold female with a history of emphysema presents with days of shortness of breath thought by her primary care medications on admission omeprazole mg daily furosemide 10mg daily tablet sustained release hr po once a day
4.7 Concatenating Embeddings
We explore the option of splitting the free-form medical data into sections and concatenating the embeddings. The discharge summary is split into seven logical sections: Admission Date, Past Medical History, Pertinent Results, Brief Hospital Course, Medications on Admission, Discharge Diagnosis and Followup Instructions. Embeddings for each section can be obtained and concatenated. For example, if a 50 dimensional embeddings model is used the resulting concatenated embedding has 350 dimensions. If the discharge summary does not include any of the sub-sections mentioned above, then the respect embeddings are all zeros. For hospital admissions with more than one available discharge summary, all the summaries are first embedded independently, and then averaged into one final embedding.
Another variation considers concatenating statistical outcomes of the embeddings from each of the sections of a given hospital admission. For these experiments we look at the minimum, maximum, mean, standard deviation, lower quartile and upper quartile of the embeddings, and hence the resulting embedding will have six times as many dimensions as the original one.
4.8 Tagging Words
We explore two variations of tagging words in medical free-form text. Part-of-speech (POS) is a technique used where the syntactic categories of words in a given sentence are identified automatically. Common examples of such POS tags are: noun, verb, adjective, adverb, pronoun, preposition, conjunction and interjection. We make use of Natural Language Toolkit (NLTK333http://nltk.org/) Bird et al. (2009) POS tagger, where if the input text is:
History of Present Illness 54 year old female with recent diagnosis of ulcerative colitis on mercaptopurine
HistoryNN ofIN PresentNNP IllnessNNP 54CD yearNN oldJJ femaleNN withIN recentJJ diagnosisNN ofIN ulcerativeJJ colitisNN onIN mercaptopurineJJ
where NN indicates a noun, IN is referring to preposition or conjunction, NNP is referring to a proper noun, CD is referring to numeral and JJ is referring to adjective or numeral.
Also, we tag the words of MIMIC III discharge summaries using the text splits presented in Section 4.7. Tokens in each of these sections are tagged with 0_, 1_, 2_, 3_, 4_, 5_, 6_ for text in the seven splits Admission Date, Past Medical History, Pertinent Results, Brief Hospital Course, Medications on Admission, Discharge Diagnosis and Followup Instructions respectively.
This section presents results for the top level ICD-9 grouping, the sub-level grouping, and the overall top 50 highest frequency ICD-9 codes where the number of labels are 18, 155, and 50 respectively. The top level ICD-9 groupings are primarily used to present comparisons and detailed results for the multi-label medical text classification techniques mentioned in Section 4. Results are primarily presented at the level of individual labels to enable better understanding of the imbalanced nature of the data and to observe improvements in F-measure. We also present micro-averaged and macro-averaged F1 scores to facilitate comparisons across the variations in embeddings to the overall system.
5.1 Top-Level Groups of Medical Codes
This section presents results for the 18 top-level ICD-9 groups, as mentioned in Table 2. We also present comparisons of 18 groups treated as individual binary problems, as presented in Yogarajan et al. (2020) Yogarajan et al. (2020a), where appropriate. The results in this section are aligned with the experimental methodology as described above.
5.1.1 Comparing Multi-label Classifiers
|groups||Best case||Best case|
|F1||E =||I =||F1||I =|
|endo||0.848||0.839||0.848||0.851||50||100||0.848||10, 30, 100|
|gen||0.731||0.657||0.731||0.735||500||30||0.739||10, 30, 100|
Table 5 presents a comparison between several multi-label classifiers to predict the 18 top-level ICD-9 groups. Critical difference plots are available in Figure 2. Performance when considering the 18 groups as individual binary classification problems is also presented. All experiments use the T300 word embedding and 10-fold cross validation. As anticipated, using multi-label variations does provide advantage over the individual binary classification case. Evidently, for most ICD-9 groups ECC using with logistic regression (LR) performs best. Optimising the number of iterations and epochs can improve F-measure results. ECC-LR with a ridge value of and the number of iterations achieves the best results overall. Experiments across a range of different ridge values provided almost identical values for F-measure, hence only a ridge value of 1 is included in Table 5.
5.1.2 FastText Parameter Choices
Table 6 presents a comparison of F-measures of the fastText parameter choices I-IX for both CBOW and Skip-gram embeddings as outlined in Section 4.5. Critical difference plots are available in Fig. 3. Results correspond to 18 embeddings 13 classifiers 10 folds cv for a total of 2340 tests. The best F-measure among the models using the options presented in Table 4 for CBOW is also presented. Evidently, for all 18 groups, Skip-gram out-performs CBOW. Option I has the same specifications as the W300 embeddings, and the embeddings presented in Table 3 are trained as per option I for comparison. However, it is evident from Table 6 that varying the fastText parameters impacts F-measures across all 18 ICD-9 groups, but not necessarily always for the better. Thus, care must be taken when selecting these parameters.
Figure 4 presents a comparison of text pre-processed and truncated with the text ‘as is’ within MIMIC III. As mentioned earlier, MIMIC III pre-processes and de-identifies all free-form text released to the public. Here we further process the discharge summary and truncate it to the maximum number of tokens. Generally, the option of discharge summary pre-processed and truncated to 1000 tokens maximum performs much worse than the other options. However, when comparing the text ‘as is’ to the other two pre-processed options, there is very little or no difference in the F-measures. Even for very infrequent categories the differences in F-measure are very marginal. Hence, the question of benefits over trade-off (known and unknown) with regard to pre-processing medical text, or not, remains unclear. It’s important to point out that apart from the results presented in this section, all other results presented in this paper are obtained using discharge summaries without any additional pre-processing or truncating other than that already done by MIMIC III.
5.1.4 Comparison between Embeddings
Figure 5 presents a comparison of top-level ICD-9 groups between CBOW and Skip-gram models, and between 300 and 600 dimensions. As observed in the binary case, presented in Yogarajan et al. (2020a), increase in dimension does provide an improvement in F-measure. This is more evident with lower frequency groups such as skin, cong and preg. Skip-gram is consistently better than CBOW as observed in Section 5.1.2. We also compared W300 embeddings, and multi-label variations also present similar observations to that of the binary case, where health-related pre-trained embeddings provide an advantage over general text pre-trained embeddings across all 18 groups.
Figure 6 presents a comparison of embeddings formed by concatenating embeddings as per Section 4.7. The base embeddings used here are T50. CONCAT300 is formed by concatenating the embeddings of the statistical outcomes, i.e. CONCAT300 = 50 dim (min + max + mean + sd + q1 + q3). CANCAT350 is formed by concatenating the embeddings of the seven text splits 7 50 dim. In comparison, both CONCAT300 and CONCAT350 improve F-measures relative to the base embeddings T50 except for the ICD-9 group pren. CONCAT350 generally performs better than CONCAT300. However, the T300 embeddings outperform both CONCAT300 and CONCAT350 across all 18 groups. Also, the improvements that CONCAT300 and CONCAT350 produce over T50 are not replicated for larger embeddings. For instance when starting with T300 and generating CONCAT1800, or CONCAT2100, no significant improvements are observed. This maybe due to the fact that T300 already performs much better than T50, possibly not leaving much room for further improvement. More future research is needed to investigate this behaviour in more detail.
5.1.5 Tagging Words
Figure 7 presents a comparison of F-measures for top level ICD-9 groups among the MIMIC III discharge summaries with POS tags, with text split tags and for the raw text without any tagging. Evidently, except for categories bld, sym, and preg, using a POS tagger does not improve the F-measures. For categories bld, sym and preg the use of the POS tagger improves the F-measure, from 0.612 to 0.616, from 0.552 to 0.555, and from 0.470 to 0.491, respectively. When using the text split tagger, the F1 score for circ is equivalent to the no-tagger case, and for category endo there is a small improvement from 0.848 to 0.850.
5.1.6 Summary: Top level ICD-9 Groups
|ICD-9||ML Classifier||Concatenating Embeddings||Text Tagger||Pre-processing|
|circ||ECC-SGD,||T50 CONCAT300 CONCAT350 T300||no tagger =||truncated to 2500|
|E=100, I = 10||TextSplitTag|
|e+v||ECC-SGD,||T50 CONCAT350 CONCAT300 T300||no tagger||truncated to 2500,|
|E=100, I = 30||no truncate|
|endo||ECC-SGD,||T50 CONCAT300 CONCAT350 T300||TextSplitTag||no truncate|
|E=50, I = 100|
|resp||ECC-SGD,||T50 CONCAT300 CONCAT350 T300||no tagger||no truncate|
|E=500, I = 10|
|inj||ECC-LR||T50 CONCAT300 CONCAT350 T300||no tagger||truncated to 2500|
|R = 1, I = 30|
|gen||ECC-LR R = 1,||T50 CONCAT350 CONCAT300 T300||no tagger||text ‘as is’|
|I = 10, 30, 100|
|diges||ECC-LR||T50 CONCAT350 CONCAT300 T300||no tagger||text ‘as is’|
|R = 1, I = 10|
|bld||ECC-LR||T50 CONCAT300 CONCAT350 T300||POS||no truncate|
|R = 1, I = 10|
|symp||ECC-LR||T50 CONCAT300 CONCAT350 T300||POS||no truncate|
|R = 1, I = 10|
|ment||ECC-LR||T50 CONCAT300 CONCAT350 T300||no tagger||no truncate|
|R = 1, I = 10|
|nerv||ECC-LR||T50 CONCAT300 CONCAT350 T300||no tagger||no truncate|
|R = 1, I = 10|
|inf||ECC-LR||T50 CONCAT350 = CONCAT300 T300||no tagger||no truncate|
|R = 1, I = 10|
|musc||ECC-LR||T50 CONCAT300 CONCAT350 T300||no tagger||no truncate|
|R = 1, I = 10|
|pren||BR, Log,||CONCAT350 CONCAT300 T50 T300||no tagger||truncated to 2500|
|R = 1|
|neop||ECC-LR||T50 CONCAT300 CONCAT350 T300||no tagger||no truncate|
|R = 1, I = 10|
|skin||ECC-LR||T50 CONCAT300 CONCAT350 T300||no tagger||truncated to 2500|
|R = 1, I = 10|
|cong||ECC-LR||T50 CONCAT350 CONCAT300 T300||no tagger||no truncate|
|R = 1, I = 30|
|preg||BR-LR,||T50 CONCAT300 CONCAT350 T300||POS||text ‘as is’|
|R = 1|
The 18 top-level ICD-9 groups are used to present experimental results of several variations of techniques as outlined in Section 4. Skip-gram models improve the F-measures of all 18 groups compared to CBOW. Also, we presented comparisons of several variations of parameters to pre-trained embeddings for both CBOW and Skip-gram. Results to such modifications indicate changes to F-measures across all 18 labels, but not always for the better (see section 5.1.2 for details of results). The T600SG embeddings are the best-performing choice for all but one of the groups. It is only for category gen that the CBOW-based T600 manages to outperform T600SG. As observed in the binary case, embeddings trained using health-related data do provide an advantage over general text pre-trained embeddings. Also, higher dimensions improve F-measures, especially evident in low-frequency categories. A summary of the choices which provided the best F-measure for each ICD-9 top-level groups is presented in Table 7.
5.2 Highest Frequency Medical Codes
This section presents the results for the top 50 most frequent individual ICD-9 codes. Mullenbach et al. (2018) Mullenbach et al. (2018), arguably the most prominent work in predicting ICD-9 codes from MIMIC III, considers the top 50 most frequent codes from both diagnosis and procedure ICD-9 codes. Hence, we also present results for the same top 50 ICD-9 codes where the most frequent code is 401.9 and occurs in 35.13% of all cases, whereas the least frequent code V15.82 is only present in about 5% of all cases (see Table 2 for more details).
Figure 8 presents a comparison of F-measures between general text pre-trained embeddings W300 and health-related pre-trained T300 and T600 embeddings for the 50 topmost frequently occurring ICD-9 codes. As observed with the 18 top-level ICD-9 groups, when compared to general text trained embeddings, F-measures of health-related trained embeddings are better. Also an increase in dimensions from 300 to 600 results in a considerable improvement in F-measure, far more than what was noticed in the 18 label case. For example, for ICD-9 code 530.81 the F-measure improves from 0.194 to 0.329, and for 410.71 from 0.332 to 0.404.
Figure 9 presents a comparison between CBOW and Skip-gram trained embeddings for the 50 topmost frequently occurring ICD-9 codes. As with the 18 label case, in general Skip-gram models are better than CBOW models, except for a few ICD-9 codes, such as 995.5, or 389.
5.3 Sub-level Groups of Medical Codes
A comparison of F-measures for sub-level ICD-9 groups is presented where the 155 labels are treated as one multi-label classification problem. We only use the sub-groups which are recorded in more than ten unique hospital admissions, hence 155 and not the entire 167 possible sub-groups. F-measures are presented from highest frequency of occurrence to the lowest. Figure 10 presents results for ICD-9 sub-groups with occurrences of more than 5%, Figure 11 for sub-level groups with occurrences between 1% and 5%, and Figure 12 contains sub-groups with less than 1% occurrences. A comparison of F-measures between W300 and T300, and between 300 and 600 dimensions is presented.
5.4 Overall Results
In this section we present results for the overall performance of a multi-label medical text classification problem with 18, 50 and 155 labels. We present micro-averaged and macro-average F1 measures, aligned with prior work (see section 2 for examples).
Table 8 presents micro- and macro-averaged F-measures for 18, 50 and 155 label multi-label medical text classification tasks with embeddings variations. The overall pattern matches the observations for individual label-level F-measures, where the combination of higher dimensions and a Skip-gram model usually results in the highest performance measures. For all three groups of labels (18, 50, or 155), micro- and macro-averaged F1 scores are always better for T300 than for W300.
|Model Description||Micro F1||Macro F1||Model||Micro F1||Macro F1|
|Top-level: 18 labels||Top 50: 50 labels|
|T300, truncated to 2500*||0.737||0.654||T600SG||0.539||0.493|
|CONCAT350||0.702||0.593||Sub-level: 155 labels|
|T300, POS tag||0.730||0.646||W300||0.534||0.293|
|T300, ECC-SGD, E=500, I=100**||0.721||0.634||T600SG||0.568||0.337|
We present a detailed analysis of clinical NLP techniques used to enhance the embeddings layer of a multi-label medical text classification task. We focus on predicting ICD-9 for patients with multi-morbidity, and present results for 18, 50 and 155 labels. Results and analysis are primarily done at individual label level. Given the imbalanced nature of the data, at the individual label level, it is evident that variations in embeddings such as the use of Skip-gram model over CBOW, and higher dimensional embeddings do result in improvements in F-measure. These improvements are more significant with less frequent labels. This is evident across all three setups, regardless of using 18, 50 or 155 labels. These improvements and differences are also evident in overall micro- and macro-averaged F-measures. This research emphasises the need for enhancing text representations, and results show that there is a definite benefit in incorporating additional features. The benefits are depended on the data distributions and the task at hand. This paper used predicting medical codes as an example. However, the NLP techniques used in this research can be adapted to other tasks where multi-label medical text classification is required.
Our analysis on pre-processed text only presents marginal improvements to F-measure when compared to text ‘as is’. Hence there is no clear indication that additional pre-processing is required on already pre-processed and de-identified data, such as MIMIC III, especially given the nature of the medical text.
- Overview of the MEDIQA 2019 shared task on textual inference, question entailment and question answering. In Proceedings of the 18th BioNLP Workshop and Shared Task, pp. 370–379. Cited by: §4.3.5.
- Patterns of multimorbidity associated with 30-day readmission: a multinational study. BMC public health 19 (1), pp. 738. Cited by: §1.
Multi-label classification of patient notes: case study on icd code assignment.
Workshops at the Thirty-Second AAAI Conference on Artificial Intelligence, Cited by: Table 1.
- Natural Language Processing with Python. O’Reilly Media. Cited by: §4.8.
- Enriching word vectors with subword information. arXiv preprint arXiv:1607.04606. Cited by: §4.2.
- Secondary analysis of electronic health records. Springer. Cited by: §2, §4.1.
- Statistical comparisons of classifiers over multiple data sets. Journal of Machine Learning Research 7, pp. 1–30. External Links: Cited by: §4.4.
- ML-net: multi-label classification of biomedical texts with deep neural networks. Journal of the American Medical Informatics Association 26 (11), pp. 1279–1285. Cited by: Table 1.
- What we need to learn about multimorbidity. CMAJ 190 (34). Cited by: §1.
- An extension on ”statistical comparisons of classifiers over multiple data sets” for all pairwise comparisons. Journal of Machine Learning Research 9, pp. 2677–2694. Cited by: §4.4.
- The Yale cTAKES extensions for document classification: architecture and application. Journal of the American Medical Informatics Association 18 (5), pp. 614–620. Cited by: §2.
- Discriminative methods for multi-labeled classification. In Pacific-Asia conference on knowledge discovery and data mining, pp. 22–30. Cited by: §4.3.1.
- Neural network methods for natural language processing. Synthesis Lectures on Human Language Technologies 10 (1), pp. 1–309. Cited by: §1, §4.2, §4.3.5.
- PhysioBank, PhysioToolkit, and PhysioNet: components of a new research resource for complex physiologic signals. Circulation 101 (23), pp. e215–e220. Cited by: §2, §4.1.
- Learning word vectors for 157 languages. In Proceedings of the International Conference on Language Resources and Evaluation (LREC 2018), Cited by: §4.2, Table 3.
- Sensitivity for multimorbidity: the role of diagnostic uncertainty of physicians when evaluating multimorbid video case-based vignettes. PloS one 14 (4). Cited by: §1.
- Feature importance for biomedical named entity recognition. In Australasian Joint Conference on Artificial Intelligence, pp. 406–417. Cited by: §2, §4.2.
- Mining electronic health records: towards better research applications and clinical care.. Nature Reviews Genetics 13 (6), pp. 395. Cited by: §3.
- Reproducibility in critical care: a mortality prediction case study. In Machine Learning for Healthcare Conference, pp. 361–376. Cited by: §2.
- MIMIC-III, a freely accessible critical care database. Scientific data 3, pp. 160035. Cited by: §4.1.
- FastText.zip: compressing text classification models. arXiv preprint arXiv:1612.03651. Cited by: §4.2.
- Bag of tricks for efficient text classification. arXiv preprint arXiv:1607.01759. Cited by: §4.2.
NCUEE at MEDIQA 2019: Medical Text Inference Using Ensemble BERT-BiLSTM-Attention Model. In Proceedings of the 18th BioNLP Workshop and Shared Task, pp. 528–532. Cited by: §4.3.5.
- Automated ICD-9 coding via a deep learning approach. IEEE/ACM transactions on computational biology and bioinformatics. Cited by: Table 1.
- Integrated cTAKES for Concept Mention Detection and Normalization.. In CLEF (Working Notes), Cited by: §2.
Efficient estimation of word representations in vector space. CoRR abs/1301.3781. External Links: Cited by: §4.5.
- The associations of multimorbidity with the sum of annual medical and long-term care expenditures in japan. BMC geriatrics 19 (1), pp. 69. Cited by: §1.
- Explainable prediction of medical codes from clinical text. arXiv preprint arXiv:1802.05695. Cited by: Table 1, §2, §4.6, §5.2.
- Benchmark of deep learning models on large healthcare mimic datasets. arXiv preprint arXiv:1710.08531. Cited by: §2.
- Classifier chains for multi-label classification. In Joint European Conference on Machine Learning and Knowledge Discovery in Databases, pp. 254–269. Cited by: §4.3.2, §4.3.3.
- Classifier chains for multi-label classification. Machine learning 85 (3), pp. 333. Cited by: §4.3.2.
- MEKA: A Multi-label/Multi-target Extension to WEKA. Journal of Machine Learning Research 17 (21), pp. 1–5. External Links: Cited by: §4.3.6, §4.3.
- Comparison of MetaMap and cTAKES for entity extraction in clinical notes. BMC Medical Informatics and Decision Making 18 (3), pp. 74. Cited by: §2.
- EMR coding with semi–parametric multi–head matching networks. In Proceedings of the conference. Association for Computational Linguistics. North American Chapter. Meeting, Vol. 2018, pp. 2081. Cited by: Table 1.
- Overview of the TREC 2017 precision medicine track. NIST Special Publication, pp. 500–324. Cited by: §4.2.
- Beyond the grey tsunami: a cross-sectional population-based study of multimorbidity in ontario. Canadian Journal of Public Health 109 (5-6), pp. 845–854. Cited by: §1.
- Mayo clinical Text Analysis and Knowledge Extraction System (cTAKES): architecture, component evaluation and applications. Journal of the American Medical Informatics Association 17 (5), pp. 507–513. Cited by: §2.
- Multi-label classification: an overview. International Journal of Data Warehousing and Mining (IJDWM) 3 (3), pp. 1–13. Cited by: §4.3.1.
- UEvora at CLEF eHealth 2017 Task 3.. In CLEF (Working Notes), Cited by: §2.
- Comparing High Dimensional Word Embeddings Trained on Medical Text to Bag-of-Words For Predicting Medical Codes.. Proceedings of the Asian Conference on Intelligent Information and Database Systems (ACIIDS 2020). In N. T. Nguyen et al. (Eds.), Lecture Notes on Artificial Intelligence (LNAI), Springer Nature. (to appear). 12033, pp. 1–12. Cited by: §1, §2, Table 3, §5.1.4, §5.1.
- A review of automatic end-to-end de-identification: is high accuracy the only metric?. Applied Artificial Intelligence, pp. 1–19. Cited by: §1.
- Automatic icd-9 coding via deep transfer learning. Neurocomputing 324, pp. 43–50. Cited by: Table 1, §2.
- A k-nearest neighbor based algorithm for multi-label classification. In 2005 IEEE international conference on granular computing, Vol. 2, pp. 718–721. Cited by: §4.3.4.