Seeing The Whole Patient: Using Multi-Label Medical Text Classification Techniques to Enhance Predictions of Medical Codes

by   Vithya Yogarajan, et al.
University of Waikato

Machine learning-based multi-label medical text classifications can be used to enhance the understanding of the human body and aid the need for patient care. We present a broad study on clinical natural language processing techniques to maximise a feature representing text when predicting medical codes on patients with multi-morbidity. We present results of multi-label medical text classification problems with 18, 50 and 155 labels. We compare several variations to embeddings, text tagging, and pre-processing. For imbalanced data we show that labels which occur infrequently, benefit the most from additional features incorporated in embeddings. We also show that high dimensional embeddings pre-trained using health-related data present a significant improvement in a multi-label setting, similarly to the way they improve performance for binary classification. High dimensional embeddings from this research are made available for public use.



There are no comments yet.


page 1

page 2

page 3

page 4


Predicting COVID-19 Patient Shielding: A Comprehensive Study

There are many ways machine learning and big data analytics are used in ...

Ontology-supported processing of clinical text using medical knowledge integration for multi-label classification of diagnosis coding

This paper discusses the knowledge integration of clinical information e...

Exemplar Auditing for Multi-Label Biomedical Text Classification

Many practical applications of AI in medicine consist of semi-supervised...

Improving Predictions of Tail-end Labels using Concatenated BioMed-Transformers for Long Medical Documents

Multi-label learning predicts a subset of labels from a given label set ...

Assertion Detection in Multi-Label Clinical Text using Scope Localization

Multi-label sentences (text) in the clinical domain result from the rich...

Learning to diagnose from scratch by exploiting dependencies among labels

The field of medical diagnostics contains a wealth of challenges which c...

Application of the Multi-label Residual Convolutional Neural Network text classifier using Content-Based Routing process

In this article, we will present an NLP application in text classifying ...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

The human body is a very complex system, and often patients admitted to hospitals with one initial prognosis or diagnosis have multiple related or unrelated chronic diseases, referred to as multi-morbidity. Modern medical practice emphasises the need to understand the patient as a whole, as multi-morbidity increases the patient’s overall burden of disease, and worsens prognosis Flegel (2018); Ryan et al. (2018); Hausmann et al. (2019); Aubert et al. (2019); Mori et al. (2019). Multi-morbidity makes the diagnosis of each disease more complicated, and physicians may be less accurate in their diagnoses Hausmann et al. (2019). The effects of different conditions may interact with each other, and complicate the management of each disease Flegel (2018). This, in turn, leads to poorer outcomes, such as increased preventable hospital re-admissions, overall hospital re-admissions, and increased total medical and long term care costs Mori et al. (2019); Aubert et al. (2019). For example, a patient newly diagnosed with HER2 (human epidermal growth factor receptor 2) positive breast cancer may also have underlying, possibly undiagnosed, heart failure. This can be crucial, as some treatments for breast cancer can cause cardiac damage. Accurately identifying the symptoms of heart failure allows the physician to best balance the risks and benefits of such treatments.

Machine learning techniques have proven to aid medical advancements and enhance overall patient care. This research uses multi-label medical text classification techniques to improve prediction of the medical codes of patients with multi-morbidity. In single-label classification only one target variable is predicted per instance, i.e., each instance is assigned a class label out of 2 (binary) or more (multi-class) candidates. Whereas, in multi-label classification, the goal is to predict multiple output variables for each input instance. In the above example, the patient is an instance with potential labels such as cancer, hypertension, heart failure, cholesterol and many more related and unrelated health complications. This research focuses on medical codes due to the availability of labels in the dataset. Medical codes such as international classification of diseases (ICD) are used as a way of classifying diseases, symptoms, signs and causes of diseases. Almost every health condition can be assigned a unique code.

The focus of this research is to make use of free-form medical text. Free-form medical text such as discharge summaries, consultation notes and nurses notes are generally longitudinal and are rich sources of information about a patient’s well-being and medical history. However, electronic health records (EHR) in free-form medical text present added complexity due to the nature of the content. EHRs in the free-form text contain an abundance of personal health identifiers which have to be carefully de-identified to avoid any ethical or legal issues Yogarajan et al. (2020b). Also, EHRs contain a large number of abbreviations and acronyms, which can be easily misinterpreted. For example, “Mg” is used to refer to magnesium, “MG” refers to Myestina gravis and “mg” refers to milligram.

This research restricts itself to techniques that enable maximising the feature extraction of the medical text of the embedding layer. Embedding layer is a mapping of discrete variables to continuous vectors, where the dimensional space of the categorical variables is reduced. The embedding layer is considered a significant component for text representation 

Goldberg (2017). Embeddings allow words to transform from isolated distinct symbols to mathematical representations, where the distance between vectors and distance between words can be equated, and behaviour between words can be generalised. We focus only on multi-label machine learning techniques commonly used in health-related information extraction tasks to better enhance the accuracy of predicting medical codes on patients with multi-morbidity.

This paper extends the work on binary classification of medical codes presented in Yogarajan et al. (2020) Yogarajan et al. (2020a). More specifically, in this paper:

  • we acknowledge the multi-morbidity nature of patients, and we make use of the multi-label variations of medical text classification to enhance prediction of concurrent medical codes.

  • we present new embeddings on the health-related text and compare several variations to embeddings models when dealing with an imbalanced multi-label medical text classification problem.

  • we analyse pre-processing of free-form medical text, given the nature of the medical text, and show that there are very minimum improvements to F-measure when medical text is pre-processed to that of the text ‘as is’.

  • we present a study exploring variations to tagging words including the traditional part-of-speech (POS).

  • we provide a comparison of popular machine learning classifiers used in medical text classification.

  • we present a detailed study and discussion of results extended by varying the formations of embeddings, size of the embeddings and number of labels considered (18, 50 and 155) for the prediction of medical codes.

  • we show that variations in embeddings, especially the dimensional size, influences the F-measure of the infrequent labels.

The rest of the paper is structured as follows. Section 2 presents related work. This is followed by a brief overview of medical codes in Section 3. Details of the machine learning techniques and experimental methodology are provided in part 4. Section 4 also presents an overview of the data used for experiments. This is followed by results, where a detailed subsection of results are given for 18 label case, followed by 50 and 155 labels. The paper is concluded with discussions and suggestions for future work.

2 Related Work

Developments in machine learning, especially deep learning, have influenced the advancements in many fields, including health applications. The rapid growth in computational power and the availability of EHR are the main reasons for such changes. Rule-based systems have been the most favoured option by health professionals, with systems such as cTAKES and MetaMap considered the leading information extraction tools

Savova et al. (2010); Garla et al. (2011); Liu et al. (2013); Reátegui and Ratté (2018); Yang and Gonçalves (2017). However, recently there is a shift towards favouring machine learning, more specifically deep learning-based models.

Table 1 presents examples of recent developments in predicting medical codes. All systems are based on variations of deep learning models. The number of ICD-9 codes, i.e. the number of labels, used varies across systems with the best reported F1 measures around the 0.4 to 0.6 range. The number of labels and the frequency of the chosen labels influence the F1 score, with top 50 ICD-9 codes generally leading to higher F-measure. MIMIC III (Medical Information Mart for Intensive Care III) is the biggest publicly accessible de-identified dataset and is the most popular free-form medical text used in many applications, including predicting medical codes Purushotham et al. (2017); Johnson et al. (2017); Goldberger et al. (2000); Data (2016) (also evident in Table 1).

System Methods Data Best Score Details
Zeng et al. (2019) Zeng et al. (2019)

Deep transfer learning

MIMIC III micro avg most frequent 200 labels
Multi-scale CNN F1 = 0.420
Du et al. (2019) Du et al. (2019) ML-Net, ELMo based, MIMIC III Best*
F1 = 0.428 70 labels
Baumel et al. (2018) Baumel et al. (2018) Hierarchical Attention MIMIC III micro avg
Bi-GRU F1 = 0.405 6527 labels
F1 = 0.559 1047 labels
Mullenbach et al. (2018) Mullenbach et al. (2018) CNN based, Word2Vec MIMIC III micro avg
DR-CAML F1 = 0.633 most frequent 50 labels
CAML F1 = 0.539 8922 labels
macro avg
CAML F1 = 0.088 8922 labels
Li et al. (2018) Li et al. (2018) DeepLabeler MIMIC III micro avg 6984 labels
CNN, Doc2Vec F1 = 0.408
Rios and Kavuluru (2018) Rios and Kavuluru (2018) CNN, few-shot learning MIMIC III micro avg 6932 labels
Skip-gram embeddings F1 = 0.468
Table 1:

Examples of the most recent systems for predicting ICD-9 codes are presented. Here CNN refers to convolutional neural network, LSTM to Long short-term memory, Bi-GRU to bidirectional Gated Recurrent Unit and DR-CAML to Description Regularized - Convolutional Attention for Multi-label classification. * Du et al. (2019) 

Du et al. (2019)

do not specify best micro average or macro average F score.

Embeddings are the popular method used to represent text in a neural network, and all systems presented in Table 1 use embeddings from algorithms such as Word2Vec, Doc2Vec and ELMo to represent free-form medical text. Yogarajan et al. (2020) Yogarajan et al. (2020a) used fastText to obtain embeddings and presented comparisons with published embeddings, both for general text and health-related text trained models. Embeddings trained on health-related text perform better than those trained on general text, and higher dimensions perform better when top-level ICD-9 groups are considered as an individual binary problem Yogarajan et al. (2020a). Huggard et al. (2019) Huggard et al. (2019)

also show that embeddings obtained from fastText result in significantly higher F-measure on the biomedical name entity recognition when compared to other embeddings such as that of ELMo.

We restrict this research to enhancing embeddings in a multi-label prediction setting. Our findings in this research will aid the development of better performing neural networks. All systems presented in Table 1, and other deep learning-based models, focus predominantly on the complexity of the deep learning algorithm and very little on the representation of the text and pre-processing of the text. Although we acknowledge the need for such developments, in this paper, we constrain ourselves to text representations as this is vital to improving predictive performance for health records. This is so to avoid using the same baseline recipe for the embeddings layer where the size of the embedding is the same; generally, 100 dimensions, and pre-processing steps are also the same Zeng et al. (2019); Mullenbach et al. (2018).

3 Medical Codes

ICD codes are widely used to describe diagnoses of patients, and are used to classify diseases, symptoms, and also causes of diseases Jensen et al. (2012). Many countries use ICD codes for billing purposes, as does the USA where insurance must cover the cost of patient care. ICD codes also provide insights on multi-morbidity of patients. We focus on predicting ICD-9 codes in this paper due to the availability of labels in the data. Generally, hospitals manually assign the correct codes to patient records based on doctors’ clinical diagnosis notes. This requires expert knowledge and is time-consuming. Hence, the use of advancements in machine learning to predict ICD codes from free-form medical text has become an important research avenue.

There are roughly 13,000 ICD-9 codes and their definitions follow a hierarchical structure. Figure 1 presents the tree structure of ICD-9. At the top level, ICD-9 codes can be grouped into 18 main categories, which then divide into 167 sub-groups and finishes with roughly 13,000 individual codes.

18 top-level groups

167 sub-level groups


001-139 (inf)

001-009 (inf1)






137-139 (inf16)

140-239 (neop)

E & V
Figure 1: ICD-9 Hierarchy. The first split contains 18 groups (which we refer to as the top level ICD-9 grouping). These top level ICD-9 groups then splits into 167 sub-groups. Leaves represents the individual ICD-9 codes. All of the top level groups split into individual ICD-9 codes. E & V refers to external causes of injury and supplemental classification and the ICD-9 codes that belongs to this group contains codes starting with E or V.

4 Experimental Methodology

This section presents an overview of the data used in experiments and for training embeddings. We provide an overview of the machine learning and natural language processing techniques. The details of the embeddings used in this research are also presented.

4.1 Data

This research makes use of the medical text data of more than 50,000 patients presented in the publicly available medical database Medical Information Mart for Intensive Care (MIMIC-III) Johnson et al. (2016); Goldberger et al. (2000); Data (2016). MIMIC III contains de-identified medical free-form text among other forms of medical data of patients admitted in critical care units at the Beth Israel Deaconess Medical Center between 2001 and 2012. MIMIC III contains 15 categories of notes in the free-form text, including discharge summaries, nursing notes, nutrition notes and social work notes. More than 90% of the unique hospital admissions contain at least one discharge summary, with many including more than one. We make use of discharge summaries of individual hospital admissions in this research.

There are 6,984 distinct diagnosis ICD-9 codes and 2,032 distinct procedure ICD-9 codes reported in MIMIC III, among more than 50,000 patient admission records found in this database. Patient records in MIMIC III typically have more than one code assigned. However, the frequency of ICD-9 codes is extremely unevenly spread, with a large proportion of the ICD-9 codes occurring infrequently. Table 2 provides an overview of the frequency of ICD-9 codes and ICD-9 groupings. We focus on the top-level and sub-level groups in this study, along with the first 50 highest frequently occurring individual ICD-9 codes. Table 2 also presents frequency ranks of ICD-9 codes and sub-groups to showcase the unbalanced nature of the data. This biased nature is primarily because MIMIC III data were obtained from patients admitted in critical care.

ICD-9 Top Level Grouping: 18 Groups
Group % Group % Group %
circ (390-459) 78.40 diges (520-579) 38.80 musc (710-739) 17.99
e+v (E- & V-) 69.09 bld (280-289) 33.56 pren (760-779) 17.07
endo (240-279) 66.51 symp (780-799) 31.36 neop (140-239) 16.37
resp (460-519) 46.63 ment (290-319) 29.66 skin (680-709) 12.02
inj (800-999) 41.42 nerv (320-389) 29.10 cong (740-759) 5.41
gen (580-629) 40.29 inf (001-139) 26.96 preg (630-679) 0.31
Top 50 ICD-9 codes
Frequency ICD-9 code % Frequency ICD-9 code %
rank rank
1 401.9 35.13 20 V05.3 9.81
5 414.01 21.09 30 496.0 7.52
10 967.1 15.11 40 305.1 5.70
15 599.0 11.13 50 V15.82 4.77
ICD-9 Sub-Level Grouping: 155 Groups
Frequency Sub-group % Frequency Sub-group %
rank rank
1 endo4 (270-279) 52.13 50 blood3 (288-289) 4.28
5 symp1 (780-789) 30.86 75 inj4 (820-829) 1.81
10 v6 (V40-V49) 25.76 100 cong7 (753-753) 0.56
25 diges6 (560-569) 11.47 155 v12 (V86-V86) 0.02
Table 2: Percentage of occurrence of ICD-9 codes and ICD-9 groupings (top-level and sub-level) in MIMIC III discharge summaries of unique hospital admissions is presented. For top level all 18 groups are presented. For sub-level group and top 50 ICD-9 codes only selected frequencies are presented with corresponding ranking of these frequencies (ordered highest to lowest). The total number of hospital admissions with a recorded discharge summary in MIMIC III is 52,710. Only ICD-9 codes or groups that occurred in 10 unique hospital admissions are included.

4.2 Training Embeddings

Representing words as embeddings is a common mechanism used in language processing Goldberg (2017). Embeddings obtained from algorithms such as Word2Vec, Glove and fastText are used in text classification tasks, including medical applications. This research makes use of fastText Bojanowski et al. (2016); Joulin et al. (2016b, a) where words are represented as a bag of character -grams, and word embeddings are obtained by summing these representations. This feature gives fastText the ability to produce vectors for words that are misspelt or concatenated. The nature of free-form medical text does benefit from this feature of fastText embeddings, and there are examples of medical applications where embeddings from fastText are shown to outperform other similar algorithms Huggard et al. (2019).

Medical codes for patients admitted to the hospital are labelled at individual admission or patient level documents rather than single words in a health record. Hence, we predict medical codes for entire documents, in this case, discharge summaries of unique hospital admission. For this research, document embeddings are obtained by computing the vector sum of the embeddings for each word in the document. This vector sum is then normalised to have length one, ensuring documents with different lengths have representations of similar magnitudes.

For comparison, our embeddings are trained to the exact same specifications as the fastText embeddings W300 presented in Grave et. al. (2018) Grave et al. (2018). Table 3 presents details of the embedding used in this research with details of dimensional size, source data, model size and training time222Processing was run on a 4 core Intel i7-6700K CPU @ 4.00GHz with 64GB of RAM.

. The word embeddings are trained using CBOW (T300, T600) and Skip-gram (T300SG, T600SG), with character n-grams of length 5, a window of size 5 and ten negative samples per positive sample. The learning rate used for training these models is 0.05.

We make use of the data provided by TREC 2017 competitions Roberts et al. (2017) to train our embeddings. TREC 2017 provides an extensive 24G of health-related data. TREC data contains 26.8 million published abstracts of medical literature listed on PubMed Central, 241,006 clinical trials documents, and 70,025 abstracts from recent proceedings focused on cancer therapy from the American Association for Cancer Research and the American Society of Clinical Oncology Roberts et al. (2017).

Models Dimensions Source Data Train Time Model Size
W300 Grave et al. (2018) 300 Wiki - 7G
T300 Yogarajan et al. (2020a) 300 TREC 7 hours 13G
T300SG 300 TREC 28 hours 13G
T600 Yogarajan et al. (2020a) 600 TREC 13 hours 23G
T600SG 600 TREC 51 hours 23G
Table 3: Word embeddings used in this research are presented, with dimension details, training times and embeddings model sizes.

4.3 Multi-label Classifiers

Generally, a given medical record is annotated with multiple tags for different diagnoses, procedures or treatments. That is, from a machine learning perspective, health text coding is a multi-label classification problem, where one text may belong to more than one label. For example, many ICD codes exist for matters that relate to hypertension or diabetes, and such illnesses often co-occur in individual patients, but they also occur independently. Thus, it would be useful to be able to classify a particular health text to one or the other, or both, or neither of these categories. Moreover, with approximately 13,000 ICD-9 categories for diagnoses and treatments that can combine almost arbitrarily for individual patients, the problem of multi-label classification for health records can be extraordinarily large.

In this section, we provide an overview of multi-label classifiers used for the experiments in this paper. For more details on these methods, see Read et al. (2016).

4.3.1 Binary relevance (BR)

The first and simplest multi-label classification algorithm used here is called binary relevance (BR) Godbole and Sarawagi (2004); Tsoumakas and Katakis (2007). A separate binary classification model is created for each label, such that any text with that label is a positive instance, negative otherwise (i.e. one versus all). To predict the labels for a new text, each classifier decides if the text is in or out the class it has been trained to recognise, and the overall output for the new text is the set of all positive labels. Note that binary relevance ignores any potential relationships between labels.

4.3.2 Classifier Chains (CC)

BR models make their predictions independently. However, as seen with the earlier example of the strong correlation between diabetes and hypertension, a model could possibly benefit from the result of another when making its own decision. Accordingly, BR models can be ‘chained’ together into a sequence such that the predictions made by earlier classifiers are made available as additional features for the next classifier. Such a configuration is unsurprisingly called a classifier chain (CC) Read et al. (2009, 2011).

4.3.3 Ensemble of classifier chains (ECC)

The order of predictions in a classifier chain affects what advice later models have available from preceding ones when it’s their turn to make a judgment. This is a problem for multi-label classification in general, but particularly so for health records where dependencies between ICD codes are myriad, complex, and sometimes quite strong. One way to mitigate the problem of choosing a poor ordering is to create a collection of classifier chains that are each ordered randomly, then make final predictions by polling the results of all chains. Such a collection is called an ensemble of classifier chains (ECC) Read et al. (2009).

4.3.4 Multi-label k-nearest neighbor classifier (MLkNN)

MLkNN Zhang and Zhou (2005)

is a multi-label variant of the standard k-Nearest Neighbor (kNN) algorithm, that predicts the set of the most common labels among the

-nearest neighbours. To guard against any anomalies inside a neighbourhood, a Bayesian calibration step refines the raw predictions. An important characteristic of this approach is its excellent scalability with respect to the number of labels: the set of nearest neighbors needs to be calculated only once for a given query text.

4.3.5 Neural Networks

As emphasised earlier, the focus of this research is only at embeddings layers, and we use the most commonly used multi-label classifiers for prediction. However, the outcome of this research can be incorporated into a neural network, where embeddings layers are generally used to represent text Goldberg (2017). Furthermore, in 2019 very recent NLP techniques like BERT (Bidirectional Encoder Representations from Transformers) and BioBERT, showed significant improvements on some other biomedical tasks Lee et al. (2019); Abacha et al. (2019). These are all worthy avenues for future research.

4.3.6 MEKA and base classifiers

All of the classification results presented in this research were carried out using MEKA Read et al. (2016)

: an open-source Java system specifically designed to support multi-label classification experiments. MEKA includes almost all widely-used algorithms and evaluation metrics. The default algorithm for each class (i.e. the base classifier) within MEKA is logistic regression, and this is used for the majority of our experiments; however, stochastic gradient descent (SGD) is used for tests with ECC, with ensembles of 50, 100 and 500 randomly ordered classifier chains.

4.4 Statistical assessment of differences

We perform non-parametric tests to verify if there are statistically significant differences between algorithms, as described in Demšar (2006); Garcia and Herrera (2009). First we use Davenport’s corrected Friedman test with

to check if we can safely reject the null hypothesis that all algorithms perform the same. If there are differences, we proceed with the post-hoc Nemenyi test to determine the critical difference (CD) that serves to identify algorithms with different performance. We include the critical difference plots in our results.

4.5 FastText Parameters

As mentioned in Section 4.2 models used in this research are trained to the exact same specifications as general text trained published models. We also present a comparison of variations of specifications for training embeddings for a multi-label medical text classification problem. The combination of variations to parameter choices are presented in Table 4. The learning rate used to train all of the variations presented is 0.05. For simplicity all dimensions were set to 50. The two word representation models are Skip-gram and CBOW. It is important to note that these combinations result in 18 different embeddings models, and only a selected sub-set of the experimental results are presented in this paper.

Option Dimensions Window neg Character -gram Loss Function Epoch
Size minn maxn
I 50 5 10 5 5 softmax 5
II 50 3 10 5 5 softmax 5
III 50 7 10 5 5 softmax 5
IV 50 5 5 5 5 softmax 5
V* 50 5 10 0 0 softmax 5
VI 50 5 10 3 3 softmax 5
VII 50 5 10 5 5 hierarchical softmax 5
VIII 50 5 10 5 5 negative sampling 5
IX 50 5 10 5 5 softmax 10
Table 4: Variations of parameter choices for embeddings trained using fastText are presented. These options are used for both CBOW and Skip-gram. Option “I” contains the exact same parameters choices as the published model W300. “neg” refers to the number of negative samples per positive sample. [minn, maxn] refers to minimum length and maximum length respectively. *Option V sets maxn = 0, this means no sub-words will be used by fastText. Hence, the model should give similar results to that of word2vec.

The Continuous Bag-of-Words Model (CBOW) Mikolov et al. (2013)

is similar to a feed-forward neural network language model with non-linear hidden layer removed and the projection layer being shared for all words. CBOW predicts the current word from the surrounding words. The Skip-gram architecture

Mikolov et al. (2013) is similar to that of CBOW, but Skip-gram uses the current word to predict the words before and after within a given range. For example, for a sentence “Male patient is admitted to the hospital”, CBOW predicts the word “admitted” using the source context words (“Male”, “patient”, “is”, “to”, “the”, “hospital”), whereas Skip-gram predicts context words like “patient” or “hospital” for the source word “admitted”.

4.6 Pre-processing Text Data

We present a comparison of F-measures between pre-processed discharge summary and text ‘as is’ as presented by MIMIC III data. MIMIC III data has been de-identified and pre-processed before being released for research access. Also, most models developed using MIMIC pre-process the text and truncate the maximum number of words Mullenbach et al. (2018). Text pre-processing includes removal of tokens without alphabetic characters, down-casing all tokens, removal of punctuation and truncating the number of tokens in a given discharge summary.

On the other hand, experiments presented in this research use MIMIC III discharge summaries ‘as is’ with minimum pre-processing. This allows us to maximise the use of features in medical free-form text as embeddings are case sensitive. It also avoids the meanings of abbreviations and acronyms used in a medical context being altered. An example of text ‘as is’ followed by an example of pre-processed text is presented below:  

Medicine HISTORY OF PRESENT ILLNESS: This is an 81-year-old female with a history of emphysema, presents with 3 days of shortness of breath thought by her primary care. Medications on Admission: Omeprazole 20 mg daily, Furosemide 10mg daily. Tablet Sustained Release 24 hr PO once a day.


medicine history of present illness this is an 81yearold female with a history of emphysema presents with days of shortness of breath thought by her primary care medications on admission omeprazole mg daily furosemide 10mg daily tablet sustained release hr po once a day


4.7 Concatenating Embeddings

We explore the option of splitting the free-form medical data into sections and concatenating the embeddings. The discharge summary is split into seven logical sections: Admission Date, Past Medical History, Pertinent Results, Brief Hospital Course, Medications on Admission, Discharge Diagnosis and Followup Instructions. Embeddings for each section can be obtained and concatenated. For example, if a 50 dimensional embeddings model is used the resulting concatenated embedding has 350 dimensions. If the discharge summary does not include any of the sub-sections mentioned above, then the respect embeddings are all zeros. For hospital admissions with more than one available discharge summary, all the summaries are first embedded independently, and then averaged into one final embedding.

Another variation considers concatenating statistical outcomes of the embeddings from each of the sections of a given hospital admission. For these experiments we look at the minimum, maximum, mean, standard deviation, lower quartile and upper quartile of the embeddings, and hence the resulting embedding will have six times as many dimensions as the original one.

4.8 Tagging Words

We explore two variations of tagging words in medical free-form text. Part-of-speech (POS) is a technique used where the syntactic categories of words in a given sentence are identified automatically. Common examples of such POS tags are: noun, verb, adjective, adverb, pronoun, preposition, conjunction and interjection. We make use of Natural Language Toolkit (NLTK333 et al. (2009) POS tagger, where if the input text is:

History of Present Illness 54 year old female with recent diagnosis of ulcerative colitis on mercaptopurine

output is:

HistoryNN ofIN PresentNNP IllnessNNP 54CD yearNN oldJJ femaleNN withIN recentJJ diagnosisNN ofIN ulcerativeJJ colitisNN onIN mercaptopurineJJ

where NN indicates a noun, IN is referring to preposition or conjunction, NNP is referring to a proper noun, CD is referring to numeral and JJ is referring to adjective or numeral.

Also, we tag the words of MIMIC III discharge summaries using the text splits presented in Section 4.7. Tokens in each of these sections are tagged with 0_, 1_, 2_, 3_, 4_, 5_, 6_ for text in the seven splits Admission Date, Past Medical History, Pertinent Results, Brief Hospital Course, Medications on Admission, Discharge Diagnosis and Followup Instructions respectively.

5 Results

This section presents results for the top level ICD-9 grouping, the sub-level grouping, and the overall top 50 highest frequency ICD-9 codes where the number of labels are 18, 155, and 50 respectively. The top level ICD-9 groupings are primarily used to present comparisons and detailed results for the multi-label medical text classification techniques mentioned in Section 4. Results are primarily presented at the level of individual labels to enable better understanding of the imbalanced nature of the data and to observe improvements in F-measure. We also present micro-averaged and macro-averaged F1 scores to facilitate comparisons across the variations in embeddings to the overall system.

5.1 Top-Level Groups of Medical Codes

This section presents results for the 18 top-level ICD-9 groups, as mentioned in Table 2. We also present comparisons of 18 groups treated as individual binary problems, as presented in Yogarajan et al. (2020) Yogarajan et al. (2020a), where appropriate. The results in this section are aligned with the experimental methodology as described above.

5.1.1 Comparing Multi-label Classifiers

groups Best case Best case
F1 E = I = F1 I =
circ 0.932 0.921 0.932 0.933 100 10 0.932 30, 100
e+v 0.829 0.823 0.812 0.831 100 30 0.830 30, 100
endo 0.848 0.839 0.848 0.851 50 100 0.848 10, 30, 100
resp 0.777 0.703 0.770 0.784 500 10 0.782 30
inj 0.662 0.590 0.653 0.683 500 10 0.686 30
gen 0.731 0.657 0.731 0.735 500 30 0.739 10, 30, 100
diges 0.696 0.600 0.694 0.706 50 30 0.713 10
bld 0.571 0.494 0.571 0.577 50 100 0.612 10
symp 0.487 0.361 0.463 0.489 50 10 0.552 10
ment 0.542 0.299 0.538 0.562 500 30 0.590 10
nerv 0.543 0.376 0.522 0.530 100 10 0.589 10
inf 0.647 0.547 0.651 0.667 500 30 0.683 10
musc 0.298 0.086 0.302 0.272 500 30 0.410 10
pren 0.594 0.575 0.592 0.588 500 10 0.590 10
neop 0.703 0.500 0.705 0.709 500 10 0.718 10
skin 0.347 0.075 0.349 0.328 500 10 0.413 10
cong 0.384 0.294 0.383 0.361 500 100 0.449 30
preg 0.592 0.267 0.572 0.542 500 100 0.514 100
Table 5: Comparison of F-measures for the 18 top level ICD-9 groups is presented for varying multi-label classifiers. I indicates number of iterations and E is the number of epochs to perform. T300 is used for embeddings. Bold is used to indicate F-measures better than that of the BR-LR, and underline to indicate the best F-measure across all presented.

Table 5 presents a comparison between several multi-label classifiers to predict the 18 top-level ICD-9 groups. Critical difference plots are available in Figure 2. Performance when considering the 18 groups as individual binary classification problems is also presented. All experiments use the T300 word embedding and 10-fold cross validation. As anticipated, using multi-label variations does provide advantage over the individual binary classification case. Evidently, for most ICD-9 groups ECC using with logistic regression (LR) performs best. Optimising the number of iterations and epochs can improve F-measure results. ECC-LR with a ridge value of and the number of iterations achieves the best results overall. Experiments across a range of different ridge values provided almost identical values for F-measure, hence only a ridge value of 1 is included in Table 5.

Figure 2: Critical difference plots. Nemenyi post-hoc test (95% confidence level), identifying statistical differences between methods in our tests.

5.1.2 FastText Parameter Choices

ICD-9 CBOW Skip-gram
circ 0.924 0.925 0.923 0.926 0.925 0.925 0.925 0.923 0.926 0.925
e+v 0.823 0.824 0.822 0.823 0.822 0.823 0.823 0.821 0.823 0.822
endo 0.839 0.840 0.838 0.840 0.840 0.841 0.839 0.838 0.841 0.841
resp 0.723 0.734 0.729 0.737 0.738 0.735 0.736 0.721 0.732 0.736
inj 0.607 0.614 0.612 0.621 0.616 0.612 0.614 0.608 0.618 0.619
gen 0.659 0.665 0.661 0.671 0.669 0.667 0.664 0.653 0.666 0.670
diges 0.653 0.655 0.648 0.655 0.655 0.651 0.655 0.638 0.657 0.658
bld 0.489 0.509 0.500 0.507 0.512 0.513 0.504 0.487 0.506 0.511
symp 0.413 0.421 0.416 0.414 0.422 0.414 0.403 0.396 0.422 0.421
ment 0.411 0.427 0.426 0.433 0.442 0.430 0.447 0.445 0.434 0.440
nerv 0.442 0.456 0.449 0.454 0.456 0.448 0.446 0.425 0.457 0.459
inf 0.587 0.599 0.596 0.597 0.599 0.590 0.598 0.591 0.598 0.601
musc 0.139 0.138 0.120 0.144 0.147 0.143 0.156 0.121 0.142 0.145
pren 0.578 0.579 0.580 0.578 0.578 0.578 0.578 0.578 0.579 0.578
neop 0.610 0.627 0.610 0.623 0.618 0.624 0.623 0.632 0.625 0.621
skin 0.163 0.189 0.156 0.181 0.174 0.169 0.170 0.150 0.180 0.178
cong 0.152 0.184 0.182 0.196 0.201 0.181 0.207 0.170 0.202 0.191
preg 0.234 0.318 0.241 0.312 0.323 0.286 0.306 0.202 0.319 0.327
Table 6: Comparison of F-measures for top level ICD-9 groups for embeddings obtained by varying fastText parameters, as presented in Table 4. Options I to IX match that of Section 4.5. Embeddings are trained on TREC data and the classifier used for experiments is BR with logistic regression and . Best F1 score for CBOW models are presented on the left. Bold is used to indicate F-measures better than the best CBOW F1 score, and underline is used to indicate the best F1 across the options presented.
Figure 3: Critical difference plots. Nemenyi post-hoc test (95% confidence level), identifying statistical differences between learning methods. CBOW values correspond to the best performance for this embedding.

Table 6 presents a comparison of F-measures of the fastText parameter choices I-IX for both CBOW and Skip-gram embeddings as outlined in Section 4.5. Critical difference plots are available in Fig. 3. Results correspond to 18 embeddings 13 classifiers 10 folds cv for a total of 2340 tests. The best F-measure among the models using the options presented in Table 4 for CBOW is also presented. Evidently, for all 18 groups, Skip-gram out-performs CBOW. Option I has the same specifications as the W300 embeddings, and the embeddings presented in Table 3 are trained as per option I for comparison. However, it is evident from Table 6 that varying the fastText parameters impacts F-measures across all 18 ICD-9 groups, but not necessarily always for the better. Thus, care must be taken when selecting these parameters.

5.1.3 Pre-processing

Figure 4: A comparison of F-measures for top level ICD-9 groups between MIMIC III discharge summary used ‘as is’ vs pre-processed. Maximum length indicates pre-processed text being truncated to a maximum token length. Classifier used for experiments is ECC with base classifier of logistic regression, ridge value of one. Embeddings used is T300. Pre-processed text is presented with solid lines and dashes represent text ‘as is’.

Figure 4 presents a comparison of text pre-processed and truncated with the text ‘as is’ within MIMIC III. As mentioned earlier, MIMIC III pre-processes and de-identifies all free-form text released to the public. Here we further process the discharge summary and truncate it to the maximum number of tokens. Generally, the option of discharge summary pre-processed and truncated to 1000 tokens maximum performs much worse than the other options. However, when comparing the text ‘as is’ to the other two pre-processed options, there is very little or no difference in the F-measures. Even for very infrequent categories the differences in F-measure are very marginal. Hence, the question of benefits over trade-off (known and unknown) with regard to pre-processing medical text, or not, remains unclear. It’s important to point out that apart from the results presented in this section, all other results presented in this paper are obtained using discharge summaries without any additional pre-processing or truncating other than that already done by MIMIC III.

5.1.4 Comparison between Embeddings

Figure 5: A comparison of F-measures for top level ICD-9 groups presented for embeddings trained using CBOW vs Skip-gram models with dimensions 300 and 600. All experiments used ECC with logistic regression as the base classifier, using a ridge value of one.
Figure 6: A comparison of F-measures for top level ICD-9 groups is presented. F-measures for two models T50 and T300 are indicated with dashes. Two variations of concatenations CONCAT300 and CONCAT350 are presented with solid lines. All experiments used ECC with logistic regression as the base classifier, using a ridge value of one.

Figure 5 presents a comparison of top-level ICD-9 groups between CBOW and Skip-gram models, and between 300 and 600 dimensions. As observed in the binary case, presented in Yogarajan et al. (2020a), increase in dimension does provide an improvement in F-measure. This is more evident with lower frequency groups such as skin, cong and preg. Skip-gram is consistently better than CBOW as observed in Section 5.1.2. We also compared W300 embeddings, and multi-label variations also present similar observations to that of the binary case, where health-related pre-trained embeddings provide an advantage over general text pre-trained embeddings across all 18 groups.

Figure 6 presents a comparison of embeddings formed by concatenating embeddings as per Section 4.7. The base embeddings used here are T50. CONCAT300 is formed by concatenating the embeddings of the statistical outcomes, i.e. CONCAT300 = 50 dim (min + max + mean + sd + q1 + q3). CANCAT350 is formed by concatenating the embeddings of the seven text splits 7 50 dim. In comparison, both CONCAT300 and CONCAT350 improve F-measures relative to the base embeddings T50 except for the ICD-9 group pren. CONCAT350 generally performs better than CONCAT300. However, the T300 embeddings outperform both CONCAT300 and CONCAT350 across all 18 groups. Also, the improvements that CONCAT300 and CONCAT350 produce over T50 are not replicated for larger embeddings. For instance when starting with T300 and generating CONCAT1800, or CONCAT2100, no significant improvements are observed. This maybe due to the fact that T300 already performs much better than T50, possibly not leaving much room for further improvement. More future research is needed to investigate this behaviour in more detail.

5.1.5 Tagging Words

Figure 7: A comparison of F-measures for top level ICD-9 groups between discharge summaries with and without a POS tagger, and with text split tags. All experiments used ECC with logistic regression as the base classifier, using a ridge value of one.

Figure 7 presents a comparison of F-measures for top level ICD-9 groups among the MIMIC III discharge summaries with POS tags, with text split tags and for the raw text without any tagging. Evidently, except for categories bld, sym, and preg, using a POS tagger does not improve the F-measures. For categories bld, sym and preg the use of the POS tagger improves the F-measure, from 0.612 to 0.616, from 0.552 to 0.555, and from 0.470 to 0.491, respectively. When using the text split tagger, the F1 score for circ is equivalent to the no-tagger case, and for category endo there is a small improvement from 0.848 to 0.850.

5.1.6 Summary: Top level ICD-9 Groups

ICD-9 ML Classifier Concatenating Embeddings Text Tagger Pre-processing
groups Text
circ ECC-SGD, T50 CONCAT300 CONCAT350 T300 no tagger = truncated to 2500
E=100, I = 10 TextSplitTag
e+v ECC-SGD, T50 CONCAT350 CONCAT300 T300 no tagger truncated to 2500,
E=100, I = 30 no truncate
endo ECC-SGD, T50 CONCAT300 CONCAT350 T300 TextSplitTag no truncate
E=50, I = 100
resp ECC-SGD, T50 CONCAT300 CONCAT350 T300 no tagger no truncate
E=500, I = 10
inj ECC-LR T50 CONCAT300 CONCAT350 T300 no tagger truncated to 2500
R = 1, I = 30
gen ECC-LR R = 1, T50 CONCAT350 CONCAT300 T300 no tagger text ‘as is’
I = 10, 30, 100
diges ECC-LR T50 CONCAT350 CONCAT300 T300 no tagger text ‘as is’
R = 1, I = 10
bld ECC-LR T50 CONCAT300 CONCAT350 T300 POS no truncate
R = 1, I = 10
symp ECC-LR T50 CONCAT300 CONCAT350 T300 POS no truncate
R = 1, I = 10
ment ECC-LR T50 CONCAT300 CONCAT350 T300 no tagger no truncate
R = 1, I = 10
nerv ECC-LR T50 CONCAT300 CONCAT350 T300 no tagger no truncate
R = 1, I = 10
inf ECC-LR T50 CONCAT350 = CONCAT300 T300 no tagger no truncate
R = 1, I = 10
musc ECC-LR T50 CONCAT300 CONCAT350 T300 no tagger no truncate
R = 1, I = 10
pren BR, Log, CONCAT350 CONCAT300 T50 T300 no tagger truncated to 2500
R = 1
neop ECC-LR T50 CONCAT300 CONCAT350 T300 no tagger no truncate
R = 1, I = 10
skin ECC-LR T50 CONCAT300 CONCAT350 T300 no tagger truncated to 2500
R = 1, I = 10
cong ECC-LR T50 CONCAT350 CONCAT300 T300 no tagger no truncate
R = 1, I = 30
preg BR-LR, T50 CONCAT300 CONCAT350 T300 POS text ‘as is’
R = 1
Table 7: A summary of the choices which produce the best F-measure for the top-level ICD-9 groups is presented.

The 18 top-level ICD-9 groups are used to present experimental results of several variations of techniques as outlined in Section 4. Skip-gram models improve the F-measures of all 18 groups compared to CBOW. Also, we presented comparisons of several variations of parameters to pre-trained embeddings for both CBOW and Skip-gram. Results to such modifications indicate changes to F-measures across all 18 labels, but not always for the better (see section 5.1.2 for details of results). The T600SG embeddings are the best-performing choice for all but one of the groups. It is only for category gen that the CBOW-based T600 manages to outperform T600SG. As observed in the binary case, embeddings trained using health-related data do provide an advantage over general text pre-trained embeddings. Also, higher dimensions improve F-measures, especially evident in low-frequency categories. A summary of the choices which provided the best F-measure for each ICD-9 top-level groups is presented in Table 7.

Figure 8: A comparison of F-measures for the top 50 most frequently occurring ICD-9 codes ordered from the highest frequency ICD-9 code 401.9 down to the 50th V15.82, is presented. Comparisons are between the W300 and T300 embeddings as well as the T300 and T600 embeddings. All experiments used ECC with logistic regression as the base classifier, using a ridge value of one.
Figure 9: A comparison of F-measures for most frequently occurring top 50 ICD-9 codes ordered from the highest frequent ICD-9 code 401.9 to the 50th V15.82 is presented. Comparisons are between CBOW trained embeddings T300 and T600 (solid lines) with Skip-gram trained embeddings T300SG and T600SG (dashes). All experiments used ECC with logistic regression as the base classifier, using a ridge value of one.
Figure 10: A comparison of F-measures between W300 and T300, and between 300 and 600 dimensions is presented for ICD-9 sub-level groups occurring in more than 5% of the cases in MIMIC III. All experiments used ECC with logistic regression as the base classifier, using a ridge value of one.
Figure 11: A comparison of F-measures between W300 and T300, and between 300 and 600 dimensions is presented for ICD-9 sub-level groups occurring in between 1% and 5% of the cases in MIMIC III. All experiments used ECC with logistic regression as the base classifier, using a ridge value of one.
Figure 12: A comparison of F-measures between W300 and T300, and between 300 and 600 dimensions is presented for ICD-9 sub-level groups occuring in less than 1% of the cases in MIMIC III. All experiments used ECC with logistic regression as the base classifier, using a ridge value of one.

5.2 Highest Frequency Medical Codes

This section presents the results for the top 50 most frequent individual ICD-9 codes. Mullenbach et al. (2018) Mullenbach et al. (2018), arguably the most prominent work in predicting ICD-9 codes from MIMIC III, considers the top 50 most frequent codes from both diagnosis and procedure ICD-9 codes. Hence, we also present results for the same top 50 ICD-9 codes where the most frequent code is 401.9 and occurs in 35.13% of all cases, whereas the least frequent code V15.82 is only present in about 5% of all cases (see Table 2 for more details).

Figure 8 presents a comparison of F-measures between general text pre-trained embeddings W300 and health-related pre-trained T300 and T600 embeddings for the 50 topmost frequently occurring ICD-9 codes. As observed with the 18 top-level ICD-9 groups, when compared to general text trained embeddings, F-measures of health-related trained embeddings are better. Also an increase in dimensions from 300 to 600 results in a considerable improvement in F-measure, far more than what was noticed in the 18 label case. For example, for ICD-9 code 530.81 the F-measure improves from 0.194 to 0.329, and for 410.71 from 0.332 to 0.404.

Figure 9 presents a comparison between CBOW and Skip-gram trained embeddings for the 50 topmost frequently occurring ICD-9 codes. As with the 18 label case, in general Skip-gram models are better than CBOW models, except for a few ICD-9 codes, such as 995.5, or 389.

5.3 Sub-level Groups of Medical Codes

A comparison of F-measures for sub-level ICD-9 groups is presented where the 155 labels are treated as one multi-label classification problem. We only use the sub-groups which are recorded in more than ten unique hospital admissions, hence 155 and not the entire 167 possible sub-groups. F-measures are presented from highest frequency of occurrence to the lowest. Figure 10 presents results for ICD-9 sub-groups with occurrences of more than 5%, Figure 11 for sub-level groups with occurrences between 1% and 5%, and Figure 12 contains sub-groups with less than 1% occurrences. A comparison of F-measures between W300 and T300, and between 300 and 600 dimensions is presented.

For most of the 155 sub-groups T300 outperforms W300, and for most sub-level groups, there is a definite improvement in F-measure when increasing the number of dimensions from 300 up to 600. This pattern matches the results for 18 and for 50 labels as presented above in sections

5.1 and 5.2.

5.4 Overall Results

In this section we present results for the overall performance of a multi-label medical text classification problem with 18, 50 and 155 labels. We present micro-averaged and macro-average F1 measures, aligned with prior work (see section 2 for examples).

Table 8 presents micro- and macro-averaged F-measures for 18, 50 and 155 label multi-label medical text classification tasks with embeddings variations. The overall pattern matches the observations for individual label-level F-measures, where the combination of higher dimensions and a Skip-gram model usually results in the highest performance measures. For all three groups of labels (18, 50, or 155), micro- and macro-averaged F1 scores are always better for T300 than for W300.

Model Description Micro F1 Macro F1 Model Micro F1 Macro F1
Top-level: 18 labels Top 50: 50 labels
W300 0.730 0.648 W300 0.484 0.434
T300 0.734 0.653 T300 0.497 0.445
T300SG 0.737 0.658 T600 0.532 0.486
T300, truncated to 2500* 0.737 0.654 T600SG 0.539 0.493
CONCAT300 0.676 0.589
CONCAT350 0.702 0.593 Sub-level: 155 labels
T300, POS tag 0.730 0.646 W300 0.534 0.293
T300, TextSplitTag 0.723 0.684 T300 0.551 0.306
T300, ECC-SGD, E=500, I=100** 0.721 0.634 T600SG 0.568 0.337
T600 0.742 0.665
T600SG 0.745 0.674
Table 8: Micro- and macro- averaged F-measures for multi-label medical text classification problem with 18, 50 and 155 labels is presented. All experiments used ECC with logistic regression as the base classifier, using a ridge value of one, except for ** where the classifier is explicitly stated. * refers to “text pre-processed and truncated to 2500 tokens” as per section 4.6. Bold is used to indicate the best measures for each case.

6 Conclusions

We present a detailed analysis of clinical NLP techniques used to enhance the embeddings layer of a multi-label medical text classification task. We focus on predicting ICD-9 for patients with multi-morbidity, and present results for 18, 50 and 155 labels. Results and analysis are primarily done at individual label level. Given the imbalanced nature of the data, at the individual label level, it is evident that variations in embeddings such as the use of Skip-gram model over CBOW, and higher dimensional embeddings do result in improvements in F-measure. These improvements are more significant with less frequent labels. This is evident across all three setups, regardless of using 18, 50 or 155 labels. These improvements and differences are also evident in overall micro- and macro-averaged F-measures. This research emphasises the need for enhancing text representations, and results show that there is a definite benefit in incorporating additional features. The benefits are depended on the data distributions and the task at hand. This paper used predicting medical codes as an example. However, the NLP techniques used in this research can be adapted to other tasks where multi-label medical text classification is required.

Our analysis on pre-processed text only presents marginal improvements to F-measure when compared to text ‘as is’. Hence there is no clear indication that additional pre-processing is required on already pre-processed and de-identified data, such as MIMIC III, especially given the nature of the medical text.


  • A. B. Abacha, C. Shivade, and D. Demner-Fushman (2019) Overview of the MEDIQA 2019 shared task on textual inference, question entailment and question answering. In Proceedings of the 18th BioNLP Workshop and Shared Task, pp. 370–379. Cited by: §4.3.5.
  • C. E. Aubert, J. L. Schnipper, N. Fankhauser, P. Marques-Vidal, J. Stirnemann, A. D. Auerbach, E. Zimlichman, S. Kripalani, E. E. Vasilevskis, E. Robinson, et al. (2019) Patterns of multimorbidity associated with 30-day readmission: a multinational study. BMC public health 19 (1), pp. 738. Cited by: §1.
  • T. Baumel, J. Nassour-Kassis, R. Cohen, M. Elhadad, and N. Elhadad (2018) Multi-label classification of patient notes: case study on icd code assignment. In

    Workshops at the Thirty-Second AAAI Conference on Artificial Intelligence

    Cited by: Table 1.
  • S. Bird, E. Klein, and E. Loper (2009) Natural Language Processing with Python. O’Reilly Media. Cited by: §4.8.
  • P. Bojanowski, E. Grave, A. Joulin, and T. Mikolov (2016) Enriching word vectors with subword information. arXiv preprint arXiv:1607.04606. Cited by: §4.2.
  • M. C. Data (2016) Secondary analysis of electronic health records. Springer. Cited by: §2, §4.1.
  • J. Demšar (2006) Statistical comparisons of classifiers over multiple data sets. Journal of Machine Learning Research 7, pp. 1–30. External Links: ISSN 1532-4435 Cited by: §4.4.
  • J. Du, Q. Chen, Y. Peng, Y. Xiang, C. Tao, and Z. Lu (2019) ML-net: multi-label classification of biomedical texts with deep neural networks. Journal of the American Medical Informatics Association 26 (11), pp. 1279–1285. Cited by: Table 1.
  • K. Flegel (2018) What we need to learn about multimorbidity. CMAJ 190 (34). Cited by: §1.
  • S. Garcia and F. Herrera (2009) An extension on ”statistical comparisons of classifiers over multiple data sets” for all pairwise comparisons. Journal of Machine Learning Research 9, pp. 2677–2694. Cited by: §4.4.
  • V. Garla, V. L. Re III, Z. Dorey-Stein, F. Kidwai, M. Scotch, J. Womack, A. Justice, and C. Brandt (2011) The Yale cTAKES extensions for document classification: architecture and application. Journal of the American Medical Informatics Association 18 (5), pp. 614–620. Cited by: §2.
  • S. Godbole and S. Sarawagi (2004) Discriminative methods for multi-labeled classification. In Pacific-Asia conference on knowledge discovery and data mining, pp. 22–30. Cited by: §4.3.1.
  • Y. Goldberg (2017) Neural network methods for natural language processing. Synthesis Lectures on Human Language Technologies 10 (1), pp. 1–309. Cited by: §1, §4.2, §4.3.5.
  • A. L. Goldberger, L. A. Amaral, L. Glass, J. M. Hausdorff, P. C. Ivanov, R. G. Mark, J. E. Mietus, G. B. Moody, C. Peng, and H. E. Stanley (2000) PhysioBank, PhysioToolkit, and PhysioNet: components of a new research resource for complex physiologic signals. Circulation 101 (23), pp. e215–e220. Cited by: §2, §4.1.
  • E. Grave, P. Bojanowski, P. Gupta, A. Joulin, and T. Mikolov (2018) Learning word vectors for 157 languages. In Proceedings of the International Conference on Language Resources and Evaluation (LREC 2018), Cited by: §4.2, Table 3.
  • D. Hausmann, V. Kiesel, L. Zimmerli, N. Schlatter, A. von Gunten, N. Wattinger, and T. Rosemann (2019) Sensitivity for multimorbidity: the role of diagnostic uncertainty of physicians when evaluating multimorbid video case-based vignettes. PloS one 14 (4). Cited by: §1.
  • H. Huggard, A. Zhang, E. Zhang, and Y. S. Koh (2019) Feature importance for biomedical named entity recognition. In Australasian Joint Conference on Artificial Intelligence, pp. 406–417. Cited by: §2, §4.2.
  • P. B. Jensen, L. J. Jensen, and S. Brunak (2012) Mining electronic health records: towards better research applications and clinical care.. Nature Reviews Genetics 13 (6), pp. 395. Cited by: §3.
  • A. E. Johnson, T. J. Pollard, and R. G. Mark (2017) Reproducibility in critical care: a mortality prediction case study. In Machine Learning for Healthcare Conference, pp. 361–376. Cited by: §2.
  • A. E. Johnson, T. J. Pollard, L. Shen, H. L. Li-wei, M. Feng, M. Ghassemi, B. Moody, P. Szolovits, L. A. Celi, and R. G. Mark (2016) MIMIC-III, a freely accessible critical care database. Scientific data 3, pp. 160035. Cited by: §4.1.
  • A. Joulin, E. Grave, P. Bojanowski, M. Douze, H. Jégou, and T. Mikolov (2016a) compressing text classification models. arXiv preprint arXiv:1612.03651. Cited by: §4.2.
  • A. Joulin, E. Grave, P. Bojanowski, and T. Mikolov (2016b) Bag of tricks for efficient text classification. arXiv preprint arXiv:1607.01759. Cited by: §4.2.
  • L. Lee, Y. Lu, P. Chen, P. Lee, and K. Shyu (2019)

    NCUEE at MEDIQA 2019: Medical Text Inference Using Ensemble BERT-BiLSTM-Attention Model

    In Proceedings of the 18th BioNLP Workshop and Shared Task, pp. 528–532. Cited by: §4.3.5.
  • M. Li, Z. Fei, M. Zeng, F. Wu, Y. Li, Y. Pan, and J. Wang (2018) Automated ICD-9 coding via a deep learning approach. IEEE/ACM transactions on computational biology and bioinformatics. Cited by: Table 1.
  • H. Liu, K. B. Wagholikar, S. Jonnalagadda, and S. Sohn (2013) Integrated cTAKES for Concept Mention Detection and Normalization.. In CLEF (Working Notes), Cited by: §2.
  • T. Mikolov, K. Chen, G. Corrado, and J. Dean (2013)

    Efficient estimation of word representations in vector space

    CoRR abs/1301.3781. External Links: Link, 1301.3781 Cited by: §4.5.
  • T. Mori, S. Hamada, S. Yoshie, B. Jeon, X. Jin, H. Takahashi, K. Iijima, T. Ishizaki, and N. Tamiya (2019) The associations of multimorbidity with the sum of annual medical and long-term care expenditures in japan. BMC geriatrics 19 (1), pp. 69. Cited by: §1.
  • J. Mullenbach, S. Wiegreffe, J. Duke, J. Sun, and J. Eisenstein (2018) Explainable prediction of medical codes from clinical text. arXiv preprint arXiv:1802.05695. Cited by: Table 1, §2, §4.6, §5.2.
  • S. Purushotham, C. Meng, Z. Che, and Y. Liu (2017) Benchmark of deep learning models on large healthcare mimic datasets. arXiv preprint arXiv:1710.08531. Cited by: §2.
  • J. Read, B. Pfahringer, G. Holmes, and E. Frank (2009) Classifier chains for multi-label classification. In Joint European Conference on Machine Learning and Knowledge Discovery in Databases, pp. 254–269. Cited by: §4.3.2, §4.3.3.
  • J. Read, B. Pfahringer, G. Holmes, and E. Frank (2011) Classifier chains for multi-label classification. Machine learning 85 (3), pp. 333. Cited by: §4.3.2.
  • J. Read, P. Reutemann, B. Pfahringer, and G. Holmes (2016) MEKA: A Multi-label/Multi-target Extension to WEKA. Journal of Machine Learning Research 17 (21), pp. 1–5. External Links: Link Cited by: §4.3.6, §4.3.
  • R. Reátegui and S. Ratté (2018) Comparison of MetaMap and cTAKES for entity extraction in clinical notes. BMC Medical Informatics and Decision Making 18 (3), pp. 74. Cited by: §2.
  • A. Rios and R. Kavuluru (2018) EMR coding with semi–parametric multi–head matching networks. In Proceedings of the conference. Association for Computational Linguistics. North American Chapter. Meeting, Vol. 2018, pp. 2081. Cited by: Table 1.
  • K. Roberts, D. Demner-Fushman, E. M. Voorhees, W. R. Hersh, S. Bedrick, A. J. Lazar, and S. Pant (2017) Overview of the TREC 2017 precision medicine track. NIST Special Publication, pp. 500–324. Cited by: §4.2.
  • B. L. Ryan, K. B. Jenkyn, S. Z. Shariff, B. Allen, R. H. Glazier, M. Zwarenstein, M. Fortin, and M. Stewart (2018) Beyond the grey tsunami: a cross-sectional population-based study of multimorbidity in ontario. Canadian Journal of Public Health 109 (5-6), pp. 845–854. Cited by: §1.
  • G. K. Savova, J. J. Masanz, P. V. Ogren, J. Zheng, S. Sohn, K. C. Kipper-Schuler, and C. G. Chute (2010) Mayo clinical Text Analysis and Knowledge Extraction System (cTAKES): architecture, component evaluation and applications. Journal of the American Medical Informatics Association 17 (5), pp. 507–513. Cited by: §2.
  • G. Tsoumakas and I. Katakis (2007) Multi-label classification: an overview. International Journal of Data Warehousing and Mining (IJDWM) 3 (3), pp. 1–13. Cited by: §4.3.1.
  • H. Yang and T. Gonçalves (2017) UEvora at CLEF eHealth 2017 Task 3.. In CLEF (Working Notes), Cited by: §2.
  • V. Yogarajan, H. Gouk, T. Smith, M. Mayo, and B. Pfahringer (2020a) Comparing High Dimensional Word Embeddings Trained on Medical Text to Bag-of-Words For Predicting Medical Codes.. Proceedings of the Asian Conference on Intelligent Information and Database Systems (ACIIDS 2020). In N. T. Nguyen et al. (Eds.), Lecture Notes on Artificial Intelligence (LNAI), Springer Nature. (to appear). 12033, pp. 1–12. Cited by: §1, §2, Table 3, §5.1.4, §5.1.
  • V. Yogarajan, B. Pfahringer, and M. Mayo (2020b) A review of automatic end-to-end de-identification: is high accuracy the only metric?. Applied Artificial Intelligence, pp. 1–19. Cited by: §1.
  • M. Zeng, M. Li, Z. Fei, Y. Yu, Y. Pan, and J. Wang (2019) Automatic icd-9 coding via deep transfer learning. Neurocomputing 324, pp. 43–50. Cited by: Table 1, §2.
  • M. Zhang and Z. Zhou (2005) A k-nearest neighbor based algorithm for multi-label classification. In 2005 IEEE international conference on granular computing, Vol. 2, pp. 718–721. Cited by: §4.3.4.