Convolutional Neural Networks for Medical Diagnosis from Admission Notes

12/06/2017 ∙ by Christy Li, et al. ∙ 0

Objective Develop an automatic diagnostic system which only uses textual admission information from Electronic Health Records (EHRs) and assist clinicians with a timely and statistically proved decision tool. The hope is that the tool can be used to reduce mis-diagnosis. Materials and Methods We use the real-world clinical notes from MIMIC-III, a freely available dataset consisting of clinical data of more than forty thousand patients who stayed in intensive care units of the Beth Israel Deaconess Medical Center between 2001 and 2012. We proposed a Convolutional Neural Network model to learn semantic features from unstructured textual input and automatically predict primary discharge diagnosis. Results The proposed model achieved an overall 96.11 80.48 significantly outperforming four strong baseline models by at least 12.7 weighted F1 score. Discussion Experimental results imply that the CNN model is suitable for supporting diagnosis decision making in the presence of complex, noisy and unstructured clinical data while at the same time using fewer layers and parameters that other traditional Deep Network models. Conclusion Our model demonstrated capability of representing complex medical meaningful features from unstructured clinical notes and prediction power for commonly misdiagnosed frequent diseases. It can use easily adopted in clinical setting to provide timely and statistically proved decision support. Keywords Convolutional neural network, text classification, discharge diagnosis prediction, admission information from EHRs.



There are no comments yet.


page 10

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Background and Significance

Mis-diagnosis is one of the most severe problems in healthcare (dia, April 17, 2014), inducing significant harm to patients’ well-being. As reported by (Hardeep Singh, April 17, 2014), approximately 12 million adults are misdiagnosed in outpatient medical care, which amounts to 1 out of 20 adult patients. More importantly, around 40,000 - 80,000 patients die annually in the U.S.A due to diagnostic errors, as evidence from autopsies (dia, April 17, 2014) indicates.

While a plethora of factors can lead to mis-diagnosis, untimely diagnosis and lack of guidelines stand out as the most significant factors (Anthony T. DiPietro, May 1, 2012). Many diseases such as asthma, chronic obstructive pulmonary disease and bronchiectasis, share similar symptoms such as dyspnea, coughing, wheezing and expectoration, with both factors rendering their differential diagnosis a difficult task. However, differentiating between these diseases in the early stage, such as at patient’s admission time, is extremely important since different treatments could lead to very different clinical outcomes, and the adoption of an improper treatment plan could prove to be disastrous for the patient’s health. Additionally, insufficiency of clinical guidelines for handling rare symptoms quite often can result in failure to correctly explain a patient’s symptoms. As a result, no suitable treatments are immediately administered to the patient.

The significant improvement of Electronic Health Record (EHR) systems over the past decades has facilitated standardization of data collection by health-care professionals and allowed clinicians to access patients’ important and clinical information at their earliest convenience, preventing mis-diagnosis or delayed diagnosis. However, initial diagnostic decision-making based on EHRs still faces some challenges. The absence of a protocol for filling an EHR as well as the fact that an EHR usually contains information collected from multiple clinicians, with different personalized writing styles who potentially work at different health institutions, results in EHRs not sharing the same length or writing style, missing structure, alternating between official names, unofficial names, medical names and abbreviations of diseases and containing various misspellings and grammar errors. According to a recent study  (Sinsky et al., 2016), clinicians were found spending up to half of their total working time on EHR and desk work and less than a third of their time interacting directly with patients.

Convolutional Neural Network (CNN) models’ ability to identify patterns and learn appropriate latent representations from textual data has rendered them as one of the most powerful models for classification. In this work, we investigated the potential of a CNN model to support the clinician’s diagnostic decision making, formulating discharge diagnosis prediction as a multiclass classification problem. Our model takes as input only a subset of the information contained in a patients EHR upon admission and produces a discharge diagnosis prediction in its output. We trained a separate neural network in order to learn the embeddings

(low dimensional real valued vectors that act as features) of the words in the vocabulary of our dataset and used the weights of this model to initialize the weights of the embedding layer of the CNN model. This is a critical optimization that further boosted the performance of the CNN model.

MIMIC-III is a freely available dataset containing clinical data of more than forty thousand patients who stayed in intensive care units of the Beth Israel Deaconess Medical Center between 2001 and 2012  (Johnson et al., 2016)

. After appropriate preprocessing, which included the application of the coreference resolution method on the disease names (seeks to find the mentions in text that refer to the same real-world entity), extracting the clinical notes that correspond to the most frequent diseases (and ignoring the rest), discharge diagnosis prediction was formulated as a multiclass classification problem, with the following 10 classes: coronary artery disease, hemorrhage, pneumonia, myocardial infarction, gastrointestinal bleeding, fracture, aortic stenosis, cardiac failure, prematurity and stroke. We evaluated the performance of our CNN model as well as other baseline classification models such as Support Vector Machines (SVM), Random Forest (RF), Multi Layer Perceptron (MLP) and Logistic Regression (LR) on the MIMIC-III dataset using precision, accuracy, recall and F1 score metrics for each of the 10 different diseases. Experiments indicate that our CNN model, achieving overall 96.11% accuracy and 80.84% weighted F1 score, outperforms all of the baseline models with respect to 9 out of the 10 disease classes. Furthermore the F1 score value of 80.84%, at least 12% higher than that of the best of the baseline models, reflects that the CNN model performed equally well on all disease classes independently of how frequently they appear and are diagnosed in the MIMIC corpus. We hence believe it be a great tool for discharge diagnosis support.

2 Methods

2.1 CNN-based multi-class text classification

The discharge diagnosis prediction problem is cast as a multi-class text classification problem, with an admission note (after appropriate preprocessing) being fed into the input of a convolutional neural network, and classified by the latter into one of

primary discharge diseases. We now shed light on the different parts of the models’ architecture, as depicted in figure 1.

Figure 1:

Demonstration of CNN model which consists of four components: embedding layer, convolutional layer, max-pooling layer, and fully-connected layer.

The CNN among other things, learns an underlying representation of an admission note in a high-dimensional space. The architecture of the CNN model is depicted in Figure 1. The embedding layer maps each word in the admission note to its embedding, (a low-dimensional real-valued vector) and acts essentially as a lookup table. The embedding layers can be represented by a matrix, where is the number of different words of the vocabulary considered and the dimension of an embedding. The embedding layer outputs a matrix for an admission note consisting of words.

The convolutional layer consists of independent filtering operations, ( is usually 128 or 256) with each filter trying to extract different type of information from the embedding representation of the admission note. The -th filtering operation can be thought of as sliding a matrix (filter) through the output of the embedding layer and computing its dot product with the corresponding area of the admission note’s embedding representation. Performing this operation trivially, the -th filtering operation will output an

dimensional vector; padding the output of the embedding layer by

rows of zeros, we ensure the th filtering operation’s output is an dimensional vector. The output of the convolutional layer results from stacking together these vectors to obtain a matrix; this is the “feature representation” that the CNN is learning.

The output of the convolutional layer, an matrix, is then fed into a max-pooling layer. The -th max-pooling operation selects the maximum value of the -th convolution (-th column of the matrix). The output of the max-pooling layer is thus an -dimensional vector, having a max-value per filter. A CNN usually contains filters of different sizes that capture patterns across contexts of different sizes (number of words). Our model uses an embedding of and total of filters of size , filters of size and filters of size .

The -dimensional output of the max-pooling layer enters a fully-connected layer. This is essentially a layer of

input neurons connected to all

output neurons. Complex co-adaptations of the fully-connected layer’s weights on the training data can cause the CNN model to overfit. To prevent overfitting, individual neurons in the fully-connected layer are either kept with probability

or “dropped out” with probability , a technique known as dropout.

A final softmax layer converts the -dimensional output of the fully-connected layer, into a

-dimensional probability distribution vector

, applying essentially the normalized exponential function to :


Finally, the true class of a clinical note is represented as a “one-hot” -dimensional vector , with if corresponds to the disease that was actually diagnosed and

otherwise. The CNN is then trained, end-to-end, using the stochastic gradient descent algorithm in order to minimize the

cross-entropy loss of and :


2.2 Word embedding pre-training

We employed Skip-gram model (Mikolov et al., 2013) for training medical word embeddings for initializing the CNN-based text classification model. This is beneficial because the semantic information of clinical notes can be incorporated through the pre-trained embeddings of words from clinical notes. We trained the embedding model on the whole MIMIC III dataset where all data fields including admission information and non-admission information sush as brief hospital course, discharge instructions, discharge plan, discharge medication, lab test during hospitalization are used. The Skip-gram model is designed to learn the probability distribution of words that appear closely in the corpus. The objective function is shown as follows:


where the is the number of words in the input sequence, is the size of the training context. The conditional probability of a target word given current word is defined by softmax function:


where the is the vocabulary size, and and are the input and output vector representations of word w.

In practice, we used fasttext  (Joulin et al., 2016) to train the embedding model. Wrods whose frequency is less than or equal to 1 were removed. During initializing of the classification model, we used embedding vectors of in-vocabulary words, and computed the vector representations of out-vocabulary words using the function provided by the Fasttext.

2.3 Disease coreference resolution

Figure 2: An example of disease coreference resolution of ST Segment Elevation Myocardial Infarction

Real-world clinical notes are characterized by lack of structure and the usage of different words referring to the same term. This is a result of doctors’ personalized writing styles as well as their tendency to alternate between official names, abbreviations, unofficial names and medical custom names when they refer to a particular term. Moreover, various misspellings and inconsistency in using lowercase/uppercase letters further increase the variety of different words used to refer to the same term. Figure 2 illustrates an example where the terms “segment elevation myocardial infarction”, “stents elevation myocardial infarction”, “st-elevation myocardial infarction”, “st elevation myocardial infarction”, “st elevated myocardial infarction”, “st segment elevation myocardial infarction”, “st elevation mi”, “st-elevation mi” and “stemi” all refer to the disease with official name “ST Segment Elevation Myocardial Infarction”.

Coreference resolution

, the task of finding all expressions that refer to the same entity in a text, is an important step for a lot of higher level NLP tasks that involve natural language understanding such as document summarization, question answering, and information extraction. Although there exists a plethora of machine learning algorithms for coreference resolution, both supervised and un-supervised, we used a remarkably simple, fast and still successful approach on the MIMIC dataset: we collected all discharge diagnosis disease names appearing after phrases such as “diagnosis”, “primary diagnosis” and: (i) manually grouped words that refer to the same disease, (ii) replaced all the words appearing in the groups by the name of the disease.

The MIMIC III dataset includes the International Statistical Classification of Diseases and Related Health Problems (ICD-9) codes associated with the disease(s) of each clinical note. It is important to note that even though ICD-9 is very fine-grained and contains the official names of diseases and as many sub-diseases as possible, the clinical records reflect the doctors’ tendency to not closely follow ICD-9 terminology. In a preliminary experiment, we measured the classification performance of our CNN model under two different settings with respect to the classification labels used: in the first setting, we used the disease names that result from coreference resolution whereas in the second setting ICD-9 official disease names. Given that the classification accuracy was about 5% higher in the first setting, we chose the disease names that resulted from our simple coreference resolution method as the classes for the multi-class classification problem.

3 Experiments & Results

3.1 Data

3.1.1 Dataset

We evaluated the performance of the proposed CNN model using MIMIC-III, a freely available dataset that consists of clinical data for more than 40,000 patients who stayed in the intensive care units of the Beth Israel Deaconess Medical Center between 2001 and 2012  (Johnson et al., 2016). The dataset consists of more than 50,000 anonymized clinical records that include, among others, demographics, chief complaints, past medical history, vital signs, procedures, lab tests, medications and discharge diagnosis information. Given that our model tries to provide an accurate prediction of the discharge diagnosis by focusing solely on information available upon admission, we filtered information such as chief complaints, past medical history, past surgical history, social history, family history, allergies, laboratory examination results upon admission and medications taken upon admission from the clinical records.

3.1.2 Data preprocessing

The filtered information, initially in text format and containing abbreviations, misspellings and grammar errors, needed to undergo through several pre-processing tasks before it could be feeded into the input of the CNN model. These preprocessing tasks included the concatenation of separate lines, the removal of duplicate spaces, the splitting of words by punctuation, the transformation of words to lowercase as well as the replacement of specific numbers, person names, hospital names, dates and times by “***” for the sake of anonymization. Finally, since the length (in number of words) was not unique among the different clinical notes that resulted from the previous preprocessing tasks and the CNN model only operates on documents of a fixed common length, an additional preprocessing task was carried out. This last preprocessing task involved the computation of a maximum document length as the maximum length (in number of words) not exceeded by 90% of the clinical notes that resulted from the previous preprocessing tasks, and a truncation at exactly this length for those preprocessed clinical notes whose length was higher than this threshold value.

3.1.3 Classification categories

We applied coreference resolution on the MIMIC III dataset, resulting with 479 unique disease categories. We further analyzed the dataset and discovered the top 10 most frequent diseases, which we used as the labels for our classification problem. In the case of a clinical note that mentions multiple diseases, only the first one is counted, based on the assumption that the first disease is the most important. Furthermore, we discarded clinical notes which did not mention at all any of these 10 diseases, ending up with a total of 13152 EHRs. The sample distribution of the 10 diseases is illustrated in figure 1 and table  1.

disease number of samples
Coronary artery disease 3193
Hemorrhage 1955
Pneumonia 1634
Myocardial infarction 1229
Gastrointestinal bleeding 1158
Fracture 1047
Aortic stenosis 934
Cardiac failure 927
Prematurity 559
Stroke 504
Table 1: Sample distribution of the 10 disease categories.
Figure 3: Bar chart of sample distribution of the 10 disease categories. We can see that the sample distribution is unbalanced. More than 23% of samples drop in the first category coronary artery disease while less than 4% of samples are in the last two disease categories which are prematurity and stroke.

3.2 Evaluation Approach

For each individual disease category, we evaluated the model performance on the testing set keeping track of the following metrics: accuracy, true negative rate, false positive rate, false negative rate, precision, recall and F1 score. We additionally measured, for each of these metrics, a un-weighted and a weighted average value across the different disease categories (where the weight for disease category is simply the ratio of the clinical notes where disease was diagnosed over the total number of clinical notes in the testing set. The weighted average values capture well the distribution of diagnosed diseases in the testing set.

3.3 Baseline Models

For multi-class disease classification, we compare our CNN text classification model against several baseline models: support vector machine (SVM), random forest (RF), multi-layer perception (MLP) and logistic regression (LR). Tf-idf (term frequency-inverse document frequency)  (Salton et al., 1975) is a common numerical statistic reflecting the importance of a word to a document in a corpus. For each clinical note, we first create a -dimensional vector ( being the size of our vocabulary) containing tf-idf values for the words that appear in it and zero values for the words that are absent. Therefore, for a set with clinical notes we obtain a feature matrix. Given the large value of

, we further apply the Principal Components Analysis (PCA) dimensionality reduction technique to obtain a

feature matrix with . The rows of this matrix are the actual clinical notes’ feature vectors used by each of the baseline classification models during training and testing.

3.4 Implementation Details

We randomly split the MIMIC III dataset into training, validation and testing sets by a ratio of 7:1.5: 1.5. The best trained model is selected based on the performance on the validation set and tested on the testing set. For each model, we further trained 5 different times with each using a different random seed for splitting the dataset, and ended up with 5 evaluation results. The 5 evaluations results were averaged to represent the final evaluation results for the corresponding model.

For our CNN model, we trained the Fasttext  (Joulin et al., 2016) word embedding model (using embedding dimension ) on the full MIMIC dataset and obtained an initialization of the embedding layer weights. We used a learning rate , filters of size , filters of size and filters of size , dropout probability

, the adam optimizer for gradient descent and trained on a single GPU. We used two layers with 100 and 10 neurons in each layer for MLP model, along with Relu activation functions and early stopping. We used a kernelized one-vs-all SVM with an RBF kernel and tolerance


3.5 Results on multi-class diagnosis classification

Figure 4

shows the evaluation metrics including accuracy, average or weighted average true positive, false positive, false negative, precision, recall and F1. The exact values of these metrics are shown in Table


ACC 93.710.37 91.541.69 90.740.14 93.690.13 96.110.08
TNR 96.590.20 93.442.76 95.210.08 96.410.07 97.780.05
FPR 3.410.20 4.560.77 4.790.08 3.590.07 2.220.05
FNR 21.610.25 29.485.78 15.880.54 32.910.49 20.920.62
Precision 60.811.58 49.029.87 40.410.47 64.450.58 78.720.67
Recall 72.392.53 50.5212.22 68.122.40 67.090.49 79.080.62
F1 60.861.50 47.1710.94 39.480.54 65.040.61 78.450.50
WACC 92.590.14 88.493.56 86.100.23 92.220.17 95.290.08
WTNR 96.590.21 90.265.46 95.800.11 95.730.08 97.130.10
WFPR 3.410.21 4.900.64 4.200.11 4.270.08 2.870.10
WFNR 23.380.93 30.873.34 25.660.55 31.960.69 19.060.36
WPrecision 68.561.86 57.688.46 53.690.70 68.430.64 80.550.42
WRecall 72.211.39 53.1811.90 64.611.14 68.040.69 80.940.36
WF1 66.081.23 53.1511.01 45.510.79 67.770.72 80.480.41
Table 2:

Evaluation metrics of convolutional-based text classification (CNN), support vector machine (SVM) and random forest (RF), multilayer perceptron (MLP) and logistic regression (LR) models on diagnosis classification. The evaluation metric ACC, TNR, FPR, FNR, Precision, Recall, F1 mean accuracy, true negative rate, false positive rate, false negative rate, precision, recall and F1 score respectively. They are computed by averaging the corresponding metric of individual classes. The same metric name with ”W” attached at the beginning denote the weighted average of that metric. The weights are sample weights in testing dataset. The standard errors computed by 5 set of results of each model using different random splitting are appended after the metric values with a

sign. We see that CNN model consistently outperforms baseline models on all measurements.
Figure 4: Bar chart demonstrating the performance difference among SVM, random forest, MLP, logistic regression and CNN models on discharge diagnosis classification. From the table, we can see that CNN model consistently outperform baseline models on all measurement metrics.
Figure 5: Bar chart demonstrating the F1 score difference on among SVM, random forest, MLP, logistic regression and CNN models on individual discharge diagnosis classification. From the table, we can see that CNN model consistently outperform baseline models on 9 out of 10 diagnosis categories by mostly 10% to 15% margins.
Model Label ACC TNR FPR FNR Precision Recall F1
SVM 0 87.410.46 97.000.20 3.000.20 32.661.02 91.440.61 67.341.02 77.540.71
MLP 0 75.5612.84 74.8618.72 5.141.35 36.319.94 83.794.29 63.699.94 68.967.52
LR 0 64.080.66 99.730.11 0.270.11 60.220.59 99.550.17 39.780.59 56.840.60
RF 0 87.400.32 93.470.19 6.530.19 29.160.68 79.850.66 70.840.68 75.070.55
CNN 0 92.630.27 94.640.44 5.360.44 13.931.26 82.861.41 86.071.26 84.370.60
SVM 1 93.870.24 97.720.25 2.280.25 25.261.56 86.541.71 74.741.56 80.090.58
MLP 1 91.691.43 94.902.17 5.102.17 21.395.93 66.2716.70 58.6114.87 61.9715.54
LR 1 91.490.24 97.660.15 2.340.15 34.881.25 86.660.97 65.121.25 74.341.04
RF 1 92.100.28 95.790.16 4.210.16 28.871.43 74.811.35 71.131.43 72.921.34
CNN 1 95.370.28 97.400.14 2.600.14 16.101.71 85.000.76 83.901.71 84.390.84
SVM 2 93.100.82 96.600.72 3.400.72 24.855.84 77.205.20 75.155.84 74.631.74
MLP 2 90.871.36 94.312.00 5.692.00 26.037.36 60.5615.27 53.9713.92 56.9014.42
LR 2 91.960.20 94.570.30 5.430.30 28.051.56 63.381.22 71.951.56 67.310.65
RF 2 91.800.29 95.960.24 4.040.24 33.081.38 73.411.95 66.921.38 69.991.50
CNN 2 94.960.20 97.430.30 2.570.30 20.831.64 82.132.39 79.171.64 80.430.38
SVM 3 92.610.56 93.120.84 6.880.84 16.334.43 28.368.11 83.674.43 38.908.09
MLP 3 89.760.71 93.501.12 6.501.12 46.2612.21 34.4311.69 33.749.29 33.1710.27
LR 3 90.810.22 90.810.22 9.190.22 4.004.00 0.920.25 96.004.00 1.810.49
RF 3 93.140.24 96.030.27 3.970.27 36.511.39 61.032.22 63.491.39 62.151.47
CNN 3 95.160.27 97.180.22 2.820.22 23.353.13 73.682.40 76.653.13 74.791.11
SVM 4 95.281.88 98.420.61 1.580.61 20.8910.60 83.346.79 79.1110.60 77.685.86
MLP 4 93.120.93 96.501.31 3.501.31 31.478.98 61.8315.86 48.5312.88 54.3214.18
LR 4 93.600.20 93.550.20 6.450.20 4.410.72 28.501.02 95.590.72 43.871.24
RF 4 96.310.16 97.960.14 2.040.14 20.921.19 78.821.58 79.081.19 78.921.16
CNN 4 98.340.09 99.090.13 0.910.13 9.021.56 91.061.20 90.981.56 90.950.47
SVM 5 96.140.22 97.350.31 2.650.31 20.333.08 67.894.18 79.673.08 72.882.46
MLP 5 93.880.97 95.470.96 4.530.96 28.159.10 43.3413.30 51.8514.19 46.9313.49
LR 5 94.310.26 94.600.31 5.400.31 15.112.45 32.811.38 84.892.45 47.241.45
RF 5 94.540.11 96.290.14 3.710.14 31.712.48 55.161.02 68.292.48 60.981.52
CNN 5 96.600.16 98.260.30 1.740.30 21.741.02 79.883.83 78.261.02 78.851.86
SVM 6 93.020.29 93.310.45 6.690.45 12.157.53 10.386.45 27.8517.09 15.029.25
MLP 6 91.720.40 94.180.42 5.820.42 47.2811.88 24.437.14 32.728.27 27.277.18
LR 6 92.600.18 92.600.18 7.400.18 0.000.00 0.000.00 0.000.00 0.000.00
RF 6 91.650.09 95.400.12 4.600.12 56.620.99 42.280.72 43.380.99 42.800.73
CNN 6 94.500.38 97.320.16 2.680.16 40.412.32 63.552.63 59.592.32 61.291.77
SVM 7 95.290.11 95.770.07 4.230.07 16.102.12 44.920.91 83.902.12 58.491.14
MLP 7 94.490.63 95.590.93 4.410.93 23.756.90 41.7613.25 56.2514.50 46.8213.46
LR 7 92.960.13 92.980.13 7.020.13 11.343.33 5.380.54 88.663.33 10.120.95
RF 7 94.420.19 95.900.19 4.100.19 32.831.93 47.272.70 67.171.93 55.412.48
CNN 7 96.950.16 98.150.09 1.850.09 20.911.47 74.071.54 79.091.47 76.501.51
SVM 8 90.963.98 97.040.33 2.960.33 44.3018.58 27.649.79 55.7018.58 19.855.18
MLP 8 95.990.12 96.270.19 3.730.19 25.5916.73 2.951.65 34.4119.19 4.672.47
LR 8 96.190.16 96.190.16 3.810.16 0.000.00 0.410.25 40.0024.49 0.810.50
RF 8 96.540.15 97.410.13 2.590.13 41.532.39 33.951.20 58.472.39 42.851.15
CNN 8 97.760.23 98.340.12 1.660.12 22.205.28 56.333.00 77.805.28 65.043.42
SVM 9 99.440.04 99.550.03 0.450.03 3.180.59 90.400.72 96.820.59 93.490.42
MLP 9 98.290.68 98.780.68 1.220.68 8.526.47 70.8717.79 71.4818.88 70.6217.96
LR 9 99.370.05 99.370.04 0.630.04 0.750.36 86.480.95 99.250.36 92.420.61
RF 9 98.960.10 99.900.05 0.100.05 17.811.46 97.871.09 82.191.46 89.331.13
CNN 9 98.830.06 99.940.03 0.060.03 20.721.34 98.630.60 79.281.34 87.880.88
Table 3: Evaluation metrics on individual disease classification. The ten class labels correspond to coronary artery disease, hemorrhage, pneumonia, myocardial infarction, gastrointestinal bleeding, fracture, aortic stenos, cardiac failure, prematurity and stroke. The evaluation metrics ACC, TNR, FPR, FNR refer to accuracy, true positive rate, false positive rate, false negative rate. The mapping between label indices and diseases are: 0-coronary artery disease, 1-hemorrhage, 2-pneumonia, 3-myocardial infarction, 4-gastrointestinal bleeding, 5-fracture, 6-aortic stenosis, 7-cardiac failure, 8-prematurity, 9-stroke. The standard errors computed by 5 set of results of each model using different random splitting are appended after the metric values with a sign.

3.6 Results on individual diagnosis classification

The measurement results on individual diagnosis categories are also evaluated and shown in Table 3. A bar chart comparing F1 score among the three models is also provided in Figure 5.

3.7 T-test on performance across models

In addition, a t-test has also been done between the weighted average F1 score of the CNN and each of the baseline models, assuming the variances are not equal. The p-values of weighted average F1 metric value of CNN model with that of SVM, random forest, MLP and logistic regression are 1.20e-4, 3.05e-06, 6.81e-2, 1.81e-08 respectively. These values are much less than 0.5, indicating significant difference between the respective performance metrics.

3.8 CNN filter visualization

To visualize the patterns learned by each filter category in the convolutional layer of the proposed CNN model, we ranked all phrases in the testing dataset which are scanned by the filter window of convolutional layer by their activation scores in descending order. The top 10 phrases of randomly selected 2 filters per filter size from 3 to 5 is shown in Table 4.

filter 1 of trigram filter 2 of trigram
aortic stenosis dr hyperlipidemia degenerative joint
aortic stenosis noted hyperlipidemia obesity tobacco
aortic stenosis referred hyperlipidemia s p
aortic stenosis bicuspid hyperlipidemia not known
aortic stenosis hepatitis hyperlipidemia complete heart
aortic stenosis valve hyperlipidemia percutaneous coronary
aortic stenosis followed hyperlipidemia niddm tobacco
aortic stenosis most hyperlipidemia obesity diabetes
aortic stenosis treated hyperlipidemia aspirin allergy
aortic stenosis coronary hyperlipidemia hypertension tobacco
filter 1 of 4-gram filter 2 of 4-gram
cesarean section delivery membranes impending respiratory failure injuries
spontaneous vaginal delivery in developed respiratory failure requiring
prior to delivery besides hypoxic respiratory failure successfully
by vaginal delivery one hypoxic respiratory failure discharged
term vaginal delivery surgically use respiratory failure non
spontaneous vaginal delivery required copd respiratory failure requiring
section at delivery infant mother respiratory failure hepatitis
induced vaginal delivery apgar hypoxic respiratory failure a
spontaneous vaginal delivery mother pancytopenia respiratory failure s
delivery vaginal delivery apgars hypoxic respiratory failure transferred
filter 1 of 5-gram filter 2 of 5-gram
was stable without chest pain multiple loose watery bm stool
catheterization after developing chest pain small amount of loose stool
to have substernal chest pain noticed red blood around stool
flow she had chest pain her last semi formed stool
patient began having chest pain guaiac positive with red stool
her bms no chest pain have black guaiac positive stool
who complained of chest pain had a well formed stool
and developed sharp chest pain and passed dark black stool
onset heavy substernal chest pain for presumed infectious colitis stool
st elevation improved chest pain dark brown well formed stool
Table 4: Top 10 3-grams, 4-grams, and 5-grams ranked by activation scores by convolutional filters in the proposed CNN model in descending order. 2 filters per filter size from 3 to 5 are selected.

4 Discussion

4.1 Overall results

The CNN model we presented achieves, to the best of our knowledge, state-of-the-art prediction performance in discovering complex patterns in unstructured clinical notes for diagnosis decision making. Of the models evaluated, the best performance was achieved with the proposed CNN model that uses pretrained word embeddings and disease coreference resolution. The proposed model achieved 96.11% accuracy and 80.48% weighted average F1 score which is around 13% higher than the best result from baseline models. The proposed model also significantly outperformed traditional machine learning models that rely on bag-of-word features and the PCA feature dimensionality reduction technique. The result suggests the CNN model is suitable for supporting diagnosis decision making in the presence of complex, noisy and unstructured clinical data while at the same time using less layers and parameters that other traditional Deep Network models.

In addition, CNN model achieved close to 80% precision and recall, while all baseline models have less than 73% values and are struggling with doing equally well on both measurements. Having relatively equally good performance in precision and recall is important in medical supporting systems since not only the majority diseases should be detected early and accurately, but also the rare diseases need to be identified timely and with confidence. Sometimes, it is even more important to detect rare diseases since it is less likely to detect them and a wrong treatment plan could put the patients’ health or even life at risk.

Besides, it is interesting that the Logistic Regression model achieved the minimum false negative rate, something that came at the expense of having the lowest precision value among all models. Intuitively, the model tends to predict a sample with majority category if it is not very confident about whether the sample belongs to a rare class or not, increasing the likelihood of predicting correctly samples labeled with a rare disease class.

4.2 Individual results

According to table 3, the CNN model achieved the highest F1 score on all 10 disease classes. Furthermore, although some models have relatively high F1 score in categories with large sample size (e.g., coronary artery disease, hemorrhage, pneumonia), they performed badly on categories with less samples (e.g., myocardial infarction, aortic stenosis, prematurity). For example, the SVM achieved an F1 score value of 80.09% for the hemorrhage class, only around 4% lower than the F1 score value of the CNN model, it was only able to achieve F1 score values of 15.02% on aortic stenosis and 19.85% on prematurity. On the contrary, the CNN model performed equally well on all disease classes independently of frequently they appear and are diagnosed in the MIMIC corpus.

Table 3 also indicate that aortic stenosis and prematurity are the hardest disease classes to classify. The highest F1 score values achieved from any of the baseline models for aortic stenosis and prematurity are 42.80% and 42.85% respectively. Among all the baseline models, the logistic regression and the MLP model achieved less than 5% F1 score for the prematurity class. On the other side, the CNN model still achieved F1 score values of 61.29% and 65.04% on the same diseases. This fact further confirms that the proposed CNN model works remarkably well at classifying rare (not appearing frequently in the corpus) diseases.

The Logistic Regression model had the worst performance among all models with respect to most evaluation metrics. Being essentially a linear model, Logistic Regression usually works well with data that can be separated by a hyperplane. However, the latent features of the clinical notes are complex and are very likely not linearly separable. This explains the poor performance of the Logistic Regression model on the discharge diagnosis prediction problem.

4.3 Understanding the model features

Last but not least, Table 4 shows the top 10 phrases ranked by activation scores by convolutional filters in CNN model. We can see from the table that each filter tends to detect a specific pattern. For example, the first filter detects phrases containing the words “aortic stenosis”, while the second filter detects phrases highly related to ”hyperlipidemia”. While 3-grams ranked with the highest activation scores by convolutional filters tend to preserve certain words or phrases, 4-grams and 5-grams assigned with high activation scores, despite very centered around a specific symptom or medical event with different descriptions, are more flexible. For example, the first filter of 5-gram describes chest pain with various conditions such as substernal chest pain, sharp chest pain, onset heavy substernal chest pain and st elevation improved chest pain. We conclude that the convolutional filters or our CNN model are indeed extracting critical patterns for diagnosis, such as symptoms, lab tests, diseases, procedures and abnormalities. It is quite impressive how the CNN model learns to mimic the human clinician’s procedure diagnosis decision making.

5 Future work

A first direction of future work is to train a CNN model that can distinguish among more than 10 diseases. The fact that in this work our CNN model uses only 10 classes (corresponding to the 10 most frequently mentioned diseases in the MIMIC corpus) does by no means imply that it is not generalizable. We still have to take care of solving the problem of some diseases appearing and being diagnosed significantly more than others when training our model with more than 10 classes. Towards this end, we can use a slightly different loss function, where the parts of the loss pertaining to different diseases are weighted inversely proportional to the frequency of the respective diseases (penalizing mis-predictions of rare diseases more than mis-predictions of common diseases) and the total loss is a sum of these weighted losses.

A second direction of future work is to extend our CNN model so that it performs multi-label classification, predicting more than one diseases per clinical note or multi-task classification, predicting, in addition to the disease itself, significant factors such as mortality possibility or severity level of the disease.

A third direction of future work would be to extend our CNN model so that it supports hierarchical disease prediction. A model that in addition to predicting pneumonia can specify the specific type of pneumonia such as aspiration pneumonia, bacteria pneumonia, hospital-acquired pneumonia, or community-acquired pneumonia for example, could provide a much more useful tool for diagnosis decision making. The performance and success of such a model that supports hierarchical disease prediction highly depends on training on a large corpus that contains adequate samples per disease type and subtype.

A fourth direction of future work is to train our model on a much larger clinical notes corpus, hoping that the benefits of using pretrained word embeddings for the initialization of the CNN model will become more evident. The word embeddings used by our model were trained on the MIMIC dataset that contains only 50000 documents, whereas state-of-the-art word-embedding models are usually trained on millions and billions of documents.

6 Conclusion

We have presented a novel data-driven technique for diagnosis prediction, that exploits convolutional neural networks’ ability to learn latent features. Unlike many existing works, such as  (Bond et al., 2012),  (Grady and Berkowitz, 2011),  (Ebell, 2010), and  (Achour et al., 2001)

we did not use any human designed rules (based on prior medical knowledge) for clinical notes’ feature learning. Our approach is flexible in the sense that it can be easily adapted to other datasets or employed in various clinical settings where data availability, characteristics, format and statistical distribution of text vary. Moreover, the fact that it is only based on a subset of the information that is available upon admission time, allows our method to integrate well with the clinical setting workflow and provide timely feedback to the clinician. The efficiency is further improved by the fact that our models can be trained end-to-end, without specific need for fine-tuning hyperparameters of individual components.


  • dia (April 17, 2014) About diagnostic error. Society to improve diagnosis in medicine, April 17, 2014.
  • Achour et al. (2001) Soumeya L Achour, Michel Dojat, Claire Rieux, Philippe Bierling, and Eric Lepage. A umls-based knowledge acquisition tool for rule-based clinical decision support system development. Journal of the American Medical Informatics Association, 8(4):351–360, 2001.
  • Anthony T. DiPietro (May 1, 2012) Esq. Anthony T. DiPietro. Types of medical diagnostic errors. Medical Malpractice: Misdiagnosis and Delayed Diagnosis, page 2, May 1, 2012.
  • Bond et al. (2012) William F Bond, Linda M Schwartz, Kevin R Weaver, Donald Levick, Michael Giuliano, and Mark L Graber. Differential diagnosis generators: an evaluation of currently available computer programs. Journal of general internal medicine, 27(2):213–219, 2012.
  • Ebell (2010) Mark Ebell. Ahrq white paper: use of clinical decision rules for point-of-care decision support. Medical Decision Making, 30(6):712–721, 2010.
  • Grady and Berkowitz (2011) Deborah Grady and Seth A Berkowitz. Why is a good clinical prediction rule so hard to find? Archives of internal medicine, 171(19):1701–1702, 2011.
  • Hardeep Singh (April 17, 2014) Eric J Thomas Hardeep Singh, Ashley N D Meyer.

    The frequency of diagnostic errors in outpatient care: estimations from three large observational studies involving us adult populations.

    BMJ Quality & Safety, April 17, 2014.
  • Johnson et al. (2016) Alistair EW Johnson, Tom J Pollard, Lu Shen, Li-wei H Lehman, Mengling Feng, Mohammad Ghassemi, Benjamin Moody, Peter Szolovits, Leo Anthony Celi, and Roger G Mark. Mimic-iii, a freely accessible critical care database. Scientific data, 3, 2016.
  • Joulin et al. (2016) Armand Joulin, Edouard Grave, Piotr Bojanowski, Matthijs Douze, Hérve Jégou, and Tomas Mikolov. Compressing text classification models. arXiv preprint arXiv:1612.03651, 2016.
  • Mikolov et al. (2013) Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S Corrado, and Jeff Dean. Distributed representations of words and phrases and their compositionality. In Advances in neural information processing systems, pages 3111–3119, 2013.
  • Salton et al. (1975) Gerard Salton, Anita Wong, and Chung-Shu Yang. A vector space model for automatic indexing. Communications of the ACM, 18(11):613–620, 1975.
  • Sinsky et al. (2016) Christine Sinsky, Lacey Colligan, Ling Li, Mirela Prgomet, Sam Reynolds, Lindsey Goeders, Johanna Westbrook, Michael Tutty, and George Blike. Allocation of physician time in ambulatory practice: A time and motion study in 4 specialties. Annals of Internal Medicine, 165(11):753–760, 2016.