Detecting and interpreting myocardial infarctions using fully convolutional neural networks

06/18/2018 ∙ by Nils Strodthoff, et al. ∙ University Medical Center Schleswig-Holstein 0

We consider the detection of myocardial infarction in electrocardiography (ECG) data as provided by the PTB ECG database without non-trivial preprocessing. The classification is carried out using deep neural networks in a comparative study involving convolutional as well as recurrent neural network architectures. The best architecture, an ensemble of fully convolutional architectures, beats state-of-the-art results on this dataset and reaches 93.3 sensitivity and 89.7 is the performance level of human cardiologists for this task. We investigate questions relevant for clinical applications such as the dependence of the classification results on the considered data channels and the considered subdiagnoses. Finally, we apply attribution methods to gain an understanding of the network's decision criteria on an exemplary basis.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 7

page 9

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

Ischaemic heart diseases are the leading cause of death in Europe. The most prominent entity of this group is acute myocardial infarction (MI), where the blood supply to parts of the heart muscle is permanently interrupted due to an occluded coronary artery. Early detection is crucial for the effective treatment of acute myocardial infarction with percutaneous coronary intervention (PCI) or coronary artery bypass surgery. Myocardial infarction is usually diagnosed with the help of clinical findings, laboratory results, and electrocardiography. ECGs are produced by recording electrical potentials of defined positions of the body surface over time, representing the electric activity of the heart. Deviations from the usual shape of the ECG curves can be indicative of myocardial infarction as well as many other cardiac and non-cardiac conditions. ECGs are a popular diagnostic tool as they are non-invasive and inexpensive to produce but provide a high diagnostic value.

Clinically, cases of myocardial infarction fall into one of two categories, ST-elevation myocardial infarction (STEMI) and non-ST-elevation myocardial infarction (NSTEMI), depending on whether or not the ECG exhibits a specific ECG-sign called ST elevation. The former can and should be treated as soon as possible with PCI, whereas the NSTEMI diagnosis has to be confirmed with time-costly laboratory tests before specific treatment can be initiated [1]. Since waiting for these results can delay effective treatment by hours, a more detailed analysis of the ECG could speed up this process significantly.

Failure to identify high-risk ECG findings in the emergency department is common and is of grave consequences [2]. To increase accuracy, speed and economic efficiency, different algorithms have been proposed to automatically detect myocardial infarction in recorded ECGs. Algorithms with adequate performance would offer significant advantages: Firstly, they could be applied by untrained personnel in situations where no cardiologist is available. Secondly, once set up they would be highly reliable and inexpensive. Thirdly, they could be tuned to specific decision boundaries, for example to high sensitivity (low specificity) for screening purposes.

Common ECG classification algorithms usually mimic the approach a human physician would take: First preprocessing steps include the correction of baseline deviations, noise reduction and the segmentation of single heartbeats. In the next step, hand-engineered features such as predefined or automatically detected time intervals and voltage values are extracted from the preprocessed signal. Finally, the classification is carried out with a variety of common classifiers such as simple cutoff values, support vector machines or neural networks. Preprocessing and feature extraction are non-trivial steps with technical and methodical problems, especially with unusual heart rhythms or corrupted data, resulting in a high risk of information loss. This urges for a more unified and less biased algorithmic approach to this problem.

Deep neural networks [3, 4]

and in particular convolutional neural networks have been the driving force behind the tremendous advances in computer vision

[5, 6, 7, 8, 9, 10] in recent years. Consequently, related methods have also been applied to the problem of time series classification in general and ECG classification tasks specifically. Even though we focus exclusively on ECG classification in this work, we stress that the methodology put forward here can be applied to generic time series classification problems in particular to those that satisfy the following conditions: Data (a) with multiple aligned channels (b) that arises as a continuous sequence, i.e. with no start/end points in the sequence), and that exibits a degree of periodicity, and (c) that is fed to the algorithm in unprocessed/unsegmented form. These three criteria define a subclass within general time series classification problems that is important for many real-world problems. In particular, these criteria include raw sensor data from medical monitoring such as ECG or EEG.

The main contributions presented in this paper are the following:

  1. We put forward a fully convolutional neural network for myocardial infarction detection on the PTB dataset [11, 12] focusing on the clinically most relevant case of 12 leads. It outperforms state-of-the-art literature approaches [13, 14] and reaches the performance level of human cardiologists reported in an earlier comparative study [15].

  2. We study in detail the classification performance on subdiagnoses and investigate channel selection and its clinical implications.

  3. We apply state-of-the-art attribution methods to investigate the patterns underlying the network’s decision and draw parallels to cardiologists’ rules for identifying myocardial infarction.

Ii Related works

Concerning time series classification in general, we focus on time series classification using deep neural networks and do not discuss traditional methods in detail, see e.g. [16] for a recent review. Hüsken and Stagge [17] use recurrent neural networks for time series classification. Wang et al [18] use different mainly convolutional networks for time series classification and achieve state-of-the-art results in comparison to traditional methods applied to the UCR Time Series Classification Archive datasets [19]. Cui et al [20] use a sliding window approach similar to the one applied in this work and feed differently downsampled series into a multi-scale convolutional neural network also reaching state-of-the-art results on UCR datasets. Also, recurrent neural networks have been successfully applied to time series classification problems in the clinical context [21]. More recent works include attention [22] and more elaborate combinations of convolutional and recurrent architectures [23, 24]

. For a more detailed account on deep learning methods for time series classification, we refer the reader to the recent review

[25].

Concerning time series classification on ECG signals, the two main areas of work are the detection of either arrhythmia or infarction. It is beyond the scope of this work to review the rich body of literature on classification of ECG signals using algorithmic approaches, in particular those involving neural networks, see e.g [26] for a classic review. While the literature on myocardial infarction is covered in detail below, we want to briefly mention some more recent works [27, 28, 29, 30] in the broad field of arrhythmia detection. For further references we refer the reader to the recent reviews [31, 32]. At this point we also want to highlight [27], where the authors trained a convolutional neural network on an exceptionally large custom dataset reaching human-level performance in arrhythmia detection.

More specifically, turning to myocardial infarction detection in ECG recordings, many proposed algorithms rely on classical machine learning methods for classification after initial preprocessing and feature extraction

[33, 34, 35, 36, 37, 38, 39]. Particular mentioning deserve [40, 13] who operate on Wavelet-transformed signals. Whereas the above works used neural networks at most as a classifier on top of previously extracted features [41, 36, 42, 37], there are works that apply neural networks directly as feature extractors to beat-level separated ECG signals [43, 44]. These have to be distinguished from approaches as the one considered in this work where deep neural networks are applied to the raw ECG with at most minor preprocessing steps. In this direction Zheng et al [45], present an approach based on convolutional neural networks for multichannel time series classification similar to ours but applied it to ECGs in the context of congestive heart failure classification. Also [28, 46, 29] use a related approach for arrhythmia/coronary artery disease detection in a single-channel setting. The most recent work on the myocardial infarction detection using deep neural networks [14] also uses convolutional architectures applied to three channel input data. A quantitative comparison to their results is presented in Sec. V. Similar performance was reported in [47] who used LSTMs on augmented channel data obtained from a generative model.

Iii Dataset and medical background

The best-known collection of standard datasets for time series classification is provided by the UCR Time Series Classification Archive [19]. Although many benchmarks are available for the contained datasets [16], we intentionally decided in favor of a different dataset, as the UCR datasets contain only comparably short and not necessarily periodic sequences and are almost exclusively single-channel data. The same applies to various benchmarks datasets [19, 48, 49] considered for example in [24], which do not match the criteria put forward in the introduction. In particular the requirement of continuous data with no predefined start and end points that show a certain degree of periodicity is rarely found in existing datasets, especially not in combination with the other two requirements from above.

We advocate in-depth studies of more complex datasets that are more representative for real-world situations and therefore concentrate our study on ECG data provided by the PTB Diagnostic ECG Database [11, 12]. It is one of the few freely available datasets that meet the conditions from above. The dataset comprises 549 records from 290 subjects. For this study we only aim to discriminate between healthy control and myocardial infarction. Therefore, we only take into account records classified as either of these two diagnosis classes. We excluded 22 ECGs from 21 patients with unknown localization and infarction status from our analysis.

For some patients classified as myocardial infarction the dataset includes multiple records of highly variable age and in some cases even ECGs recorded after the medical intervention. The most conservative choice would be to exclude all myocardial infarction ECGs after the intervention and within a preferably short threshold after the infarction. Such a dataset would be most representative for the detection of acute myocardial infarction in a clinical context, would, however, seriously reduce the already small dataset. As a compromise, we decided to keep all healthy records but just the first ECG from patients with myocardial infarction. Note that a selection based on ECG age is not applicable here as the full metadata is not provided for all records. For the ECGs where the full metadata is provided this selection leads to a median (interquartile range) of the infarction age of 2.0 (4.5) days with 14% of them taken after intervention. On the contrary including all infarction ECGs would result in a median of days, of which were taken after intervention. These figures render the second, most commonly employed, selection questionable for an acute infarction detection problem. In summary, our selection leaves us with a dataset of 127 records classified as myocardial infarction and 80 records (from 52 patients) classified as healthy control. Demographical and statistical information on the dataset using the selection criteria from above is compiled in Tab. I.

quantity MI HC all
# patients 127 52 179
sex: male/female 92/35 39/13 131/48
age: median(iqr)/nans 61.0(16.0)/0 38.4(24.0)/6 57.0(19.0)/6
MI: untreated/treated/nans 81/18/28 - -
MI age: median(iqr)/nans 2.0(4.5)/4 - -
TABLE I: Demographical/statistical information on the selected records from the PTB dataset. Abbreviations: MI: myocardial infarction HC: healthy control iqr:interquartile range nans: records with no information provided
subdiagnosis/localization # patients # samples (selected)
anterior 17 47 (17)
antero-septal 27 77 (27)
antero-septo-lateral 1 2 (1)
antero-lateral 16 43 (16)
lateral 1 3 (1)
aMI 62 172 (62)
inferior 30 89 (30)
infero-posterior 1 1 (1)
infero-postero-lateral 8 19 (19)
infero-lateral 23 56 (23)
posterior 1 4 (1)
postero-lateral 2 5 (2)
iMI 65 174 (65)
Healthy control 52 80 (80)
MI 127 346 (127)
all 179 426 (207)
TABLE II: Infarction localization in the PTB ECG Database.

For the case of myocardial infarction the dataset distinguishes different subdiagnoses corresponding to the localization of the infarction, see Tab. II, with smooth transitions between certain subclasses. It is therefore not reasonable to expect to be able to train a classifier that is able to distinguish records into all these subclasses based on the rather small number of records in certain cases. We therefore decided to distinguish just two classes that we colloquially designate as anterior myocardial infarction (aMI) and inferior myocardial infarction (iMI), see Tab. II for a detailed breakdown. This grouping models the most common anatomical variant of myocardial vascular supply with the left coronary artery supplying the regions noted in the aMI group and the right coronary artery supplying those in the iMI group [50]. If not noted otherwise we only use the subdiagnoses information for stratified sampling of records into cross-validation folds and just discriminate between healthy control and myocardial infarction. In Sec. V-B we specifically investigate the impact of the above subdiagnoses on the classification performance. The fact that the inferior and anterior myocardial infarction can be distinguished rather well represents a further a posteriori justification for our assignment.

The PTB Database provides 15 simultaneously measured channels for each record: six limb leads (Einthoven: I, II, III, and Goldberger: aVR, aVL, aVF), six precordial leads (Wilson: V1, V2, V3, V4, V5, V6), and the three Frank leads (vx, vy, vz). As the six limb leads are linear combinations of just two measured voltages (e.g. I and II) we discard all but two limb leads. Frank leads are rarely used in the clinical context. Consequently, in our analysis we only take into account eight leads that are conventionally available in clinical applications and non-redundant (I, II, V1, V2, V3, V4, V5, V6). This is done in spite of the fact that using the full although clinically less relevant set of channels can lead to an even higher classification performance, see the analysis in Sec. V-B3 where the lead selection is discussed in detail.

Iv Classifying ECG using deep neural networks

Iv-a Algorithmic procedure

As discussed in the previous section, time series classification in a realistic setting has to be able to cope with time series that are so large that they cannot be used as input to a single neural network or that cannot be downsampled to reach this state without losing large amounts of information. At this point two different procedures are conceivable: Either one uses attentional models that allow to focus on regions of interest, see e.g. 

[23, 24], or one extracts random subsequences from the original time series. For reasons of simplicity and with real-time on-site analysis in mind we explore only the latter possibility, which is only applicable for signals that exhibit a certain degree of periodicity. The assumption underlying this approach is that the characteristics leading to a certain classification are present in every random subsequence. We stress at this point that this procedure does not rely on the identification of beginning and endpoints of certain patterns in the window [51]. This approach can be justified a posteriori with the reasonable accuracies and specificities it achieves. Furthermore, from a medical point of view it is reasonable to assume that ECG characteristics do not change drastically within the time frame of any single recording.

The procedure leaves two hyperparameters: the choice of the window size and an optional downsampling rate to reduce the temporal input dimension for the neural network. As the dataset is not large enough for extensive hyperparameter optimizations we decided to work with a fixed window size of 4 seconds downsampled to an input size of 192 pixels for each sequence. The window size is sufficiently large to capture at least three heartbeats at normal heart rates.

As discussed in Sec. III, if we consider a binary classification problem we are dealing with an imbalanced dataset with 80 healthy records in comparison to 127 records diagnosed with myocardial infarction. Several approaches have been discussed in the literature to best deal with imbalance [52, 53]. Here we follow the general recommendations and oversample the minority class of healthy patients by 2:1.

We refrain from using accuracy as target metric as it depends on the ratio of healthy and infarction ECGs under consideration. As sensitivity and specificity are the most common metrics in the medical context, we choose Youden’s J-statistic as target metric for model selection which is determined by the sum of both quantities i.e.

(1)

where denote true positive/false negative/false positive classification results. Other frequently considered observables in this context include or scores that are defined as combinations of positive predictive value (precision) and sensitivity (recall).

Finally, to obtain the best possible estimate of the test set sensitivity and specificity using the given data, we perform 10-fold cross-validation on the dataset. Its size is comparably small and there are still considerable fluctuations of the final result statistics, even considering the data augmentation via random window selection. These result statistics do not necessarily reflect the variance of the estimator under consideration when applied to unseen data

[54] and it is not possible to infer variance information from cross-validation scores by simple means [55]. The given dataset is not large enough to allow a train-validation-test split with reasonable respective sample sizes. Following [56]

, we circumvent this problem by reporting ensemble scores corresponding to models with different random initializations without performing any form of hyperparameter tuning or model selection using test set data. Compared to single initializations the ensemble score gives a more reliable estimate of the model’s generalization performance on unseen data. For calculating the ensemble score we combine five identical models and report the ensemble score formed by averaging the predicted scores after the softmax layer

[57].

Iv-B Investigated architectures

We investigate both convolutional neural networks as well as recurrent neural network architectures. While recurrent neural networks seem to be the most obvious choice for time series data, see e.g. [17], convolutional architectures have been applied for similar tasks in early days, see e.g. [58] for applications in phoneme recognition.

We study different variants of convolutional neural networks inspired by several successful architectures applied in the image domain such as fully convolutional networks [10] and resnets [7, 59, 60], see App. A

for details. In addition to architectures that are applied directly to the (downsampled) time series data, we also investigate the effect of incorporating frequency-domain input data obtained by applying a Fourier transform to the original time-domain data. We stress again that our approach operates directly on the (downsampled) input data without any preprocessing steps.

For comparison we also consider recurrent neural networks, namely LSTM [61]

cells. We investigate two variants: In the first approach we feed the last LSTM output into a fully connected layer. In the second case we additionally apply a time-distributed dense layer i.e. a layer with shared weights across all time steps to train the network in addition on a time-series classification task where we adjusted both loss functions to reach similar values. Similar to

[17] we investigate in this way if the time series predication task improves the classification accuracy.

V Results

V-a Network architectures

In Tab. III we compare the architectures described in the previous section based on 12-lead data. The comparison is based on cross-validated

statistics without implying statistical significance of our findings. In the light of very small number of 20 or even fewer patients in the respective test sets, we do not report confidence intervals or similar variance measures as these would be mainly driven by the fluctuations due to the small size of the test set, see also

[54, 55, 62] and the related discussion in Sec. IV-A.

The fully convolutional architecture and the resnet achieve similar performance applied to time-domain data. In contradistinction to an earlier investigation [18] that favored the fully convolutional architecture, a ranking of the two convolutional architectures is not possible on the given data. Interestingly, the convolutional architectures perform better applied to raw time-domain data than applied to frequency-domain data. In this context it might be instructive to investigate also other transformations of the input data such as Wavelet transformations as considered in [40, 13].

Both convolutional architectures show a better score than recurrent architectures. This can probably be attributed to the fact that we report just results with standard LSTMs and do not investigate more advanced mechanisms such as most notably an attention mechanism, see e.g. 

[22, 23, 24] for recent developments in this direction. Training the recurrent neural network jointly on a classification task as well as on a time series prediction task, see also the description in App. A-B, did not lead to an improved score, whereas a significant increase was reported by [17], which might be related to the small size of the dataset.

Model J-Stat sens. spec. prec.
Fully Convolutional 0.827 0.933 0.897 0.936
Resnet 0.828 0.925 0.903 0.940
Fully Convolutional(freq) 0.763 0.902 0.860 0.913
Resnet(freq) 0.656 0.870 0.786 0.869
LSTM mode (final output) 0.743 0.910 0.833 0.899
LSTM (final output + pred.) 0.742 0.914 0.828 0.897
TABLE III: Classification results for different network architectures on 12-lead data. Abbreviations: sens.: sensitivity=recall; spec.: specificity; prec.: precision=positive predictive value

In the following sections we analyze particular aspects of the classification results in more detail. All subsequent investigations are carried out using the fully convolutional architecture, which achieved the same performance as the best performing resnet architecture with a comparably much simpler architecture. If not noted otherwise we use the default setup of 12-lead data.

V-B MI localization, benchmarks, and channel selection

V-B1 MI localization and training procedure

As described in Sec. III we distinguish the aggregated subdiagnosis classes aMI and iMI. Here we examine the classification performance of a model that distinguishes these subclasses rather than training just on a common superclass myocardial infarction. We can investigate a number of different combination of either training/evaluating with or without subdiagnoses as shown in Tab. IV

Data J-Stat sens. spec. prec.
cardiologists aMI [15] 0.857 0.874 0.983 -
cardiologists iMI [15] 0.738 0.749 0.989 -
train MI eval MI 0.827 0.933 0.897 0.936
train MI eval aMI 0.877 0.980 0.897 0.884
train MI eval iMI 0.789 0.894 0.896 0.879
train aMI eval aMI 0.880 0.919 0.961 0.950
train iMI eval iMI 0.689 0.810 0.879 0.849
train aMI+iMI eval MI 0.788 0.912 0.876 0.947
train aMI+iMI eval aMI 0.846 0.966 0.881 0.906
train aMI+iMI eval iMI 0.741 0.861 0.879 0.897
TABLE IV: Classification performance on subdiagnoses (with fully convolutional architecture) on 12-lead data. Abbreviations as in Tab. III. train MI (train aMI+iMI) refers to training disregarding (incorporating) subdiagnoses and train aMI/iMI to training using just a particular subdiagnosis; analogously for eval.

Both for models trained on unspecific infarction and for models trained using subdiagnosis labels, the performance on the inferior myocardial infarction classification task turns out to be worse than the score achieved for anterior myocardial infarction. The most probable reason for this is that anterior myocardial infarctions show typical signs in most of the Wilson leads because of the proximity of the anterior myocard to the anterior chest wall. For the more difficult task of iMI classification, the model seems to profit from general myocardial infarction data during training, as a model trained on generic MI achieves a higher score on aMI classification than a model trained specifically on aMI classification only. The converse is true for the simpler task of aMI classification.

Interestingly, the model trained without subdiagnoses reaches a slightly higher score both for unspecific myocardial infarction classification as well as for classification on subdiagnoses aMI/iMI only, which might just be an effect of an insufficient amount of training data. In any case, we restricted the rest of our investigations on the model trained disregarding subdiagnoses.

Fig. 1: Confusion matrix for model trained on subdiagnoses aMI and iMI.

In Fig. 1 we show the confusion matrix for the model that is trained and evaluated on the subdiagnoses aMI and iMI in addition to healthy control. The confusion matrix underlines the fact that the model is able to discriminate between the aggregated subdiagnoses whose assignments were motivated by medical arguments, see Sec. III, and represents an a posteriori justification for this choice.

The reported score for models evaluated on subdiagnoses allows a comparison to human performance on this task. We base this comparison on a study [15] that assessed the human classification performance for different diagnosis classes based on a panel of eight cardiologists. Here we only report the combined result and refer to the original study for individual results. The most appropriate comparison is the model trained on general MI and evaluated on subdiagnoses aMI and iMI. However, it turns out that irrespective of the training procedure the algorithm achieves a slightly higher score on aMI classification and a considerably higher score on iMI classification and we see it therefore justified to claim at least human-level performance on the given classification task. We restrain from drawing further conclusions from this comparison as it depends on the precise performance metric under consideration, the fact that not the same datasets were used in both studies and differences in subdiagnosis assignments. The claim of superhuman performance on this task would certainly require more thorough investigations in the future.

V-B2 Comparison to literature approaches

Our data and channel selection strategy, see Sec. III, was carefully chosen to reflect the requirements of a clinical application as closely as possible. In addition, considering the comparably small size of the dataset, a careful cross-validation strategy is of utmost importance, see Sec. IV-A. Unfortunately, most literature results do not report cross-validated scores or introduce data leakage in their cross-validation procedures. This can happen for example by sampling from beat-level segmented signals or most commonly by sampling based on ECGs rather than patients, see [13] for a detailed discussion. Both cases lead to unrealistically good performance estimates as the classification algorithm can in some form adapt to structures in the same ECG or structures in a different ECG from the same patient during the training phase.

We refrain from presenting results for the latter setups as they do not allow to disentangle a model’s classification performance from its ability to reproduce already known patterns. Therefore, we only include a comparison to the most recent works [13, 14] that are to our knowledge the only works where a cross-validated score with sampling on patient level, in the literature also termed subject-oriented approach [13], is reported. To ensure comparability with [13, 14] we replicate their setup as closely as possible and modify our data selection to include limb leads and in addition to healthy records only all genuine inferior myocardial infarction ECGs. In this case our approach shows not only superior performance compared to literature results, see Tab. V, but does unlike their algorithms operate directly on the input data and does not require preprocessing with appropriate input filters.

Benchmark J-Stat sens. spec. prec.
Wavelet transform + SVM [13] 0.583 0.790 0.793 0.803
CNN [14] 0.694 0.853 0.841 -
limb leads + inferior MI 0.773 0.874 0.900 0.932
TABLE V: Comparison to literature results (with fully convolutional architecture). Abbreviations as in Tab. III. Results from this work marked by asterisk.

We replicated the above setup to demonstrate the competitiveness of our approach, but for a number of reasons we are convinced that the scores presented in Tabs. III and IV are the more suitable benchmark results: Firstly, from a clinical point of view 12-lead ECGs are the default choice and the algorithm should be fed with the full set of 8 non-redundant channels. Secondly, the restriction to include only the first infarction ECG per patient is arguably more suited for the application of the clinically most relevant problem of classifying acute myocardial infarctions, see the discussion in Sec. III. Finally, from a machine learning perspective it is beneficial to include all subdiagnoses for training allowing to adapt to general patterns in infarction ECGs and to only evaluate the trained classifier on a particular subdiagnosis of interest. For the case of aMI this procedure leads to an improved score, see the discussion of Tab. IV.

V-B3 Channel selection

By including different combinations of leads one can estimate the relative amount of information that these channels contribute to the classification decision, see Tab. VI.

channels J-Stat sens. spec. prec.
all leads 0.878 0.941 0.937 0.961
12 leads (default) 0.827 0.933 0.897 0.936
Frank leads only 0.803 0.930 0.873 0.923
limb leads only 0.811 0.912 0.899 0.937
I only 0.703 0.875 0.828 0.893
II only 0.695 0.907 0.787 0.874
III only 0.590 0.855 0.735 0.841
TABLE VI: Channel dependence of the classification performance (with fully convolutional architecture). Abbreviations as in Tab. III.

Starting with single-lead classification results, out of leads I, II and III, lead III offers the least amount of information, possibly because its direction coincides worst with the usual electrical axis of the heart. The classification result using Frank leads achieves a score that is slightly worse than the result using limb-leads only. A further performance increase is observed when complementing the limb leads with the Wilson leads towards the standard 12-lead setup. The overall best result is achieved using all channels, which does, however, not correspond to the clinically relevant situation, where conventionally only 12 leads are available.

V-C Interpretability

A general challenge remains the topic of interpretability of machine learning algorithms and in particular deep learning approaches that is especially important for applications in medicine [63]. In the area of deep learning, there has been a lot of progress in this direction [64, 65, 66]. So far most applications covered computer vision whereas time series data in particular did only receive scarce attention. Interpretability methods have been applied to time series data in [18] and ECG data in particular in [67]. A different approach towards interpretability in time series was put forward in [68].

As an exploratory study for the application of interpretability methods to time series data we investigate the application of attribution methods to the trained classification model. This allows investigating on a qualitative level if the machine learning algorithm uses similar features as human cardiologists. Our implementation makes use of the DeepExplain framework put forward in [69]

. For neural networks with only ReLU activation functions it can be shown

[69] that attention maps from ‘[70] coincide with attributions obtained via the -rule in LRP [71]. Even though we are using ELU activation functions the attribution maps show only minor quantitative differences. The same applies to the comparison to integrated gradients [65]. For definiteness, we focus our discussion on ‘’. Different from computer vision, where conventionally attributions of all three color channels are summed up, we keep different attributions for every channel to be able to focus on channel-specific effects. We use a common normalization of all channels to be able to compare attributions across channels.

We stress that attributions are inherent properties of the underlying models and can therefore differ already for models with different random initializations in an otherwise identical setup. If we aim to use it to identify typical indicators for a classification decision as a guide for clinicians a more elaborate study is required. For simplicity, in this exploratory study we focus on a single model rather than the model ensemble. By visual inspection we identified the most typical attribution pattern for myocardial infarction among examples in the batch that occurred shortly after the infarction. Prototypical outcomes of this analysis are presented below.

Fig. 2: Examplary interpretability analysis for two ECGs with myocardial infarction. The attribution score is color-coded in the background. Red (blue) areas influenced the neural network towards (against) a correct classification as myocardial infarction. See the accompanying text for details.

Fig. 2 shows examples of the interpretability analysis of selected channels of two myocardial infarction ECGs that were correctly classified. As in the clinical context ECGs are always considered in the context of the full set of twelve channels (if available), the complete set of channels is shown in Fig. 4 in the appendix. Take note that the attributions show a high consistency over beats of one ECG, even if a significant baseline shift is present. There is also a reasonable consistency with regard to similar ECG features exhibited by other patients which are not shown here.

ECG A is taken from a 74-year-old male patient one day after the infarction took place. A coronary angiography performed later confirmed an anterior myocardial infarction. ECG B is taken from a 68-year-old male patient one day after the infarction which was later confirmed to be in inferior localization. ECGs A and B are listed as s0021are and s0225lre in the PTB dataset.

Signs for ischemia and infarction are numerous and of variable specificity [72]. Highlighted areas coincide with established ECG signs of myocardial infarction. These are typically found between and including the QRS complex and the T wave, as this is when the contraction and consecutive repolarization of the ventricles take place. ST-segment-elevation (STE) is the most important finding in myocardial infarction ECGs and diagnostic criterion for ST-elevation myocardial infarction (STEMI) [73]. This sign (though not formally significant in every case) and corresponding high attributions can be found in both example ECGs at the positions marked a, e, and g. At the same time in another channel (position c) there is no STE and the attribution is consequently inverted. The attribution at position a also coincides with pathological Q waves, which also occur in some infarction ECGs. T wave inversion, another common sign for infarction, can be found at position d. Some other attributions of the model are less conclusive. Although attributions at positions b and f fall in the T and/or U waves, that is regions that are relevant for detection of infarction, it is unclear why they influenced the decision against infarction.

Note that the highlighted areas do not necessarily align perfectly with what clinicians would identify as important. For example for a convolutional neural network to detect an ST-elevation, it must use and compare information from before and after the QRS complex, which most likely results in high attributions to the QRS complex itself and its immediate surrounding rather than to the elevated ST-segment.

Comparing the overall visual impression of the attributions across all channels (see Fig. 4), the model seems to attribute more importance to the Wilson leads in ECG A (anterior infarction) and more importance to the limb leads in ECG B (inferior infarction). This is also where clinicians would expect to find signs of infarction in these cases.

Attributions are inherently model-dependent and as a matter of fact the corresponding attributions show quantitative and in some cases even qualitative differences. However, across different folds and different random initializations the attribution corresponding to the STE was always correctly and prominently identified. This is a very encouraging sign for future classification studies on ECG data based on convolutional methods, in particular in combination with attribution methods. A future study could put the qualitative finding presented in this section on a quantitative basis. This would require a segmentation of the data, possibly using another model trained on an annotated dataset as no annotations are available for the PTB dataset, and statistically evaluating attribution scores in conjunction with this information. In this context it would be interesting to see if different classification patterns arise across different models or if they can at least be enforced as in [74].

Vi Summary and Conclusions

In this work, we put forward a fully convolutional neural network for myocardial infarction detection evaluated on the PTB dataset. The proposed architecture outperforms the current state-of-the-art approaches on this dataset and reaches a similar level of performance as human cardiologists for this task. We investigate the classification performance on subdiagnoses and identify two clinically well-motivated subdiagnosis classes that can be separated very well by our algorithm. We focus on the clinically most relevant case of 12-lead data and stress the importance of a careful data selection and cross-validation procedure.

Moreover, we present a first exploratory study of the application of interpretability methods in this domain, which is a key requirement for applications in the medical field. These methods can not only help to gain an understanding and thereby build trust in the network’s decision process but could also lead to a data-driven identification of important markers for certain classification decisions in ECG data that might even prove useful to human experts. Here we identified common cardiologists’ decision rules in the network’s attribution maps and outlined prospects for future studies in this direction.

Both such an analysis of attribution maps and further improvements of the classification performance would have to rely on considerably larger databases such as [75] for quantitative precision. This would also allow an extension to further subdiagnoses and other cardiac conditions such as other confounding and non-exclusive diagnoses or irregular heart rhythms.

Appendix A Network architectures

All models were implemented in TensorFlow

[76]

. As only preprocessing step we apply input normalization by applying batch normalization

[77] to all input channels. In all cases we minimize crossentropy loss using the Adam optimizer [78] with learning rate 0.001.

A uniformly sampled ECG signal can be represented as a two-dimensional tensor (

sampling points input channels), as opposed to image data that is conventionally represented as a three-dimensional tensor (height width input channels), to which for example conventional CNN building blocks like one-dimensional convolutional or pooling layer (operating on a single axis rather than two axes in the image case) can be applied straightforwardly. This approach is predominantly used in the literature, see e.g. [45, 46, 27].

A-a Convolutional architectures

We generally use ELU [79] as activation function both for convolutional as well as fully connected layers without using batch normalization [77], which was reported to lead to a slight performance increase compared to the standard ReLU activation with batch normalization [80]. In the architectures with fully connected layers we apply dropout [81] at a rate of 0.5 to improve the generalization capability of the model. We initialize weights according to [82]

. Note that in contrast to the case of two-dimensional data a max pooling operation only reduces the number of couplings by a factor of 2 rather than 4, which is then fully compensated by the conventional increase of filter dimensions by 2 in the next convolutional layer. To achieve a gradual reduction of couplings we therefore keep the number of filters constant across convolutional layers. We study the following convolutional architectures that are also depicted in Fig. 

3:

  1. A fully convolutional architecture [10] with a final global average pooling layer

  2. A resnet-inspired [7, 59, 60] architecture with skip-connections

We investigate the impact of including frequency information obtained via a Fast Fourier Transformation with , where denotes the sequence length after rescaling. The independent components are used as frequency-domain input data with otherwise unchanged network architectures.

Fig. 3: Convolutional architectures: fully convolutional (left) and resnet (right).

A-B Recurrent architectures

As alternative architecture we investigate recurrent neural networks, namely LSTM [61] cells. We investigated stacked LSTM architectures but found no significant gain in performance. However, even for a single RNN cell, in our case with 256 hidden units, different training methods are feasible:

  1. In the first variant we feed the last LSTM output into a fully connected softmax layer.

  2. In the second variant we additionally apply a time-distributed fully connected layer, i.e. a fully connected layer with shared weights for every timestep, and train the network to predict the next element in a time series prediction task jointly with the classification task. Here we adjusted both loss functions to reach similar values. Similar to [17] we investigate in this way if the time series predication task improves the classification accuracy.

During RNN training we apply gradient clipping.

Acknowledgment

The authors thank M. Grünewald and K.-R. Müller for discussions and E. Dolman for comments on the manuscript.

References