Log In Sign Up

Analysis of a Deep Learning Model for 12-Lead ECG Classification Reveals Learned Features Similar to Diagnostic Criteria

Despite their remarkable performance, deep neural networks remain unadopted in clinical practice, which is considered to be partially due to their lack in explainability. In this work, we apply attribution methods to a pre-trained deep neural network (DNN) for 12-lead electrocardiography classification to open this "black box" and understand the relationship between model prediction and learned features. We classify data from a public data set and the attribution methods assign a "relevance score" to each sample of the classified signals. This allows analyzing what the network learned during training, for which we propose quantitative methods: average relevance scores over a) classes, b) leads, and c) average beats. The analyses of relevance scores for atrial fibrillation (AF) and left bundle branch block (LBBB) compared to healthy controls show that their mean values a) increase with higher classification probability and correspond to false classifications when around zero, and b) correspond to clinical recommendations regarding which lead to consider. Furthermore, c) visible P-waves and concordant T-waves result in clearly negative relevance scores in AF and LBBB classification, respectively. In summary, our analysis suggests that the DNN learned features similar to cardiology textbook knowledge.


page 1

page 2

page 3

page 4


Automatic Diagnosis of the Short-Duration 12-Lead ECG using a Deep Neural Network: the CODE Study

We present a Deep Neural Network (DNN) model for predicting electrocardi...

Learned Feature Attribution Priors

Deep learning models have achieved breakthrough successes in domains whe...

Model Explainability in Deep Learning Based Natural Language Processing

Machine learning (ML) model explainability has received growing attentio...

DEPARA: Deep Attribution Graph for Deep Knowledge Transferability

Exploring the intrinsic interconnections between the knowledge encoded i...

CRT-Net: A Generalized and Scalable Framework for the Computer-Aided Diagnosis of Electrocardiogram Signals

Electrocardiogram (ECG) signals play critical roles in the clinical scre...

Explaining Deep Neural Networks with a Polynomial Time Algorithm for Shapley Values Approximation

The problem of explaining the behavior of deep neural networks has gaine...

Logic Rules Meet Deep Learning: A Novel Approach for Ship Type Classification

The shipping industry is an important component of the global trade and ...


Despite their remarkable performance, deep neural networks remain unadopted in clinical practice, which is considered to be partially due to their lack in explainability. In this work, we apply attribution methods to a pre-trained deep neural network (DNN) for 12-lead electrocardiography classification to open this ”black box” and understand the relationship between model prediction and learned features. We classify data from a public data set and the attribution methods assign a ”relevance score” to each sample of the classified signals. This allows analyzing what the network learned during training, for which we propose quantitative methods: average relevance scores over a) classes, b) leads, and c) average beats. The analyses of relevance scores for atrial fibrillation (AF) and left bundle branch block (LBBB) compared to healthy controls show that their mean values a) increase with higher classification probability and correspond to false classifications when around zero, and b) correspond to clinical recommendations regarding which lead to consider. Furthermore, c) visible P-waves and concordant T-waves result in clearly negative relevance scores in AF and LBBB classification, respectively. In summary, our analysis suggests that the DNN learned features similar to cardiology textbook knowledge.

1 Introduction

The development and evaluation of algorithms for automatic interpretation of biosignals has attracted great interest in the last decade. Biosignals are time series, i.e. they are ordered sequences of measurements, which are usually acquired in successive and equally-spaced time intervals. Typical examples are the electrocardiogram (ECG) representing the electrical activity of the heart or the electroencephalogram (EEG) representing brain activity. The temporal ordering discriminates biosignals from many other biomedical data without any order (e.g. lab tests, sequencing) and introduces challenges in their interpretation by humans and algorithms alike. Next to measurement artefacts (e.g. loss of electrode contact), signals are influenced by other physiological processes (e.g. ECG by respiration) and (in)voluntary movement of the patient.

Traditionally, the field of ECG signal processing was dominated by methods based on mathematical or physical models recreating human physiology. Human experts defined semantic models or features which were used for different tasks, e.g. for generating synthetic waveforms [McSharry.2003], waveform delineation [Bock.2021], or even human identification [Israel.2005]. Evidently, this led to a plethora of proposed features and the question which feature set is optimal for a specific task, e.g. for ECG classification [mar.2011]. Regarding this application the aim is to either assign a label to individual heart beats or to a whole recording. As an example for the latter use case, the PhysioNet/CinC Challenge 2020 posed the task to automatically assign one or multiple of classes to a large, multi-institutional database of 12-lead ECGs [perez.2020]. More than teams took part with the most common algorithms being deep neural networks (DNNs).

In the last years, data-driven methods from the field of machine learning (ML) became popular with DNNs accounting for a significant percentage

[Piccialli.2021]. While at the beginning many works used DNNs as classifiers and used traditional, semantic features as their input, in the last years there is a trend towards ”end-to-end” pipelines where the raw signal is processed and the DNN extracts relevant features by itself. Although these methods are able to produce outstanding results and outperform conventional methods in many areas [Hannun.2019, smith.2019], a pitfall lies in the fact that they are black box models often based on agnostic features. While they bear the theoretical potential to aid clinicians in diagnostics or treatment decisions, clinicians need to be able to comprehend the reasoning behind the algorithm, as a ”Clever Hans” prediction [clever_hans], based on spurious or artifactual correlations, might lead to wrong decisions and thereby to adverse consequences for patients. Hence, next to issues such as inadequate performance metrics [clifford.2022] and data leakage [kapoor.2022], one of the main reasons for DNNs remaining unadopted in clinical practice is missing explainability [Yoon.2020, elul.2021].

To address this need, frameworks and methods from the field of Explainable Artificial Intelligence (XAI) are developed and evaluated

[review_XAI]. While XAI for text and tabular input data is advancing, XAI for time series data such as biosignals is still in the need for further research [Guidotti.2019]. XAI methods for DNNs include layer-wise relevance propagation (LRP) [lrp], integrated gradients (IG) [integratedGrads], and GRAD-Cam [Selvaraju.2020]. With regard to ECG classification however, these methods are usually applied qualitatively [Taniguchi.2021, Bodini.2021, Sturm.2016] by showing individual recordings and corresponding XAI information, e.g. as pseudo-colored overlays. This qualitative evaluation of single recordings is a rather anecdotal evidence and does not suffice the requirements for integrating DNNs in clinical practice, which needs a comprehensive characterization of models and their limitations.


Figure 1: Overview of the processing pipeline: Our data set consists of healthy controls (Normal) that are compared to patients showing AF and LBBB. Each (unseen) 12-lead ECG is fed into the pre-trained DNN and subsequently results are explored with the XAI methods, yielding a relevance score for each input sample, indicated here by blue (negative relevance score), grey (neutral), and red values (positive relevance score). We propose novel analysis methods for these scores, allowing to gain insight into the DNN’s reasoning.

Hence, in this work, we address the unmet clinical need of missing explainability by proposing a quantitative analysis pipeline (Fig. 1) enabling an objective justification of a DNN’s decision. We use a state-of-the-art, pre-trained DNN proposed by Ribeiro et al. for 12-lead ECG classification [Ribeiro.2020] and apply the attribution XAI method IG [integratedGrads] to a large-scale public data set [Liu.2018]. IG assigns to each sample of the ECG time series a relevance score reflecting how much it influenced the DNN’s decision. Additionally, we apply another attribution XAI method, LRP, to compare its explanatory power to IG.

The main contribution of this work are novel analysis methods for processing relevance scores. These analyses allow to gain insight into the DNN’s reasoning when classifying unseen ECG signals. By mapping the results to clinical knowledge, we investigate in how far the DNN’s features align with clinical knowledge. By doing so, we also propose novel visualization methods of relevance scores, allowing an intuitive and quick assessment of DNN classifications.

2 Results

After processing the China Physiological Signal Challenge 2018 (CPSC2018) data [Liu.2018] with the DNN, is the probability that a recording shows the interrogated abnormality. The recording is classified as this abnormality if is higher than a threshold defined by Ribeiro et al., which is for atrial fibrillation (AF) and for left bundle branch block (LBBB). Applying an XAI method results in a relevance score for each input sample of a classified ECG with reflecting the XAI method, representing sample index, and representing the ECG lead.

2.1 Average Relevance Scores Over Class


(a) Normal and AF recordings. Colors denote ground truth label of data set. Values for AF range from and values for normal recordings from .


(b) Normal and LBBB recordings. Colors denote ground truth label of data set. Values for LBBB range from and values for normal recordings from .
Figure 2: Distribution of relevance scores .


(a) Atrial Fibrillation


(b) Left Bundle Branch Block
Figure 3: Distribution of for each recording as single boxplot. The bottom x-axis represents sigmoid activation output of the DNN, while the upper x-axis represents the output with linear activation. Boxplot colors denote DNN classification results and red crosses indicate false negatives.

The mean of the distributions of relevance scores for each class (Fig. 2) is close to zero, representing that the majority of ECG samples is not relevant for the DNN’s decision. Distributions for both abnormalities are almost similar to normal recordings, although they are slightly broader and shifted to positive values. For LBBB, in the range there is a large number of more positive relevance scores compared to normal recordings (Fig. 1(b)).

The relevance scores of individual recordings are again centered close to zero and rather equally-distributed (Fig. 3). In general, AF shows larger values in positive and negative direction compared to LBBB. While the median value is always very close to zero, the mean value of relevance scores is increasing with increasing .

For AF classification (Fig. 2(a)) a large amount of normal recordings correctly classified as not showing AF have a near and correctly classified AF recordings are near . In between is a ”transition area” with nine false negative classifications in . The remaining seven false negatives show values close to zero. LBBB has similar properties to AF, although there is no visible transition area and the values do not get as close to (Fig. 2(b)).


(a) AF classification


(b) LBBB classification
Figure 4: Distribution of w.r.t. ECG leads, colors denoting ground truth label. For AF classification (a) and LBBB classification (b) boxplots show that the abnormal mean is higher for each lead with the highest difference in V1.

2.2 Average Relevance Scores Over Class and Lead

Analyzing model results of each lead for AF classification (Fig. 3(a)), mean relevance scores showed medians of , and ranges of and for AF and normal recordings, respectively. For LBBB classification (Fig. 3(b)), medians were , and ranges were and for LBBB and normal recordings, respectively. For each lead, the mean relevance scores were significantly higher (Wilcoxon-Rank-Sum-Test, p-value 0.01) for both abnormalities compared to normal recordings. Particularly, lead V1 shows the highest difference in median values.


Figure 5:

Left column: Average beats (black curves) and IG relevance scores for lead V1 in AF classification. Abnormal ECGs show positive relevance scores (red) distributed over the whole P-QRS-T-cycle, while negative relevance scores (blue) on normal recordings cover QRS-complexes and especially P-waves. Right column: Instead of average beats, the variance of relevance scores across recordings is shown (orange).

2.3 Average Relevance Scores Over Class, Lead, and Beats

Average beats over 200 recordings show mostly positive relevance scores for both abnormalities, and mostly negative relevance scores for normal recordings for both classifications.

In case of AF classification, QRS-complexes are the most relevant parts, especially R-peaks (Fig. 5). For normal recordings, we can observe high negative values for the area of P-waves as well. Negative values of normal recordings are higher compared to positive values of AF recordings.

For LBBB classification, QRS-complexes are most relevant as well, although P-waves are not as important for normal recordings (Fig. 6). Furthermore, the concentration of high absolute relevance scores on specific waves or peaks is clearer, such as the negative T-wave in LBBB recordings, assigned with negative relevance scores when being positive in normal recordings. In contrast, for AF we can see many smaller relevance scores with higher variance distributed on the whole beat.

2.4 Qualitative Analysis

We observed clusters of high absolute relevance scores in the area of QRS-complexes during visual inspection of single recordings visualized as heatmap (Fig. 7). For LBBB, IG seems to focus on negative S-waves and prolonged ST-segments in lead V1. Occasionally, broad and notched R-waves were also marked relevant. On the contrary, for AF recordings, the relevant parts were usually R-waves and in rare instances areas with missing P-waves.


Figure 6: Left column: Average beats and IG relevance scores for lead aVL in LBBB classification. Abnormal ECGs show positive relevance scores (red) on QRS-complexes, while negative relevance scores (blue) on normal recordings can be seen on P- and T-waves as well. Right column: Instead of average beats, the variance of relevance scores across recordings is shown (orange).


Figure 7: Positive (red) and negative (blue) relevance scores calculated with integrated gradient on correctly classified electrocardiogram (LBBB: ) from CPSC data set (ID A0977). Relevance scores normed to per lead.


Figure 8: Positive (red) and negative (blue) relevance scores calculated with integrated gradient on false negative classified electrocardiogram (AF: ) from CPSC data set (ID A1017). Relevance scores are clustered around the artefact in lead V1.


Figure 9: Relevance scores calculated with five XAI methods normed to each on lead V6 of a correctly classified electrocardiogram (AF: ) from CPSC data set (ID A0086). EPS: LRPEpsilon, AB0: LRPAlpha1Beta0, WSQ: LRPWSquare, PSA: LRPSequentialPresetA, IGR: IntegratedGradients.

When looking at individual recordings we also observed that in cases of artefacts, such as baseline drifts or noise, IG relevance scores are usually accumulated mainly in these areas. This can be seen on multiple false negative classifications, such as recordings A1017 (lead: V1, Fig. 8), A0745 (lead: V6), and A0205, A0502 (both multiple leads, mainly: V1-6) in the CPSC2018 data set. In some cases the classification was still correct despite the focus on artefacts, e.g. A0639 (lead: V1) with an AF classification of .

2.5 Comparison of XAI Methods

IG and all considered LRP methods yield diverging results for the given data set. As can be seen in Fig. 9 as an example, LRP methods and distribute high absolute relevance scores especially around R-peaks, while shows higher absolute values on waves in between as well as artefacts. IG can also concentrate high absolute relevance scores around artefacts, but generally shows more high absolute values, especially on R peaks, when comparing leads of single patients to each other.

3 Discussion

Our key finding is that the DNN’s reasoning corresponds to cardiology textbook knowledge regarding several factors, including which lead to consider, the existence/absence of certain waves, or their morphology.

Results of the first analysis show that the relevance scores follow a reasonable distribution (Fig. 2) with the majority of values being close to zero. This is expected as the majority of samples in an ECG is at baseline, e.g. the interval between two heart beats from the end of the T-wave to the beginning of the P-wave, and carry little clinically-relevant information. Comparing AF and LBBB classification shows that the AF relevance scores are more evenly spread around zero while the LBBB relevance scores tend to more positive relevance scores which can also be seen clearly in Fig. 1(b) with a distinct gap for positive relevance scores between LBBB and normal recordings. We conclude that the DNN trained a larger inter-class distance for LBBB classification.

Analyzing individual recordings (Fig. 3) shows similar distributions for both classifications. Additionally, a distinct relationship between the averaged relevance scores and the outputted probability of the DNN can be observed. An optimal DNN classifier would show a cluster nearby and (normal recordings) as well as a cluster nearby and (AF/LBBB). The analyzed DNN shows a sub-optimal relationship that can generally be expected with a transition area between both clusters in which the DNN has not a high certainty in its decisions (e.g. Fig. 2(a): ). Furthermore, we observed many of the false negative classifications slightly below the threshold, indicating that the thresholds might not be optimal for the CPSC2018 data set.

When analyzing individual leads, significant differences in relevance score distributions between abnormal and normal recordings were revealed (Fig. 4). This indicates which leads are most relevant for the DNNs decision. In general, for AF, the limb leads show lower relevance scores compared to the chest leads [Bollmann.2006]. For AF as well as LBBB classifications, lead V1 shows clear positive relevance scores, indicating that the DNN trained clinically-relevant features: For AF, f-waves can often be observed in V1 [langley.2000] and for LBBB a negative terminal deflection in V1, e.g. a rS-complex with a tiny R-wave and a huge S-wave, is a clear diagnostic marker [strauss.2011]. Interestingly, there is a large difference in the distributions of the precordial leads V4-V6. While in AF it shows a clear tendency towards positive relevance scores, for LBBB the median is close to zero. Another sign for LBBB are prolonged R-waves and absence of Q-waves in left-sided leads (I, aVL, V5, V6) [macfarlane.2020] which might not have been learned.

For these first analyses, we used averaged mean values of relevance scores, which have been used for explanations of models that take feature based input instead of raw data [Lauritsen.2020, Jansen.2019]

. However, this is a rather coarse measure. As the relevance scores are signed, values can be composed of rather low relevance scores or competing strong relevance scores for and against the respective class. Still, outliers in overall means or means of leads could be an indicator for false classification due to artefacts, for example if a lead not typically being relevant for this abnormality has the highest mean, such as in lead V6 in Fig.


As time information is lost in average means, we proposed the third analysis. As can be seen in Figs. 5 and 6, the ”average beat” and ”average relevance scores” of a single lead can give an even more detailed idea of the model’s features. Although it is still not possible to uniquely identify the actual features learned by the DNN, positively relevant areas in case of missing P-waves for AF classification indicate a good fit to clinical criteria [langley.2000]. Additionally, for the healthy controls, there are very pronounced negatively relevant areas nearby P-waves, demonstrating that the DNN learned that existence of P-waves is a counter-sign for AF. As IG does not allow to gain insight into the time scale, we cannot quantify to what extent RR-interval variations impact the relevance scores. However, as the QRS complex has similar shapes in AF and normal recordings, we assume that the DNN took the arrhythmic RR-intervals of AF recordings into account.


(a) Lead V1 used for AF classification.


(b) Lead aVL used for LBBB classification.
Figure 10: Average beats (black curve) and relevance scores for individual leads in a single normal recording correctly classified by the DNN: a) Highly negative relevance scores (blue) are found during the occurrence of the P-wave. b) Negative relevance scores (blue) are found during the P-/T-waves, and especially during occurrence of the P-wave of the QRS-complex.

Moreover, when analyzing the shape of an average relevance signal, which is continuously averaged over more and more recordings in Fig. 5 (see Supplemental Material for video), it can be seen that, for AF as well as normal ECGs, the variance of relevance scores is quite low. This indicates a robustness of the DNN as it generates similar relevance scores despite the natural inter-patient variability in abnormal ECGs. Regarding LBBB classification (Fig. 6), high relevance scores around broadened QRS-complexes indicate a good fit to clinical criteria [Tan.2020]. The criterion of a T-wave displacement opposite to the major deflection of the QRS-complex [Tan.2020] can also be observed very well, although resulting in small positive relevance scores only. In contrast, for healthy controls, T-waves result in very pronounced negatively relevant areas (e.g. Fig. 9(b)). Similarly, for AF classifications, P-waves are learned as a feature speaking against AF (e.g. Fig. 9(a)). Furthermore, the robustness of the relevance scores in terms of variance is even higher than for AF.

We observed that the DNN tends to wrong classifications in case of artefacts as can be seen exemplarily in Fig. 8. This effect has been observed by others as well [Taniguchi.2021]. Although we have not attempted it in this work, artefact detection based on our approach could be a promising avenue for future work. Additionally, we observed that the relevance scores result in certain temporal patterns that might allow to apply analysis methods from nonlinear signal processing [apen] which we will analyze in future work.

In summary, our analysis suggests that the model by Ribeiro et al. learned features similar to criteria used by clinicians. IG relevance scores indicate that it learned features pointing towards a disease, such as the abnormal QRS-complex in LBBB, while other features, such as the T-wave pointing in opposite direction, are not used for LBBB detection. Instead, the opposite of the feature, a T-wave pointing in expected direction, is used as a feature for detecting healthy ECGs. Our proposed analysis and visualization methods for relevance scores facilitate a rapid and effective assessment of the DNN’s learned features and were confirmed by cardiologists.

In this work, we applied the XAI attribution methods IG and LRP. There are other approaches available for explaining models for biosignal data using ante-hoc methods as in [Hu.2021, Doborjeh.2021, elul.2021], but these methods are not suitable for pre-trained DNNs where no adaption to the model itself is possible. Other methods, such as perturbation methods [zeiler2014visualizing, zintgraf2017visualizing], focus on occluding different parts of images and then analyzing the resulting changes in activations. These methods can also be used to calculate relevance scores for every input feature, but as shown by [samek2015evaluating] they produce noisier heatmaps compared to LRP methods. Our results indicate that both methods, IG and LRP, are well suited for gaining insight into reasoning of DNNs applied to biosignals. Additionally, we conducted a comparison of IG and LRP methods (Fig. 9) and came to the conclusion that IG gives most distinct results.

However, a limitation of our analysis based on IG is that we cannot infer any time-dependent information of the relevance scores. Especially for AF it is not clear whether e.g. the R-peaks are marked as relevant because of their morphology or their distance to one another. Therefore, we rate our results as more robust for LBBB as a morphological abnormality compared to AF as an arrhythmic and therefore time-dependent abnormality. Another limitation of our work is that we used a single-center ECG data set which might introduce a certain bias due to the hardware used and standard operating procedures at the clinic. Therefore, using a different data set might show different results. Thus, in future work, we will verify our results with more diverse data sources.

3.1 Conclusion

Missing explainability of ML methods for ECG analysis is a pressing issue preventing the dissemination of these methods in clinical practice. In this work we aimed for analyzing a state-of-the-art DNN for ECG classification with an XAI method and propose several quantitative analysis techniques. Although this approach does not allow to gain absolute certainty about the features learned by the DNN, it allows for infering assumptions about its decision process. For example, our results reveal that the DNN learned that clearly-visible P-waves are a counter-sign for AF and T-waves pointing in same direction as the QRS-complex in particular leads are counter-signs for LBBB. Furthermore, decisions of the DNN for LBBB classification are based on unusual QRS-complexes. We can conclude that the DNN learned cardiology textbook knowledge covering the whole cardiac cycle (P-waves, QRS-complex, T-wave). Moreover, by using our qualitative analysis we were able to explain false classifications due to transient noise which attracts the DNN’s relevance scores, leading to relevant features being ignored.

In future work, we will use the methods proposed in this work for developing an interactive tool for clinical practice which offers cardiologists an intuitive overview of the DNN’s reasoning, supporting them in their decision whether to trust the DNN’s classification, or not.

4 Methods

4.1 Physiological Introduction

An ECG measures electrical activity on a patient’s skin to monitor his/her cardiac cycle. It is a routine measurement in clinical settings, especially in emergency care as it allows a fast, accurate, and comfortable assessment of key clinical parameters. Standard parameters derived from ECGs include heart rate, lengths between different peaks and waves, as well as the heart’s electrical axis. Differences of these parameters to normal values can be interpreted as abnormalities, substantiating diagnoses. The acquisition of ECGs differs in length (resting: s, Holter: h) and circumstances (resting vs. exercise).

Raw ECG data is measured at equally-spaced points in time (samples) in units millivolt (mV) from multiple directions (leads) which are computed from differences in electrical potentials measured in two distinct electrodes. A standard resting ECG uses electrodes, resulting in leads, including six chest leads and six limb leads derived from electrodes on each arm and the left leg.

The stages of the cardiac cycle, a single heart beat, are represented by characteristic waves and peaks in a P-QRS-T sequence. The P-wave represents the depolarization before the contraction of the atria which is initiated by the sinus node. The QRS complex consists of the Q-, R-, and S-waves and corresponds to the ventricular systole, and the T-wave represents the ventricular relaxation.

The morphology of the different waves (e.g. amplitude, width) as well as the intervals between them are clinically relevant. For example, AF is an arrhythmia based on uncoordinated electrical impulses in the atrium of the heart and a non-functioning sinus node [Hindricks.2021] that can be diagnosed from ECGs. Criteria for diagnosis are absence of P-waves, as they are initiated by the sinus node, and irregular RR intervals [Hindricks.2021]. However, repeating fibrillatory waves, so called f-waves, mimic P-waves and can usually be observed best in leads V1-6, especially V1 [Bollmann.2006]. Another abnormality is LBBB which results from the cardiac conduction through the left bundle branch being compromised downstream from lesions of the His bundle or its derivatives. LBBB criteria for ECGs include an unusually wide QRS complex with the ST-segment and T-waves pointing in opposite directions [Tan.2020]. In left-sided leads (I, aVL, V5, V6) broad notched or slurred R waves can be observed, while Q waves are absent [Tan.2020]. Both AF and LBBB could be diagnosed using ECG acquisition with a reduced number of leads, but the gold standard for diagnosis is 12-lead ECG [harris.2012].

4.2 Technical Background

Ribeiro et al. published a residual network (ResNet) trained on more than two million ECGs from a Brazilian telehealth network, showing F1-scores of more than

% for classification of six ECG abnormalities. The output from convolutional layers in each of four residual blocks are fed into a fully connected layer with sigmoid activation function, yielding independent probabilities for six classes of ECG abnormalities

[Ribeiro.2020c]. Thresholds calculated for the final classifications are available on GitHub111, commit 89f929d, line 121. In previous work, we demonstrated methods and results reproducibility with a local data set [Bender.2021].

The model accepts a matrix with dimensions with and defining the number of samples and leads, respectively. denotes the number of recordings to be processed. The model outputs a matrix with dimensions assigning probabilities for six ECG abnormalities, namely first degree AV block, right bundle branch block, LBBB, sinus bradycardia, AF and sinus tachycardia.

In medical applications like ECG diagnostics it is especially important for clinicians to understand the reasoning behind such a deep neural network. Explainable AI methods build a wrapper around the black box model, giving insight into possible features that led to the DNN’s output. In this paper, we focus on two state-of-the-art attribution methods, IG and LRP.

4.2.1 Integrated Gradients

IG attribute the prediction of a neural network on unseen data to its input features. However, IG use a baseline input for attribution calculation. The authors [integratedGrads] motivate this by noting that if we assign blame to something, we implicitly consider the absence of this thing as a baseline for comparing outcomes.

IG are calculated as follows: Let be a function that represents a neural network, the input at hand, and the baseline input. The IG are defined as the path integral of the gradients along the straight-line path from the baseline and input . The straight-line path can easily be written down as for . The integrated gradient for the -th input dimension is defined as


where is the gradient of along the -th dimension.

The property of the LRP methods that the relevance scores of the input can be summed up and approximate the prediction score (see (4)) can also be proven for IG by using the fundamental theorem of calculus for path integrals. This states that if is differentiable almost everywhere222This means is continuous everywhere and the partial derivative of

along each input dimension is Lebesgue integrable. This holds for most neural networks using Sigmoid, ReLU, or Pooling functions.



For a baseline with prediction near zero, we can see that the sum over the IG in (2) also approximates the prediction score similar to how the sum over the relevance scores calculated by LRP approximates the prediction score in (4). This property is termed completeness in [integratedGrads].

For computing IG the integration is replaced by a sum over sufficiently small intervals along the straight-line path


4.2.2 Layer-wise Relevance Propagation

LRP tries to explain the output made by a classifier with respect to an input by decomposing the output in such a way that


where is the input dimension. would then indicate the presence of the structure which is to be classified and would indicate its absence.

Propagation of relevance scores works as follows: Let

be a known relevance score of a certain neuron

in the -th layer of a neural network, for a classification decision . The decomposition of the relevance score in terms of messages sent to neurons of the previous layer must hold the following conservation property


where describes the sum over all neurons in the -th layer of the neural network.
One possible relevance decomposition that satisfies (5) would be to use the ratio of local and global activations:


where is the activation (calculated by a non-linear activation function) of the -th neuron in the -th layer, is the weight connecting neuron in the -th layer to neuron in the -th layer, is a bias term, and describes the sum over all neurons in the -th layer.

A problem with (6) is that if gets very small, the relevance scores can get infinitely large. To overcome this problem, the authors of [lrp] have introduced a stabilizer :


As we can see in (7), if becomes very large, the relevance scores will tend to zero which poses another problem. To counteract this, a different treatment of positive and negative activations is proposed in [lrp]. Let and denote the positive and negative part of such that . The same notation will be used for the positive and negative parts of . Relevance decomposition can now be defined by


where .
A different propagation rule has been proposed by [Montavon_2017] for real valued inputs that redistributes relevance scores according to the square magnitude of the weights:


Other papers such as [Samek2017UnderstandingAC] and [Kohlbrenner] propose a combination of different decomposition rules for different layer types, like (7) for fully connected layers to truthfully represent the decisions made via the layers’ linear mapping and (8) for convolutional layers with ReLU activation functions to separately handle the positive and negative parts of .

4.3 Experimental Design

Fig. 1 shows an overview of the DNN and XAI pipeline applied. Data stems from the CCPSC2018 data set acquired in eleven Chinese hospitals containing 12-lead ECGs and ground truth provided by human experts [Liu.2018]. We use a subset of each for AF, LBBB and healthy subjects showing normal signals, resulting in recordings. We investigate these two classes as AF is defined by an abnormal heart rhythm, i.e. irregular distances between heart beats, and therefore it can only be diagnosed by analyzing multiple heart beats. In contrast, LBBB can be diagnosed by a single heart beat as it is characterized by distinct morphological features, e.g. a notched QRS complex.

All recordings were resampled to

Hz and trimmed or zero-padded to

samples. In the remainder of this work, we denote a single ECG sample as with representing the recording index, representing samples, and representing leads. Regarding data processing333

All computations are implemented using Python v3.6.8 and the libraries iNNvestigate v1.0.9, Tensorflow v1.12.0, neurokit2 V0.1.7, and h5py v2.10.0.

, each ECG signal is fed to the model by Ribeiro et al. [Ribeiro.2020b] for classification, resulting in a matrix with dimensions assigning probabilities for six ECG abnormalities. In the following, we define indicating the prediction score of the model with sigmoid activation, representing the classification probability.

We utilize the package iNNvestigate [Alber.2019], which implements multiple XAI methods, to compute relevance scores for each sample of the input ECGs. We use the XAI methods IG and LRP with the implementation of IG being with baseline input zero and interval size , after changing the activation of the DNN’s last layer to linear. Sigmoid activation does not change the ranking order of the predicted classes, but might obfuscate the true confidence of the model’s individual class predictions444 (accessed: October 14, 2022).

Both XAI methods , IG and LRP, assign a relevance scores to each input sample of a classified ECG recording. By computing this for all recordings we obtain with the same dimensions as our input ECG data . Both are the basis for our analysis to compare features embedded in the DNN model to clinically-relevant criteria. We analyze the obtained relevance scores with three novel quantitative methods and one qualitative method as described in the following sections. With each new analysis, we take more details into account. While in the first analysis relevance scores are binned to each class, in the second analysis we split relevance scores w.r.t. their lead and in the third analysis w.r.t. lead and heart beats.

4.3.1 Binned and Average Relevance Scores Over Class

We first analyze relevance scores for all normal, LBBB, and AF recordings separately and bin the values for their respective class, allowing us to compare the overall distribution of for the different classes.

We then aggregate all leads of each recording into


with and . takes positive or negative values, hence a higher is associated with a higher prediction score (here: model output with linear activation), termed completeness in [integratedGrads].

4.3.2 Average Relevance Scores Over Class and Lead

We aggregate relevance scores for each lead and each recording in


with . This allows for comparing the distribution of w.r.t. class and ECG leads and thus the importance of the individual ECG leads for the DNN. This is required as the different leads show different morphologies and signal shapes that might cancel out in the first analysis.

4.3.3 Average Relevance Scores Over Class, Lead, and Beats

In the first two analysis methods, time information is lost. However, for explaining the DNNs decision this is crucial as we need to compare whether the agnostic features trained by the DNN reflect the clinical features described in section 4.1 such as missing P-waves, unusually wide QRS-complexes etc.

Analyzing individual ECG records gives only anecdotal evidence. Therefore, we perform a two-step averaging procedure which averages the information over several recordings while preserving time information. First, for each ECG record and lead, we use the concept of ”average beats” [hamilton.1991] by splitting the whole signal into individual heart beats (ecg_segment() function of neurokit2) and average them into a single, time-aligned representative beat for each lead. Then we use the exact same indices of the heart beats and perform the same steps on the relevance scores , yielding an ”average relevance score”. All average beats and average relevance scores are then averaged for a given class. All segments are of equal size for one recording, hence we fill segments overlapping start or end of the recording with zeros. Finally, amplitudes are normalized to . For scatter plot visualizations, relevance scores are upsampled by a factor of .

4.3.4 Qualitative Analysis of XAI Relevance Scores

The results of all processed ECG signals were visualized as heatmap-colored scatter plots for each lead, after a normalization of the output to , keeping the center of the values at zero. Furthermore, these relevance score plots were evaluated by an experienced cardiologist.

4.3.5 Comparison to Other XAI Methods

Since both methods, IG and LRP, differ substantially in their approach on how to calculate relevance scores for the input, we believe that using both methods will help uncover important information about why the DNN made certain decisions. Hence, we compare IG results to LRP using the following LRP decomposition rules implemented in the iNNvestigate [Alber.2019] package:

  • The -LRP decomposition (see (7)) with .

  • The -LRP decomposition (see (8)) with and .

  • The -LRP decomposition (see 9)).

  • The combination of -LRP decomposition (see (8)) with and for convolutional layers and -LRP decomposition (see (7)) with for fully connected layers.

The sigmoid function (used in the output layer) maps from

to and thus inverts the signs of all negative values, as well as scales all values into the interval of

. This results in only small and positive values being backpropagated by the LRP method possibly resulting in small and only positive relevance scores. Thus we compared these relevance scores to those obtained by using a linear output in the last layer. Since both activations yield similar results compared visually in heatmaps, we decided to continue with linear activation, to avoid the possible sign flip.

4.4 Ethics approval

Human subject research: This work only makes use of the CPSC2018 data set [Liu.2018] and does not contain any additional information involving human participants obtained by the authors.

Competing Interests

The authors declare no competing financial or non-financial interests.

Data Availability

The data presented in this study is openly available in PhysioNet at, under “training/cpsc_2018”.

Code Availability

All source code developed in this work is publicly available on GitLab:

Author Contributions

Conceptualization, T.B., N.S., D.K.; methodology, T.B., J.B., A.H., N.S.; software, T.B.; validation, C.M., T.S.; formal analysis, T.B., J.B., D.K., H.D., N.S.; investigation, T.B., N.S., D.K.; writing—original draft preparation, T.B., J.B.; visualization, T.B., N.S., D.K., J.B.; supervision, N.S., A.H.

All the authors reviewed, edited, and approved the manuscript.

Supplementary Material

We provide videos showing average beats and relevance scores for all ECGs:

  • AF classification, lead V1: ”’beats_V1_AF.mp4’

  • LBBB classification, lead aVL: ”beats_AVL_LBBB.mp4”


This research was funded by the German Federal Ministry of Education and Research (grant no. 16TTP073 11, and HiGHmed, grant no. 01ZZ1802B) and the Lower Saxony “Vorab” of the Volkswagen Foundation and the Ministry for Science and Culture of Lower Saxony (grant no. 76211-12-1/21).