Heart diseases are among the leading causes of death of the world [benjamin2018heart]. The routine monitoring of physiological signals is deemed important in heart disease prevention. Among existing monitoring technologies, electrocardiography (ECG) is a commonly used non-invasive and convenient diagnostic tool that records physiological activities of heart over a period of time. Deciphering ECG signals can help detect many heart diseases such as atrial fibrillation (AF), myocardial infarction (MI), and heart failure (HF) [jama_af, yanowitz2012introduction].
An example of real world ECG signal is shown in Fig.1. ECG signals from cases and controls of heart diseases show different patterns at 1) beat level, 2) rhythm level, and 3) frequency level, each representing different anomalous activities of the heart. For example, beat level morphology such as P wave (atrial depolarization) and QRS complex (ventricular depolarization) can reflect conditions related to heart electric conduction. Rhythm level patterns capture rhythm features across beats and reflect cardiac arrhythmia conditions (abnormal heart rhythms). Frequency level is about frequency variations and sheds light on the diagnosis of ventricular flutter and ventricular fibrillation. Learning these patterns to support diagnoses has been an important research area in ECG analysis [roopa2017survey, expert_1, expert_3, tateno2001automatic].
In real clinical settings, in addition to the demand of an accurate classification, the interpretability of the results is equally important [tsai2003computer]. Cardiologists need to provide both diagnosis and detailed explanations to support diagnosis [std]. Also, many heart diseases do not pose abnormal ECG diagram constantly [benjamin2018heart, yanowitz2012introduction], especially during the early stage of the diseases. Therefore, interpretability of the results, particularly highlighting diagnosis related parts of the data, is crucial for early diagnosis and better clinical decisions.
Traditional machine learning methods either learn time domain patterns including beat level[ladavich2015rate, purerfellner2014p] and rhythm level [huang2011novel], or extract frequency patterns using signal processing techniques such as discrete wavelet transform [garcia2016application]. However, time domain approaches are easily affected by noise or signal distortion [RODRIGUEZ2015261]; while frequency domain methods cannot model rare events or some temporal dynamics that occur in time domain. Besides, they all require laborious feature engineering, and their performance also relies on the quality of the constructed features.
Recently, deep learning models showed initial success in modeling ECG data. Convolutional neural networks (CNN) were used to learn beat level patterns[tbe, 2017arXiv170701836R, hannun2019cardiologist]
. Recurrent neural networks (RNN) are suitable for capturing rhythm features[schwab2017beat, hong2017encase, zihlmann2017convolutional]. Moreover, attention mechanism is employed to extract interpretable rhythm features [schwab2017beat]. Despite their progress, these models were either black-box or only highlighted one aspect of patterns (such as rhythm features as in [schwab2017beat]), thus lack the comprehensive interpretability of the results for real clinical usage.
In this work, we propose MultIlevel kNowledge-guide Attention model (MINA) to learn and integrate different levels of features from ECG which are aligned with clinical knowledge. For each level MINA extracts level-specific domain knowledge features and uses them to guide the attention, including beat morphology knowledge that guides attentive CNN and rhythm knowledge that guides attentive RNN. MINA also performs attention fusion across time- and frequency domains. We proposed new evaluation approaches by interfering ECG signals with noise and signal distortion. We evaluated interpretability and robustness of the model by tracking intermediate reactions across layers from multilevel attentions to the final predictions.
Experimental results show MINA can correctly identify critical beat location, significant rhythm variation, important frequency component and remain robust in prediction under signal distortion or noise contamination. Tested on the atrial fibrillation prediction, MINA achieved PR-AUC (outperforming the best baseline by ). Finally, MINA also showed strong result interpretability and more robust performance than baselines.
2 Related Work
Traditional methods include time domain methods such as beat level methods [ladavich2015rate, purerfellner2014p] and rhythm level ones [tateno2001automatic, huang2011novel, oster2015impact], both depending on segmentation by detecting QRS complex. However, time domain methods rely on the accuracy of QRS detection, thus are easily affected by noise or signal distortion. Frequency domain approaches, on the other hand, cannot model rare events and other time-domain patterns and thus lack interpretability. Moreover, both types of features are subjective.
Recently, deep neural networks (DNNs) have been used in ECG diagnosis [tbe, 2017arXiv170701836R, hannun2019cardiologist, zihlmann2017convolutional, hong2017encase, schwab2017beat]. Many of them have demonstrated state-of-the-art performance due to their ability in extracting effective features [2017arXiv170701836R, hong2017encase]
. Some of them build an end-to-end classifier[tbe, 2017arXiv170701836R, zihlmann2017convolutional], others build a mixture model which combines traditional feature engineering methods and deep models [hong2017encase, schwab2017beat]. However, existing deep models are insufficient in three aspects. First, they neglect the characteristics of ECG signals when design model architecture, namely, beat morphological, rhythm variations. Second, they only analyze ECG signals in time domain. Last, they are “black-box” and thus not interpretable. In real world medical applications, interpretability is critical for clinicians to accept machine recommendations and implement intervention.
In this section, we will introduce the model design of MINA. Section 3.1 provides an overview and introduces all notations. Section 3.2 describes the basic framework, including each layer of MINA. Section 3.3 proposes our new attention mechanism which is integrated in MINA. Section 3.4 describes how we evaluate interpretability and robustness. Fig.2 depicts the architecture of MINA.
3.1 Overview of Mina
Here we briefly describe the framework and introduce notations used throughout this paper. Assume we are given a single lead ECG signal
and use it to predict class probability. We firstly transform it into multi-channel signals withchannels across different frequency bands where th signal is denoted as . We then split each into segments . Next we apply CNN and RNN consecutively on to obtain beat level attention and rhythm level attention . This follows by a fully connected layer that transforms into . We then take weighted average to integrate across all channels to output frequency attention , which will be used in prediction. To improve model accuracy and interpretability, we propose a knowledge guided attention
to learn attention vectors from beat-, rhythm-, and frequency levels, denoted as, , and respectively. More details will be described in Section 3.2. The notations are summarized in Table 1. Detailed configurations of MINA are introduced in the Implementation Details section.
|,,,||# of classes, # of frequency channels, # of segments, segment length|
|, ,||Original ECG signal, signals after transformation, th signal|
|, ,||Segment of ECG with length , segments (), th segment|
|,||CNN layer output, th column in , th segment output|
|Output of beat level attention|
|,||Output of beat level attention of segments, th segment output|
|,||Bi-LSTM layer output, th column in|
|Output of rhythm level attention|
|,||Output of rhythm level attention of channels, th channel output|
Weight matrix and bias vector in fully connected layer
|,||Fully connected layer output, th column in|
|Output of freq. level attention|
|Weight matrix and bias vector in prediction layer|
|, , ; , ,||Predicted probability, class weight, one-hot label; th value in each vector|
|, ,||Beat level attention weights, th value in , segment attention|
|,||Rhythm level attention weights, th value in|
|,||Frequency level attention weights, th value in|
|, ,||Beat-, rhythm-, and frequency level knowledge feature|
|,||Beat-, rhythm- level st layer attention weights|
|Frequency level st layer attention weights|
|, ,||Beat-, rhythm-, and frequency level st layer attention biases|
|, ,||Beat-, rhythm-, and frequency level nd layer attention weights|
Function of standard deviation; function of power spectral density
|Interfered signals, attention weights and predictions|
3.2 Description of Mina
Signal Transformation and Segmentation In order to utilize the frequency-domain information, we employ an efficient strategy by decomposing original ECG signals into different frequency bands (where each band is regarded as a channel). Then we can concurrently model signals of each channel.
Specifically, we propose a new time-frequency transformation layer to transform a single lead ECG signal into multi-channel ones. Here we use Finite Impulse Response (FIR) bandpass filter [Oppenheim:1996:SAS:248702] to transform single lead ECG signal into multi-channel ECG signals .
Then for each channel, we split into a sequence of equal length segments. Unlike previous deep models [schwab2017beat, tbe] that perform segmentation using QRS complex detection, which is easily affected by signal quality, we simply use sliding window segmentation. By cutting each of th segment is indexed by and , we receive equal length segments (without the loss of generality, we assume that , otherwise we can cut off last remain part which is shorter than ). In general, segment length needs to be shorter than the length of one heart beat, so that we can extract patterns in beat level. Detailed configurations can be found in Implementation Details section.
Beat Level Attentive Convolutional Layer For beat level patterns, we mainly consider the abnormal wave shapes or edges. To locate them from signals, we design an attentive convolutional layer. Formally, given segments , we perform 1-D convolution on each of them and output convolved features: , , is the number of filters,operations are shared weights of segments. Then instead of traditional global average pooling which treats all features homogeneously, we propose a knowledge-guided attention to aggregate these features and get beat level attention , where represents the weight for convolved features, is the th column in , . Thus the model can focus more on significant signal locations and have better beat level interpretation. Details of knowledge-guided attention will be introduced in Section 3.3.
Rhythm Level Attentive Recurrent Layer For rhythm level patterns, we mainly consider the abnormal rhythm variation. To capture them from beat sequences, RNNs are a natural choice due to their abilities to learn on data with temporal dependencies. Again to improve interpretability and accuracies, we use knowledge guided attention with rhythm knowledge.
Specifically, we use a bidirectional Long Short-Term Memory network[schuster1997bidirectional] (Bi-LSTM) to get rhythm level annotations of segments. The bidirectional LSTM is denoted here as . We concatenate the forward and backward outputs of Bi-LSTM and receive the rhythm level feature , , . Here we use knowledge-guided attention with rhythm knowledge to output the rhythm level attention , where represents the weight of th rhythm level hidden state .
Fusion and Prediction At the beginning we decompose ECG signals into multiples channels (i.e., frequency bands) and learn rhythm level features from each channel . Now we will perform attention fusion across all channels to have a more comprehensive view about the signal.
We first perform fully connected transformation: , where , , and . means broadcasting to all column vectors in and applies addition. Then, since the importance of these channels may not be homogeneous, we will take weighted average of to calculate frequency level attention where is the weight of , , . We use frequency knowledge, signals with greater energy are more informative, to determine the weight . Here we use power spectral density to measure energy.
Last, given integrated features we make prediction using , where , and optimize the weighted cross entropy loss , where is the number of classes, is the ground truth , is the weight vector with the same shape as , is the indication function. is adjusted to handle with class imbalance problem which is common in medical area.
3.3 Knowledge Guided Attention of Mina
We now describe how to compute multilevel attention weights . The attention mechanism can be regarded as a two-layer neural network: the st fully connected layer calculates the scores for computing weights; the nd fully connected layer computes the weights with via softmax activation.
In the first layer, the scores are computed based on the following features. (1) Multilevel outputs , , extracted by MINA. (2) Domain knowledge features including beat level , rhythm level , and frequency level . Concretely, three levels of domain knowledge features can be represented as below.
Beat Level : For beat level knowledge we mainly consider the abnormal wave shapes or sharply changed points such as QRS complex [kashani2005significance]. To represent it we compute first-order difference and a convolutional operation on each segment to extract the beat level knowledge feature ), and , is the th value in . Detailed configurations of are introduced in Implementation Details section.
Rhythm Level : Attention weights focus on rhythm level variation, such as severe fluctuation in ventricular fibrillation disease [yanowitz2012introduction]. To characterize it we compute standard deviation on each segment in to extract the rhythm level knowledge feature vector , where calculate standard deviation of each in ,
Frequency Level : On frequency level, signals with greater energy contain more information and thus need more attention [yanowitz2012introduction]. So we use power spectral density (PSD), a popular measure of energy, to extract the frequency level knowledge feature vector , where calculate PSD [Oppenheim:1996:SAS:248702] using a periodogram of each in .
Then, we concatenate model outputs and knowledge features to compute scores and attention weights.
where, represent weights and biases in the first layer, represent weights in the second layer. is addition with broadcasting.
3.4 Method for Evaluating Interpretability and Robustness
To evaluate the interpretability and robustness of MINA, we perturb the signals and observe attention weights and prediction results. The evaluation method is illustrated in Fig. 3.
Concretely, we add signal distortion (low frequency interferer) or noise (high frequency interferer) to the original ECG signal and get
, here we choose baseline signal distortion and white noise. For the perturbed signals, we applied MINA to generate prediction and output multilevel attention weights . We compare them with the original results and from unperturbed data.
To evaluate the interpretability of MINA, we visually check whether attention weights are in line with medical evidences. For beat level attention weights of segments ] and ] , we align them to input ECG signals , where the th attention weight approximately corresponds from to . Then we visualize the values and verify whether high relates to beat level medical evidence. For rhythm level attention weight and , we align them to segments , where corresponds to . Then we verify whether high relates to rhythm level medical evidence. For frequency level attention weight and , we align them to channels , where corresponds to . Likewise, we check whether high relates to frequency level medical evidence.
We evaluate the robustness of MINA based on the two tasks: (1) we visually compare whether the new attention weights after perturbation are still in line with medical evidences, using the same way above, (2) we gradually change the interfered amplitude and evaluate the overall performance changes. The more robust model will be less impacted. Moreover, these results can also be used to evaluate interpretability, since interpretable model can highlight meaningful information, while also suppress unrelated parts.
In this section, we first describe the dataset used for the experiments, followed by the description of the baseline models. Then we discuss the model performance.
4.1 Source of Data
We conducted all experiments using real world ECG data from PhysioNet Challenge 2017 databases [clifford2017af]. The dataset contains 8,528 de-identified ECG recordings lasting from 9s to just over 60s and sampled at 300Hz by the AliveCor device, 738 from AF patients and 7790 from controls as predefined by the challenge. We first divided the data into a training set (75%), a validation set (10%) and a test set (15%) to train and evaluate in all tasks. Then, we preprocess them to get equal length data, where . The summary statistics of the data is described in Table 2. In this study, the objective is to discriminate records of AF patients from those of controls.
|Type||# recording||# of points|
4.2 Baseline Models
We will compare MINA with the following models: 1. Expert: A combination of extracted features used in AF diagnosis including: rhythm features like sample entropy on QRS interval [expert_1]tateno2001automatic]; thresholding on the median absolute deviation (MAD) of RR intervals [expert_3]; heart rate variability in Poincare plot [park2009atrial]; morphological features like location, area, duration, interval, amplitude and slope of related P wave, QRS complex, ST segment and T wave; frequency features like frequency band power. We used QRS segmentation method in [pan1985real]
and trained an LR classifier using these features. Then, we build both logistic regression (ExpertLR
) and random forest (ExpertRF) on above extracted features. 2. CNN
: Convolutional layers are performed on ECG segments with shared weights. We use global average pooling to combine features, and fully connect (FC) layer and softmax layer for prediction. The model architecture is modified based on[tbe] to handle ECG segments. The hyper-parameters in CNN, FC and softmax are the same as MINA to match the model complexity. 3. CRNN: We used shared weights convolutional layers on ECG segments, and replaced the global average pooling with bi-directional LSTM. Then FC and softmax are applied to the top hidden layer. The architecture is modified based on [zihlmann2017convolutional], but only keep one convolutional layer. Other hyper-parameters in CNN, RNN, FC and softmax are the same as MINA. 4. ACRNN: Based on CRNN, with additional beat level attentions and rhythm level attentions. Other hyper-parameters are the same as MINA.
4.3 Implementation Details
In convolutional layers of CNN, CRNN, ACRNN and MINA, we use one layer for each model. The number of filters is set to 64, the filter size is set to 32 and strider is set to 2. Pooling is replaced by attention mechanism. of has one filter with size set to 32, the strider is also 2. In recurrent layers of CRNN, ACRNN and MINA, we also use one single layer for each model, the number of hidden units in each LSTM is set to 32. The dropout rate in the fully connected prediction layer is set to 0.5. In sliding window segmentation, we use non-overlapping stride with
. Deep models are trained with the mini-batch of 128 samples for 50 iterations, which was a sufficient number of iterations for achieving the best performance for the classification task. The final model was selected using early stopping criteria on validation set. We then tested each model for 5 times using different random seeds, and report their mean values with standard deviation. All models were implemented in PyTorch version 0.3.1, and trained with a system equipped with 64GB RAM, 12 Intel Core i7-6850K 3.60GHz CPUs and Nvidia GeForce GTX 1080. All models were optimized using Adam[adam], with the learning rate set to 0.003. Our code is publicly available at https://github.com/hsd1503/MINA.
4.4 Performance Comparison
Performance was measured by the Area under the Receiver Operating Characteristic (ROC-AUC), Area under the Precision-Recall Curve (PR-AUC) and the F1 score. The PR-AUC is considered a better measure for imbalanced data like ours [davis2006relationship]. Table 3 shows MINA outperforms all baselines, and shows higher PR-AUC than the second best models.
|ExpertLR||0.9350 0.0000||0.8730 0.0000||0.8023 0.0000|
|ExpertRF||0.9394 0.0000||0.8816 0.0000||0.8180 0.0000|
|CNN||0.8711 0.0036||0.8669 0.0068||0.7914 0.0090|
|CRNN||0.9040 0.0115||0.8943 0.0111||0.8262 0.0215|
|ACRNN||0.9072 0.0047||0.8935 0.0087||0.8248 0.0229|
|MINA||0.9488 0.0081||0.9436 0.0082||0.8342 0.0352|
5 Interpretability and Robustness Analysis
5.1 Mina Automatically Extracts Clinically Meaningful Patterns
When reading an ECG record (upper left in Fig. 4), cardiologists will make AF diagnosis based on following clinical evidences: 1) the absence of P wave: a small upward wave before QRS complex; 2) the irregular RR interval such as the much wider one between the th and the th QRS complex.
MINA learns these patterns automatically via beat-, rhythm-, and frequency level attention weights. From Fig. 4, the beat level attentions point to where QRS complex or absent P waves occur. The rhythm level attentions indicate the location of abnormal RR interval, which precisely matches the clinical evidence. Besides, from the frequency level attentions, we notice channel 10Hz-50Hz receives the highest attention weight so MINA pays more attention to it. In fact, QRS complex, the most significant clinical evidence in ECG diagnosis, is known to be dominant in 10Hz-50Hz [tateno2001automatic, expert_3, expert_1].
5.2 Mina Remains Interpretable and Robust Against Baseline Signal Distortion
The baseline wander distortion is a low frequency noise with slow but large changes of the signal offset. It is a common issue that drops ECG analysis performance. In this experiment, we mimic the real world setting by distorting data and observe whether MINA can still make robust and interpretable predictions.
For the experiment we interfered the signal in Fig. 4 with baseline wander distortion. The interfered signal is plotted in Fig. 5(a). From the original frequency attention in Fig. 4, it is easy to see Channel 1 (0.5Hz) has the lowest weights, while Channel 2 (0.5Hz-50Hz) weights much higher. Thus Channel 1 can be interpreted as baseline component while Channel 2 as clean signal component. MINA pays more attention to Channel 2 than Channel 1. After signal distortion, the importance of both channels remain the same, which is also reflected from their beat level and rhythm level attentions. Channel 1 shows no significant patterns, but the more informative Channel 2 have similar beat- and rhythm level patterns as unperturbed data, which indicates the interpretability of MINA will be less impacted by data distortion.
To evaluate model robustness, we compare the performance change along the increase of distortion amplitude on the entire test set. As shown in Fig.5(d), MINA still has much lower performance drop even after distortion by large amplitude. While all baselines start to have large performance drop even with little distortion. This is mainly thanks to frequency attention fusion. In training process, the model already identified Channel 1 a baseline signal. Thus baseline distortion will have less impact on important signals in clean signal channel. Since baseline signal distortion occurs in real clinical setting, MINA will provide more accurate prediction in these scenarios.
5.3 Mina Remains Interpretable and Robust in the Presence of Noise
The high frequency noise contamination is another common issues. For this experiment, we perturbed the signal in Fig. 4 with white noise. The perturbed signal is in Fig. 6(a). Similar to last experiment, from original frequency attentions we know Channel 3 (50Hz) has lower weights. It is a channel known for high noise. While Channel 2 (0.5Hz-50Hz) weights much higher and is known as a clean signal channel.
After noise contamination, the noise impacts more to the noise Channel which is less important in the prediction of MINA, but the more informative Channel 2 have similar beat- and rhythm level patterns as unperturbed data, which indicates the interpretability of MINA will be less impacted by noise contamination. In Fig. 6(d), we compare the PR-AUC change along the increase of noise amplitude on the entire test set. MINA is less impacted by noise than other methods, demonstrating more robust performance in the presence of noise thanks to frequency attention fusion.
6 Conclusion and Future Work
In this paper, we propose MINA, a deep multilevel knowledge-guided attention networks that interpretatively predict heart disease from ECG signals. MINA outperformed baselines in heart disease prediction task. Experimental results also showed robustness and strong interpretability against signal distortion and noise contamination. In future, we can extend to a broad range of disease where ECG signals can be treated as additional information in the diagnosis, on top of other health data such as electronic health records. Then we will need to investigate interpretable prediction based on multimodal data, which is a possibly rewarding avenue of future research.
This work was supported by the National Science Foundation, award IIS-1418511, CCF-1533768 and IIS-1838042, the National Institute of Health award 1R01MD011682-01 and R56HL138415. We also thanks valuable discussions with Li Jiang from BOE.
Appendix A Background of Electrocardiography (ECG)
The Electrocardiography (ECG) is a test that measures the electrical activity of the heartbeat. With each beat, an electrical impulse travels through the heart and causes the muscle to squeeze and pump blood from the heart. Then ECG signals will record the timing of the top and lower chambers.
A normal heart beat in ECG is shown in Fig.7. Usually a “P wave” which is characterized by the right and left atria or upper chambers will arrive first, following by a flat line indicating when electrical impulse goes to the bottom chambers. Then next wave called ventricular depolarization (QRS complex) arrive. The next wave is called ventricular repolarization (ST segment, T wave), which represents electrical recovery or return to a resting state for the ventricles. Together we also have “U wave” that represents papillary muscle repolarization.
ECG signals offer two types of information: 1) the time intervals measures how long the electrical wave needs to pass through the heart: normal or slow, fast or irregular; 2) the amount of electrical activity passing through the heart shows whether the size of parts of the heart become abnormal.
The time domain features for heart disease diagnosis include beat level and rhythm level.
In beat level, an unusual p-wave may indicate disease such as atrial fibrillation (AF), ectopic atrial pacemaker, atrial enlargement et al. An unusual QRS complex may indicate disease such as left/right bundle branch block and ventricular tachycardia. An unusual ST segment and T wave may indicate myocardial infarction, ischemia, and left ventricular hypertrophy.
In rhythm level, the analysis is usually based on intervals between QRS complexes, which is called RR interval. Long RR interval may indicate sinus bradycardia, short RR interval may indicate sinus tachycardia or ventricular tachycardia, while irregular RR interval may indicate AF. However, many disease such as AF poses patterns in both beat level and rhythm level, so it is beneficial to combine them together for analysis.
Appendix B Frequency Band for ECG Signals
The ECG signal is a mixture of heart muscle’s electrophysiologic activities including atrial, ventricular, papillary muscle and myocardium. Besides, it may also contain other electrical components from muscle, skin, respiration, body moving etc. The frequency bands listed below are commonly considered dominant components in ECG signal:
< 0.5 Hz: very low frequency component, mainly represent heart unrelated wandering.
0.12 Hz - 0.5 Hz: respiration.
0.5 Hz - 50 Hz: P wave, QRS complex and T wave.
0.67 Hz - 5 Hz: P wave.
1 Hz - 7 Hz: T wave.
5 Hz - 50 Hz: muscle.
10 Hz - 50 Hz: QRS complex is the most dominate component.
> 50 Hz: high frequency noise.
All: raw signal.
Notice that these frequency bands are approximate, since they are hard to be divided entirely. Besides, their significance may also vary among people. However, it is beneficial to combine frequency domain features and time domain features together for disease diagnosis, since the transformation of frequency bands will divide time domain ECG signals into subspaces, thus helps classification tasks.
The illustration of frequency transformation is shown in Fig.8.
Appendix C Interferer Simulation Details
We simulate baseline wander distortion signal using sine function and noise contamination signal using random normal distribution. Concretely, when interfere lengthsignal :
where is amplitude of interfere, represents elementwise addition.
Appendix D More Interpretability Evaluation Examples