Multiple Instance Learning for ECG Risk Stratification

12/02/2018 ∙ by Divya Shanmugam, et al. ∙ MIT berkeley college 0

In this paper, we apply a multiple instance learning paradigm to signal-based risk stratification for cardiovascular outcomes. In contrast to methods that require hand-crafted features or domain knowledge, our method learns a representation with state-of-the-art predictive power from the raw ECG signal. We accomplish this by leveraging the multiple instance learning framework. This framework is particularly valuable to learning from biometric signals, where patient-level labels are available but signal segments are rarely annotated. We make two contributions in this paper: 1) reframing risk stratification for cardiovascular death (CVD) as a multiple instance learning problem, and 2) using this framework to design a new risk score, for which patients in the highest quartile are 15.9 times more likely to die of CVD within 90 days of hospital admission for an acute coronary syndrome.



There are no comments yet.


page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Machine learning has led to improved risk stratification models for a number of outcomes, including stroke li2016integrated , cancer heidari2018prediction , and treatment resistance perlis2013clinical . We consider risk stratification for cardiovascular death (CVD) within 90 days of hospital admission. 90-day risk stratification for cardiovascular patients is especially important because the majority of patients that suffer from acute coronary syndrome (ACS) die of CVD within months. Equipped with informative short-term risk metrics, doctors can make the necessary care decisions to reduce adverse outcomes. Moreover, predicting CVD within 90 days is a common task in risk stratification literature morrow2007effects liu2014ecg , enabling direct comparison to existing methods.

In this paper, we apply multiple instance learning predicting CVD. Multiple instance learning assumes that labels for collections of instances are available, but labels for individual instances are not. This framework lends itself to risk stratification using biometric signals because, while patient-level labels are frequently available, fine-grained annotated signals rarely are. To the best of our knowledge, this work demonstrates the first application of multiple instance learning to biometric signals.

Background The simplest means of predicting CVD is to construct a model based on easy-to-quantify patient characteristics, such as age, sex, or Left Ventricular Ejection Fraction (LVEF) cintron1993prognostic . However, there is strong evidence that leveraging electrocardiogram (ECG) signals, which measure electrical activity in a patient’s heart, can add significant predictive power liu2014ecg . Because these signals take the form of long time series, it is not obvious how best to incorporate them into a predictive model.

One approach is to treat the entire signal as an input to a model, either by extracting summary statistics or feeding it into a recurrent neural network

myers2017machine . An alternative is to treat one ECG signal as a sequence of many examples of heartbeats. A demonstrably successful means of doing so is to extract pairs of consecutive heartbeats and represent each pair using a set of informative features mccraty2015heart , syed2009relation , liu2014ecg . Such a pair is a special case of what we will refer to as a Consecutive Beat Series (CBS).

This approach presents two challenges:

  1. Representation. It is not obvious what features to extract from each CBS. This is made especially difficult by the variable duration of heart beats.

  2. Nondeterminism. Different levels of CVD risk at the patient level do not necessarily yield different characteristics at the level of CBSs. Namely, a high risk patient might have more worrisome heartbeats than a low risk patient, but it's unlikely they will have exclusively or even a large number of worrisome heartbeats.


We address these challenges by fusing ideas from two areas of machine learning: deep learning and Multi-Instance Learning (MIL). We address the representation challenge by learning predictive features directly from the data, with no task-specific engineering, using a compact neural network. This is in sharp contrast to most existing approaches, which extract handcrafted features.

To address the challenge of nondeterminism, we cast the task as a MIL problem. In MIL, labels are available for collections of examples, but not individual examples. In our case, the examples are consecutive beat series, the collections are the sets of CBSs extracted from a given ECG signal, and the labels are the outcomes for the associated patients. In this paper, we introduce:

  • The framing of signal-based risk stratification as a Multi-Instance Learning problem.

  • A new ECG risk score that outperforms existing risk metrics in terms of both AUC and hazard ratio, for both 60-day and 90-day risk of cardiovascular death.

  • Evidence that, contrary to common wisdom, contiguous triplets of heartbeats are more informative than contiguous pairs for cardiovascular risk stratification.

We include a summary of our method in Section 2 and an outline of our experiments in Section 3. We then discuss our results in Section 4. The paper concludes with a discussion of future work in Section 5.

2 Methods

We formulate the cardiovascular death risk stratification task as follows. We are given a collection of ECG signals . Each signal is associated with a unique patient , and consists of samples, , . Each patient is associated with a label , where indicates that the patient died of cardiac death within days of hospital admission. Our task is to predict the label for held-out patients based on their ECG signals.

Our approach is to treat this as a multiple instance learning problem. This can be formalized as the construction of three functions:

  1. Instance Extractor. A map , from the space of ECG signals to the space of collections of instances.

  2. Instance Classifier

    . A map from individual instances to probabilistic class predictions.

  3. Instance Aggregator. A map from predictions for each of the instances in a collection to an overall prediction for the collection.

2.1 Instance Extractor

In the first step, transforms a signal into a set of instances, . In the context of ECG signals, our instances are groups of consecutive heartbeats. Each group of consecute beats taken from a given ECG signal is known as a Consecutive Beat Series (CBS). Many existing methods operate on consecutive beat pairs (i.e., ). To extract these groups, we identify the peak of each heartbeat using a waveform-based method martinez2004wavelet and take one second of data, centered on each peak. We opt for a simple instance segmentation procedure to demonstrate that our approach does not rely on carefully crafted instance segmentation protocols. We also limit the number of instances taken from each ECG signal to 1000, corresponding to roughly 15 minutes.

2.2 Instance Classifier

We employ a compact, fully connected neural network to map each instance to its patient outcome. The model consists of one fully connected hidden layer connected to a sigmoid output. It is important to note that error in patient outcome prediction is back-propagated through the network during this step. We validated layer size across 1, 2, 5, and 10 units and discovered that performance plateaus after a layer size of 2. Thus, the results presented here are for a fully connected 1-hidden-layer network of two ReLU-activated units and one sigmoid output.

This model has several advantages over existing approaches. First, it operates directly on the raw instances, with no feature extraction or costly alignment operations. Second, it allows the use of more than two beats as an instance; this is not the case for methods such as MV

syed2009relation and MVB liu2014ecg , which specifically compute distances between pairs of beats.

2.3 Instance Aggregator

We aggregate instances based on the collective assumption in MIL. That is, instead of assuming that only collections with have any instances predicted as belonging to this class, we only assume that collections with have more of these instances foulds2010review

. Based on this assumption, we compute the probability of the collection having label

as the mean of the predictions for each of its instances. I.e.,


This formalizes our hypothesis that patients likely to die within days of hospital admission contain certain pathological beat sequences at a higher rate than low risk patients. Following a convention common in clinical literature, we designate patients that fall in the upper quartile of the metric as high risk and the lower three quartiles as low risk.

3 Experimental Set-Up

Data We use a subset of 5396 patients from the MERLIN-TIMI dataset morrow2007effects . We filter out patients who leave the study before 90 days, leaving us with 5245 patients. Of these patients, 107 patients die of CVD within 90 days of hospital admission. Demographic factors for CVD and non-CVD patients are similar, as are their distributions of pre-existing conditions, such as smoking.

Each ECG signal was sampled at a rate of 128Hz and spans approximately forty-eight consecutive hours of time. We employ typical ECG pre-processing in removing abnormal beats and baseline wander and normalizing the signal. This protocol is directly borrowed from liu2014ecg , and we perform the relevant steps using the same Physionet SQI package li2007robust .

Baselines We evaluate our approach against two sets of benchmarks: existing CVD risk metrics and existing multiple instance learning methods.

We evaluate against three existing CVD risk metrics: MV syed2009relation , MVB liu2014ecg , and LR-RNN myers2017machine

. Morphological variability (MV) measures risk for CVD by averaging the dynamic time warping distance between adjacent beats. Morphological variability in beat-space (MVB) improves upon MV by learning the best frequency at which to compute variability between segments in an ECG signal. The LR-RNN approach combines the output of an RNN trained on the first minute of patient’s ECG signal and the output of a logistic regression model over seven patient features.

We also evaluate against three MIL methods: STK, MIL-LR and MIL-NN. The size of our dataset far exceeds the typical scale of MIL datasets cheplygina2015characterizing

, disqualifying several kernel-based methods because of storage constraints. STK applies a specialized SVM, which defines collection similarity as the dot product of collection-level statistics including mean, variance, and standard deviation

gartner2002multi . MIL-LR, a variation of our method, combines the MIL framework with logistic regression as the instance classifier. MIL-NN[K] combines the proposed instance aggregator with a neural network instance classifier trained on consecutive heartbeats. We test over .

AUC .68 .72 .56 .72 .73 .79 .75 .72 .72 .77
90-Day HR 1.06 5.52 2.40 4.16 9.66 15.90 10.74 8.45 8.81 5.92
Table 1: Model performance predicting cardiovascular risk. We report the published performance of MV, MVB, LR-RNN and TRS.

4 Results

The MIL framework combined with a neural network instance classifier produces a state-of-the-art risk metric, as measured by hazard ratios and AUCs. We include the reported AUCs and hazard ratios (HRs) for each risk metric in Figure 1 and include the Kaplan-Meier mortality curve in Figure 2.

Hazard ratios are commonly reported in the literature for cardiac risk stratification; for our score, we rank patients based on the risk score and calculate the HR according to the Cox Proportional Hazards model lin1989robust . This hazard ratio measures the time-average relative risk between the high risk and low risk cohort. Averaged across all days, a patient with a MIL-NN3 risk score in the top quartile is approximately 16 times more likely to die of cardiovascular death within 90 days. In comparison, the next best non-MIL approach is half as predictive as the MIL-NN3 risk score, achieving an HR of 8.81. We present our results in forecasting CVD within 90 days of hospital admission in Figure 1.

Figure 1: Kaplan-Meier curve for 90-day risk of cardiovascular death.

We also tested the model's ability to forecast CVD within 60 and 30 days of hospital admission. It is worth noting that our metric is much better at predicting CVD within 60 days of hospital admission, but performs slightly worse than both LR-RNN and MV in terms of 30-Day HR.

Focusing on the comparison between the MIL-NN* models, we see that MIL-NN3 outperforms its counterparts that operate on differing numbers of consecutive beats. Existing CVD metrics, including MVB and MV, focus on adjacent pairs of beats. Our results suggest that contiguous triplets of heartbeats are more predictive of cardiac risk than consecutive pairs of heartbeats. Interestingly, groups of four heartbeats are worse than three. We hypothesize that this is because the larger parameter space of four contiguous beats makes it harder for the model to identify important patterns and avoid overfitting.

5 Discussion

The success of our approach suggests at least three directions for future work. First, our ability to forecast 90 day outcomes from ECG signals suggests that models for risk stratification at longer time scales – on the order of years instead of months – may be worth exploring. This has remained a relatively understudied problem bueno2016long , but may be tractable using extensions of the ideas presented here. Second, because there is nothing in our approach that is specific to ECG signals, it could be applied to risk stratification using other biometric signals as well. Third, the objective of this task is binary classification. Multiple instance regression could be of value in incorporating the CVD date for each patient. Doing so would allow us to forecast patient outcomes at a more granular level.


  • [1] H. Bueno and R. M. Asenjo. Long-term cardiovascular risk after acute coronary syndrome, an ongoing challenge. Revista Española de Cardiología, 69(01):1–2, 2016.
  • [2] V. Cheplygina and D. M. Tax. Characterizing multiple instance datasets. In

    International Workshop on Similarity-Based Pattern Recognition

    , pages 15–27. Springer, 2015.
  • [3] G. Cintron, G. Johnson, G. Francis, F. Cobb, and J. N. Cohn. Prognostic significance of serial changes in left ventricular ejection fraction in patients with congestive heart failure. the v-heft va cooperative studies group. Circulation, 87(6 Suppl):VI17–23, 1993.
  • [4] J. Foulds and E. Frank. A review of multi-instance learning assumptions.

    The Knowledge Engineering Review

    , 25(1):1–25, 2010.
  • [5] T. Gärtner, P. A. Flach, A. Kowalczyk, and A. J. Smola. Multi-instance kernels. In ICML, volume 2, pages 179–186, 2002.
  • [6] M. Heidari, A. Z. Khuzani, A. B. Hollingsworth, G. Danala, S. Mirniaharikandehei, Y. Qiu, H. Liu, and B. Zheng. Prediction of breast cancer risk using a machine learning approach embedded with a locality preserving projection algorithm. Physics in Medicine & Biology, 63(3):035020, 2018.
  • [7] Q. Li, R. G. Mark, and G. D. Clifford.

    Robust heart rate estimation from multiple asynchronous noisy sources using signal quality indices and a kalman filter.

    Physiological measurement, 29(1):15, 2007.
  • [8] X. Li, H. Liu, X. Du, P. Zhang, G. Hu, G. Xie, S. Guo, M. Xu, and X. Xie. Integrated machine learning approaches for predicting ischemic stroke and thromboembolism in atrial fibrillation. In AMIA Annual Symposium Proceedings, volume 2016, page 799. American Medical Informatics Association, 2016.
  • [9] D. Y. Lin and L.-J. Wei. The robust inference for the cox proportional hazards model. Journal of the American statistical Association, 84(408):1074–1078, 1989.
  • [10] Y. Liu, Z. Syed, B. M. Scirica, D. A. Morrow, J. V. Guttag, and C. M. Stultz. Ecg morphological variability in beat space for risk stratification after acute coronary syndrome. Journal of the American Heart Association, 3(3):e000981, 2014.
  • [11] J. P. Martínez, R. Almeida, S. Olmos, A. P. Rocha, and P. Laguna. A wavelet-based ecg delineator: evaluation on standard databases. IEEE Transactions on biomedical engineering, 51(4):570–581, 2004.
  • [12] R. McCraty and F. Shaffer. Heart rate variability: new perspectives on physiological mechanisms, assessment of self-regulatory capacity, and health risk. Global Advances in Health and Medicine, 4(1):46–61, 2015.
  • [13] D. A. Morrow, B. M. Scirica, E. Karwatowska-Prokopczuk, S. A. Murphy, A. Budaj, S. Varshavsky, A. A. Wolff, A. Skene, C. H. McCabe, E. Braunwald, et al. Effects of ranolazine on recurrent cardiovascular events in patients with non–st-elevation acute coronary syndromes: the merlin-timi 36 randomized trial. Jama, 297(16):1775–1783, 2007.
  • [14] P. D. Myers, B. M. Scirica, and C. M. Stultz. Machine learning improves risk stratification after acute coronary syndrome. Scientific reports, 7(1):12692, 2017.
  • [15] R. H. Perlis. A clinical risk stratification tool for predicting treatment resistance in major depressive disorder. Biological psychiatry, 74(1):7–14, 2013.
  • [16] Z. Syed, B. M. Scirica, S. Mohanavelu, P. Sung, E. L. Michelson, C. P. Cannon, P. H. Stone, C. M. Stultz, and J. V. Guttag. Relation of death within 90 days of non-st-elevation acute coronary syndromes to variability in electrocardiographic morphology. The American journal of cardiology, 103(3):307–311, 2009.