Benefit-aware Early Prediction of Health Outcomes on Multivariate EEG Time Series

by   Shubhranshu Shekhar, et al.

Given a cardiac-arrest patient being monitored in the ICU (intensive care unit) for brain activity, how can we predict their health outcomes as early as possible? Early decision-making is critical in many applications, e.g. monitoring patients may assist in early intervention and improved care. On the other hand, early prediction on EEG data poses several challenges: (i) earliness-accuracy trade-off; observing more data often increases accuracy but sacrifices earliness, (ii) large-scale (for training) and streaming (online decision-making) data processing, and (iii) multi-variate (due to multiple electrodes) and multi-length (due to varying length of stay of patients) time series. Motivated by this real-world application, we present BeneFitter that infuses the incurred savings from an early prediction as well as the cost from misclassification into a unified domain-specific target called benefit. Unifying these two quantities allows us to directly estimate a single target (i.e. benefit), and importantly, dictates exactly when to output a prediction: when benefit estimate becomes positive. BeneFitter (a) is efficient and fast, with training time linear in the number of input sequences, and can operate in real-time for decision-making, (b) can handle multi-variate and variable-length time-series, suitable for patient data, and (c) is effective, providing up to 2x time-savings with equal or better accuracy as compared to competitors.



There are no comments yet.


page 9


Early prediction of respiratory failure in the intensive care unit

The development of respiratory failure is common among patients in inten...

Application of Machine Learning in Early Recommendation of Cardiac Resynchronization Therapy

Heart failure (HF) is a leading cause of morbidity, mortality, and healt...

Modeling Rare Interactions in Time Series Data Through Qualitative Change: Application to Outcome Prediction in Intensive Care Units

Many areas of research are characterised by the deluge of large-scale hi...

Patient-reported outcomes in the context of the benefit assessment in Germany

Since the 2011 Act on the Reform of the Market for Medicinal Products, b...

Modelling Cooperation in a Dynamic Healthcare System

Our research is concerned with studying behavioural changes within a dyn...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1. Introduction

Early decision making is critical in a variety of application domains. In medicine, earliness in prediction of health outcomes for patients in ICU allows the hospital to redistribute their resources (e.g., ICU bed-time, physician time, etc.) to in-need patients, and potentially achieve better health outcomes overall within the same amount of time. Of course, another critical factor in play is the accuracy of such predictions. Hastily but incorrectly predicting unfavorable health outcome (e.g withdrawal of life-sustaining therapies) could hinder equitable decision making in the ICU, and may also expose hospitals to very costly lawsuits.

A clinician considers patient history, demographics, etc. in addition to large amounts of real-time sensor information for taking a decision. Our work is motivated by the this real-world application that would help in alleviating the information overload on clinicians and aid them in early and accurate decision making in ICU, however, the setting is quite general. In predictive maintenance the goal is to monitor the functioning of physical devices (e.g., workstations, industrial machines, etc.) in real-time through sensors, and to predict potential future failures as early as possible. Here again earliness (of prediction) allows timely maintenance that prevents catastrophic halting of such systems, while hasty false-alarms take away from the otherwise healthy lifetime of a device, by introducing premature replacements that can be very costly.

Figure 1. BeneFitter wins: Note that BeneFitter (in red) is on the Pareto front (lotov2013interactive) of accuracy-vs.-tardiness trade-off on ECG dataset. Each point represents evaluation of a method for a setting of hyper-parameters controlling the trade-off.

As suggested by these applications, the real-time prediction problem necessitates modeling of two competing goals: earliness and accuracy—competing since observing for a longer time, while cuts back from earliness, provides more information (i.e., data) that can help achieve better predictive accuracy. To this end, we directly integrate a cost/benefit framework to our proposed solution, BeneFitter, toward jointly optimizing prediction accuracy and earliness. We do not tackle an explicit multi-objective optimization but rather directly model a unified target that infuses those goals.

Besides the earliness-accuracy trade-off, the prediction of health outcomes on electroencephalography (EEG) recordings of ICU patients brings additional challenges. A large number (107) of EEG signal measurements are collected from multiple electrodes constituting high dimensional multivariate time series (our data is 900 GB on disk). Moreover, the series in data can be of various lengths because patients might not survive or be discharged after varying length of stay at the ICU. BeneFitter addresses these additional challenges such as handling () multi-variate and () variable-length signals (i.e., time series), () space-efficient modeling, () scalable training, and () constant-time prediction.

We summarize our contributions as follows.

  • Novel, cost-aware problem formulation: We propose BeneFitter, which infuses the incurred savings/gains from an early prediction at time , as well as the cost from each misclassification into a unified target called benefit . Unifying these two quantities allows us to directly estimate a single target, i.e., benefit, and importantly dictates BeneFitter exactly when to output a prediction: whenever estimated benefit becomes positive.

  • Efficiency and speed: The training time for BeneFitter is linear in the number of input sequences, and it can operate under a streaming setting to update its decision based on incoming observations. Unlike existing work that train a collection of prediction models for each  (dachraoui2015early; tavenard2016cost; mori2017early), BeneFitter employs a single model for each possible outcome, resulting in much greater space-efficiency.

  • Multi-variate and multi-length time-series: Due to hundreds of measurements from EEG signals collected from patients with variable length stays at the ICU, BeneFitter employs models that are designed to handle multiple time sequences, of varying length, which is a more general setting.

  • Effectiveness on real-world data: We apply BeneFitter on real-world (a) multi-variate health care data (our main motivating application for this work is predicting survival/death of cardiac-arrest patients based on their EEG measurements at the ICU), and (b) other 11 benchmark datasets pertaining to various early prediction tasks. On ICU application, BeneFitter can make decisions with up to time-savings as compared to competitors while achieving equal or better performance on accuracy metrics. Similarly, on benchmark datasets, BeneFitter provides the best spectrum for trading-off accuracy and earliness (e.g. see Figure 1).

Reproducibility.  We share all source code and public datasets at Our EEG dataset involves real ICU patients and is under NDA with our collaborating healthcare institution.

Suitability for Real-world Use Cases

  • Business problem: Customizable, cost-aware formulation. The benefit allows domain experts to explicitly control for and , thereby, enabling them to incorporate into the predictions the knowledge gathered through experience or driven by the application domain.

  • Usability: Early prediction of health outcomes to aid physicians in their decisions.

2. Data and Problem Setting

2.1. Data Description

Our use case data are obtained from comatose patients who are resuscitated from cardiac arrest and underwent post-arrest electroencephalography (EEG) monitoring at a single academic medical center between years .

The raw EEG data are recorded at Hz from scalp electrodes; electrodes in each hemisphere of the brain placed according to 10–20 International System of Electrode Placement.111 The raw data is then used to collect quantitative EEG (qEEG) features at an interval of ten seconds that amounts to about GB of disk space for patients. For our experiments, we selected qEEG signals that physicians find informative from the electrode measurements corresponding to different brain regions. The -dimensional qEEG measurements from different electrodes on both left and right hemisphere, including the amplitude-integrated EEG (aEEG), burst suppression ratio (SR), asymmetry, and rhythmicity, form our multivariate time-series for analysis. We also record qEEG for each hemisphere as average of qEEG features from electrodes on the given hemisphere.

As part of preprocessing, we normalize the qEEG features in a range

. The EEG data contains artifacts caused due to variety of informative (e.g. the patient wakes up) or arbitrary (e.g. device unplugged/unavailability of devices) reason. This results in missing values, abnormally high or zero measurements. We filter out the zero measurements, typically, appearing towards the end of each sequence as well as abnormally high signal values at the beginning of each time series from the patient records. The zero measurements towards the end appear because of the disconnection. Similarly, abnormally high readings at the start appear when a patient is being plugged for measurements. The missing values are imputed through linear interpolation.

In this dataset, 225 patients () out of total 725 patients survived i.e. woke up from coma. Since the length of stay in ICU depends on each individual patient, the dataset contains EEG records of length 24–96 hours. To extensively evaluate our proposed approach, we create versions of the dataset by median sampling (justusson1981median) the sequences at one hour, 30 minutes and 10 minutes intervals (as summarized in §5, Table 4).

2.2. Notation

A multi-variate time-series dataset is denoted as , consisting of observations and labels for instances. Each instance has a label where is the number of labels or classes.222We use the terms label and class interchangeably throughout the paper. For example, each possible health outcome at the ICU is depicted by a class label as or . The sequence of observations is given by for equi-distant time ticks. Here, is the length of time-series and varies from one instance to another in the general case. It is noteworthy that our proposed BeneFitter can effectively handle variable-length series in a dataset, whereas most existing early prediction techniques are limited to fixed length time-series, where for all . Each observation

is a vector of

real-valued measurements, where is the number of variables or signals. We denote ’s observations from the start until time tick by .

2.3. Problem Statement

Early classification of time series seeks to generate a prediction for input sequence based on such that is small and contains enough information for an accurate prediction. Formally,

Problem 1 (Early classification).

Given a set of labeled multivariate time series , learn a function which assigns label to a given time series i.e. such that is small.

Challenges The challenges in early classification are two-fold: domain-specific and task-specific, discussed as follows.

Domain-specific: Data preprocessing is non-trivial since raw EEG data includes various biological and environmental artifacts. Observations arrive incrementally across multiple signals where the characteristics that are indicative of class labels may occur at different times across signals which makes it difficult to find a decision time to output a label. Moreover, each time series instance can be of different length due to varying length of stay of patients at the ICU which requires careful handling.

Task-specific: Accuracy and earliness of prediction are competing objectives (as noted above) since observing for a longer time, while cuts back from earliness, provides more signals that is likely to yield better predictive performance.

In this work, we propose BeneFitter (see §4) that addresses all the aforementioned challenges.

3. Background and Related Work

The initial mention of early classification of time-series dates back to early 2000s (Rodríguez et al., 2001; Bregón et al., 2005)

where the authors consider the value in classifying prefixes of time sequences. However, it was formulated as a concrete learning problem only recently

(Xing et al., 2008; xing2012early). Xing et al. (Xing et al., 2008) mine a set of sequential classification rules and formulate an early-prediction utility measure to select the features and rules to be used in early classification. Later they extend their work to a nearest-neighbor based time-series classifier approach to wait until a certain level of confidence is reached before outputting a decision (xing2012early). Parrish et al. (parrish2013classifying) delay the decision until a reliability measure indicates that the decision based on the prefix of time-series is likely to match that based on the whole time-series. Xing et al. (xing2011extracting) advocate the use of interpretable features called shapelets (Ye and Keogh, 2009) which have a high discriminatory power as well as occur earlier in the time-series. Ghalwash and Obradovic (ghalwash2012early) extend this work to incorporate a notion of uncertainty associated with the decision. Hatami and Chira (Hatami and Chira, 2013) train an ensemble of classifiers along with an agreement index between the individual classifiers such that a decision is made when the agreement index exceeds a certain threshold. As such, none of these methods explicitly optimize for the trade-off between earliness and accuracy.

Property / Method

ECTS (xing2012early)

C-ECTS (dachraoui2015early; tavenard2016cost)

EDSC (xing2011extracting)

M-EDSC (ghalwash2012early)

RelClass (parrish2013classifying)

E2EL (russwurm2019end)


Jointly optimize earliness & accuracy
Distance metric agnostic
Constant decision time
Handles variable length series
Explainable model
Explainable hyper-parameter
Cost aware
Table 1. Qualitative comparison with prior work. ‘?’ means that the respective method, even though does not exhibit the corresponding property originally, can possibly be extended to handle it.

Dachraoui et al. (dachraoui2015early) propose to address this limitation and introduce an adaptive and non-myopic approach which outputs a label when the projected cost of delaying the decision until a later time is higher than the current cost of early classification. The projected cost is computed from a clustering of training data coupled with nearest neighbor matching. Tavenard and Malinowski (tavenard2016cost) improve upon (dachraoui2015early) by eliminating the need for data clustering by formulating the decision to delay or not to delay as a classification problem. Mori et al. (mori2017early) take a two-step approach; where in the first step classifiers are learned to maximize accuracy, and in the second step, an explicit cost function based on accuracy and earliness is used to define a stopping rule for outputting a decision. Schafer and Leser (schafer2020teaser), instead, utilize reliability of predicted label as stopping rule for outputting a decision. However, these methods require a classification-only phase followed by optimizing for trade-off between earliness and accuracy. Recently, Hartvigsen et al. (hartvigsen2019adaptive)

employ recurrent neural network (RNN) based discriminator for classification paired with a reinforcement learning task to learn halting policy. The closest in spirit to our work is the recently proposed end-to-end learning framework for early classification 

(russwurm2019end) that employs RNNs. They use a cost function similar to (mori2017early) in a fine-tuning framework to learn a classifier and a stopping rule based on RNN embeddings for partial sequences.

Our proposed BeneFitter is a substantial improvement over all the above prior work on early classification of time series along a number of fronts, as summarized in Table 1. BeneFitter jointly optimizes for earliness and accuracy using a cost-aware benefit function. It seamlessly handles multi-variate and varying-length time-series and moreover, leads to explainable early predictions, which is important in high-stakes domains like health care.

4. BeneFitter: Proposed Method

4.1. Modeling Benefit

How should an early prediction system trade-off accuracy vs. earliness? In many real-world settings, there is natural misclassification cost, denoted , associated with an inaccurate prediction and certain savings, denoted , obtained from early decision-making. We propose to construct a single variable called benefit which captures the overall value (savings minus cost) of outputting a certain decision (i.e., label) at a certain time , given as


We directly incorporate benefit into our model and leverage it in deciding when to output a decision; when the estimate is positive.

4.1.1. Outcome vs. Type Classification

There are two subtly different problem settings that arise in time-series classification that are worth distinguishing between.

Outcome classification: Here, the labels of time-series encode the observed outcome at the end of the monitoring period of each instance. Our motivating examples from predictive health care and system maintenance fall into this category. Typically, there are two outcomes: favorable (e.g., or -) and unfavorable (e.g., or -); and we are interested in knowing when an unfavorable outcome is anticipated. In such cases, predicting an early favorable outcome does not incur any change in course of action, and hence does not lead to any discernible savings or costs. For example, in our ICU application, a model predicting (as opposed to ) simply suggests to the physicians that the patient would survive provided they continue with their regular procedures of treatment. That is because labels we observe in the data are at the end of the observed period only after all regular course of action have been conducted. In contrast, instances have died despite the treatments.

In outcome classification, predicting the favorable class simply corresponds to the ‘default state’ and therefore we model benefit and actively make predictions only for the unfavorable class.

Type classification: Here, the time-series labels capture the underlying process that gives rise to the sequence of observations. In other words, the class labels are prior to the time-series observations. The standard time-series early classification benchmark datasets fall into this category. Examples include predicting the type of a bird from audio recordings or the type of a flying insect (e.g., a mosquito) from their wingbeat frequencies (Batista2011). Here, prediction of any label for a time-series at a given time has an associated cost in case of misclassification (e.g., inaccurate density estimates of birds/mosquitoes) as well as potential savings for earliness (e.g., battery life of sensors). In type classification, we separately model benefit for each class.

4.1.2. Benefit Modeling for Outcome Classification

Consider the 2-class problem that arises in predictive health care of ICU patients and predictive maintenance of systems. Without loss of generality, let us denote by the class where the patient is discharged alive from the ICU at the end of their stay; and let denote the class where the patient is deceased.

As discussed previously, corresponds to the ‘default state’ in which regular operations are continued. Therefore, predicting would not incur any time savings or misclassification cost. In contrast, predicting would suggest the clinician to intervene to optimize quality of life for the patient. In case of an accurate prediction, say at time , earliness would bring savings (e.g., ICU bed-time), denoted . Here we use a linear function of time for savings on accurately predicting for a patient at time , specifically


where denotes the value of savings per unit time.333Note that BeneFitter is flexible enough to accommodate any other function of time, including nonlinear ones, as the savings function . On the other hand, an inaccurate flag at , while comes with the same savings, would also incur a misclassification cost M (e.g., a lawsuit).

All in all, the benefit model for the ICU scenario is given as in Table 2, reflecting the relative savings minus the misclassification cost for each decision at time on time-series instance . As we will detail later in §4.3, the main idea behind BeneFitter is to learn a single regressor model for the class, estimating the corresponding benefit at each time tick .



Table 2. Benefit model for ICU outcome prediction.

Specifying and Here, we make the following two remarks. First, unit-time savings and misclassification cost are value estimates that are dictated by the specific application. For our ICU case, for example, we could use value per unit ICU time, and expected cost per lawsuit. Note that and are domain-specific explainable quantities. Second, the benefit model is most likely to differ from application to application. For example in predictive system maintenance, savings and cost would have different semantics, assuming that early prediction of failure implies a renewal of all equipment. In that case, an early and accurate failure prediction would incur savings from costs of a complete system halt, but also loss of equipment lifetime value due to early replacement plus the replacement costs. On the other hand, early but inaccurate prediction (i.e., a false alarm) would simply incur unnecessary replacement costs plus the loss of equipment lifetime value due to early replacement.

Our goal is to set up a general prediction framework that explicitly models benefit based on incurred savings and costs associated with individual decisions, whereas the scope of specifying those savings and costs are left to the practitioner. We argue that each real-world task should strive to explicitly model benefit, where earliness and accuracy of predictions translate to real-world value. In cases where the prediction task is isolated from its real-world use (e.g., benchmark datasets), one could set both for unit savings per unit time earliness and unit misclassification cost per incorrect decision. In those cases where

is not tied to a specific real-world value, it can be used as a “knob” (i.e., hyperparameter) for trading off accuracy with earliness; where, fixing

, a larger nudges BeneFitter to avoid misclassifications toward higher accuracy at the expense of delayed predictions and vice versa.

4.1.3. Benefit Modeling for Type Classification

Compared to outcome prediction where observations give rise to the labels, in type classification problems the labels give rise to the observations. Without a default class, predictions come with associated savings and cost for each class.



- -
Table 3. Benefit model for general two-class type prediction.

Consider the 2-class setting of predicting an insect’s type from wingbeat frequencies. An example benefit model is illustrated in Table 3, capturing the value of battery-life savings per unit time and depicting the cost of misclassifying one insect as the other. Note that in general, misclassification cost need not be symmetric among the classes.

For type classification problems, we train a total of benefit prediction models, one for each class. Since misclassification costs are already incorporated into benefit, we train each (regression) model independently which allows for full parallelism.

4.2. Online Decision-making using Benefit

Next we present how to employ BeneFitter in decision making in real time. Suppose we have trained our model that produces benefit estimates per class for a new time-series instance in an online fashion. How and when should we output predictions?

Thanks to our benefit modeling, the decision-making is quite natural and intuitive: BeneFitter makes a prediction only when the estimated benefit becomes positive for a certain class and outputs the label of that class as its prediction i.e. for our ICU scenario the predicted label is given as

For illustration, in Fig. 2 we show benefit estimates over time for an input series where corresponds to decision time.

Note that in some cases BeneFitter may restrain from making any prediction for the entire duration of a test instance, that is when estimated benefit never goes above zero. For outcome classification tasks, such a case is simply registered as default-class prediction and its prediction time is recorded as . For the ICU scenario, a non-prediction is where no flag is raised, suggesting survival and regular course of action. For type classification tasks, in contrast, a non-prediction suggests “waiting for more data” which, at the end of the observation period, simply implies insufficient evidence for any class. We refer to those as un-classified test instances. Note that BeneFitter

is different from existing prediction models that always produce a prediction, where un-classified instances may be of independent interest to domain experts in the form of outliers, noisy instances, etc.

Figure 2. Benefit estimate over time for a patient from EEG dataset with true (i.e. ). We show two out of all 107 signals used by BeneFitter

: amplitude of EEG (aEEG) and suppression ratio (i.e. fraction of flat EEG epochs).

4.3. Predicting Benefit

For each time-series , we aim to predict the benefit at every time tick , denoted as . Consider the outcome classification problem, where we are to train one regressor model for the non-default class, say . For each training series for which (i.e., default class), benefit of predicting at t is . Similarly for training series for which (i.e., ), . (See Table 2.) To this end, we create training samples of the form per instance . Note that the problem becomes a regression task. For type classification problems, we train a separate regression model per class with the corresponding values. (See Table 3.)

Model. We set up the task of benefit prediction as a sequence regression problem. We require BeneFitter to ingest multi-variate and variable-length input to estimate benefit

. We investigate the use of Long Short Term Memory (LSTM) 


, a variant of recurrent neural networks (RNN), for the sequence (time-series) regression since their recursive formulation allows LSTMs to handle multi-variate variable-length inputs naturally. The recurrent formulation of LSTMs is useful for

BeneFitter to enable real-time predictions when new observations arrive one at a time.

Attention. The recurrent networks usually find it hard to focus on to relevant information in long input sequences. For example, an EEG pattern in the beginning of a sequence may contain useful information about the patient’s outcome, however the lossy representation of LSTM would forget it. This issue is mostly encountered in longer input sequences (luong2015effective). The underlying idea of attention (vaswani2017attention) is to learn a context that captures the relevant information from the parts of the sequence to help predict the target. For a sequence of length , given the hidden state from LSTM and the context vector , the attention step in BeneFitter combines the information from both vectors to produce a final attention based hidden state as described below:


where is the memory state of the cell at the last time step , is the hidden state output of the LSTM at time , is the attention based hidden state,

is the non-linear transformation, and

is the parameter. Intuitively, the attention weights allows the model to learn to focus on specific parts of the input sequence for the task of regression. The benefit prediction is given by a single layer neural network such that where and are parameters of the linear layer.

BeneFitter is used in real life decision making, where the attention mechanism could help an expert by highlighting the relevant information that guided the model to output a decision. We present model implementation details and list of tunable parameters in §5.

5. Experiments

We evaluate our method through extensive experiments on a set of benchmark datasets and on a set of datasets from real-world use cases. We next provide the details of the datasets and the experimental setup, followed by results.

Dataset Train Test Classes Length Dimension
EEG-ICU Hour 507 218 2 24–96 107
EEG-ICU 30 Min 507 218 2 48–192 107
EEG-ICU 10 Min 507 218 2 144–576 107
ECG200 100 100 2 96 1
ItalyPowerDemand 67 1029 2 24 1
GunPoint 50 150 2 150 1
TwoLeadECG 23 1139 2 82 1
Wafer 1000 6062 2 152 1
ECGFiveDays 23 861 2 136 1
MoteStrain 20 1252 2 84 1
Coffee 28 28 2 286 1
Yoga 300 3000 2 426 1
SonyAIBO 20 601 2 70 1
Endomondo 99754 42751 2 450 2
Table 4. Summary of the datasets used in this work.

5.1. Dataset Description

We apply BeneFitter on our EEG-ICU datasets (see §2.1), and on public benchmark datasets from diverse domains with varying dimensionality, length and scale. Table 4 provides a summary of the datasets used in evaluation. Note that EEG-ICU datasets are variable-length, but benchmarks often used in the literature are not. Detailed description of public datasets are included in Appx. A.1.

5.2. Experimental Setup

Baselines.  We compare BeneFitter to the following five early time-series classification methods (also see Table 1):

  1. ECTS: Early Classification on Time Series (xing2012early) uses minimum prediction length (MPL) and makes predictions if the MPL of the top nearest neighbor (1-NN) is greater than the length of the test series.

  2. EDSC: Early Distinctive Shapelet Classification (xing2011extracting) extracts local shapelets for classification that are ranked based on the utility score incorporating earliness and accuracy. Multivariate extension of EDSC (M-EDSC) (ghalwash2012early) provides a utility function that can incorporate multi-dimensional series.

  3. C-ECTS: Cost-aware ECTS (dachraoui2015early; tavenard2016cost) trades-off between a misclassification cost and a cost of delaying the prediction, and estimates future expected cost at each time step to determine the optimal time instant to classify an incoming time series.

  4. RelClass:Reliable Classification (parrish2013classifying)

    uses a reliability measure to estimate the probability that the assigned label given incomplete data (at time step

    ) would be same as the label assigned given the complete data.

  5. E2EL: End-to-end Learning for Early Classification of Time Series (russwurm2019end) optimizes a joint cost function based on accuracy and earliness, and provides a framework to estimate a stopping probability based on the cost function.

5.3. Evaluation

We design our experiments to answer the following questions:

[Q1] Effectiveness: How effective is BeneFitter at early prediction on time series compared to the baselines? What is the trade-off with respect to accuracy and earliness? How does the accuracy–earliness trade-off varies with respect to model parameters?

[Q2] Efficiency: How does the running-time of BeneFitter scale w.r.t. the number of training instances? How fast is the online response time of BeneFitter?

[Q3] Discoveries: Does BeneFitter lead to interesting discoveries on real-world case studies?

[Q1] Effectiveness

We compare BeneFitter to baselines on (1) patient outcome prediction, the main task that inspired our benefit formulation, (2) the activity prediction task on a web-scale dataset, as well as (3) the set of 10 two-class time-series classification datasets. The datasets for the first two tasks are multi-dimensional and variable-length that many of the baselines can not handle. Thus we compare BeneFitter with E2EL and M-EDSC baselines that can work with such time-series sequences. Comparison with M-EDSC is limited to the smaller one-hour EEG dataset since it does not scale to larger datasets. In order to compare BeneFitter to all other baselines, we conduct experiments on ten benchmark time-series datasets.

Patient Outcome Prediction.  We compare BeneFitter with the baseline E2EL on two competing criteria: performance (e.g. precision, accuracy) and earliness (tardiness – lower is better) of the decision. We report precision, recall, F1 score, accuracy, tardiness and the total benefit using each method when applied to the test set. EEG dataset is a high dimensional variable-length dataset for which most of the baselines are not applicable. In our experiments, we set as misclassification cost for each of the dataset variants – sampled at an hour, 30 minutes, 10 minutes – respectively based on average daily cost of ICU care and the lawsuit cost. For the baseline methods, we report the best results for the earliness-accuracy trade-off parameters. For the baseline methods, we select the best value of accuracy and earliness based on their Euclidean distance to ideal accuracy and ideal tardiness .

Table 5 reports the evaluation against different performance metrics. Note that predicting ‘default state’ for a patient does not change the behavior of the system. However, predicting death (unfavorable outcome) may suggest clinician to intervene with alternative care. In such a decision setting, it is critical for the classifier to exhibit high precision. Our results indicate that BeneFitter achieves a significantly higher precision (according to the micro-sign test (yang1999re)) when compared to the baselines. On the other hand, a comparatively lower tardiness indicates that BeneFitter requires conspicuously less number of observations on average to output a decision (no statistical test conducted for tardiness). We also compare the total benefit accrued for each method on the test set where BeneFitter outperforms the competition. The results are consistent across the three datasets of varying granularity from hourly sampled data to 10 minute sampled data.







EEG Hour

E2EL 0.70 0.68 0.69 0.79 1.0 -2600
M-EDSC 0.69 0.65 0.67 0.78 0.52 2497
BeneFitter 0.80* 0.68 0.73* 0.83* 0.64 2737

EEG 30Min

E2EL 0.64 0.67 0.65 0.78 1.0 -4800
BeneFitter 0.68* 0.66 0.67* 0.79 0.63 5962

EEG 10Min

E2EL 0.73 0.69 0.71 0.82 0.86 -736
BeneFitter 0.76* 0.69 0.72 0.83 0.48 18722
Table 5. Effectiveness of BeneFitter on EEG datasets. * indicates significance at -value based on the micro-sign test (yang1999re) for the performance metrics. No statistical test conducted for tardiness and total benefit.

For hourly sampled set, we also compare our method to multivariate EDSC baseline (for the 30 min and 10 min EEG dataset M-EDSC does not scale ). Though M-EDSC provides better earliness trade-off compared to other two methods, the precision of the outcomes is lowest which is not desirable in this decision setup. In Table 5, we indicate the significant results using * that is based on the comparison between BeneFitter and E2EL.

Benchmark Prediction Tasks.  To jointly evaluate the accuracy and earliness (tardiness – lower is better), we plot accuracy against the tardiness to compare the Pareto frontier for each of the competing methods over 10 different benchmark datasets. In Fig. 1 and Fig. 3, we show the accuracy and tardiness trade-off for 10 benchmark UCR datasets. Each point on the plot represents the model evaluation for a choice of trade-off parameters reported in Table A1 (§A.2). Note that BeneFitter dominates the frontiers of all the baselines in accuracy vs tardiness on five of the datasets. Moreover, our method appears on the Pareto frontier for four out of the remaining five for at least one set of parameters.

To further assess the methods, we report quantitative results in Table 6 in terms of accuracy at a given tolerance of tardiness. We define an acceptable level of tolerance to indicate how much an application domain is indifferent to delay in decision up to the indicated level. For example a tolerance of indicates that the evaluation of the decisions is done at , is the maximum length of sequence, and any decision made up to are considered for evaluation. In Fig. 3, we fix the x-axis at a particular tolerance and report the best accuracy to the left of the fixed tolerance in Table 6. The reported tolerance level indicates the average tolerance across the test time-series sequences. BeneFitter outperforms the competition seven times out of ten for a tolerance level indicating that our method achieves best performance using only the half of observations. The remaining three times our method is second best among all the competing methods. Similarly for tolerance , BeneFitter is among the top two methods nine out of ten times.

Figure 3. Comparison of methods based on accuracy versus tardiness trade-off for benchmark prediction tasks (Sec. §5.3).











0.50 - 0.84 0.83 0.88 0.87 0.91
0.75 - 0.84 0.83 0.89 0.87 0.91


0.50 - - 0.78 0.85 0.89 0.93
0.75 - 0.85 0.94 0.95 0.89 0.93


0.50 - 0.95 0.80 - 0.93 0.97
0.75 0.91 0.95 0.84 0.91 0.96 0.97


0.50 - 0.88 0.89 - 0.79 0.98
0.75 0.73 0.89 0.94 0.73 0.86 0.98


0.50 - 0.99 0.96 1.0 0.99 0.99
0.75 1.0 0.99 0.96 1.0 0.99 0.99


0.50 - - 0.59 0.57 0.64 0.87
0.75 0.72 0.95 0.59 0.77 0.77 0.87


0.50 - 0.8 0.85 - - 0.85
0.75 - 0.8 0.85 - - 0.85


0.50 - - 0.98 0.89 0.53 0.93
0.75 - 0.75 0.98 0.89 0.53 0.93


0.50 - 0.71 0.64 - 0.79 0.76
0.75 - 0.71 0.64 - 0.79 0.76


0.50 - 0.80 0.81 0.81 0.92 0.92
0.75 0.69 0.80 0.81 0.81 0.92 0.92
Endomondo 0.50 dns 0.68 0.66
Table 6. BeneFitter wins most of the times. Accuracy on benchmark datasets against mean tardiness . Bold represents best accuracy within a given tardiness tolerance, and the underline represents the next best accuracy. ‘-’ indicates that on average method requires more observations than the given tardiness tolerance. ‘✗’ specifies non-applicability of a method on the dataset, and ‘dns’ shows that a method does not scale for the dataset.

Endomondo Activity Prediction.  We run the experiments on full Endomondo dataset (a large scale dataset) to compare BeneFitter with baseline E2EL (other baselines do not scale) for one set of earliness-accuracy trade-off parameters. We, first, compare the two methods on a sampled dataset – with time series instances – evaluated for a choice of trade-off parameters (see Fig. 6 in Appendix A.3). We select the parameters that yields a performance closest to ideal indicated. With the selected parameters, comparison of two methods on large-scale Endomondo activity prediction dataset are reported in Table 6 (last row). We report the accuracy of the two methods by fixing their tardiness at . Notice that the two methods are comparable in terms of the prediction performance while using less than half the length of a sequence for outputting a decision.

The quantitative results suggest a way to choose the best classifier for a specified tolerance level for an application. In critical domains such as medical, or predictive maintenance a lower tolerance would be preferred to save cost. In such domains, BeneFitter provides a clear choice for early decision making based on the benchmark dataset evaluation.

[Q2] Efficiency

Fig. 4 shows the scalability of BeneFitter with the number of training time-series and number of observations per time series at test time. We use ECG200 dataset from UCR benchmark to report results on runtime.

Linear training time:  We create ten datasets from the ECG200 dataset by retaining a fraction of total number of training instances. For a fixed set of parameters, we train our model individually for each of the created datasets. The wall-clock running time is reported against the fraction of training sequences in Fig. 4 (left). The points in the plot align in a straight line indicating that BeneFitter scales linearly with number of sequences.

Constant test time:  We now evaluate BeneFitter runtime by varying the number of observations over time. For this experiment, we retain the hidden state of an input test sequence up to time . When a new observation at time arrives, we update the hidden state of the RNN cell using the new observation and compute the predicted benefit based on updated state. Fig. 4 (right) plots the wall-clock time against each new observation. The time is averaged over test set examples. The plot indicates that we get real-time decision in constant time.

The efficiency of our model makes it suitable for deployment for real time prediction tasks.

Figure 4. (left) BeneFitter scales linearly with the number of time-series, and (right) provides constant-time decision.

[Q3] Discoveries

In this section, we present an analysis of BeneFitter highlighting some of the salient aspects of our proposed framework on ICU patient outcome task. In particular, we discuss how our method explains the benefit prediction by highlighting the parts of inputs that contributed most for the prediction, and how our benefit formulation assists with model evaluation.

Explaining Benefit Estimation Our method utilizes the attention mechanism (see §4) in the RNN network for benefit regression. The model calculates weights corresponding to each hidden state . These weights can indicate which of the time dimensions model focuses on to estimate the benefit for the current input series. In Fig. 5, we plot one dimension of the input time series from EEG dataset. This dimension corresponds to amplitude of the EEG (aEEG) when measured in left hemisphere of the brain. The input sequence is taken from the hourly sampled dataset. Note that there are sharp rise and fall in the aEEG signal from to , and from to . We input the dimensional sequence to BeneFitter along with the aEEG signal. The model outputs the attention weights corresponding to each time-dimension of the inputs shown in Fig. 5 as a heatmap (dark colors indicate lower weights, lighter colors indicate higher weights). BeneFitter outputs a decision at , however we evaluate the model at further time steps. Note that the each row of heat map represents evaluation of input at . For each evaluation, we obtain a weight signifying the importance of a time dimension which are plotted as heatmap. X-axis of heat map corresponds to time dimension, and y-axis of the heatmap corresponds to the evaluation time step of the input. Observe that the attention places higher weights towards the beginning of the time series where we observe the crests and troughs of the signal.

Figure 5. Input test sequence with corresponding attention weights evaluated at . Model decision at .

The visualization of importance weights is useful in critical applications where each decision involves a high cost. The estimated benefit along with the importance information can assist a clinician with the decision making.

Assisting with Model Evaluation In the clinical setting with comatose patients, there is a natural cost associated with an inaccurate prediction and savings obtained from knowing the labels early. The benefit modeling captures the overall value of outputting a decision. Though, we use this value for learning a regression model, the benefit

formulation could be used as an evaluation metric to asses the quality of a predictive model. If we know the domain specific unit-time savings

and misclassification cost , we can then evaluate a model performance for that particular value of and . Table 7 reports evaluation of BeneFitter for various values of with model trained using the same values of . We notice that with increasing we improve the precision of the model, however increased also results in higher penalty for any misclassification. For our model trained on hourly sampled EEG data, we observe that values above , results in overall negative benefit averaged over test data. Assuming unit-time savings , we can tolerate lawsuit costs up to million for . Similarly, any model can be evaluated to assess its usefulness using our benefit formulation as an evaluation measure in critical domains.

Precision Recall F1 score Accuracy Tardiness Benefit
100 0.80 0.68 0.73 0.83 0.64 2737
200 0.80 0.67 0.71 0.82 0.68 1032
300 0.82 0.67 0.74 0.84 0.68 156
400 0.82 0.69 0.75 0.84 0.70 -1326
Table 7. Training and evaluation of BeneFitter for on EEG hourly sampled data.

6. Conclusions

In this paper, we consider the benefit-aware early prediction of health outcomes for ICU patients and proposed BeneFitter that is designed to effectively handle multi-variate and variable-length signals such as EEG recordings. We made multiple contributions.

  1. Novel, cost-aware problem formulation: BeneFitter infuses the incurred savings from an early prediction as well as the cost from misclassification into a unified target called benefit. Unifying these two quantities allows us to directly estimate a single target, i.e., benefit, and importantly dictates BeneFitter exactly when to output a prediction: whenever estimated benefit becomes positive.

  2. Efficiency and speed: The training time for BeneFitter is linear in the number of input sequences, and it can operate under a streaming setting to update its decision based on incoming observations.

  3. Multi-variate and multi-length time-series: BeneFitter is designed to handle multiple time sequences, of varying length, suitable for various domains including health care.

  4. Effectiveness on real-world data: We applied BeneFitter in early prediction of health outcomes on ICU-EEG data where BeneFitter provides up to time-savings as compared to competitors while achieving equal or better accuracy. BeneFitter also outperformed or tied with top competitors on other real-world benchmarks.


  • A. Bregón, M. A. Simón, J. J. Rodríguez, C. Alonso, B. Pulido, and I. Moro (2005) Early fault classification in dynamic systems using case-based reasoning. In CAEPIA, pp. 211–220. Cited by: §3.
  • N. Hatami and C. Chira (2013) Classifiers with a reject option for early time-series classification. In CIEL, pp. 9–16. Cited by: §3.
  • J. J. Rodríguez, C. J. Alonso, and H. Boström (2001) Boosting interval based literals. Intelligent Data Analysis 5 (3), pp. 245–262. Cited by: §3.
  • Z. Xing, J. Pei, G. Dong, and P. S. Yu (2008) Mining sequence classifiers for early prediction. In SIAM SDM, pp. 644–655. Cited by: §3.
  • L. Ye and E. Keogh (2009) Time series shapelets: a new primitive for data mining. In SIGKDD, pp. 947–956. Cited by: §3.

Appendix A Appendix

In this section, we first give a detailed description of the datasets used in the experiments. We then provide model training details for BeneFitter, and present the hyper-parameter configurations used in BeneFitter as well as the competing methods for the benchmark experiments.

a.1. Dataset Description

Benchmark Datasets.  Our benchmark datasets consist of 10 two-class time-series classification datasets from the UCR repository (UCRArchive). The datasets cover diverse domains and have diverse range of series length. The UCR archive provides the train/test split for each of these datasets, which we retain in our experiments.

Endomondo Dataset.  Endomondo is a social fitness app that tracks numerous fitness attributes of the users. We use the web-scale Endomondo dataset (ni2019modeling) (See Table 4) for the early activity prediction task. The data includes various measurements such as heart rate, altitude and speed, along with contextual data such as user id and activity type. For the task of early activity prediction, we use heart rate and altitude signals for early prediction of the type of activity, specifically biking vs. running. (Note that we leave out signals like speed and its derivatives which make the classification task too easy.)

a.2. Experimental Settings

a.2.1. Model Training Details

We define the outcome prediction problem as a regression task on the benefit, as presented in §4. The training examples represent the sequences observed up to time along with their corresponding expected benefit at time . We then split the training examples to use of the sequences for training the RNN model and remaining for validation. We select our model parameters based on the evaluation on validation set. We have two sets of hyper-parameters: one corresponding to our benefit formulation that are and

, and the other for the RNN model. The hyper-parameter grid for the RNN model includes the dimension of the hidden representation

and the learning rate . For BeneFitter, we fix at 1 and vary for model selection. The hyper-parameter grid for our benefit formulation is , where is the length of training series . For the general multi-class problem we tune for an additional hyper-parameter to predict the class label. We set of the maximum difference between the expected benefit for the two-class labels for a given training series. BeneFitter outputs a decision when the predicted benefit is positive and at least higher compared to the predicted benefit of other labels.

For the learning task, we use the mean squared error loss function and Adam optimizer 

(kingma2014adam) for learning the parameters. The loss corresponding to class is given by

where is the number of total count of time steps for all the sequences in the training set, is the length of each sequence, and denotes the expected benefit for the class

. We use Keras and Pytorch to implement our models.

a.2.2. Benchmark Experiments – Hyper-parameters

We compared the performance of BeneFitter against six competing methods on benchmark datasets. In Table A1, we report the different hyper-parameter configurations for each method that provides a trade-off between accuracy and earliness.

Method Model Training Hyper-parameters
ECTS support
EDSC Chebyshev parameter
C-ECTS delay cost
RelClass reliability
E2EL earliness trade-off
Table A1. Earliness and accuracy trade-off parameters for each of the methods.

a.3. Results

Figure 6. Accuracy versus tardiness for sampled Endomondo.