1. Introduction
Early decision making is critical in a variety of application domains. In medicine, earliness in prediction of health outcomes for patients in ICU allows the hospital to redistribute their resources (e.g., ICU bedtime, physician time, etc.) to inneed patients, and potentially achieve better health outcomes overall within the same amount of time. Of course, another critical factor in play is the accuracy of such predictions. Hastily but incorrectly predicting unfavorable health outcome (e.g withdrawal of lifesustaining therapies) could hinder equitable decision making in the ICU, and may also expose hospitals to very costly lawsuits.
A clinician considers patient history, demographics, etc. in addition to large amounts of realtime sensor information for taking a decision. Our work is motivated by the this realworld application that would help in alleviating the information overload on clinicians and aid them in early and accurate decision making in ICU, however, the setting is quite general. In predictive maintenance the goal is to monitor the functioning of physical devices (e.g., workstations, industrial machines, etc.) in realtime through sensors, and to predict potential future failures as early as possible. Here again earliness (of prediction) allows timely maintenance that prevents catastrophic halting of such systems, while hasty falsealarms take away from the otherwise healthy lifetime of a device, by introducing premature replacements that can be very costly.
As suggested by these applications, the realtime prediction problem necessitates modeling of two competing goals: earliness and accuracy—competing since observing for a longer time, while cuts back from earliness, provides more information (i.e., data) that can help achieve better predictive accuracy. To this end, we directly integrate a cost/benefit framework to our proposed solution, BeneFitter, toward jointly optimizing prediction accuracy and earliness. We do not tackle an explicit multiobjective optimization but rather directly model a unified target that infuses those goals.
Besides the earlinessaccuracy tradeoff, the prediction of health outcomes on electroencephalography (EEG) recordings of ICU patients brings additional challenges. A large number (107) of EEG signal measurements are collected from multiple electrodes constituting high dimensional multivariate time series (our data is 900 GB on disk). Moreover, the series in data can be of various lengths because patients might not survive or be discharged after varying length of stay at the ICU. BeneFitter addresses these additional challenges such as handling () multivariate and () variablelength signals (i.e., time series), () spaceefficient modeling, () scalable training, and () constanttime prediction.
We summarize our contributions as follows.

Novel, costaware problem formulation: We propose BeneFitter, which infuses the incurred savings/gains from an early prediction at time , as well as the cost from each misclassification into a unified target called benefit . Unifying these two quantities allows us to directly estimate a single target, i.e., benefit, and importantly dictates BeneFitter exactly when to output a prediction: whenever estimated benefit becomes positive.

Efficiency and speed: The training time for BeneFitter is linear in the number of input sequences, and it can operate under a streaming setting to update its decision based on incoming observations. Unlike existing work that train a collection of prediction models for each (dachraoui2015early; tavenard2016cost; mori2017early), BeneFitter employs a single model for each possible outcome, resulting in much greater spaceefficiency.

Multivariate and multilength timeseries: Due to hundreds of measurements from EEG signals collected from patients with variable length stays at the ICU, BeneFitter employs models that are designed to handle multiple time sequences, of varying length, which is a more general setting.

Effectiveness on realworld data: We apply BeneFitter on realworld (a) multivariate health care data (our main motivating application for this work is predicting survival/death of cardiacarrest patients based on their EEG measurements at the ICU), and (b) other 11 benchmark datasets pertaining to various early prediction tasks. On ICU application, BeneFitter can make decisions with up to timesavings as compared to competitors while achieving equal or better performance on accuracy metrics. Similarly, on benchmark datasets, BeneFitter provides the best spectrum for tradingoff accuracy and earliness (e.g. see Figure 1).
Reproducibility. We share all source code and public datasets at https://bit.ly/3n4wL0N. Our EEG dataset involves real ICU patients and is under NDA with our collaborating healthcare institution.
Suitability for Realworld Use Cases

Business problem: Customizable, costaware formulation. The benefit allows domain experts to explicitly control for and , thereby, enabling them to incorporate into the predictions the knowledge gathered through experience or driven by the application domain.

Usability: Early prediction of health outcomes to aid physicians in their decisions.
2. Data and Problem Setting
2.1. Data Description
Our use case data are obtained from comatose patients who are resuscitated from cardiac arrest and underwent postarrest electroencephalography (EEG) monitoring at a single academic medical center between years –.
The raw EEG data are recorded at Hz from scalp electrodes; electrodes in each hemisphere of the brain placed according to 10–20 International System of Electrode Placement.^{1}^{1}1https://en.wikipedia.org/wiki/1020_system_(EEG) The raw data is then used to collect quantitative EEG (qEEG) features at an interval of ten seconds that amounts to about GB of disk space for patients. For our experiments, we selected qEEG signals that physicians find informative from the electrode measurements corresponding to different brain regions. The dimensional qEEG measurements from different electrodes on both left and right hemisphere, including the amplitudeintegrated EEG (aEEG), burst suppression ratio (SR), asymmetry, and rhythmicity, form our multivariate timeseries for analysis. We also record qEEG for each hemisphere as average of qEEG features from electrodes on the given hemisphere.
As part of preprocessing, we normalize the qEEG features in a range
. The EEG data contains artifacts caused due to variety of informative (e.g. the patient wakes up) or arbitrary (e.g. device unplugged/unavailability of devices) reason. This results in missing values, abnormally high or zero measurements. We filter out the zero measurements, typically, appearing towards the end of each sequence as well as abnormally high signal values at the beginning of each time series from the patient records. The zero measurements towards the end appear because of the disconnection. Similarly, abnormally high readings at the start appear when a patient is being plugged for measurements. The missing values are imputed through linear interpolation.
In this dataset, 225 patients () out of total 725 patients survived i.e. woke up from coma. Since the length of stay in ICU depends on each individual patient, the dataset contains EEG records of length 24–96 hours. To extensively evaluate our proposed approach, we create versions of the dataset by median sampling (justusson1981median) the sequences at one hour, 30 minutes and 10 minutes intervals (as summarized in §5, Table 4).
2.2. Notation
A multivariate timeseries dataset is denoted as , consisting of observations and labels for instances. Each instance has a label where is the number of labels or classes.^{2}^{2}2We use the terms label and class interchangeably throughout the paper. For example, each possible health outcome at the ICU is depicted by a class label as or . The sequence of observations is given by for equidistant time ticks. Here, is the length of timeseries and varies from one instance to another in the general case. It is noteworthy that our proposed BeneFitter can effectively handle variablelength series in a dataset, whereas most existing early prediction techniques are limited to fixed length timeseries, where for all . Each observation
is a vector of
realvalued measurements, where is the number of variables or signals. We denote ’s observations from the start until time tick by .2.3. Problem Statement
Early classification of time series seeks to generate a prediction for input sequence based on such that is small and contains enough information for an accurate prediction. Formally,
Problem 1 (Early classification).
Given a set of labeled multivariate time series , learn a function which assigns label to a given time series i.e. such that is small.
Challenges The challenges in early classification are twofold: domainspecific and taskspecific, discussed as follows.
• Domainspecific: Data preprocessing is nontrivial since raw EEG data includes various biological and environmental artifacts. Observations arrive incrementally across multiple signals where the characteristics that are indicative of class labels may occur at different times across signals which makes it difficult to find a decision time to output a label. Moreover, each time series instance can be of different length due to varying length of stay of patients at the ICU which requires careful handling.
• Taskspecific: Accuracy and earliness of prediction are competing objectives (as noted above) since observing for a longer time, while cuts back from earliness, provides more signals that is likely to yield better predictive performance.
In this work, we propose BeneFitter (see §4) that addresses all the aforementioned challenges.
3. Background and Related Work
The initial mention of early classification of timeseries dates back to early 2000s (Rodríguez et al., 2001; Bregón et al., 2005)
where the authors consider the value in classifying prefixes of time sequences. However, it was formulated as a concrete learning problem only recently
(Xing et al., 2008; xing2012early). Xing et al. (Xing et al., 2008) mine a set of sequential classification rules and formulate an earlyprediction utility measure to select the features and rules to be used in early classification. Later they extend their work to a nearestneighbor based timeseries classifier approach to wait until a certain level of confidence is reached before outputting a decision (xing2012early). Parrish et al. (parrish2013classifying) delay the decision until a reliability measure indicates that the decision based on the prefix of timeseries is likely to match that based on the whole timeseries. Xing et al. (xing2011extracting) advocate the use of interpretable features called shapelets (Ye and Keogh, 2009) which have a high discriminatory power as well as occur earlier in the timeseries. Ghalwash and Obradovic (ghalwash2012early) extend this work to incorporate a notion of uncertainty associated with the decision. Hatami and Chira (Hatami and Chira, 2013) train an ensemble of classifiers along with an agreement index between the individual classifiers such that a decision is made when the agreement index exceeds a certain threshold. As such, none of these methods explicitly optimize for the tradeoff between earliness and accuracy.Property / Method 
ECTS (xing2012early) 
CECTS (dachraoui2015early; tavenard2016cost) 
EDSC (xing2011extracting) 
MEDSC (ghalwash2012early) 
RelClass (parrish2013classifying) 
E2EL (russwurm2019end) 
BeneFitter 
Jointly optimize earliness & accuracy  ✓  ✓  
Distance metric agnostic  ✓  ✓  ✓  
Multivariate  ✓  ✓  ✓  
Constant decision time  ✓  ✓  
Handles variable length series  ✓  ✓  ✓  ✓  ✓  ✓  ✓ 
Explainable model  ✓  ✓  ✓  ✓  
Explainable hyperparameter  ✓  
Cost aware  ✓  ✓  ✓ 
Dachraoui et al. (dachraoui2015early) propose to address this limitation and introduce an adaptive and nonmyopic approach which outputs a label when the projected cost of delaying the decision until a later time is higher than the current cost of early classification. The projected cost is computed from a clustering of training data coupled with nearest neighbor matching. Tavenard and Malinowski (tavenard2016cost) improve upon (dachraoui2015early) by eliminating the need for data clustering by formulating the decision to delay or not to delay as a classification problem. Mori et al. (mori2017early) take a twostep approach; where in the first step classifiers are learned to maximize accuracy, and in the second step, an explicit cost function based on accuracy and earliness is used to define a stopping rule for outputting a decision. Schafer and Leser (schafer2020teaser), instead, utilize reliability of predicted label as stopping rule for outputting a decision. However, these methods require a classificationonly phase followed by optimizing for tradeoff between earliness and accuracy. Recently, Hartvigsen et al. (hartvigsen2019adaptive)
employ recurrent neural network (RNN) based discriminator for classification paired with a reinforcement learning task to learn halting policy. The closest in spirit to our work is the recently proposed endtoend learning framework for early classification
(russwurm2019end) that employs RNNs. They use a cost function similar to (mori2017early) in a finetuning framework to learn a classifier and a stopping rule based on RNN embeddings for partial sequences.Our proposed BeneFitter is a substantial improvement over all the above prior work on early classification of time series along a number of fronts, as summarized in Table 1. BeneFitter jointly optimizes for earliness and accuracy using a costaware benefit function. It seamlessly handles multivariate and varyinglength timeseries and moreover, leads to explainable early predictions, which is important in highstakes domains like health care.
4. BeneFitter: Proposed Method
4.1. Modeling Benefit
How should an early prediction system tradeoff accuracy vs. earliness? In many realworld settings, there is natural misclassification cost, denoted , associated with an inaccurate prediction and certain savings, denoted , obtained from early decisionmaking. We propose to construct a single variable called benefit which captures the overall value (savings minus cost) of outputting a certain decision (i.e., label) at a certain time , given as
(1) 
We directly incorporate benefit into our model and leverage it in deciding when to output a decision; when the estimate is positive.
4.1.1. Outcome vs. Type Classification
There are two subtly different problem settings that arise in timeseries classification that are worth distinguishing between.
• Outcome classification: Here, the labels of timeseries encode the observed outcome at the end of the monitoring period of each instance. Our motivating examples from predictive health care and system maintenance fall into this category. Typically, there are two outcomes: favorable (e.g., or ) and unfavorable (e.g., or ); and we are interested in knowing when an unfavorable outcome is anticipated. In such cases, predicting an early favorable outcome does not incur any change in course of action, and hence does not lead to any discernible savings or costs. For example, in our ICU application, a model predicting (as opposed to ) simply suggests to the physicians that the patient would survive provided they continue with their regular procedures of treatment. That is because labels we observe in the data are at the end of the observed period only after all regular course of action have been conducted. In contrast, instances have died despite the treatments.
In outcome classification, predicting the favorable class simply corresponds to the ‘default state’ and therefore we model benefit and actively make predictions only for the unfavorable class.
• Type classification: Here, the timeseries labels capture the underlying process that gives rise to the sequence of observations. In other words, the class labels are prior to the timeseries observations. The standard timeseries early classification benchmark datasets fall into this category. Examples include predicting the type of a bird from audio recordings or the type of a flying insect (e.g., a mosquito) from their wingbeat frequencies (Batista2011). Here, prediction of any label for a timeseries at a given time has an associated cost in case of misclassification (e.g., inaccurate density estimates of birds/mosquitoes) as well as potential savings for earliness (e.g., battery life of sensors). In type classification, we separately model benefit for each class.
4.1.2. Benefit Modeling for Outcome Classification
Consider the 2class problem that arises in predictive health care of ICU patients and predictive maintenance of systems. Without loss of generality, let us denote by the class where the patient is discharged alive from the ICU at the end of their stay; and let denote the class where the patient is deceased.
As discussed previously, corresponds to the ‘default state’ in which regular operations are continued. Therefore, predicting would not incur any time savings or misclassification cost. In contrast, predicting would suggest the clinician to intervene to optimize quality of life for the patient. In case of an accurate prediction, say at time , earliness would bring savings (e.g., ICU bedtime), denoted . Here we use a linear function of time for savings on accurately predicting for a patient at time , specifically
(2) 
where denotes the value of savings per unit time.^{3}^{3}3Note that BeneFitter is flexible enough to accommodate any other function of time, including nonlinear ones, as the savings function . On the other hand, an inaccurate flag at , while comes with the same savings, would also incur a misclassification cost M (e.g., a lawsuit).
All in all, the benefit model for the ICU scenario is given as in Table 2, reflecting the relative savings minus the misclassification cost for each decision at time on timeseries instance . As we will detail later in §4.3, the main idea behind BeneFitter is to learn a single regressor model for the class, estimating the corresponding benefit at each time tick .
Predicted  
Actual 

0  
0 
Specifying and . Here, we make the following two remarks. First, unittime savings and misclassification cost are value estimates that are dictated by the specific application. For our ICU case, for example, we could use value per unit ICU time, and expected cost per lawsuit. Note that and are domainspecific explainable quantities. Second, the benefit model is most likely to differ from application to application. For example in predictive system maintenance, savings and cost would have different semantics, assuming that early prediction of failure implies a renewal of all equipment. In that case, an early and accurate failure prediction would incur savings from costs of a complete system halt, but also loss of equipment lifetime value due to early replacement plus the replacement costs. On the other hand, early but inaccurate prediction (i.e., a false alarm) would simply incur unnecessary replacement costs plus the loss of equipment lifetime value due to early replacement.
Our goal is to set up a general prediction framework that explicitly models benefit based on incurred savings and costs associated with individual decisions, whereas the scope of specifying those savings and costs are left to the practitioner. We argue that each realworld task should strive to explicitly model benefit, where earliness and accuracy of predictions translate to realworld value. In cases where the prediction task is isolated from its realworld use (e.g., benchmark datasets), one could set both for unit savings per unit time earliness and unit misclassification cost per incorrect decision. In those cases where
is not tied to a specific realworld value, it can be used as a “knob” (i.e., hyperparameter) for trading off accuracy with earliness; where, fixing
, a larger nudges BeneFitter to avoid misclassifications toward higher accuracy at the expense of delayed predictions and vice versa.4.1.3. Benefit Modeling for Type Classification
Compared to outcome prediction where observations give rise to the labels, in type classification problems the labels give rise to the observations. Without a default class, predictions come with associated savings and cost for each class.
Predicted  
Actual 
    
  
 
Consider the 2class setting of predicting an insect’s type from wingbeat frequencies. An example benefit model is illustrated in Table 3, capturing the value of batterylife savings per unit time and depicting the cost of misclassifying one insect as the other. Note that in general, misclassification cost need not be symmetric among the classes.
For type classification problems, we train a total of benefit prediction models, one for each class. Since misclassification costs are already incorporated into benefit, we train each (regression) model independently which allows for full parallelism.
4.2. Online Decisionmaking using Benefit
Next we present how to employ BeneFitter in decision making in real time. Suppose we have trained our model that produces benefit estimates per class for a new timeseries instance in an online fashion. How and when should we output predictions?
Thanks to our benefit modeling, the decisionmaking is quite natural and intuitive: BeneFitter makes a prediction only when the estimated benefit becomes positive for a certain class and outputs the label of that class as its prediction i.e. for our ICU scenario the predicted label is given as
For illustration, in Fig. 2 we show benefit estimates over time for an input series where corresponds to decision time.
Note that in some cases BeneFitter may restrain from making any prediction for the entire duration of a test instance, that is when estimated benefit never goes above zero. For outcome classification tasks, such a case is simply registered as defaultclass prediction and its prediction time is recorded as . For the ICU scenario, a nonprediction is where no flag is raised, suggesting survival and regular course of action. For type classification tasks, in contrast, a nonprediction suggests “waiting for more data” which, at the end of the observation period, simply implies insufficient evidence for any class. We refer to those as unclassified test instances. Note that BeneFitter
is different from existing prediction models that always produce a prediction, where unclassified instances may be of independent interest to domain experts in the form of outliers, noisy instances, etc.
4.3. Predicting Benefit
For each timeseries , we aim to predict the benefit at every time tick , denoted as . Consider the outcome classification problem, where we are to train one regressor model for the nondefault class, say . For each training series for which (i.e., default class), benefit of predicting at t is . Similarly for training series for which (i.e., ), . (See Table 2.) To this end, we create training samples of the form per instance . Note that the problem becomes a regression task. For type classification problems, we train a separate regression model per class with the corresponding values. (See Table 3.)
Model. We set up the task of benefit prediction as a sequence regression problem. We require BeneFitter to ingest multivariate and variablelength input to estimate benefit
. We investigate the use of Long Short Term Memory (LSTM)
(hochreiter1997long), a variant of recurrent neural networks (RNN), for the sequence (timeseries) regression since their recursive formulation allows LSTMs to handle multivariate variablelength inputs naturally. The recurrent formulation of LSTMs is useful for
BeneFitter to enable realtime predictions when new observations arrive one at a time.Attention. The recurrent networks usually find it hard to focus on to relevant information in long input sequences. For example, an EEG pattern in the beginning of a sequence may contain useful information about the patient’s outcome, however the lossy representation of LSTM would forget it. This issue is mostly encountered in longer input sequences (luong2015effective). The underlying idea of attention (vaswani2017attention) is to learn a context that captures the relevant information from the parts of the sequence to help predict the target. For a sequence of length , given the hidden state from LSTM and the context vector , the attention step in BeneFitter combines the information from both vectors to produce a final attention based hidden state as described below:
(3)  
(4) 
where is the memory state of the cell at the last time step , is the hidden state output of the LSTM at time , is the attention based hidden state,
is the nonlinear transformation, and
is the parameter. Intuitively, the attention weights allows the model to learn to focus on specific parts of the input sequence for the task of regression. The benefit prediction is given by a single layer neural network such that where and are parameters of the linear layer.BeneFitter is used in real life decision making, where the attention mechanism could help an expert by highlighting the relevant information that guided the model to output a decision. We present model implementation details and list of tunable parameters in §5.
5. Experiments
We evaluate our method through extensive experiments on a set of benchmark datasets and on a set of datasets from realworld use cases. We next provide the details of the datasets and the experimental setup, followed by results.
Dataset  Train  Test  Classes  Length  Dimension 
EEGICU Hour  507  218  2  24–96  107 
EEGICU 30 Min  507  218  2  48–192  107 
EEGICU 10 Min  507  218  2  144–576  107 
ECG200  100  100  2  96  1 
ItalyPowerDemand  67  1029  2  24  1 
GunPoint  50  150  2  150  1 
TwoLeadECG  23  1139  2  82  1 
Wafer  1000  6062  2  152  1 
ECGFiveDays  23  861  2  136  1 
MoteStrain  20  1252  2  84  1 
Coffee  28  28  2  286  1 
Yoga  300  3000  2  426  1 
SonyAIBO  20  601  2  70  1 
Endomondo  99754  42751  2  450  2 
5.1. Dataset Description
We apply BeneFitter on our EEGICU datasets (see §2.1), and on public benchmark datasets from diverse domains with varying dimensionality, length and scale. Table 4 provides a summary of the datasets used in evaluation. Note that EEGICU datasets are variablelength, but benchmarks often used in the literature are not. Detailed description of public datasets are included in Appx. A.1.
5.2. Experimental Setup
Baselines. We compare BeneFitter to the following five early timeseries classification methods (also see Table 1):

ECTS: Early Classification on Time Series (xing2012early) uses minimum prediction length (MPL) and makes predictions if the MPL of the top nearest neighbor (1NN) is greater than the length of the test series.

EDSC: Early Distinctive Shapelet Classification (xing2011extracting) extracts local shapelets for classification that are ranked based on the utility score incorporating earliness and accuracy. Multivariate extension of EDSC (MEDSC) (ghalwash2012early) provides a utility function that can incorporate multidimensional series.

CECTS: Costaware ECTS (dachraoui2015early; tavenard2016cost) tradesoff between a misclassification cost and a cost of delaying the prediction, and estimates future expected cost at each time step to determine the optimal time instant to classify an incoming time series.

RelClass:Reliable Classification (parrish2013classifying)
uses a reliability measure to estimate the probability that the assigned label given incomplete data (at time step
) would be same as the label assigned given the complete data. 
E2EL: Endtoend Learning for Early Classification of Time Series (russwurm2019end) optimizes a joint cost function based on accuracy and earliness, and provides a framework to estimate a stopping probability based on the cost function.
5.3. Evaluation
We design our experiments to answer the following questions:
[Q1] Effectiveness: How effective is BeneFitter at early prediction on time series compared to the baselines? What is the tradeoff with respect to accuracy and earliness? How does the accuracy–earliness tradeoff varies with respect to model parameters?
[Q2] Efficiency: How does the runningtime of BeneFitter scale w.r.t. the number of training instances? How fast is the online response time of BeneFitter?
[Q3] Discoveries: Does BeneFitter lead to interesting discoveries on realworld case studies?
[Q1] Effectiveness
We compare BeneFitter to baselines on (1) patient outcome prediction, the main task that inspired our benefit formulation, (2) the activity prediction task on a webscale dataset, as well as (3) the set of 10 twoclass timeseries classification datasets. The datasets for the first two tasks are multidimensional and variablelength that many of the baselines can not handle. Thus we compare BeneFitter with E2EL and MEDSC baselines that can work with such timeseries sequences. Comparison with MEDSC is limited to the smaller onehour EEG dataset since it does not scale to larger datasets. In order to compare BeneFitter to all other baselines, we conduct experiments on ten benchmark timeseries datasets.
Patient Outcome Prediction. We compare BeneFitter with the baseline E2EL on two competing criteria: performance (e.g. precision, accuracy) and earliness (tardiness – lower is better) of the decision. We report precision, recall, F1 score, accuracy, tardiness and the total benefit using each method when applied to the test set. EEG dataset is a high dimensional variablelength dataset for which most of the baselines are not applicable. In our experiments, we set as misclassification cost for each of the dataset variants – sampled at an hour, 30 minutes, 10 minutes – respectively based on average daily cost of ICU care and the lawsuit cost. For the baseline methods, we report the best results for the earlinessaccuracy tradeoff parameters. For the baseline methods, we select the best value of accuracy and earliness based on their Euclidean distance to ideal accuracy and ideal tardiness .
Table 5 reports the evaluation against different performance metrics. Note that predicting ‘default state’ for a patient does not change the behavior of the system. However, predicting death (unfavorable outcome) may suggest clinician to intervene with alternative care. In such a decision setting, it is critical for the classifier to exhibit high precision. Our results indicate that BeneFitter achieves a significantly higher precision (according to the microsign test (yang1999re)) when compared to the baselines. On the other hand, a comparatively lower tardiness indicates that BeneFitter requires conspicuously less number of observations on average to output a decision (no statistical test conducted for tardiness). We also compare the total benefit accrued for each method on the test set where BeneFitter outperforms the competition. The results are consistent across the three datasets of varying granularity from hourly sampled data to 10 minute sampled data.
Prec. 
Recall 
F1 
Acc. 
Tardiness 
benefit 

EEG Hour 
E2EL  0.70  0.68  0.69  0.79  1.0  2600 
MEDSC  0.69  0.65  0.67  0.78  0.52  2497  
BeneFitter  0.80*  0.68  0.73*  0.83*  0.64  2737  
EEG 30Min 
E2EL  0.64  0.67  0.65  0.78  1.0  4800 
BeneFitter  0.68*  0.66  0.67*  0.79  0.63  5962  
EEG 10Min 
E2EL  0.73  0.69  0.71  0.82  0.86  736 
BeneFitter  0.76*  0.69  0.72  0.83  0.48  18722 
For hourly sampled set, we also compare our method to multivariate EDSC baseline (for the 30 min and 10 min EEG dataset MEDSC does not scale ). Though MEDSC provides better earliness tradeoff compared to other two methods, the precision of the outcomes is lowest which is not desirable in this decision setup. In Table 5, we indicate the significant results using * that is based on the comparison between BeneFitter and E2EL.
Benchmark Prediction Tasks. To jointly evaluate the accuracy and earliness (tardiness – lower is better), we plot accuracy against the tardiness to compare the Pareto frontier for each of the competing methods over 10 different benchmark datasets. In Fig. 1 and Fig. 3, we show the accuracy and tardiness tradeoff for 10 benchmark UCR datasets. Each point on the plot represents the model evaluation for a choice of tradeoff parameters reported in Table A1 (§A.2). Note that BeneFitter dominates the frontiers of all the baselines in accuracy vs tardiness on five of the datasets. Moreover, our method appears on the Pareto frontier for four out of the remaining five for at least one set of parameters.
To further assess the methods, we report quantitative results in Table 6 in terms of accuracy at a given tolerance of tardiness. We define an acceptable level of tolerance to indicate how much an application domain is indifferent to delay in decision up to the indicated level. For example a tolerance of indicates that the evaluation of the decisions is done at , is the maximum length of sequence, and any decision made up to are considered for evaluation. In Fig. 3, we fix the xaxis at a particular tolerance and report the best accuracy to the left of the fixed tolerance in Table 6. The reported tolerance level indicates the average tolerance across the test timeseries sequences. BeneFitter outperforms the competition seven times out of ten for a tolerance level indicating that our method achieves best performance using only the half of observations. The remaining three times our method is second best among all the competing methods. Similarly for tolerance , BeneFitter is among the top two methods nine out of ten times.
Dataset 
Tardiness 
ECTS 
EDSC 
CECTS 
RelClass 
E2EL 
BeneFitter 
()  
ECG200 
0.50    0.84  0.83  0.88  0.87  0.91 
0.75    0.84  0.83  0.89  0.87  0.91  
ItalyPower 
0.50      0.78  0.85  0.89  0.93 
0.75    0.85  0.94  0.95  0.89  0.93  
GunPoint 
0.50    0.95  0.80    0.93  0.97 
0.75  0.91  0.95  0.84  0.91  0.96  0.97  
TwoLeadECG 
0.50    0.88  0.89    0.79  0.98 
0.75  0.73  0.89  0.94  0.73  0.86  0.98  
Wafer 
0.50    0.99  0.96  1.0  0.99  0.99 
0.75  1.0  0.99  0.96  1.0  0.99  0.99  
ECGFiveDays 
0.50      0.59  0.57  0.64  0.87 
0.75  0.72  0.95  0.59  0.77  0.77  0.87  
MoteStrain 
0.50    0.8  0.85      0.85 
0.75    0.8  0.85      0.85  
Coffee 
0.50      0.98  0.89  0.53  0.93 
0.75    0.75  0.98  0.89  0.53  0.93  
Yoga 
0.50    0.71  0.64    0.79  0.76 
0.75    0.71  0.64    0.79  0.76  
SonyAIBO 
0.50    0.80  0.81  0.81  0.92  0.92 
0.75  0.69  0.80  0.81  0.81  0.92  0.92  
Endomondo  0.50  ✗  dns  ✗  ✗  0.68  0.66 
Endomondo Activity Prediction. We run the experiments on full Endomondo dataset (a large scale dataset) to compare BeneFitter with baseline E2EL (other baselines do not scale) for one set of earlinessaccuracy tradeoff parameters. We, first, compare the two methods on a sampled dataset – with time series instances – evaluated for a choice of tradeoff parameters (see Fig. 6 in Appendix A.3). We select the parameters that yields a performance closest to ideal indicated. With the selected parameters, comparison of two methods on largescale Endomondo activity prediction dataset are reported in Table 6 (last row). We report the accuracy of the two methods by fixing their tardiness at . Notice that the two methods are comparable in terms of the prediction performance while using less than half the length of a sequence for outputting a decision.
The quantitative results suggest a way to choose the best classifier for a specified tolerance level for an application. In critical domains such as medical, or predictive maintenance a lower tolerance would be preferred to save cost. In such domains, BeneFitter provides a clear choice for early decision making based on the benchmark dataset evaluation.
[Q2] Efficiency
Fig. 4 shows the scalability of BeneFitter with the number of training timeseries and number of observations per time series at test time. We use ECG200 dataset from UCR benchmark to report results on runtime.
Linear training time: We create ten datasets from the ECG200 dataset by retaining a fraction of total number of training instances. For a fixed set of parameters, we train our model individually for each of the created datasets. The wallclock running time is reported against the fraction of training sequences in Fig. 4 (left). The points in the plot align in a straight line indicating that BeneFitter scales linearly with number of sequences.
Constant test time: We now evaluate BeneFitter runtime by varying the number of observations over time. For this experiment, we retain the hidden state of an input test sequence up to time . When a new observation at time arrives, we update the hidden state of the RNN cell using the new observation and compute the predicted benefit based on updated state. Fig. 4 (right) plots the wallclock time against each new observation. The time is averaged over test set examples. The plot indicates that we get realtime decision in constant time.
The efficiency of our model makes it suitable for deployment for real time prediction tasks.
[Q3] Discoveries
In this section, we present an analysis of BeneFitter highlighting some of the salient aspects of our proposed framework on ICU patient outcome task. In particular, we discuss how our method explains the benefit prediction by highlighting the parts of inputs that contributed most for the prediction, and how our benefit formulation assists with model evaluation.
Explaining Benefit Estimation Our method utilizes the attention mechanism (see §4) in the RNN network for benefit regression. The model calculates weights corresponding to each hidden state . These weights can indicate which of the time dimensions model focuses on to estimate the benefit for the current input series. In Fig. 5, we plot one dimension of the input time series from EEG dataset. This dimension corresponds to amplitude of the EEG (aEEG) when measured in left hemisphere of the brain. The input sequence is taken from the hourly sampled dataset. Note that there are sharp rise and fall in the aEEG signal from to , and from to . We input the dimensional sequence to BeneFitter along with the aEEG signal. The model outputs the attention weights corresponding to each timedimension of the inputs shown in Fig. 5 as a heatmap (dark colors indicate lower weights, lighter colors indicate higher weights). BeneFitter outputs a decision at , however we evaluate the model at further time steps. Note that the each row of heat map represents evaluation of input at . For each evaluation, we obtain a weight signifying the importance of a time dimension which are plotted as heatmap. Xaxis of heat map corresponds to time dimension, and yaxis of the heatmap corresponds to the evaluation time step of the input. Observe that the attention places higher weights towards the beginning of the time series where we observe the crests and troughs of the signal.
The visualization of importance weights is useful in critical applications where each decision involves a high cost. The estimated benefit along with the importance information can assist a clinician with the decision making.
Assisting with Model Evaluation In the clinical setting with comatose patients, there is a natural cost associated with an inaccurate prediction and savings obtained from knowing the labels early. The benefit modeling captures the overall value of outputting a decision. Though, we use this value for learning a regression model, the benefit
formulation could be used as an evaluation metric to asses the quality of a predictive model. If we know the domain specific unittime savings
and misclassification cost , we can then evaluate a model performance for that particular value of and . Table 7 reports evaluation of BeneFitter for various values of with model trained using the same values of . We notice that with increasing we improve the precision of the model, however increased also results in higher penalty for any misclassification. For our model trained on hourly sampled EEG data, we observe that values above , results in overall negative benefit averaged over test data. Assuming unittime savings , we can tolerate lawsuit costs up to million for . Similarly, any model can be evaluated to assess its usefulness using our benefit formulation as an evaluation measure in critical domains.Precision  Recall  F1 score  Accuracy  Tardiness  Benefit  
100  0.80  0.68  0.73  0.83  0.64  2737 
200  0.80  0.67  0.71  0.82  0.68  1032 
300  0.82  0.67  0.74  0.84  0.68  156 
400  0.82  0.69  0.75  0.84  0.70  1326 
6. Conclusions
In this paper, we consider the benefitaware early prediction of health outcomes for ICU patients and proposed BeneFitter that is designed to effectively handle multivariate and variablelength signals such as EEG recordings. We made multiple contributions.

Novel, costaware problem formulation: BeneFitter infuses the incurred savings from an early prediction as well as the cost from misclassification into a unified target called benefit. Unifying these two quantities allows us to directly estimate a single target, i.e., benefit, and importantly dictates BeneFitter exactly when to output a prediction: whenever estimated benefit becomes positive.

Efficiency and speed: The training time for BeneFitter is linear in the number of input sequences, and it can operate under a streaming setting to update its decision based on incoming observations.

Multivariate and multilength timeseries: BeneFitter is designed to handle multiple time sequences, of varying length, suitable for various domains including health care.

Effectiveness on realworld data: We applied BeneFitter in early prediction of health outcomes on ICUEEG data where BeneFitter provides up to timesavings as compared to competitors while achieving equal or better accuracy. BeneFitter also outperformed or tied with top competitors on other realworld benchmarks.
References
 Early fault classification in dynamic systems using casebased reasoning. In CAEPIA, pp. 211–220. Cited by: §3.
 Classifiers with a reject option for early timeseries classification. In CIEL, pp. 9–16. Cited by: §3.
 Boosting interval based literals. Intelligent Data Analysis 5 (3), pp. 245–262. Cited by: §3.
 Mining sequence classifiers for early prediction. In SIAM SDM, pp. 644–655. Cited by: §3.
 Time series shapelets: a new primitive for data mining. In SIGKDD, pp. 947–956. Cited by: §3.
Appendix A Appendix
In this section, we first give a detailed description of the datasets used in the experiments. We then provide model training details for BeneFitter, and present the hyperparameter configurations used in BeneFitter as well as the competing methods for the benchmark experiments.
a.1. Dataset Description
• Benchmark Datasets. Our benchmark datasets consist of 10 twoclass timeseries classification datasets from the UCR repository (UCRArchive). The datasets cover diverse domains and have diverse range of series length. The UCR archive provides the train/test split for each of these datasets, which we retain in our experiments.
• Endomondo Dataset. Endomondo is a social fitness app that tracks numerous fitness attributes of the users. We use the webscale Endomondo dataset (ni2019modeling) (See Table 4) for the early activity prediction task. The data includes various measurements such as heart rate, altitude and speed, along with contextual data such as user id and activity type. For the task of early activity prediction, we use heart rate and altitude signals for early prediction of the type of activity, specifically biking vs. running. (Note that we leave out signals like speed and its derivatives which make the classification task too easy.)
a.2. Experimental Settings
a.2.1. Model Training Details
We define the outcome prediction problem as a regression task on the benefit, as presented in §4. The training examples represent the sequences observed up to time along with their corresponding expected benefit at time . We then split the training examples to use of the sequences for training the RNN model and remaining for validation. We select our model parameters based on the evaluation on validation set. We have two sets of hyperparameters: one corresponding to our benefit formulation that are and
, and the other for the RNN model. The hyperparameter grid for the RNN model includes the dimension of the hidden representation
and the learning rate . For BeneFitter, we fix at 1 and vary for model selection. The hyperparameter grid for our benefit formulation is , where is the length of training series . For the general multiclass problem we tune for an additional hyperparameter to predict the class label. We set of the maximum difference between the expected benefit for the twoclass labels for a given training series. BeneFitter outputs a decision when the predicted benefit is positive and at least higher compared to the predicted benefit of other labels.For the learning task, we use the mean squared error loss function and Adam optimizer
(kingma2014adam) for learning the parameters. The loss corresponding to class is given bywhere is the number of total count of time steps for all the sequences in the training set, is the length of each sequence, and denotes the expected benefit for the class
a.2.2. Benchmark Experiments – Hyperparameters
We compared the performance of BeneFitter against six competing methods on benchmark datasets. In Table A1, we report the different hyperparameter configurations for each method that provides a tradeoff between accuracy and earliness.
Method  Model Training Hyperparameters 
ECTS  support 
EDSC  Chebyshev parameter 
CECTS  delay cost 
RelClass  reliability 
E2EL  earliness tradeoff 
BeneFitter 
Comments
There are no comments yet.