Data-driven Prognostics with Predictive Uncertainty Estimation using Ensemble of Deep Ordinal Regression Models

by   Vishnu TV, et al.
Tata Consultancy Services

Prognostics or Remaining Useful Life (RUL) Estimation from multi-sensor time series data is useful to enable condition-based maintenance and ensure high operational availability of equipment. We propose a novel deep learning based approach for Prognostics with Uncertainty Quantification that is useful in scenarios where: (i) access to labeled failure data is scarce due to rarity of failures (ii) future operational conditions are unobserved and (iii) inherent noise is present in the sensor readings. All three scenarios mentioned are unavoidable sources of uncertainty in the RUL estimation process often resulting in unreliable RUL estimates. To address (i), we formulate RUL estimation as an Ordinal Regression (OR) problem, and propose LSTM-OR: deep Long Short Term Memory (LSTM) network based approach to learn the OR function. We show that LSTM-OR naturally allows for incorporation of censored operational instances in training along with the failed instances, leading to more robust learning. To address (ii), we propose a simple yet effective approach to quantify predictive uncertainty in the RUL estimation models by training an ensemble of LSTM-OR models. Through empirical evaluation on C-MAPSS turbofan engine benchmark datasets, we demonstrate that LSTM-OR is significantly better than the commonly used deep metric regression based approaches for RUL estimation, especially when failed training instances are scarce. Further, our uncertainty quantification approach yields high quality predictive uncertainty estimates while also leading to improved RUL estimates compared to single best LSTM-OR models.



There are no comments yet.


page 1

page 2

page 3

page 4


Remaining Useful Life Estimation Using Functional Data Analysis

Remaining Useful Life (RUL) of an equipment or one of its components is ...

Remaining useful life prediction with uncertainty quantification: development of a highly accurate model for rotating machinery

Rotating machinery is essential to modern life, from power generation to...

Accurate Remaining Useful Life Prediction with Uncertainty Quantification: a Deep Learning and Nonstationary Gaussian Process Approach

Remaining useful life (RUL) refers to the expected remaining lifespan of...

Fast Uncertainty Quantification for Deep Object Pose Estimation

Deep learning-based object pose estimators are often unreliable and over...

Regression with Uncertainty Quantification in Large Scale Complex Data

While several methods for predicting uncertainty on deep networks have b...

Ensemble-based Uncertainty Quantification: Bayesian versus Credal Inference

The idea to distinguish and quantify two important types of uncertainty,...

Quantifying Model Predictive Uncertainty with Perturbation Theory

We propose a framework for predictive uncertainty quantification of a ne...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

In the current digital era, streaming data is ubiquitous. In the context of Industrial Internet of Things, remote health monitoring services driven by sensor driven data analytics are becoming increasingly popular. Data-driven approaches for anomaly detection, diagnostics, prognostics and optimization have been proposed to provide operational support to engineers, ensure high reliability and availability of equipment, and to optimize the operational cost (

da2014internet). Typically, a large number of sensors (order of hundreds or sometimes thousands) are installed to capture the operational behavior of complex equipment with various sub-systems interacting with each other.

Recently, deep learning approaches have been proposed for various data-driven health monitoring tasks including anomaly detection (p:lstm-ad; p:icmlLSTM-AD; gugulothu2018sparse) and prognostics (malhotra2016multi; gugulothu2017predicting; zheng2017long), yielding state-of-the-art results for RUL estimation (gugulothu2017predicting

) using Recurrent Neural Networks (RNNs). In this work, we focus on the problem of prognostics or Remaining Useful Life (RUL) estimation of operational instances given the current and historical readings from various sensors capturing their behavior. Deep learning approaches for prognostics, and equipment health monitoring in general, have certain limitations as highlighted in

gugulothu2018on; gugulothu2017predicting; khan2018review.

In this work, we address two important practical challenges in deep learning based RUL estimation approaches. The challenges addressed and the corresponding key contributions of this work are as follows:

Challenge-I: Deep neural networks are prone to overfitting and typically require a large number of labeled training instances to avoid overfitting. If failure time for an instance is known, a target RUL can be obtained at any time before the failure time. However, labeled training instances for RUL estimation are few as failures are rare. Also, any operational instance (or any instance for which failure time is not known, or which has not failed yet) is considered to be censored as target RUL cannot be determined for such an instance.

We note that deep RNNs (heimes2008recurrent; malhotra2016multi; gugulothu2017predicting; zheng2017long; zhang2018long

) and Convolutional Neural Networks (CNNs) (


) based approaches formulate RUL estimation as a metric regression (MR) problem where a normalized estimate of RUL is obtained given time series of sensor data via a non-linear regression metric function learned from the data. This MR formulation of RUL estimation cannot directly leverage censored data typically encountered in RUL estimation scenarios.

Key Contribution-I : In addition to using failed instances for training, we propose a novel approach to leverage the censored instances

in a supervised learning setting, in turn, increasing the training data and leading to more robust RUL estimation models. We cast RUL estimation as an

ordinal regression (harrell2001ordinal) problem (instead of the typically used metric regression formulation) and propose LSTM-OR (Long Short Term Memory Networks based Ordinal Regression) based RUL Estimation approach. We show that partially labeled training instances can be generated from the readily available operational (non-failed) instances to augment the labeled training data in the ordinal regression setting to build more robust RUL estimation models. We empirically show that LSTM-OR outperforms LSTM-MR by effectively leveraging censored data when the number of failed instances available for training is small.

Challenge-II: The black-box nature of deep neural networks makes it difficult to interpret the predictions/estimates , and in turn, gauge the reliability of the predictions. It is, therefore, desirable to quantify the predictive uncertainty in deep neural network based predictions of RUL - it can aid engineers and operators in risk assessment and decision making while accounting for the reliability of predictions.

Key Contribution-II: We propose a simple yet effective approach to quantify uncertainty based on an ensemble of LSTM-OR models (using similar idea as in NIPS2017_7219 as detailed in Section 5

). Ensemble of deep LSTM-OR models leads to improved RUL estimation performance, and also the empirical standard deviation (ESD) of the predictions from LSTM-OR models provides an approximate measure of uncertainty. We empirically show that when ESD (i.e. the uncertainty in estimation) is low, the corresponding error in estimation is also low; making ESD a useful uncertainty quantification metric.

Organization of the paper: We provide an overview of related literature in Section 2. In Section 3, we briefly introduce deep LSTM networks as used to build our deep OR models. We provide details of LSTM-OR and uncertainty quantification approaches in Sections 4 and 5, respectively. We provide experimental evaluation details and observations in Section 6, and finally conclude in Section 7.

2 Related Work

Trajectory Similarity based RUL estimation: An important class of approaches for RUL estimation is based on trajectory similarity, e.g. wang2008similarity; khelif2014rul; lam2014enhanced; malhotra2016multi; gugulothu2017predicting. These approaches compare the health index trajectory or trend of a test instance with the trajectories of failed train instances to estimate RUL using a distance metric such as Euclidean distance. Such approaches work well when trajectories are smooth and monotonic in nature but are likely to fail in scenarios when there is noise or intermittent disturbances (e.g. spikes, operating mode change, etc.) as the distance metric may not be robust to such scenarios (gugulothu2017predicting).

Metric Regression based RUL estimation: Another class of approaches is based on metric regression. Unlike trajectory similarity based methods which rely on comparison of trends, metric regression methods attempt to learn a function to directly map sensor data to RUL, e.g. heimes2008recurrent; benkedjouh2013remaining; dong2014lithium; babu2016deep; gugulothu2017predicting; zheng2017long; vishnu2018recurrent

. Such methods can better deal with non-monotonic and noisy scenarios by learning to focus on the relevant underlying trends irrespective of noise. Within metric regression methods, few methods consider non-temporal models such as Support Vector Regression for learning the mapping from values of sensors at a given time instance to RUL, e.g.

benkedjouh2013remaining; dong2014lithium.

Temporal models for RUL estimation: Deep temporal models such as those based on RNNs (heimes2008recurrent; malhotra2016multi; gugulothu2017predicting; zheng2017long) or Convolutional Neural Networks (CNNs) (babu2016deep) can capture the degradation trends better compared to non-temporal models, and are proven to perform better. Moreover, these models can be trained in an end-to-end learning manner without requiring feature engineering. Despite all these advantages of deep models, they are prone to overfitting in often-encountered practical scenarios where the number of failed instances is small, and most of the data is censored. Our approach based on ordinal regression provisions for dealing with such scenarios, by using censored instances in addition to failed instances to obtain more robust models.

Ordinal Regression for Survival Analysis: Ordinal Regression has been extensively used for applications such as age estimation from facial images (chang2011ordinal; yang2013automatic; niu2016ordinal; liu2017ordinal), however the applications are restricted to non-temporal image data using Convolutional Neural Networks. cheng2008neural; luck2017deep

use feed-forward neural networks based ordinal regression for survival analysis. To the best of our knowledge, the proposed LSTM-OR approach is the first attempt to leverage ordinal regression based training using temporal LSTM networks for RUL estimation.

Deep Survival Analysis: A set of techniques for deep survival analysis have been proposed in the medical domain, e.g. katzman2018deepsurv; luck2017deep. On similar lines, an approach to combine deep learning and survival analysis for asset health management has been proposed in liao2016combining

. However, it is not clear as to how such approaches can be adapted for RUL estimation applications, as they focus on estimating the survival probability at a given point in time, and cannot provide RUL estimates. Further,

chapfuwa2018adversarial proposes an approach that leverages adversarial learning for doing time-event modeling in health domain. On the other hand, LSTM-OR is capable of providing RUL estimates using time series sensor data.

Uncertainty quantification in RUL estimation models: Uncertainty analysis in data-driven equipment health monitoring is an active area of research and an unsolved problem. The approaches described in sankararaman2013novel, 6496971 use analytical algorithms, unlike sampling-based methods, to estimate the uncertainty in prognostics. They consider various sources of uncertainty such as the loading and operating conditions of the system at hand, inaccurate sensor measurements, etc. to quantify their combined effect on RUL predictions. The task is formulated as an uncertainty propagation problem where the various types of uncertainty are propagated through state space models until failure. Also, the future states of the system are estimated using the state space models and are used to arrive at an estimate of RUL. Unlike these approaches, we focus on estimating RUL as well as predictive uncertainty by using an ensemble of deep neural networks to model the time-series of sensor data available till a given point in time, without predicting the future states of the system. Our approach does not rely on any assumptions such as those needed in a state-space model. Further, domain knowledge of the underlying dynamics of a system is not needed to quantify uncertainty, and therefore, our approach is much simpler to adapt.

Uncertainty quantification for deep neural networks: Recently, gal2016dropout proposed the use of dropout at the inference time to provide Bayesian approximation in the RUL estimation. Further, NIPS2017_7219 proposed the use of an ensemble of neural networks for predictive uncertainty estimation and demonstrated their use in comparison to Bayesian methods. Similarly, we also use an ensemble of LSTM networks to estimate the empirical uncertainty in RUL predictions.

3 Background: Deep LSTM Networks

We use a variant of LSTMs (hochreiter1997long) as described in zaremba2014recurrent in the hidden layers of the neural network. Hereafter, we denote column vectors by bold small letters and matrices by bold capital letters. For a hidden layer with LSTM units, the values for the input gate , forget gate , output gate , hidden state , and cell state at time are computed using the current input , the previous hidden state , and the cell state , where , , , , and are real-valued -dimensional vectors.

Consider to be an affine transform of the form for matrix and vector of appropriate dimensions. In the case of a multi-layered LSTM network with layers and units in each layer, the hidden state at time for the -th hidden layer is obtained from the hidden state at for that layer and the hidden state at for the previous ()-th hidden layer . The time series goes through the following transformations iteratively at -th hidden layer for through , where is length of the time series:


where the cell state is given by , and the hidden state is given by . We use dropout for regularization (pham2014dropout), which is applied only to the non-recurrent connections, ensuring information flow across time-steps for any LSTM unit. The dropout operator randomly sets the dimensions of its argument to zero with probability equal to a dropout rate. The sigmoid () and activation functions are applied element-wise.

In a nutshell, this series of transformations for , converts the input time series of length to a fixed-dimensional vector . We, therefore, represent the LSTM network by a function such that , where represents all the parameters of the LSTM network.

(a) Metric Regression
(b) Ordinal Regression
(c) Ordinal Regression For Censored Data
Figure 1: Deep Ordinal Regression versus Deep Metric Regression.
(a) Process overview for LSTM-OR.
(b) RUL and Uncertainty Estimation using Ensemble of LSTM-OR models.
Figure 2: Steps in LSTM-OR and Ensemble of LSTM-OR.

4 Deep Ordinal Regression for RUL Estimation

4.1 Terminology

Consider a learning set of failed instances, where is the target RUL, is a multivariate time series of length , , is the number of input features (sensors). The total operational life of an instance till the failure point is , s.t. . Therefore, is the RUL in given unit of measurement, e.g., number of cycles or operational hours. Hereafter, we omit the superscript in this section for better readability, and provide all the formulation considering an instance (unless stated otherwise).

We consider an upper bound on the possible values of RUL as, in practice, it is not possible to predict too far ahead in future. So if , we clip the value of to . The usually defined goal of RUL estimation via Metric Regression (MR) is to learn a mapping . With these definitions, we next describe LSTM-based Ordinal Regression (LSTM-OR) approach as summarized in Figure 2(a), and then describe how we incorporate censored data into the LSTM-OR formulation.

4.2 LSTM-based Ordinal Regression

(a) Failed Instance
(b) Censored Instance
Figure 3: Target vector creation for failed versus censored instance.

Instead of mapping an input time series to a real-valued number as in MR, we break the range of RUL values into intervals of length each, where each interval is then considered as a discrete variable. The -th interval corresponds to , and is mapped to the -th interval with , where denotes the ceiling function.

We consider binary classification sub-problems for the

discrete variables (intervals): a classifier

solves the binary classification problem of determining whether .

We train an LSTM network for the binary classification tasks simultaneously by modeling them together as a multi-label classification problem: We obtain the multi-label target vector from such that


where .

For example, consider a scenario where , and maps to the third interval such that . The target is then given by , as illustrated in Figure 3(a). Effectively, the goal of LSTM-OR is to learn a mapping

by minimizing the loss function

given by:


where, is the estimate for target , represents the parameters of the LSTM network, and and are the parameters of the layer that maps to the output sigmoid layer.

4.3 Using Censored Data for Training

For any censored instance, the data is available only till a time prior to failure and the failure time is unknown (illustrated in Figure 3(b)). Therefore, the target RUL is also unknown. However, at any time s.t. , it is known that the RUL since the instance is operational at least till . Considering as the input time series, we next show how we assign labels to few of the dimensions of the target vector : Assuming maps to the interval , since , we have . Since is unknown (as is unknown) and we have , the target vector can only be partially obtained:


For all , the corresponding binary classifier targets are masked, as shown in Figure 3(b), and the outputs from these classifiers are not included in the loss function for the instance. The loss function given by Equation 3 can thus be modified for including the censored instances in training as:


where for a censored instance and for a failed instance.

4.4 Mapping OR estimates to RUL

Once trained, each of the classifiers provides a probability for RUL being greater than the upper limit of the interval corresponding to the -th classifier. We obtain the point-estimate for from for a test instance as follows (similar to chang2011ordinal):


It is worth noting that once learned, the LSTM-OR model can be used in an online manner for operational instances: at current time instance , the sensor data from the latest time instances can be input to the model to obtain the RUL estimate at .

5 Predictive Uncertainty Quantification using Ensemble of LSTM-OR Models

Uncertainty quantification is very important in case of RUL estimation as equipment and operations involved are often of critical nature, and reliable predictions close to (but of course, prior to) failures can help avoid catastrophic failures by generating suitable alarms beforehand. Lack of sufficient training data, inherent noise in sensor readings, and uncertainty in the future usage and operation of equipment are few sources of uncertainty in case of data-driven predictive models for RUL estimation. Quantifying uncertainty in RUL estimates can assist ground engineers and operators to arrive at more informed decisions compared to scenarios where only RUL estimates are available without any metric indicating whether the model is certain about the estimate or not. In other words, uncertainty quantification of the RUL estimate enhances the reliability of data-driven models. This is even more relevant in deep neural network-based estimation models due to their otherwise black-box nature.

An uncertainty metric can be considered to be reliable if: i) for low uncertainty values, i.e. whenever the model is confident about its estimations, the corresponding errors in the RUL estimations are low, and for high uncertainty values, the corresponding errors in the RUL estimation model should be high, ii) it produces RUL estimates with low uncertainty when a failure is approaching, i.e. the model should be able to precisely estimate the RUL with a high degree of certainty close to failures.

To quantify the predictive uncertainty in the target vector estimate and the corresponding RUL estimate , we consider training an ensemble of LSTM-OR models. We consider an ensemble learning approach similar to that introduced in NIPS2017_7219: For training an ensemble of LSTM-OR models, we consider all the training data while using different (random) initializations of the parameters () of LSTM-OR models and random shuffling of the training instances to obtain different models in an ensemble. The final RUL estimate of the ensemble is given by simple average of the RUL estimates of the models in the ensemble, and the empirical standard deviation (ESD) in the RUL estimates is used as an approximation of the predictive uncertainty in RUL estimation. More specifically, as shown in Figure 2(b), we train LSTM-OR models such that we have RUL estimates for any instance, . We obtain the point estimate for from for an instance as follows:


The uncertainty in terms of ESD is given by:


We normalize the uncertainty values () using the minimum and maximum uncertainty values across all instances in a hold-out validation set through min-max normalization. We also consider other measures of uncertainty quantification in terms of entropy (similar to park2015using) as explained in Appendix A.1 but found ESD to be the most robust measure of uncertainty. We support this with experimental evaluation in Section 6.3.

6 Experimental Evaluation

We evaluate RUL estimation and uncertainty quantification approaches using the publicly available C-MAPSS aircraft turbofan engine benchmark datasets (saxena2008turbofan). We provide an overview of the dataset in Section 6.1. We consider metric regression models and ordinal regression models trained only on failed instances as baseline models, and compare following approaches for RUL estimation: i) MR: LSTM-MR using failed instances only (as in zheng2017long; heimes2008recurrent; gugulothu2017predicting), ii) OR: LSTM-OR using failed instances only and using loss as in Equation 3, iii) ORC: LSTM-OR leveraging censored data along with failed instances using loss as in Equation 5, iv) ORCE: simple average ensemble of ORC models. We describe RUL estimation approaches in Section 6.2. Further, to evaluate uncertainty quantification approach as described in Section 5, we study the relationship of uncertainty estimates with error and ground truth RUL in Section 6.3 while also introducing novel metrics to evaluate the efficacy of uncertainty estimates in context of prognostics.

6.1 Dataset Description

We consider datasets FD001 and FD004 from the simulated turbofan engine datasets111
(saxena2008turbofan). The training sets (train_FD001.txt and train_FD004.txt) of the two datasets contain time series of readings for 24 sensors (21 sensors and 3 operating condition variables) of several instances (100 in FD001 and 249 in FD004) of a turbofan engine from the beginning of usage till end of life. The time series for the instances in the test sets (test_FD001.txt and test_FD004.txt) are pruned some time prior to failure, such that the instances are operational and their RUL needs to be estimated. The actual RUL values for the test instances are available in RUL_FD001.txt and RUL_FD004.txt. We randomly sample 20% of the available training set instances, as given in Table 1

, to create a validation set for hyperparameter selection.

For simulating the scenario for censored instances, a percentage of the training and validation instances are randomly chosen, and time series for each instance is randomly truncated at one point prior to failure. We then consider these truncated instances as censored (currently operational) and their actual RUL values as unknown. The remaining % of the instances are considered as failed. Further, the time series of each instance thus obtained (censored and failed) is truncated at 20 random points in the life prior to failure, and the exact RUL for failed instances and the minimum possible RUL for the censored instances (as in Section 4 and Figure 3) at the truncated points are used for obtaining the models. The number of instances thus obtained for training and validation for is given in Table 2. The test set remains the same as the benchmark dataset across all scenarios (with no censored instances). The MR and OR approaches cannot utilize the censored instances as the exact RUL targets are unknown, while ORC can utilize the lower bound on RUL targets to obtain partial labels as per Equation 4.

An engine may operate in different operating conditions and also have different failure modes at the end of its life. The number of operating conditions and failure modes for both the datasets are given in Table 1. FD001 has only one operating condition, so we ignore the corresponding three sensors such that , whereas FD004 has six operating conditions determined by the three operating condition variables. We map these six operating conditions to a 6-dimensional one hot vector as in zheng2017long, such that .

6.2 RUL Estimation

In this section, we define performance metrics to evaluate our RUL estimation models i.e ORC and ORCE. Further, we discuss our experimental settings which is followed by results and observations. We also draw a comparison between our proposed RUL estimation models and already existing RUL estimation models.

6.2.1 Performance Metrics for Evaluating RUL Estimation Models

There are several metrics proposed to evaluate the performance of prognostics models (saxena2008metrics). We measure the performance of our models in terms of Timeliness Score (S) and Root Mean Squared Error (RMSE): For a test instance , the error in estimation is given by . The timeliness score for test instances is given by , where if , else . Usually, such that late predictions are penalized more compared to early predictions. We use and as proposed in saxena2008damage. The lower the value of , the better is the performance. The root mean squared error (RMSE) is given by: .

Dataset Train Validation Test OC FM
FD001 80 20 100 1 1
FD004 199 50 248 6 2
Table 1: Number of train, validation and test instances. Here, OC: number of operating conditions, FM: number of fault modes.
Dataset Train Validation Test
FD001 1600 400 100
FD004 3980 1000 248
Table 2: Number of truncated instances.
FD001 FD004
Instances RMSE Timeliness Score (S) Instances RMSE Timeliness Score (S)
0 80 0 15.62 15.63 15.63 14.62 507.2 367.64 367.64 292.76 199 0 26.88 28.33 28.33 27.47
50 40 40 17.56 19.06 17.60 15.98 444.1 564.14 572.63 372.26 100 99 29.71 32.85 31.48 30.62
70 24 56 19.92 16.48 18.53 16.57 713.31 362.21 561.11 404.94 60 139 33.17 33.65 32.13 31.27
90 8 72 25.32 24.83 21.51 20.38 20 179 41.23 43.88 39.75 38.41
Table 3: Comparison of various LSTM-based approaches considered in terms of RMSE and Timeliness Score (S) for FD001 and FD004 datasets. and denote number of failed and censored instances in training set, respectively.
(a) RMSE FD001
(b) Timeliness Score (S) FD001
(c) RMSE FD004
(d) Timeliness Score (S) FD004
Figure 4: %age gain of ORC and ORCE over MR with decreasing number of failed instances () in training.
(a) FD001
(b) FD004
(c) FD001
(d) FD004
Figure 5: Comparison of ESD and ENT as measures of uncertainty in terms of (a)-(b) Precision Recall Curves; and (c)-(d) F1 Scores with varying . ESD is a more robust uncertainty metric compared to ENT.
(a) Average Error with varying uncertainty threshold.
(b) Uncertainty Evaluation with varying RUL.
Figure 6: Performance evaluation of ESD as an uncertainty metric showing: (a) lower uncertainty values corresponding to low RUL estimation errors, (b) highly precise and correct uncertainty estimates close to failures, i.e. when RUL is low.

6.2.2 Experimental Setup

We consider cycles for all models, as used in babu2016deep; zheng2017long. For OR and ORC, we consider such that interval length . For training the MR models, a normalized RUL in the range 0 to 1 (where 1 corresponds to a target RUL of 130 or more) is given as the target for each input. We use a maximum time series length of ; for any instance with more than 360 cycles, we take the most recent 360 cycles. Also, we use the standard z-normalization to normalize the input time series sensor wise using mean and standard deviation of each sensor from the train set.

The hyperparameters (number of hidden units per layer), (number of hidden layers) and the learning rate are chosen from the sets , and respectively. We use a dropout rate of 0.2 for regularization, and a batch size of 32 during training. The models are trained for a maximum of 2000 iterations with early stopping. The best hyperparameters are obtained using grid search by minimizing the respective loss function on the validation set.

For ORCE, we consider an ensemble of models (we consider up to 10 models in the ensemble, and found to work best across the scenarios considered). The models are trained on the best hyperparameters selected from the corresponding hyperparameter sets of ORC. While training different models, we ensure random initializations of the parameters of neural network and random shuffling of the training instances. For selecting models from available models, we ordered the models in the ascending order of their respective loss values on the validation set and then select the first 6 models.

6.2.3 Results and Observations

As summarized in Table 3, we observe that: As the number of failed training instances () decreases, the performance for all models degrades (as expected). However, importantly, for scenarios with small , ORCE significantly outperforms MR and OR. For example, as shown in Figure 4, with (i.e. with and 20 for FD001 and FD004, respectively), ORCE performs significantly better than MR, and shows 19.5% and 6.8% improvement over MR in terms of RMSE, for FD001 and FD004, respectively. The gains in terms of timeliness score are higher because of the exponential nature of (refer Section 6.2.1). It is evident from Figure 4 that ORCE is performing better than ORC and MR in terms of both RMSE and S. The performance gap between ORCE and ORC significantly increases in case of timeliness score (S) for FD004 dataset when , shown in 4(d). Due to fewer number of failed training instances (), some models in the ensemble are not trained properly and result in high errors even for the instances with lower RUL . This results in very high values of . In case of ORC, the overall value of tends to be high since, for ORC, we compute the average of timeliness scores corresponding to models in an ensemble. This is not the case for ORCE, since the instance-wise RUL estimations are obtained as the average of estimations from the models in the ensemble, the performance of ORCE in terms of is better when compared to ORC.

While MR and OR have access to only a small number failed instances for training, ORCE and ORC have access to failed instances as well as partial labels from censored instances for training. Therefore, MR and OR models tend to overfit while ORC and ORCE models are more robust.

We also provide a comparison with existing deep CNN-based (babu2016deep) and LSTM-based (zheng2017long) MR approaches in Table 4. ORC (same as OR for 0%) performs comparably to existing MR methods. More importantly, as noted above, ORC and ORCE may be advantageous and more suitable for practical scenarios with few failed training instances.

6.3 Uncertainty Quantification

We introduce various metrics used to evaluate the performance of the proposed ensemble-based uncertainty estimation approach. Using these metrics, we demonstrate the efficacy of the proposed approach from a practical point of view. We compare the proposed ESD (Equation 8) and two variants of entropy (as introduced in Appendix A.1) for uncertainty evaluation.

6.3.1 Performance Metrics for Evaluating Uncertainty
Quantification Methods

We expect our model to be certain (have high certainty) when the RUL estimates are correct, and less certain (have low certainty) for highly erroneous RUL estimates. We consider an RUL estimation to be correct if the absolute error , and to be certain if the corresponding uncertainty estimate . Also, for evaluating the performance of uncertainty metrics we restrict the target RUL to a maximum of because we train our models with a maximum target RUL of and so cannot be greater than . This is done because even if the model confidently estimates close to , a value of much greater than

will lead to high error and cannot result in proper performace evaluation of the uncertainty metrics. Under above considerations, we measure precision and recall to evaluate the performance of uncertainty quantification approach as follows: Precision is the fraction of test instances with uncertainty below a threshold

that also have error . Recall is defined as the fraction of test instances having uncertainty and error below some threshold and , respectively. More specifically:


where denotes the number of instances satisfying the condition .

Further, it is desirable to have very certain and correct estimates close to failure to avoid fatal consequences upon failure. To evaluate performance from this point-of-view, we analyze the relation of uncertainty with nearness to failure. It is desirable to have low error as well as low uncertainty when is low. To evaluate this aspect, we study the variation in precision for different RUL thresholds , considering test instances with low ground truth RULs. The modified precision in this context is given by:


For given thresholds and , quantifies the fraction of test instances with actual RUL and uncertainty that also have error .

FD001 FD004
CNN-MR (babu2016deep) 18.45 29.16
LSTM-MR (zheng2017long) 16.14 28.17
MR (ours) 15.62 26.88
ORC (proposed) 15.63 28.33
ORCE (proposed) 14.62 27.47
Table 4: Performance comparison of the proposed approach with existing approaches in terms of RMSE and Timeliness Score (S).

6.3.2 Results and Observations

For sake of brevity, we restrict the results and observations to the uncensored scenario, i.e. . Similar results and observations for models corresponding to censored scenarios are presented in Appendix A.2.

Comparing ESD vs Entropy (ENT) as uncertainty metric: Precision and Recall (as in Equation 9) are used to compare the two approaches for uncertainty estimation. Precision-Recall curves are obtained by varying the threshold on uncertainty while keeping . We observe that for , P is higher in case of ESD for FD001 dataset, shown in Figure 5(a). Similar behavior is observed in case of FD004 dataset, for , shown in Figure 5(b). We further plot score (as in Equation 9) by varying the , shown in Figure 5(c) and 5(d), which shows that ESD is a better uncertainty quantification metric compared to ENT. (We also analyze the instances for which ESD has unexpected behavior in terms of low uncertainty while having high error in RUL estimate. The observations are given in Appendix A.2.)

Relation between uncertainty and error: For a reliable model, RUL estimates with high certainty must be accurate, i.e. have low RUL estimation errors. To evaluate the performance of uncertainty metric in this context, we consider instances with uncertainty , and compute the average error in RUL estimation for these instances. As shown in Figure 6(a), we observe that for low values of , the average error thus computed is also low, indicating that the model is more accurate when it is more certain. Further, as expected, we observe an increase in average error with increasing , suggesting that the RUL estimates tend to be more erroneous when the model is uncertain.

Relation between uncertainty and actual RUL: For quantifying the relationship between RUL and uncertainty, is calculated as in Equation 10. is computed for varying , ranging from to and, keeping and fixed as and respectively. From practical point of view, higher precision () in case of lower values of is expected to correctly and confidently handle instances that are approaching failure. Similar trend is observed in our case also, as shown in Figure 6(b). For , for FD001 dataset and for FD004 dataset suggests that the model is certain and accurate of the times for FD001 dataset and of the times for FD004 dataset.

7 Conclusion and Discussion

In this work, we have proposed a novel approach for RUL estimation using deep ordinal regression based on multilayered LSTM neural networks. We have argued that ordinal regression formulation is more robust compared to metric regression, as the former allows for incorporation of more labeled data from censored instances. We found that leveraging censored instances significantly improves performance when the number of failed instances is small. In future, it would be interesting to see if a semi-supervised approach (e.g. as in yoon2017semi; gugulothu2018on) with initial unsupervised pre-training of LSTMs using failed as well as censored instances can further improve the robustness of the models. Further, an extension to the proposed approach to address the usually encountered non-stationarity scenario using approaches similar to saurav2018online can be considered. It is to be noted that although we have experimented with LSTMs for Ordinal Regression, our OR approach is generic enough to be useful for any neural network, e.g. CNNs.

Further, we have proposed a simple yet effective approach to quantify uncertainty in the RUL estimates by using a simple average ensemble of the deep ordinal regression models. The proposed empirical standard deviation based metric for uncertainty provides accurate predictive uncertainty estimates: we observe low errors in RUL estimation for low uncertainty values. Further, the model is found to be accurate with high certainty when the remaining useful life is very low, i.e. the instance is approaching failure. It will be interesting to see if the ensemble based approach for uncertainty quantification can be extended to metric regression models as well using uncertainty methods for regression as proposed in NIPS2017_7219.


Appendix A Appendix

a.1 Entropy as a measure of uncertainty

Let set of possible values of the multi-label target vector. In our case, since we are using OR, the number of possible combinations for any given is . For example, when , .

For entropy, we average out the estimates to get the final . The uncertainty for a test instance, considering the independent nature of all of the binary classifier is given as follows:


We normalize the uncertainty values () using the minimum and maximum uncertainty values across all instances in a hold-out validation set through min-max normalization. For entropy calculations, we also tried by calculating the entropy for each of the classifiers individually and later averaging them out to get the final one. But, it was less effective as compared to the approach defined before.

(a) FD001 Dataset
(b) FD004 Dataset
Figure 7: Precision Recall Curves comparing ESD and ENT.
(a) FD001 Dataset
(b) FD004 Dataset
Figure 8: Average Error at varying .
(a) FD001 Dataset
(b) FD004 Dataset
Figure 9: Uncertainty Evaluation wrt RUL.

a.2 Detailed Evaluation for Uncertainty Quantification

Instance Level Analysis : We perform a qualitative analysis of test instances having low uncertainty values despite having high error values to understand the scenarios where our proposed uncertainty quantification measure is failing. For such test engine (test instance), we consider three nearest engines from the training set where nearness is defined in terms of Euclidean distance between the target vector estimate of test and training engines. After finding the nearest training engines, we plot the first PCA component corresponding to the test and respective nearest training engines.

For PCA process, each multivariate reading from each timestamp is reduced to univariate reading by taking the first principal component as discussed in malhotra2016multi.

(a) FD001 Dataset
(b) FD004 Dataset
Figure 10: Instance Level Analysis (PCA) showing instances with low uncertainty estimate but high error in RUL estimate.

PCA plot for test engine#93 from FD001 is shown in Figure 10(a). Although a large number of cycles have passed for this instance, it has a significantly high RUL. The total life of this instance is making it a rare instance as instances with such high total cycles are not observed in the training data. Despite the passage of higher number of cycles, RUL is very high. Training engines with such high RUL are rare which is leading to the higher error inspite of having low uncertainty.

Similarly, PCA plot for test engine#166 from FD004 is shown in Figure 10(b). Due to the assumption of having maximum RUL as in ORC formulation, predicted life is around , even though the actual life is significantly higher. This clipping effect causes the high error. Moreover, scarcity of training engines with such high RUL further leads to increase in error inspite of having low uncertainty.

Uncertainty Evaluation wrt Error: We expect our model to be highly uncertain for higher error values. To evaluate the same, we calculate as follows:


where, is the fraction of test instances with error and uncertainty and respectively.

(a) FD001 Dataset
(b) FD004 Dataset
Figure 11: Uncertainty Evaluation w.r.t. Error.

We compute at varying error threshold, and fixing uncertainty threshold as . The results are shown in Figure 11. At lower , higher indicates that higher fraction of correct predictions are confident in nature.

(a) FD001 Dataset
(b) FD004 Dataset
Figure 12: Average Uncertainty with varying indicating low uncertainty values when error in RUL estimates is low.

Relationship Between Error and Uncertainty: RUL estimates with lower error values are expected to be certain in nature. For evaluating the uncertainty quantification metric from this aspect, we plot the relationship between error and average uncertainty, shown in Figure 12. Average uncertainty value at given is computed by considering the test instances with RUL estimate error and we then average out the uncertainty values corresponding to these filtered out test instances. We observe that at lower error thresholds, the computed average uncertainty is also low, indicating the preciseness of the RUL estimation model. Further, increase in average uncertainty with increase in error threshold indicates reliable behavior of the RUL estimation model.