Log In Sign Up

Remaining Useful Lifetime Prediction via Deep Domain Adaptation

by   Paulo R. de O. da Costa, et al.

In Prognostics and Health Management (PHM) sufficient prior observed degradation data is usually critical for Remaining Useful Lifetime (RUL) prediction. Most previous data-driven prediction methods assume that training (source) and testing (target) condition monitoring data have similar distributions. However, due to different operating conditions, fault modes, noise and equipment updates distribution shift exists across different data domains. This shift reduces the performance of predictive models previously built to specific conditions when no observed run-to-failure data is available for retraining. To address this issue, this paper proposes a new data-driven approach for domain adaptation in prognostics using Long Short-Term Neural Networks (LSTM). We use a time window approach to extract temporal information from time-series data in a source domain with observed RUL values and a target domain containing only sensor information. We propose a Domain Adversarial Neural Network (DANN) approach to learn domain-invariant features that can be used to predict the RUL in the target domain. The experimental results show that the proposed method can provide more reliable RUL predictions under datasets with different operating conditions and fault modes. These results suggest that the proposed method offers a promising approach to performing domain adaptation in practical PHM applications.


page 1

page 2

page 3

page 4


Zero-Shot Domain Adaptation via Kernel Regression on the Grassmannian

Most visual recognition methods implicitly assume the data distribution ...

Transfer learning for Remaining Useful Life Prediction Based on Consensus Self-Organizing Models

The traditional paradigm for developing machine prognostics usually reli...

Causal Mechanism Transfer Network for Time Series Domain Adaptation in Mechanical Systems

Data-driven models are becoming essential parts in modern mechanical sys...

Controlled Generation of Unseen Faults for Partial and OpenSet Partial Domain Adaptation

New operating conditions can result in a performance drop of fault diagn...

Zero-shot Domain Adaptation Based on Attribute Information

In this paper, we propose a novel domain adaptation method that can be a...

Deep Unsupervised Domain Adaptation: A Review of Recent Advances and Perspectives

Deep learning has become the method of choice to tackle real-world probl...

LAMA-Net: Unsupervised Domain Adaptation via Latent Alignment and Manifold Learning for RUL Prediction

Prognostics and Health Management (PHM) is an emerging field which has r...

1 Introduction

Companies interested in increasing the reliability of their assets have been investing in Prognostics and Health Management (PHM) systems to increase reliability, availability and to reduce maintenance costs of engineering assets atamuradov2017prognostics. In particular, several works have drawn attention to the task of using the collected data coming from sensors and IoT (Internet of Things) assets to perform prediction of maintenance events, such as fault prognostics, detection and diagnostics Jardine2006. In PHM, Remaining Useful Lifetime (RUL) relates to the amount of time left before a piece of equipment is considered not to perform its intended function. Accurate RUL prognostics enable the interested parties to assess an equipment’s health status and to plan future maintenance actions, e.g. logistics of personnel, spare parts and services Papakostas2010.

For example, in our current industry collaboration, a leading manufacturer of medical imaging systems requires prognostics prediction models that can leverage real-time data collected continuously over many locations. These systems are complex and modelling the precise degradation mechanisms is not possible. Also, although they perform similar procedures, the systems log different multivariate sensor data due to equipment version updates, sensor malfunctioning and timing (equipment installed at a later stage have less temporal data). Furthermore, sensor values have different distributions due to distinct usage and degradation levels. In such cases, high dimensional temporal data has to be directly used to determine the health state of the systems and models have to adapt to incoming changes in the data.

In the PHM literature, physics, statistical and machine learning approaches have been proposed to address the RUL prediction problem. Physics-based approaches build mathematical models that describe the degradation processes of the failure mechanisms


. Such models require prior degradation knowledge and provide accurate RUL estimation when failure can be described using its physical properties

Lei2018. Statistical methods usually attempt to fit the observations under a probabilistic method that can describe the uncertainty of the degradation process Si2011. Their shortcomings relate to assumptions about health state transitions and data distributions. On the other hand, machine learning models focus on learning the degradation patterns directly from acquired complex raw data. In general, machine learning models are non-parametric and can be applied in practice even without prior information about the underlying distributions and degradation knowledge Lei2018.

Several machine learning methods have been studied for prognostic prediction problems, including methods like Support Vector Machines (SVM)

Dong2013, Support Vector Regression (SVR) Benkedjouh2013 and Neural Networks. Neural Networks have been receiving much attention given their ability to approximate high dimensional non-linear functions directly from raw data Babuska2016

. Moreover, several architectures are specially built to support temporal inputs usually encountered in machine prognostics problems, e.g. Recurrent Neural Networks (RNN). Recently, deep learning methods have been proposed to prognostics problems containing high amounts of time-series input data, including Convolutional Neural Networks (CNN)


and variations of RNNs capable of dealing with long input sequences such as the Long Short-Term Memory Networks (LSTM)

ListouEllefsen2019; Zheng2017

and Gated Recurrent Unit (GRU)


In classical machine learning, models need enough annotated historical data to be able to train to a significant performance level Li2018; Zheng2017. Presumably, interested parties already apply time-based maintenance at their assets and observing run-to-failure behaviours becomes scarcer. To overcome the problem practitioners and researchers have to find ways to handle censored data hong2009prediction or generate (e.g. simulate) more data, which leads to imperfect models that do not represent real-world scenarios. Even when enough run-to-failure data are available, algorithms trained for one specific dataset cannot be generalised to a different but related dataset. For example, an algorithm trained for a specific failure mode prediction often does not generalise well to other modes under similar machinery conditions Lei2018. Moreover, when the input features change among different equipment versions (e.g. new sensor information is available), it is common practice to retrain the models. This retraining leads to delayed prognostics actions until enough data is available for accurate prediction.

To address these issues, predictive models which are trained with specific run-to-failure data, have to adapt to data with different input features, data distributions and limited fault information, i.e. different domains. In machine learning, this situation is often referred to as domain adaptation. In general, domain adaption methods attempt to solve the learning problem when the main learning goal (learning task

) is the same, but the domains have different feature spaces or marginal probabilities

Pan2010. Several algorithms were proposed to different flavours of the domain adaptation problem, which include reducing the domain discrepancy between the source and target via instance re-weighting jiang2007instance, subspace alignment fernando2013unsupervised, and adversarial deep learning tzeng2017adversarial; ganin2016domain. Many of these approaches work well for non-sequential data but are less suitable for multivariate time-series data as they do not usually capture the temporal dependencies present in the data. However, this type of data is prevalent in the maintenance context (e.g. in condition monitoring and equipment output data). Therefore, the general domain adaptation methods are hardly applicable to RUL prognostics.

In this work, we propose to use LSTMs hochreiter1997long to address the problem of learning temporal dependencies from time-series sensor data that can be transferred across related RUL prediction tasks with different distributions in their features. We learn from a source domain with sufficient run-to-failure annotated data and a target domain containing only sensor data. We perform adversarial learning similar to ganin2016domain

and learn a common domain-invariant feature representation that can be used with the classical backpropagation (through time) algorithm

werbos1990backpropagation. To the best of our knowledge, we are the first to focus on domain adaptation for RUL prediction regression task under varying operating conditions and fault modes. Furthermore, we use the C-MAPPS NASA Turbofan degradation datasets Saxena2008 to validate our results. We chose these datasets as they contain four run-to-failure datasets under different failure modes and operating conditions. As a standard, models built, evaluated and deployed for one particular dataset may perform poorly on unseen datasets with different input feature distributions and fault modes. We show the effectiveness of the proposed method against other adapted and non-adapted models in predicting the RUL of aircraft engines. In practice, capital assets go through re-engineering and redesign during their lifetime to prevent obsolescence problems. In such scenarios, the data from a previous design could be used to adapt learned models to a new design with distinct input data.

The main contributions of this work include a new model that can handle feature distribution shift across domains under different asset operating conditions and fault modes. Unlike classic domain adaption methods, we incorporate heterogeneous time series data coming from multiple sensors in an RUL regression prediction task. Furthermore, the proposed method can be easily updated as new data becomes available. We show in our experiments that the method improves prognostics predictions on unlabelled target data when compared to non-adapted methods.

The rest of the paper is structured as follows. In the next section, we briefly discuss the state-of-the-art in prognostics prediction and domain adaptation methods. In the subsequent session, we present our model detailing the learning algorithm and how the temporal dependencies of the data are used to create domain-invariant features. In Section 4, we present the learning procedure and detail the choices of model hyperparameters. In section 5, we compare and contrast the performance of proposed methods using our datasets and provide analysis of the results.

2 Related Work

2.1 Machine Learning Methods for RUL Prediction

In the prognostics literature, several artificial intelligence methods have been proposed to predict the RUL of engineering assets. In particular, authors have proposed several methods that attempt to extract the relationships of acquired sequential data and RUL prediction, such as linear regression

He2008, Support Vector Regression (SVR) Benkedjouh2013, fuzzy-logic systems zio2010data and neural networks tian2012artificial.

Neural networks have drawn much attention given its ability to approximate complex functions directly from raw data. For example, huang2007residual

proposed a Feed Forward Neural Networks (FFNN) architecture in a PHM prediction problems yielding superior results in comparison to other reliability-based approaches. Moreover, in many PHM applications, sequential time-series data are present. Neural Networks architectures such as Recurrent Neural Networks (RNN) are a natural fit for such problems given that their recurrent internal structure can handle sequential patterns in the input data. However, as demonstrated by

bengio1994learning RNNs have issues when learning long-term dependencies, because of vanishing gradient issues as training progresses. To address these issues Long Short Term Memory (LSTM) hochreiter1997long and Gated Recurrent Unit (GRU)cho2014learning

networks were introduced. Such networks posses internal gates that control how information will flow in the network during the learning procedure. These gates enable the network to preserve its memory state over time and fight the vanishing gradient problem while retaining information for a longer period.

In PHM, yuan2016 recently showed that LSTMs could outperform RNNs, GRUs and Adaboost-LSTM in an RUL prediction task. Zheng2017 showed that a sequence of LSTM layers followed by FFNNs could outperform other methods including CNN’s in three distinct degradation datasets. wu2018remaining presented similar results by extracting features based on a dynamic difference procedure and later training an LSTM for RUL predictions. Results showed that the LSTM also outperforms simpler RNNs and GRU architectures under similar machinery conditions. More recently, ListouEllefsen2019

showed that Restricted Boltzmann Machines could be used to extract useful weight information by pretraining on degradation data in an unsupervised manner. In this two-stage method, weights extracted in the first step are then used in a further step to fine-tune a supervised LSTM and FFNN model. A genetic algorithm (GA) is used to select the best performing hyperparameters. The methodology holds the state-of-the-art prediction results for the C-MAPSS datasets, presenting it as an effective method for temporal degradation data prediction.

CNN’s are notable for being able to extract spatial information from 2D and 3D high dimensional data yielding the best-known results in several related tasks such as image segmentation, classification and captioning


. CNN’s can also handle 1D sequential data and extract high-level features by combining convolution and max-pooling operations while sliding a local receptive field over the input features. Several CNN architectures have been proposed for remaining useful lifetime prognostics.


proposed a 2D deep CNN to predict the RUL of a system based on normalised variate time series from sensor signals; they show the effectiveness of the CNN in comparison to Multi-layer Perceptron (MLP), SVR and Relevance Vector Regression (RVR).


proposed to apply 1D convolutions in sequence without pooling operations. The results show that the proposed architecture can extract deep features from wear data by concatenating only convolution operations. They show competitive results on the C-MAPPS dataset without incurring in high training times encountered when training recurrent models.

2.2 Domain Adaptation Methods for PHM

Most previous studies have focused on predicting the RUL when enough run-to-failure data is available and assuming the training and future data come from the same distribution and feature space Li2018; ListouEllefsen2019; Zheng2017; babu2016deep; yuan2016. However, in real-life PHM scenarios, RUL values may be absent and coming from different marginal distributions between training and testing data. Examples include collecting data coming from different devices in varied operating conditions. Moreover, the data can also have different features across training and future data and run-to-failure data can be expensive to obtain. Unsupervised domain adaptation addresses those issues by building algorithms that can be applied when there is domain shift in the distribution and feature spaces Pan2010. Early methods to unsupervised adaptation attempted to re-weight source example losses to reflect the ones in the target distribution jiang2007instance; huang2007correcting. Re-weighting based methods often assume a restricted form of domain shift and selection bias which restricts their applications. Subspace alignment methods fernando2013unsupervised attempt to find a linear map that minimises the Frobenius norm between a number of top eingenvectors. However, such methods fail to align the distributions among the two data sources. Maximum Mean Discrepancy (MMD) based methods (e.g., Transfer Component Analysis (TCA) pan2011domain

) can be interpreted as moment matching methods and can express arbitrary statistics of the data using kernel tricks. Similarly, CORrelation ALignment (CORAL)

sun2016return attempts to align the second order statistics between source and target domains.

Recently, domain-adapted neural networks have been proposed for unsupervised domain adaptation. In general, these methods have attempted to restrain the target error by the source error plus a discrepancy metric between the source and the target domains Ben-David2010. For example, long2015learning; tzeng2014deep propose to incorporate a the MMD metrics to reduce the discrepancy between domains in classification problems. Similarly, sun2016deep

propose to incorporate a CORAL loss function for the same purpose. Another approach, based on the theory by

Ben-David2010, is to use a classification loss (Proxy -distance) to directly confuse between domains Ben-David2010; ganin2016domain; ajakan2014domain. Adversarial goodfellow2014generative methods have also been proposed to the adaptation task. For example, tzeng2017adversarial

proposes to pre-train a classifier in the source domain task and use its weights in a an adversarial learning task. In their implementation a target network attempts to confuse a discriminator by generating a representation similar to the source domain. This representation is then used to do inference using the source weights.

In regression, cortes2014domain showed that the discrepancy between domains is a distance for the squared loss when the hypothesis set is the reproducing kernel Hilbert space induced by an universal kernel. lopez2012semi proposed a method to factorise a multivariate density into a product of bi-variate copula functions to identify independent changes between domains (i.e., covariate shift). Therefore, changes in each of the input features can be detected and corrected to adapt a density model across different learning domains. More recently, nikzad2018domain has proposed a method a domain-invariant Partial-Least-Squares Regression using a domain regulariser to align source and target distributions in a latent space. Albeit the current attention in the recent literature, few works have attempted to perform domain adaptation when the input data are composed of time-series data. Similar to our work, Purushotham2016

proposed a method for time-series domain adaptation using Variational Recurrent Autoencoders and show promising results in a classification task using healthcare data.

In PHM, Lu2017 proposed a method for domain adaptation where Maximum-Mean Discrepancy (MMD) discrepancy metric is used to find a common latent space in a classification task. zhang2017new proposed to use hand-crafted and raw input features to construct a CNN and wide first-layer kernels model capable of performing domain adaptation in the presence of noisy data. Another CNN approach was proposed by li2019multi where an MMD parameterised by multiple kernels is proposed to address the domain shift problem. xie2016

used feature extraction from time and frequency domains using Transfer Component Analysis (TCA)

pan2011domain for gearbox fault diagnosis. The results showed that the proposed method could find cross-domain features using data under various operating conditions and yields better results in comparison to other dimensionality reduction methods. Recently, li2018cross proposed a deep generative model to generate fault target data using a labelled source domain and an unlabelled target domain. The method proposed an MMD metric and a classification loss in the generation phase to induce a common shared space between the domains. Results showed that the method could be used to solve the original target classification problem in the presence of unlabelled data.

Our work builds upon previous successful works using LSTMs to extract temporal features from time-series sensor data for RUL prediction yuan2016; Zheng2017; wu2018remaining; ListouEllefsen2019. To handle the distribution shift and different features among tasks we perform unsupervised domain adaptation from a labelled source containing observed RUL vales to an unlabelled target data. This scenario is typical in PHM as usually one is interested in predicting the RUL of an asset before failures are observed. We propose a method that can use the labelled failure data from the source domain to predict the RUL of the target domain under different failure modes and operating conditions. Similar to ganin2016domain we use a gradient reversal layer to perform adversarial learning during training and induce a domain-invariant representation.

Unlike previous domain adaptation works in PHM that focused on classification tasks, our method focuses on a regression task, where the goal is to determine the number of remaining cycles for aircraft engines. Similar to ganin2016domain our method utilises a single neural network capable of learning the source regression task and performing adversarial training. As a standard, adversarial learning is achieved by training two or more networks pitted against each other with contrasting objectives in a two-pass optimisation procedure goodfellow2014generative. Therefore, this modification allows for an easier implementation requiring only one architecture. Moreover, weight updates can be performed using the classic backpropagation through time algorithm. Our method uses a classification loss ajakan2014domain to induce domain confusion as it has recently been shown it can replace the classical MMD metric in classification instances.

3 LSTM Deep Adversarial Neural Network

In this section, we present our domain adaptation model to predict the RUL of assets across domains with different fault modes and operating conditions. We first introduce the problem and the notations used in the paper and then further discuss the proposed method and its components.

3.1 Problem Definition

We denote a source domain , containing training examples, where belongs to a feature space and denotes a multivariate time-series sensor data of length and features, i.e. . Moreover, denotes the remaining useful lifetime values of length with . Where for each , and represent the -th measurement of all variables and labels, respectively. Similarly, we assume a target domain , where and but no labels are available. We assume and

are sampled from distinct marginal probability distributions

. Our goal is to learn a function such that we can approximate the corresponding RUL in the target domain examples at testing time directly from degradation data, i.e. . Clearly, our assumption is that the true mapping between input output pairs is somewhat similar across domains for adaptation to be possible. At training time we have access to source training samples and their real valued labels and we assume access to training samples from the target domain (unsupervised domain adaptation). We assign a domain label to each -th training example to indicate the domain it originates from and to assist domain classification.

3.2 Time Windows Processing

To adapt for different sequence lengths and allow information from past multivariate temporal sequences influence the RUL prediction at a point in time we apply a time window approach for feature extraction. The sequential input is assumed to be where denotes the size of each sequence length. We define a function that divides each sequence of size in sequential time windows of size , i.e. . After the transformation, at time all previous sensor data within the time window are collected to form a high-dimensional input vector used to predict . If

we apply zero-padding on the left side of

until has size , this ensures that after the transformation each original time series will have training samples. Also, we define as and the updated number of examples after the transformation, that is and . We maintain fixed across source and target data to allow consistency on the number time steps seen by the network before a prediction, although this parameter could be made flexible between domains.

3.3 Long Short-Term Memory Neural Network

One choice of learning function to accommodate temporal relationships between inputs and outputs are LSTMs, which have been studied on prognostics and RUL predictions tasks when enough training data is available yuan2016; ListouEllefsen2019. Such networks offer recurrent connections capable of modelling the temporal dynamics of sensor data in prognostics scenarios. Moreover, they control how information flows within the LSTM cells by updating a series of gates capable of learning long-term relationships in the input data.

Figure 1: LSTM memory cell LSTMcell.

In our proposed model, we use LSTM layers to extract temporal features contained in the previous time windows of size before a RUL prediction. In an LSTM, the memory cell (Figure 1) consists of three non-linear gating units that update a cell state , using a hidden state vector and inputs , where is the dimension of the LSTM cells and the input dimension:



is a sigmoid activation function responsible for squeezing the output to the 0-1 range,

are the input weight matrices, are the recurrent weight matrices, and

are bias vectors. Where the subscript

can either be the forget gate , input gate or the output gate , depending on the activation being calculated.

After computing , and , the new cell state candidate is computed as follows:


where, similar to the gate operations: , , and .

The previous cell state is then updated to the new cell state :


where denotes the element-wise multiplication.

In other words, in the previous equations, the forget gate is responsible for deciding which information will be thrown away from the cell state. Next, the input gate decides which states will be updated from a candidate cell state. The input and forget gates are then used to update a new cell state for the next time step.

Lastly, the output gate decides which information the cell will output and new hidden state is computed by applying a function to the current cell state times the output gate results.


3.4 LSTM Deep Adversarial Neural Network

Our model, referred as LSTM-DANN and depicted in Figure 2, is trained to predict for each input a real value and its domain label . Similar to ganin2016domain we use a Domain Adversarial Neural Network (DANN) approach and decompose our learning method in three parts. We assume that the inputs are first decomposed by a combination LSTM layers capable of extracting temporal relationships in the input space to the rest of the network. Our feature extractor embeds the inputs in a feature space . We denote the vector of parameters in this layer combination as , i.e. . This new feature space is first mapped to a real-valued variable via a mapping function composed of fully connected layers with parameters . Lastly, the same feature vector is mapped to a domain label by a mapping function with parameters .

Figure 2: The proposed domain adaptation deep network architecture.

During training, we aim at minimising a regression loss using the observed RUL values from the source domain . Thus, the parameters of the feature extractor and regressor are optimised towards the same goal, i.e. minimising a regression loss function for the source domain task. This is performed to ensure that the features are discriminative towards the main learning task and can be used to predict the RUL at each time . We also aim at finding features that are domain invariant, i.e. we want to find a feature space in which and are similar. To address this contrasting objective, we look at an auxiliary loss over the domain classifier function . We want to estimate the dissimilarity between domains by inducing a high adversarial loss in the domain-invariant features when the domain classifier has been trained to discriminate between the two domains.

To enforce such behaviour in the network, we train the model in a two-pass adversarial goodfellow2014generative procedure. In the first pass, we learn features by minimising the weights of the feature extractor in the direction of the regression loss . In the second pass, we train the algorithm to maximise the same weights in the direction of a domain-classification loss that is being trained to minimise its overall domain classification error .

In other terms, we define the model loss functions in terms of the learning functions and parameters and we aim at minimising a combined loss function expressed as:


and the losses and are expressed as:


where is the RUL prediction at time coming from the source domain, i.e. and is the domain prediction from source and target domains i.e. . Where is a regression loss that take the form of the Mean Absolute Error (MAE) when and the Mean Squared Error (MSE) when . is the binary cross-entropy loss between domain labels and is a positive hyperparameter that weighs the domain classification loss during training. Both losses are commonly used loss functions in regression and classification problems.

We optimise the function by searching for a saddle point solution of the minimax problem below:


and update the learning weights in the network we use gradient updates ganin2016domain of the form:


We use stochastic estimates of the updates in equations (12) - (14

) via Stochastic Gradient Descent (SGD) and its variants. Where the learning rate

represents the learning steps taken by the SGD algorithm as training progresses. To achieve the desired updates, we use a Gradient Reversal Layer (GRL) ganin2016domain alongside the gradient updates. This layer does not perform any changes in the weights of the network during the forward pass. On gradient updates, however, it changes the sign of the gradient of the subsequent levels multiplied by a factor . The GRL makes it possible to learn the weights without many transformations of current implementations of the backpropagation algorithm in common deep learning libraries.

3.5 Dropout Regularisation

To effectively learn neural network models one has to account for its capabilities on learning complicated patterns seen in raw data, but also its tendency to overfit the training data. Dropout srivastava2014dropout is a regularisation method that can prevent overfitting in deep neural network architectures. It provides a simple solution to the overfitting problem by randomly dropping network units and their connections during training to prevent such units from getting highly specialised in the training data. At testing, a single fully connected network is used to approximate the averaging over all the thinned networks used during training.

This method significantly reduces overfitting, and it has shown considerable results in many prediction tasks srivastava2014dropout. In this work, we apply the dropout method independently in the feature weights , and . In the feature extraction layers, we want to avoid weights to be too specialised in one of the domains without adapting to changing input data. In the remaining layers, we aim to prevent overfitting on both tasks (regression and classification). We search for the best-performing dropout fraction for our model during the hyperparameter tuning phase.

4 Design of Experiments

In this section, we describe the experiments using the proposed model to predict the RUL using degradation data coming from different domains. We describe the datasets used in the experiments and the details about the implementation.

4.1 C-MAPPS Datasets

The method is evaluated using the benchmark Commercial Modular Aero-Propulsion System Simulation (C-MAPPS) Saxena2008 datasets containing turbofan engine degradation data. The C-MAPPS datasets are composed of four distinct datasets that contain information coming from 21 sensors as well as 3 operational settings. Each of the four datasets possesses a number of degradation engines split into training and testing data. Moreover, the datasets have run-to-failure information from multiple engines collected under various operating conditions and fault modes.

Data FD001 FD002 FD003 FD004
Engines: Training () 100 260 100 249
Engines: Testing 100 259 100 248
Operating Conditions 1 6 1 6
Fault Modes 1 1 2 2
Table 1: The C-MAPPS datasets. Each dataset contains a number of training engines (Engines: Training ()) with run-to-failure information and a number of testing engines (Engines: Testing) with information terminating before a failure is observed. Operating Conditions: Each dataset can have one or six (based on altitude (0 - 42000 feet), throttle resolver angle (20 - 100) and Mach (0 - 0.84)) operating conditions. Fault Modes: Each dataset can have and one (HPC degradation) or two (HPC degradation and Fan degradation) fault modes.

Engines in the datasets are considered to start with various degrees of initial wear but are considered healthy at the start of each record. As the number of cycles increases the engines begin to deteriorate until they can no longer function. At this point in time the engines are considered unhealthy and cannot perform their intended function. Unlike the training datasets, the testing datasets contain temporal data that terminates some time before a system failure.

The original prediction task is to predict the RUL of the testing units using the training units Saxena2008. We expand on this goal and consider the case when one has enough run-to-failure data under a set of fault modes and operating conditions but wants to apply a learned model to a different dataset, i.e. we validate our results on a different set of operating conditions and fault modes. We motivate such setup on cases found in maintenance prediction scenarios. Where run-to-failure data is available for assets under specific running conditions, but unobserved failure prevents the use of previously learned models in a domain with different conditions and fault modes.

The details about the four datasets are given in Table 1. We refer to the datasets as FD001, FD002, FD003 and FD004. The operating conditions in the datasets vary between one (sea level) in FD001 and FD003, to six, based on different combinations of altitude (0 - 42000 feet), throttle resolver angle (20 - 100) and Mach (0 - 0.84) in FD002 and FD004. Also, fault modes vary between one (HPC degradation) in FD001 and FD002, and two (HPC degradation and Fan degradation) in FD003 and FD004. For our experiments, we consider each one of the datasets as source and target domains and perform domain adaptation on the different source-target pairs.

(a) Sensor 2
(b) Sensor 7
(c) Sensor 14
(d) Sensor 15
Figure 3: Normalised sensor values 100 time steps before a failure for each C-MAPPS dataset. The sensor distributions are more similar between FD001 and FD003 and FD002 and FD004 pairs due to operating conditions.

4.2 Data Preprocessing

The temporal input data coming from 21 sensor values and 3 operational settings are used across the experiments. We note that for both FD001 and FD003 datasets, 7 sensor values have constant readings. However, as the constant readings are not consistent across the datasets, we keep the sensor values in our experiments to be able to consider their variations in different source and target scenarios.

Since the original distributions and feature values across the datasets are similar, we need to ensure that enough distribution shift exists so that performing adaptation would make sense. To induce a higher discrepancy between domains and aid gradient descent weight updates, we normalise the input data and RUL values by scaling each feature individually such that it is in the (0-1) range using the min-max normalisation method:


where denotes the original -th data point of the -th input feature at time and the vector of all inputs of the -th feature. We perform the normalisation for each dataset individually and perform domain adaptation on the normalised input values.

In RUL prediction tasks it is often not straightforward how to determine the health status and the remaining useful lifetime of an equipment. In our datasets, RUL targets are only available at the last time step for each engine in the test datasets. As Heimes2008 has shown it is reasonable to estimate the RUL as a constant value when the engines operate in normal conditions. Similar to other works in the literature ListouEllefsen2019; Lei2018, we propose to use a piece-wise linear degradation model to define the correct RUL values in the training datasets. That is, after an initial period with constant RUL values, we assume that the RUL targets decrease linearly as the number of observed cycles progresses. We denote as the initial period in which the engines are still working in their desired conditions. A constant of 125 cycles is selected in our experiments to allow comparison to other proposed models in the literature ListouEllefsen2019; Lei2018. We point out that the choice of such constant impacts the performance of the RUL prediction methods and that further optimisation can be done to select the best performing .

Moreover, we note that the same normalised sensor values coming from distinct datasets can present different distributions according to their degradation level. We show in Figure 3 four normalised sensor values of the training examples coming from each of the four datasets 100 time steps before a failure occurs. We observe a lower distribution shift between the dataset pairs FD001, FD003 and FD002, FD004. This is the case because these pairs have data simulated under the same operating conditions, which causes their sensor values to have similar overall distributions near failure Saxena2008. However, as it can be seen in Figures 2(a), 2(b) and 2(d) there are still some distribution shift observed between FD001 and FD003 due to the varying fault modes. Similarly, in Figures 2(c) and 2(c) we observe a small distribution shift between FD002 and FD004. In practice, these distribution shifts make models data-specific, i.e. a model trained in one dataset often does not perform well in a different dataset unless the sensor values driving the fault behaviour are similar across source-target pairs.

4.3 Performance Metrics

Similar to other prognostic studies using the same datasets, we measure the performance of the proposed method of target datasets using two metrics. We propose to use the Root Mean Squared Error (RMSE) as this can be directly related to equations (7) and (8) and provide an estimation of how well the model is performing in the target prediction task.

Figure 4: Performance metrics plot. The Scoring performance metric overpenalises positive errors of the RUL prediction.

Moreover, we evaluate our model using a scoring function shown in equation (16) proposed by Saxena2008:


where and and Saxena2008. That is, is the difference between predicted and observed RUL values. The scoring metric penalises positive errors more than negative errors as these have an impact on RUL prognostics tasks as it can be seen in Figure 4.

5 Training and Hyperparameter Selection

5.1 Training Procedure

For training, we select one of the four C-MAPPS datasets as source domain and use our proposed domain adaptation method to learn the remaining useful lifetime on the remaining three target datasets. That is, we use the input features and labelled RUL values from the source data, and only the input features values from the target datasets as inputs to the network. The C-MAPPS datasets are normalised individually according to equation (15). We apply the time window transformation in both source and target datasets with to allow consistency between the number of time steps a network sees before making a prediction. The selection of the number of time steps is based on previous literature using the same datasets Lei2018; ListouEllefsen2019. No further feature engineering is performed in the input data as we aim to extract features automatically using the proposed method. regularisation is applied in the weights and in equation 7. Also, we separate the original training data into training (seen by the algorithm) and cross-validation (used for stopping criteria) data containing 90% and 10% of the original dataset.

Figure 5: The training procedure of the LSTM-DANN

We split the training dataset into mini-batches (collection of data samples) of data that are used to calculate model error and update model coefficients. During training, we randomly select mini-batches coming from source and target domains to update the weights in the network on each gradient pass. As it can be seen in Table 1 the datasets have different number of training samples. To cope with this difference between domains, we over sample the smaller dataset to match the same number of mini-batches coming from the larger dataset. After the network has seen all training examples coming from both domains we consider an epoch finished. Further, we define if the training example comes from the source domain and if it comes from the target domain.

Next, the proposed LSTM-DANN architecture is defined including the number of LSTM and fully connected hidden layers, number of cells in each layer, learning rate and gradient update algorithm. We train using the Rectified Linear Unit (ReLU) as activation function. The SGD and RMSProp

tieleman2012lecture algorithms are used to update the weights in the network. Our implementation breaks the learning in two models sharing the feature extraction layers. One model aims at learning the regression task using the source domain sensor and RUL information. The other model aims at finding domain-invariant features using the adversarial loss function in equation (7). We point out that the second model is the one responsible for the adversarial learning aspect of the model as it includes the GRL to swap the gradient signals. While the weights are updated to minimise the classification loss between domains, the weights are updated to maximise the domain classifier loss at the same time. The complete diagram of the learning procedure can be seen in Figure 5.

We train the models for a maximum of 200 epochs and interrupt training if no improvement is seen for 20 epochs. In our case the MAE, in equation (8), presents the best performing results. In addition, a varying learning rate is adopted, we start with a fixed learning rate, and after 100 epochs the learning rate is multiplied by a 0.1 factor to allow for stable convergence. We clip the norm values of the gradients to 1 in the SGD algorithm to avoid exploding gradients. Finally, the data coming from the target domain including the RUL values are fed to the network to calculate final RUL estimations and the performance measures can be obtained.

5.2 Hyperparameter Selection

We perform grid search on more sensitive hyperparameters: optimiser (opt) and learning rates (lr), and fine-tune the remaining parameters manually. The range considered for each hyperparameter is shown in Table 2. To asses the quality of the proposed algorithm we need to validate the hyperparameters without using the RUL values coming from the target domain. In our proposed methodology, we evaluate the performance of the adaptation task by observing the cross-validation error and the domain classifier error on the source domain. In general, we observed that performance results are better when a lower source error is achieved while the domain classification stabilises in loss values that lead to an accuracy close to a random guess. We select the hyperparameters that yield the lowest source RMSE. We report the resulting hyperparameters settings in Table 3.

Hyperparameter Range
Learning rate (source regression) {0.001, 0.01, 0.1}
Learning rate (domain classification) {0.001, 0.01, 0.1}
Batch size {256, 512, 1024}
Optimiser {SGD, RMSProp}
Number of layers (LSTM) {1, 2}
Number of units (LSTM) {32, 64, 100, 128}
Number of units () {30, 32, 64, 128, 512}
Number of layers (source regression) {1, 2}
Number of units (source regression) {16, 20, 32, 64, 128}
Number of layers (domain classification) {1, 2}
Number of units (domain classification) {16, 20, 32, 64, 128}
Regularisation {0.0, 0.01, 0.1}
{0.8, 1.0, 2.0, 3.0}
Table 2: Hyperparameter values evaluated in the proposed methodology.
Source: FD001 Target
Layers (units)
Source regression
Layers (units)
Source regression
Domain classification
Layers (units)
Domain classification
batch size lr source reg. lr domain class. opt
- FD002 0.01 1 (128) 0.5 64 1 (32) 0.3 1 (32) 0.3 0.8 256 0.01 0.01 SGD
- FD003 0.01 1 (128) 0.5 64 1 (32) 0.3 1 (32) 0.3 0.8 256 0.01 0.01 SGD
- FD004 0.01 1 (128) 0.7 64 2 (32, 32) 0.3 1 (32) 0.3 1.0 256 0.01 0.1 SGD
Source: FD002 Target
Layers (units)
Source regression
Layers (units)
Source regression
Domain classification
Layers (units)
Domain classification
batch size lr source reg. lr domain class. opt
- FD001 0.01 1 (64) 0.1 64 1 (32) 0.0 2 (16, 16) 0.1 1.0 512 0.01 0.01 SGD
- FD003 0.01 1 (64) 0.1 512 2 (64, 32) 0.0 2 (64, 32) 0.1 2.0 256 0.1 0.1 SGD
- FD004 0.01 2 (32, 32) 0.1 32 1 (32) 0.0 1 (16) 0.1 1.0 256 0.1 0.1 SGD
Source: FD003 Target
Layers (units)
Source regression
Layers (units)
Source regression
Domain classification
Layers (units)
Domain classification
batch size lr source reg. lr domain class. opt
- FD001 0.01 2 (64, 32) 0.3 128 2 (32, 32) 0.1 2 (32, 32) 0.1 2.0 256 0.01 0.01 SGD
- FD002 0.01 2 (64, 32) 0.3 64 2 (32, 32) 0.1 2 (32, 32) 0.1 2.0 256 0.01 0.01 SGD
- FD004 0.01 2 (64, 32) 0.3 64 2 (32, 32) 0.1 2 (32, 32) 0.1 2.0 256 0.01 0.01 SGD
Source: FD004 Target
Layers (units)
Source regression
Layers (units)
Source regression
Domain classification
Layers (units)
Domain classification
batch size lr source reg. lr domain class. opt
- FD001 0.01 1 (100) 0.5 30 1 (20) 0.0 1 (20) 0.1 1.0 512 0.01 0.01 SGD
- FD002 0.01 1 (100) 0.5 30 1 (20) 0.0 1 (20) 0.1 1.0 512 0.01 0.01 SGD
- FD003 0.01 1 (100) 0.5 30 1 (20) 0.0 1 (20) 0.1 1.0 512 0.01 0.01 SGD
Table 3: Hyperparameters for each source-target experiment pair.

We run the experiments presented in this paper in a machine running an Intel Core i5 7th generation processor with 16 GB RAM and a GeForce GTX 1070 Graphics Processing Unit (GPU). We implement the models using the Python 3.6 programming language and the Keras


deep learning library with the TensorFlow

tensorflow2015-whitepaper backend.

6 Experimental Results

In this section, the prognostic performance of the proposed domain adaptation method for RUL estimation is presented. All experiments consider each of one the C-MAPPS datasets as source domain and the remaining datasets as target domains. In total we have 12 different experiments and results are averaged over 10 trials for each experiment to reduce the effect of randomness. For each experiment we report the mean and standard deviations of each model’s performance.

We start by comparing the proposed method with baseline LSTM models trained in the source domain and applied on the target domain (SOURCE-ONLY). Also, we compare to models trained in the target domain using the target domain labels (TARGET-ONLY) representing the ideal situation when target RUL values are available for prediction. Furthermore, we assess our domain adaptation method against other popular methods that attempt to align source and target domains before a prediction model is constructed. We compare against Feed Forward Neural Network (FFNN) models trained on the Transfer Component Alignment (TCA) and CORrelation Alignment (CORAL) domain-invariant spaces. We present the methodologies’ effectiveness in finding representations that can both adapt to the different wear distributions across domains.

Lastly, we show that our TARGET-ONLY models can be effectively used to predict the RUL values for each of the C-MAPPS datasets. We compare our methodology with the current state-of-the-art methods to assess the general effectiveness of our proposed method when both sensor and RUL information are available for training.

6.1 Comparison to Non-adapted Models under Domain Shift

In this section, we compare the proposed model with models trained in the source domain and applied on the target domain (SOURCE-ONLY) serving as a baseline for the domain-adapted models and models trained in the target domain using target labels (TARGET-ONLY) serving as upper bound of for the proposed methods. For the models where no adaptation is performed (SOURCE-ONLY and TARGET-ONLY) we train a network of the form: ReLU(LSTM(100)) + Dropout(0.5) + ReLU(Dense(30)) + Dropout(0.1) + ReLU(Dense(20)) + Dense(1) for 100 epochs using the Adam kingma2014adam optimiser with a learning rate of 0.001. We use an MSE loss function and equal to 30, 20, 30, 15 for FD001, FD002, FD003, FD004. These hyperparameters are chosen because they yield the best performances in our experiments. We present, in Table 4, the performances in the target test datasets for each source-target pairs in our experiments. We aim to show the effect of using a model for the target domains trained only on the source domain in comparison with the proposed method. Therefore, we present the percentage change between SOURCE-ONLY and the LSTM-DANN method as %. We also present the normalised RUL prediction results of several engines coming from the target cross-validation dataset in Figure 6. In the figure, we present the target RUL values as well as the predictions coming from the LSTM-DANN, SOURCE-ONLY and TARGET-ONLY models. We analyse the results splitting the analysis for each domain, as its selection poses distinct difficulties in the adaptation results.

- FD002 71.70 3.88 48.62 (-32%) 6.83 17.76 0.43
- FD003 51.20 3.39 45.87 (-10%) 3.58 12.49 0.29
- FD004 73.88 4.50 43.82 (-41%) 4.15 21.30 1.06
- FD001 164.84 23.00 28.10 (-83%) 5.03 13.64 0.80
- FD003 154.04 21.79 37.46 (-76%) 1.54 12.49 0.29
- FD004 37.76 2.17 31.85 (-16%) 1.65 21.30 1.06
- FD001 49.94 7.65 31.74 (-36%) 9.37 13.64 0.80
- FD002 70.32 4.02 44.62 (-36%) 1.21 17.76 0.43
- FD004 69.28 4.51 47.94 (-31%) 5.78 21.30 1.06
- FD001 188.00 25.95 31.54 (-83%) 2.42 13.64 0.80
- FD002 20.88 1.66 24.93 (+19%) 1.82 17.76 0.43
- FD003 157.32 20.37 27.84 (-82%) 2.69 12.49 0.29
Table 4: RMSE Standard Deviation - Comparison between SOURCE-ONLY, TARGET-ONLY and LSTM-DANN on the test datasets.
(a) Target: FD001
(b) Target: FD002
(c) Target: FD003
(d) Target: FD001
(e) Target: FD002
(f) Target: FD004
(g) Target: FD001
(h) Target: FD003
(i) Target: FD004
(j) Target: FD002
(k) Target: FD003
(l) Target: FD004
Figure 6: RUL predictions of the TARGET-ONLY, SOURCE-ONLY and LSTM-DANN models for one engine coming from the target domain cross-validation datasets.
  1. Source: FD004
    Several RUL prediction results for FD004 acting as source domain are presented in Figures 5(a), 5(b), 5(c). In the figures, one can notice that the RUL predictions of the SOURCE-ONLY model for target domains FD001 (Fig. 5(b)) and FD003 (Fig. 5(c)) have a large error in comparison with the observed values. On the other hand, despite some errors between the predictions and observed values, our proposed model shows higher accuracy on the target RUL values leading to a smaller error than the one from SOURCE-ONLY model.

    We also point out that for target domain FD002, the SOURCE-ONLY model already provides a good fit for the observed RUL values (Figure 5(b)). In this case, domain adaptation results in predictions similar to SOURCE-ONLY. Usually, this result is not known beforehand, but can be expected since the marginal distribution shift across domains FD004 and FD002 is low. We also point out that FD004 is the dataset that contains 6 operating conditions and 2 fault modes. Therefore, the adaptation method can use the source domain to find correspondences between the operating conditions and fault modes in each source-target pair. That is of practical value since one could use previous data coming from different conditions and fault modes to estimate better RUL values on unobserved run-to-failure data.

  2. Source: FD003
    In the examples provided, the adaptation from FD003 to FD002 (Fig. 5(e)) and FD004 (Fig. 5(f)) show worse results than the one for FD001 (Fig. 5(d)). This is the case as FD003 is much more similar to FD001, varying only the number of fault modes. As it can be seen on Table 4 the LSTM-DANN model yields a lower RMSE value when compared to the SOURCE-ONLY model showing that the weights learned by the network can be effectively used even when the domains are already similar.

    For target domains FD002 and FD004, the differences between domains are more prominent. FD003 has one operating condition while FD002 and FD004 have 6 different operating conditions and different sensor values near a failure. Despite the difficulties in transferring from such distinct domains, the domain adaptation method can improve the SOURCE-ONLY model showing a better prediction error in the test dataset. However, some higher errors are present as the estimated values are noisy and do not fit the linear degradation model in its complete extension.

  3. Source: FD002
    In Figures 5(g), 5(h) and 5(i) we present engines from the FD001, FD003, FD004 cross-validation target datasets. Similar to the inverted experiment, the similarities between the FD004 and FD002 make the SOURCE-ONLY model fit the target data with high accuracy. In this experiment pair, our model is also able to fit the target data with a similar error level as the one without adaptation. On the other hand, we can improve the predictions on the FD001 target dataset in comparison to SOURCE-ONLY. In this case, FD001 and FD002 share the same fault mode (HPC degradation), but FD002 has more operating conditions than FD001 which makes the algorithm able to learn the degradation function better than a model without adaptation. Our method can also produce more accurate results than SOURCE-ONLY when the target domain is FD003. Although both operating conditions and fault mode are different across domains, the predictions produce lower errors and a better fit to the linear degradation model. The result of this experiment shows that it is possible to transfer from a domain that has more operating conditions than the target domains under the same or different fault modes. This is important, as in practice one is interested in reusing previous gathered or simulation data to predict the RUL on unseen data.

  4. Source: FD001
    In our experiments, FD001 presents the highest errors when functioning as a source domain. When target domains are FD002 of FD004 (Figs. 5(j) and 5(j)) the best solution found is one that yields a flattened curve over the entire cycle. It can be observed that the SOURCE-ONLY model has a similar behaviour when being used to predict the target dataset. However, the proposed model is capable of adjusting the learned values towards a mean RUL value. Also, we point out that FD001 is the dataset containing only one operating condition and fault mode. That is, we are trying to transfer to domains where the fault modes and conditions are not in the source domain. Similar to other results, this shows to be a much harder problem to the methodology proposed. When the target domain is FD003 (Fig. 5(k)), we are attempting to learn in a domain with one fault mode and predict in a domain with two fault modes and similar operating conditions. For this case, the model results in RUL prediction curves that can fit the trend of the observed RUL values to a lower RMSE than the SOURCE-ONLY model.

We notice, in Table 4, that the proposed methodology is capable of improving performance over almost all but one SOURCE-ONLY methods in our experiments. The results of the adapted models change depending on the information contained in the dataset acting as source domain. We achieve better results when FD004 dataset acts as source domain as it contains all 6 operating conditions and 2 fault modes. Also, when the distribution shift in the source and target domains are similar, a model trained in the source domain with no adaptation can already achieve strong performance in the learning task.

In particular, using LSTM-DANN does not add much value when the distributions between domains are very similar and the source domain contains more operating conditions and fault modes than the target domain (e.g. FD004 to FD002). For this example, even if we have access to the ground truth values in the target domain the performance (RMSE) among SOURCE-ONLY, LSTM-DANN and TARGET-ONLY models will not be considerably improved, unlike other experiments. Moreover, among our 10 runs there were cases when the LSTM-DANN would be able to find better RMSE values than the ones found by SOURCE-ONLY indicating that there may be better hyperparameters than the ones proposed in this paper that would result in lower RMSE values.

For the other cases, the results show that when the source data “contain” the target data the adaptation achieve lower RMSE results and a better fit. It is expected that the observed features in the source domain can be used to improve predictions in the target domain; thus, having a source domain with similar degradation data helps to learn. On the other hand, learning from a domain with fewer operating conditions and fault modes (e.g. FD001) to one with more conditions and fault modes proves to be a harder task. The model can adjust for the mean RUL adaptation value but fails to learn the linear degradation model that emerges from the target domain. However, the results still prove to be better estimations of the RUL in a target domain when compared to SOURCE-ONLY models.

As a rule of thumb, the results presented show that the methodology could be useful whenever one has similar degradation data with a degree of distribution shift across domains. The degree to which the distributions vary can determine how accurate an adaptation could result, but further investigation has to be done to provide a reliable estimate of “when” to transfer. The model achieves the best results when enough source data under different conditions has been observed. In practice, one could use simulated data acting as a source domain with various fault scenarios. In many applications, real-world data is dissimilar to simulation data due to noise and unexpected sensor behaviour. To improve the models, one could use the LSTM-DANN method to adapt from simulated data to real-world data and achieve better results in RUL predictions.

6.2 Comparison to Domain Adaptation Approaches

To assess the quality of the model in transferring the degradation patterns from a source to a target domain, we test several well-known methodologies for unsupervised domain adaptation. Two different methodologies are carried out, Transfer Component Analysis (TCA) and CORrelation ALignment (CORAL).

  1. TCA
    Transfer Component Analysis pan2011domain

    is a well-known unsupervised domain adaptation method that focuses on finding a common shared feature representation between source and target domains. It transfers components across domains in a Reproducing Kernel Hilbert Space (RKHS) using Maximum Mean Discrepancy (MMD) and different kernels to construct a feature space that minimises the difference between the domains. We use the TCA feature representation to train a shallow neural network with one fully connected layer and 32 units (TCA-NN) and a deep feed forward neural network (TCA-DNN) containing the same amount of layers and units as our final models. In our tests, we apply the radial basis function (RBF) kernels extracting 20 transfer components with

    and .

  2. CORAL
    CORrelation ALignment (CORAL) sun2016return is a metric that minimises domain shift by aligning the second-order statistics of a source and target distributions, without requiring target labels. After the alignment is found the new feature space can be used to train a model in the transformed source domain. Similar to the TCA method we use such a method in combination with a shallow (CORAL-NN) and deep neural network architectures (CORAL-DNN).

For all compared methods in this study, the input data are the same, the mean squared error (MSE) is used as the loss function and the Adam kingma2014adam optimisation algorithm is applied with a learning rate of 0.001.

TCA and CORAL provide out-of-the-box methods that can be easily applied in a domain adaptation setup where no observed values are available. Both methods are not entirely suitable for temporal data, for this reason we perform domain adaptation on TCA and CORAL using features coming from time and RUL at time . After the new features are computed we train a FFNN model using the same number of units as in the SOURCE-ONLY models. The results are summarised for the target cross-validation datasets (not seen during training) in Table 5. It can be seen that on average our proposed method yields lower RMSE than the methodologies tested in all but one experiment pair. This supports that the proposed deep adaptation architecture is well suited for the studied prognostics prediction problem.

The LSTM structure in combination with the adversarial classification loss are capable of extracting useful temporal features from multivariate time series to perform adaptation from source and target domain pairs. In other terms, this means that the methodology performs better than the tested out-of-the-box adaptation methods not tailored for time-series sensor data. The results provide a foundation for using the proposed domain adaptation method in cases when one has limited observed RUL data in one domain, but is concerned with predicting RUL targets in a similar domain with distinct operating conditions and fault modes.

- FD002 94.1 1.0 90.0 2.9 99.2 3.6 77.5 4.6 46.4 3.6
- FD003 120.0 1.0 116.1 1.0 60.0 0.7 69.6 5.2 37.3 3.4
- FD004 120.1 1.0 113.8 6.9 107.7 2.8 84.6 7.0 43.5 5.3
- FD001 94.7 1.1 85.6 5.5 77.9 19 80.9 9.4 31.2 5.4
- FD003 107.4 3.7 111.5 7.2 60.9 15.1 79.8 10.1 32.2 3.1
- FD004 93.5 2.8 94.4 6.7 37.5 0.5 43.6 3.6 27.7 1.5
- FD001 98.7 0.4 90.5 4.6 26.5 0.5 26.5 1.9 30.6 6.2
- FD002 90.5 0.3 80.8 4.3 113.2 4.5 75.6 9.5 43.1 1.4
- FD004 78.9 5.3 102.6 3.2 113.9 5.5 77.2 9.1 49.7 9.1
- FD001 98.5 0.4 85.6 5.0 119.1 16.7 94.0 8.8 25.4 4.2
- FD002 75.3 1.7 80.8 5.8 37.3 0.6 30.9 1.4 26.9 3.3
- FD003 77.2 6.0 102.9 2.7 68.1 11.1 68.6 11.2 23.6 5.0
Table 5: RMSE Standard Deviation - Domain Adaptation Models on target cross-validation data.

6.3 Comparison to Non-adapted RUL Prediction Approaches Using Target Domain Labels

We provide the results of TARGET-ONLY models in Table 4 as a the best case performance where the RUL target values are used to train the models. That is, these are the best results found if we could use the RUL values in the target domain for training. We also present a comparison of the TARGET-ONLY models to the state-of-the-art results in the C-MAPPS datasets for the RMSE and Scoring performance metrics in Tables 6 and 7. We compare our method to the ones in the literature to attest that our model can show similar results when presented with a complete labelled dataset as others methods proposed in the literature.

Dataset TARGET-ONLY GA + LSTM ListouEllefsen2019 CNN + FFNN Li2018 MODBNE zhang2017multiobjective LSTM + FFNN Zheng2017
FD001 13.64 12.56 12.61 15.04 16.14
FD002 17.76 22.73 22.36 25.05 24.49
FD003 12.49 12.10 12.64 12.51 16.18
FD004 21.30 22.66 23.31 28.66 28.17
Table 6: RMSE comparison between TARGET-ONLY and other models on the literature on the C-MAPPS datasets

We compare our TARGET-ONLY model with the LSTM methods proposed by ListouEllefsen2019 (GA + LSTM) and Zheng2017 as (LSTM + FFNN). We also compare to the Convolutional Neural Network (CNN + FFNN) methodology proposed by Li2018

and the Multiobjective Deep Belief Networks Ensemble (MODBNE) proposed by

zhang2017multiobjective. In Table 6 we report the results of our experiments and the results reported in the original papers of the compared approaches. We notice that our models provides the best known RMSE results for datasets FD002 and FD004 showing that the method proposed can yield a strong performance on the datasets with more operating conditions. For datasets FD001 and FD002 we can produce similar results to the best known performance models proposed in the literature for both RMSE and Scoring metrics. We note that these methods cannot be directly used for dataset without labels as they are trained and tested in the same domain.

Dataset TARGET-ONLY GA + LSTM ListouEllefsen2019 CNN + FFNN Li2018 MODBNE zhang2017multiobjective LSTM + FFNN Zheng2017
FD001 300 231 274 334 338
FD002 1,638 3,366 10,412 5,585 4,450
FD003 267 251 284 422 852
FD004 2,904 2,840 12,466 6,558 5,550
Table 7: Scoring comparison between TARGET-ONLY and other models on the literature on the C-MAPPS datasets

7 Discussion

7.1 Relationship to Standardisation

A common straightforward way to attempt to align a source and target domains and reduce the difference between the means and variances of the input distributions is performing standardisation. That is, local mean centering (subtracting the mean) and dividing by the standard deviation of each input feature. This leads to each feature in the data to have zero-mean (moment matching) and unit-variance.

To test whether such transformation already suffices as an alignment strategy we standardise the data before feeding it to SOURCE-ONLY and TARGET-ONLY architectures. We select FD004 as source domain as results have shown that adaptation is possible for all remaining C-MAPPS datasets. We present the RMSE values for each methodology evaluated in the test target datasets in Table 8. We compare to the original LSTM-DANN trained on the data with the min-max normalisation and to a model trained on the standardised data (LSTM-DANN-STD) and to reference models SOURCE-ONLY and TARGET-ONLY.

- FD001 53.31 5.02 31.54 2.42 32.62 2.07 14.51 1.55
- FD002 23.22 1.01 24.93 1.82 21.78 1.71 18.44 0.42
- FD003 62.76 7.52 27.84 2.69 40.20 7.03 16.03 0.33
Table 8: Test performance (RMSE Standard Deviation) of SOURCE-ONLY, TARGET-ONLY, LSTM-DANN-STD. Models are trained on standardised training data with zero-mean and unit-variance.

The results in Table 8 show that on average the RMSE performances of SOURCE-ONLY models are considerably improved when the data is normalised with zero mean and unit-variance. However, our proposed methodology (LSTM-DANN-STD) still outperforms the baseline models and provide a better fit to the target data (Figure 7). We point out that training on standardised data causes the LSTM-DANN-STD models to saturate and start overfitting much faster than on previous experiments. This effect negatively impacted the adaptation performance on FD003 as training progressed for more epochs than necessary. To address this effect, simple changes in the models hyperparameters could be made. For example one could consider a lowering the value of and reducing learning rates.

(a) Target: FD001
(b) Target: FD002
(c) Target: FD003
Figure 7: RUL predictions of the TARGET-ONLY, SOURCE-ONLY and LSTM-DANN-STD models in the standardised target cross-validation datasets.

8 Conclusion

In this paper, a deep learning method for domain adaptation in prognostics is proposed based on a Long-Short Term Memory Neural Network and a Domain Adversarial Neural Network (LSTM-DANN). We use time windows to incorporate long term sequences in the feature extraction layers. Experiments are carried out on the popular C-MAPPS dataset to show the effectiveness of the proposed method. The goal of the task is to estimate the remaining useful lifetime for aircraft engines units accurately while transferring from a source domain with observed RUL values to a target domain with only input features.

We normalise the datasets independently to allow for higher distribution shifts across domains. It is worth noticing that locally normalising the data to zero-mean and unit variance improves the performance of SOURCE-ONLY methods on the target datasets. Nevertheless, we focus on using the raw features and an adequate training procedure to find the weights representation that can accommodate the original distribution shift between domains. In general, we are capable to achieve lower errors between the prediction and the actual RUL value in comparison to a model with no adaptation features. We notice that the RUL prediction can be more effectively transferred from datasets that have more fault modes or operating conditions than their target counterpart. On the other hand, transferring from a dataset with fewer operating conditions and fault modes to a higher number of characteristics is a much harder task. However, even in the harder latter scenario the model is able to correct the RUL predictions to alleviate the RMSE error without utilising the target RUL values. Furthermore, the prognostic results obtained by the proposed method are compared with the out-of-the-box domain adaptation models. In our tests, the proposed network has shown superior performance in comparison to other simple domain adaptation methods. We point out that no thorough evaluation of domain adaptation methods was performed as most of the domain adaptation methodologies do not focus on sequential temporal data. Instead, we focused on methods that would require little domain adaptation knowledge and architecture tweaking for comparison.

Additionally, it should be noted that in common real-world online applications, the whole life-span data are usually not available. As future research, we argue that domain adaptation methods should be able to accommodate non-complete data coming from the target domain. As it is the case in many real-world scenarios, the sooner an accurate prediction can be done on the RUL of an equipment means that early actions can be planned to prevent equipment downtime. Furthermore, the network could be made such that it can retain hidden states for a longer period of time, by allowing states flow between mini-batches of data. Such modification would have allowed the network to ”remember” for longer periods of time while training. This modification comes with a cost, padding and number of training examples within a batch have to be carefully selected to allow states to propagate between batches. In our experiments the varying sizes of the time sequences and number of training examples between domains led to mini-batches of a small size, which made training particularly slow.

While promising experimental results have been obtained by the proposed method, further architecture optimisation is still necessary, as hyperparameter search is restrained. Also, deep learning methods generally suffer from high computing burden, thus using larger datasets could be made computationally intractable. In addition, we report that the Scoring function is only used for evaluation purposes. No optimisation steps are taken towards the minimisation of this function. Further development could incorporate the function in a learning algorithm.

9 Acknowledgments

This work was supported by the “Netherlands Organisation for Scientific Research” (NWO). Project: NWO Big data - Real Time ICT for Logistics. Number: 628.009.012

10 References