1 Introduction
Companies interested in increasing the reliability of their assets have been investing in Prognostics and Health Management (PHM) systems to increase reliability, availability and to reduce maintenance costs of engineering assets atamuradov2017prognostics. In particular, several works have drawn attention to the task of using the collected data coming from sensors and IoT (Internet of Things) assets to perform prediction of maintenance events, such as fault prognostics, detection and diagnostics Jardine2006. In PHM, Remaining Useful Lifetime (RUL) relates to the amount of time left before a piece of equipment is considered not to perform its intended function. Accurate RUL prognostics enable the interested parties to assess an equipment’s health status and to plan future maintenance actions, e.g. logistics of personnel, spare parts and services Papakostas2010.
For example, in our current industry collaboration, a leading manufacturer of medical imaging systems requires prognostics prediction models that can leverage realtime data collected continuously over many locations. These systems are complex and modelling the precise degradation mechanisms is not possible. Also, although they perform similar procedures, the systems log different multivariate sensor data due to equipment version updates, sensor malfunctioning and timing (equipment installed at a later stage have less temporal data). Furthermore, sensor values have different distributions due to distinct usage and degradation levels. In such cases, high dimensional temporal data has to be directly used to determine the health state of the systems and models have to adapt to incoming changes in the data.
In the PHM literature, physics, statistical and machine learning approaches have been proposed to address the RUL prediction problem. Physicsbased approaches build mathematical models that describe the degradation processes of the failure mechanisms
Cubillo2016. Such models require prior degradation knowledge and provide accurate RUL estimation when failure can be described using its physical properties
Lei2018. Statistical methods usually attempt to fit the observations under a probabilistic method that can describe the uncertainty of the degradation process Si2011. Their shortcomings relate to assumptions about health state transitions and data distributions. On the other hand, machine learning models focus on learning the degradation patterns directly from acquired complex raw data. In general, machine learning models are nonparametric and can be applied in practice even without prior information about the underlying distributions and degradation knowledge Lei2018.Several machine learning methods have been studied for prognostic prediction problems, including methods like Support Vector Machines (SVM)
Dong2013, Support Vector Regression (SVR) Benkedjouh2013 and Neural Networks. Neural Networks have been receiving much attention given their ability to approximate high dimensional nonlinear functions directly from raw data Babuska2016. Moreover, several architectures are specially built to support temporal inputs usually encountered in machine prognostics problems, e.g. Recurrent Neural Networks (RNN). Recently, deep learning methods have been proposed to prognostics problems containing high amounts of timeseries input data, including Convolutional Neural Networks (CNN)
Li2018and variations of RNNs capable of dealing with long input sequences such as the Long ShortTerm Memory Networks (LSTM)
ListouEllefsen2019; Zheng2017and Gated Recurrent Unit (GRU)
Wu2018.In classical machine learning, models need enough annotated historical data to be able to train to a significant performance level Li2018; Zheng2017. Presumably, interested parties already apply timebased maintenance at their assets and observing runtofailure behaviours becomes scarcer. To overcome the problem practitioners and researchers have to find ways to handle censored data hong2009prediction or generate (e.g. simulate) more data, which leads to imperfect models that do not represent realworld scenarios. Even when enough runtofailure data are available, algorithms trained for one specific dataset cannot be generalised to a different but related dataset. For example, an algorithm trained for a specific failure mode prediction often does not generalise well to other modes under similar machinery conditions Lei2018. Moreover, when the input features change among different equipment versions (e.g. new sensor information is available), it is common practice to retrain the models. This retraining leads to delayed prognostics actions until enough data is available for accurate prediction.
To address these issues, predictive models which are trained with specific runtofailure data, have to adapt to data with different input features, data distributions and limited fault information, i.e. different domains. In machine learning, this situation is often referred to as domain adaptation. In general, domain adaption methods attempt to solve the learning problem when the main learning goal (learning task
) is the same, but the domains have different feature spaces or marginal probabilities
Pan2010. Several algorithms were proposed to different flavours of the domain adaptation problem, which include reducing the domain discrepancy between the source and target via instance reweighting jiang2007instance, subspace alignment fernando2013unsupervised, and adversarial deep learning tzeng2017adversarial; ganin2016domain. Many of these approaches work well for nonsequential data but are less suitable for multivariate timeseries data as they do not usually capture the temporal dependencies present in the data. However, this type of data is prevalent in the maintenance context (e.g. in condition monitoring and equipment output data). Therefore, the general domain adaptation methods are hardly applicable to RUL prognostics.In this work, we propose to use LSTMs hochreiter1997long to address the problem of learning temporal dependencies from timeseries sensor data that can be transferred across related RUL prediction tasks with different distributions in their features. We learn from a source domain with sufficient runtofailure annotated data and a target domain containing only sensor data. We perform adversarial learning similar to ganin2016domain
and learn a common domaininvariant feature representation that can be used with the classical backpropagation (through time) algorithm
werbos1990backpropagation. To the best of our knowledge, we are the first to focus on domain adaptation for RUL prediction regression task under varying operating conditions and fault modes. Furthermore, we use the CMAPPS NASA Turbofan degradation datasets Saxena2008 to validate our results. We chose these datasets as they contain four runtofailure datasets under different failure modes and operating conditions. As a standard, models built, evaluated and deployed for one particular dataset may perform poorly on unseen datasets with different input feature distributions and fault modes. We show the effectiveness of the proposed method against other adapted and nonadapted models in predicting the RUL of aircraft engines. In practice, capital assets go through reengineering and redesign during their lifetime to prevent obsolescence problems. In such scenarios, the data from a previous design could be used to adapt learned models to a new design with distinct input data.The main contributions of this work include a new model that can handle feature distribution shift across domains under different asset operating conditions and fault modes. Unlike classic domain adaption methods, we incorporate heterogeneous time series data coming from multiple sensors in an RUL regression prediction task. Furthermore, the proposed method can be easily updated as new data becomes available. We show in our experiments that the method improves prognostics predictions on unlabelled target data when compared to nonadapted methods.
The rest of the paper is structured as follows. In the next section, we briefly discuss the stateoftheart in prognostics prediction and domain adaptation methods. In the subsequent session, we present our model detailing the learning algorithm and how the temporal dependencies of the data are used to create domaininvariant features. In Section 4, we present the learning procedure and detail the choices of model hyperparameters. In section 5, we compare and contrast the performance of proposed methods using our datasets and provide analysis of the results.
2 Related Work
2.1 Machine Learning Methods for RUL Prediction
In the prognostics literature, several artificial intelligence methods have been proposed to predict the RUL of engineering assets. In particular, authors have proposed several methods that attempt to extract the relationships of acquired sequential data and RUL prediction, such as linear regression
He2008, Support Vector Regression (SVR) Benkedjouh2013, fuzzylogic systems zio2010data and neural networks tian2012artificial.Neural networks have drawn much attention given its ability to approximate complex functions directly from raw data. For example, huang2007residual
proposed a Feed Forward Neural Networks (FFNN) architecture in a PHM prediction problems yielding superior results in comparison to other reliabilitybased approaches. Moreover, in many PHM applications, sequential timeseries data are present. Neural Networks architectures such as Recurrent Neural Networks (RNN) are a natural fit for such problems given that their recurrent internal structure can handle sequential patterns in the input data. However, as demonstrated by
bengio1994learning RNNs have issues when learning longterm dependencies, because of vanishing gradient issues as training progresses. To address these issues Long Short Term Memory (LSTM) hochreiter1997long and Gated Recurrent Unit (GRU)cho2014learningnetworks were introduced. Such networks posses internal gates that control how information will flow in the network during the learning procedure. These gates enable the network to preserve its memory state over time and fight the vanishing gradient problem while retaining information for a longer period.
In PHM, yuan2016 recently showed that LSTMs could outperform RNNs, GRUs and AdaboostLSTM in an RUL prediction task. Zheng2017 showed that a sequence of LSTM layers followed by FFNNs could outperform other methods including CNN’s in three distinct degradation datasets. wu2018remaining presented similar results by extracting features based on a dynamic difference procedure and later training an LSTM for RUL predictions. Results showed that the LSTM also outperforms simpler RNNs and GRU architectures under similar machinery conditions. More recently, ListouEllefsen2019
showed that Restricted Boltzmann Machines could be used to extract useful weight information by pretraining on degradation data in an unsupervised manner. In this twostage method, weights extracted in the first step are then used in a further step to finetune a supervised LSTM and FFNN model. A genetic algorithm (GA) is used to select the best performing hyperparameters. The methodology holds the stateoftheart prediction results for the CMAPSS datasets, presenting it as an effective method for temporal degradation data prediction.
CNN’s are notable for being able to extract spatial information from 2D and 3D high dimensional data yielding the bestknown results in several related tasks such as image segmentation, classification and captioning
Hossain2018ACS. CNN’s can also handle 1D sequential data and extract highlevel features by combining convolution and maxpooling operations while sliding a local receptive field over the input features. Several CNN architectures have been proposed for remaining useful lifetime prognostics.
babu2016deepproposed a 2D deep CNN to predict the RUL of a system based on normalised variate time series from sensor signals; they show the effectiveness of the CNN in comparison to Multilayer Perceptron (MLP), SVR and Relevance Vector Regression (RVR).
Li2018proposed to apply 1D convolutions in sequence without pooling operations. The results show that the proposed architecture can extract deep features from wear data by concatenating only convolution operations. They show competitive results on the CMAPPS dataset without incurring in high training times encountered when training recurrent models.
2.2 Domain Adaptation Methods for PHM
Most previous studies have focused on predicting the RUL when enough runtofailure data is available and assuming the training and future data come from the same distribution and feature space Li2018; ListouEllefsen2019; Zheng2017; babu2016deep; yuan2016. However, in reallife PHM scenarios, RUL values may be absent and coming from different marginal distributions between training and testing data. Examples include collecting data coming from different devices in varied operating conditions. Moreover, the data can also have different features across training and future data and runtofailure data can be expensive to obtain. Unsupervised domain adaptation addresses those issues by building algorithms that can be applied when there is domain shift in the distribution and feature spaces Pan2010. Early methods to unsupervised adaptation attempted to reweight source example losses to reflect the ones in the target distribution jiang2007instance; huang2007correcting. Reweighting based methods often assume a restricted form of domain shift and selection bias which restricts their applications. Subspace alignment methods fernando2013unsupervised attempt to find a linear map that minimises the Frobenius norm between a number of top eingenvectors. However, such methods fail to align the distributions among the two data sources. Maximum Mean Discrepancy (MMD) based methods (e.g., Transfer Component Analysis (TCA) pan2011domain
) can be interpreted as moment matching methods and can express arbitrary statistics of the data using kernel tricks. Similarly, CORrelation ALignment (CORAL)
sun2016return attempts to align the second order statistics between source and target domains.Recently, domainadapted neural networks have been proposed for unsupervised domain adaptation. In general, these methods have attempted to restrain the target error by the source error plus a discrepancy metric between the source and the target domains BenDavid2010. For example, long2015learning; tzeng2014deep propose to incorporate a the MMD metrics to reduce the discrepancy between domains in classification problems. Similarly, sun2016deep
propose to incorporate a CORAL loss function for the same purpose. Another approach, based on the theory by
BenDavid2010, is to use a classification loss (Proxy distance) to directly confuse between domains BenDavid2010; ganin2016domain; ajakan2014domain. Adversarial goodfellow2014generative methods have also been proposed to the adaptation task. For example, tzeng2017adversarialproposes to pretrain a classifier in the source domain task and use its weights in a an adversarial learning task. In their implementation a target network attempts to confuse a discriminator by generating a representation similar to the source domain. This representation is then used to do inference using the source weights.
In regression, cortes2014domain showed that the discrepancy between domains is a distance for the squared loss when the hypothesis set is the reproducing kernel Hilbert space induced by an universal kernel. lopez2012semi proposed a method to factorise a multivariate density into a product of bivariate copula functions to identify independent changes between domains (i.e., covariate shift). Therefore, changes in each of the input features can be detected and corrected to adapt a density model across different learning domains. More recently, nikzad2018domain has proposed a method a domaininvariant PartialLeastSquares Regression using a domain regulariser to align source and target distributions in a latent space. Albeit the current attention in the recent literature, few works have attempted to perform domain adaptation when the input data are composed of timeseries data. Similar to our work, Purushotham2016
proposed a method for timeseries domain adaptation using Variational Recurrent Autoencoders and show promising results in a classification task using healthcare data.
In PHM, Lu2017 proposed a method for domain adaptation where MaximumMean Discrepancy (MMD) discrepancy metric is used to find a common latent space in a classification task. zhang2017new proposed to use handcrafted and raw input features to construct a CNN and wide firstlayer kernels model capable of performing domain adaptation in the presence of noisy data. Another CNN approach was proposed by li2019multi where an MMD parameterised by multiple kernels is proposed to address the domain shift problem. xie2016
used feature extraction from time and frequency domains using Transfer Component Analysis (TCA)
pan2011domain for gearbox fault diagnosis. The results showed that the proposed method could find crossdomain features using data under various operating conditions and yields better results in comparison to other dimensionality reduction methods. Recently, li2018cross proposed a deep generative model to generate fault target data using a labelled source domain and an unlabelled target domain. The method proposed an MMD metric and a classification loss in the generation phase to induce a common shared space between the domains. Results showed that the method could be used to solve the original target classification problem in the presence of unlabelled data.Our work builds upon previous successful works using LSTMs to extract temporal features from timeseries sensor data for RUL prediction yuan2016; Zheng2017; wu2018remaining; ListouEllefsen2019. To handle the distribution shift and different features among tasks we perform unsupervised domain adaptation from a labelled source containing observed RUL vales to an unlabelled target data. This scenario is typical in PHM as usually one is interested in predicting the RUL of an asset before failures are observed. We propose a method that can use the labelled failure data from the source domain to predict the RUL of the target domain under different failure modes and operating conditions. Similar to ganin2016domain we use a gradient reversal layer to perform adversarial learning during training and induce a domaininvariant representation.
Unlike previous domain adaptation works in PHM that focused on classification tasks, our method focuses on a regression task, where the goal is to determine the number of remaining cycles for aircraft engines. Similar to ganin2016domain our method utilises a single neural network capable of learning the source regression task and performing adversarial training. As a standard, adversarial learning is achieved by training two or more networks pitted against each other with contrasting objectives in a twopass optimisation procedure goodfellow2014generative. Therefore, this modification allows for an easier implementation requiring only one architecture. Moreover, weight updates can be performed using the classic backpropagation through time algorithm. Our method uses a classification loss ajakan2014domain to induce domain confusion as it has recently been shown it can replace the classical MMD metric in classification instances.
3 LSTM Deep Adversarial Neural Network
In this section, we present our domain adaptation model to predict the RUL of assets across domains with different fault modes and operating conditions. We first introduce the problem and the notations used in the paper and then further discuss the proposed method and its components.
3.1 Problem Definition
We denote a source domain , containing training examples, where belongs to a feature space and denotes a multivariate timeseries sensor data of length and features, i.e. . Moreover, denotes the remaining useful lifetime values of length with . Where for each , and represent the th measurement of all variables and labels, respectively. Similarly, we assume a target domain , where and but no labels are available. We assume and
are sampled from distinct marginal probability distributions
. Our goal is to learn a function such that we can approximate the corresponding RUL in the target domain examples at testing time directly from degradation data, i.e. . Clearly, our assumption is that the true mapping between input output pairs is somewhat similar across domains for adaptation to be possible. At training time we have access to source training samples and their real valued labels and we assume access to training samples from the target domain (unsupervised domain adaptation). We assign a domain label to each th training example to indicate the domain it originates from and to assist domain classification.3.2 Time Windows Processing
To adapt for different sequence lengths and allow information from past multivariate temporal sequences influence the RUL prediction at a point in time we apply a time window approach for feature extraction. The sequential input is assumed to be where denotes the size of each sequence length. We define a function that divides each sequence of size in sequential time windows of size , i.e. . After the transformation, at time all previous sensor data within the time window are collected to form a highdimensional input vector used to predict . If
we apply zeropadding on the left side of
until has size , this ensures that after the transformation each original time series will have training samples. Also, we define as and the updated number of examples after the transformation, that is and . We maintain fixed across source and target data to allow consistency on the number time steps seen by the network before a prediction, although this parameter could be made flexible between domains.3.3 Long ShortTerm Memory Neural Network
One choice of learning function to accommodate temporal relationships between inputs and outputs are LSTMs, which have been studied on prognostics and RUL predictions tasks when enough training data is available yuan2016; ListouEllefsen2019. Such networks offer recurrent connections capable of modelling the temporal dynamics of sensor data in prognostics scenarios. Moreover, they control how information flows within the LSTM cells by updating a series of gates capable of learning longterm relationships in the input data.
In our proposed model, we use LSTM layers to extract temporal features contained in the previous time windows of size before a RUL prediction. In an LSTM, the memory cell (Figure 1) consists of three nonlinear gating units that update a cell state , using a hidden state vector and inputs , where is the dimension of the LSTM cells and the input dimension:
(1) 
(2) 
(3) 
where
is a sigmoid activation function responsible for squeezing the output to the 01 range,
are the input weight matrices, are the recurrent weight matrices, andare bias vectors. Where the subscript
can either be the forget gate , input gate or the output gate , depending on the activation being calculated.After computing , and , the new cell state candidate is computed as follows:
(4) 
where, similar to the gate operations: , , and .
The previous cell state is then updated to the new cell state :
(5) 
where denotes the elementwise multiplication.
In other words, in the previous equations, the forget gate is responsible for deciding which information will be thrown away from the cell state. Next, the input gate decides which states will be updated from a candidate cell state. The input and forget gates are then used to update a new cell state for the next time step.
Lastly, the output gate decides which information the cell will output and new hidden state is computed by applying a function to the current cell state times the output gate results.
(6) 
3.4 LSTM Deep Adversarial Neural Network
Our model, referred as LSTMDANN and depicted in Figure 2, is trained to predict for each input a real value and its domain label . Similar to ganin2016domain we use a Domain Adversarial Neural Network (DANN) approach and decompose our learning method in three parts. We assume that the inputs are first decomposed by a combination LSTM layers capable of extracting temporal relationships in the input space to the rest of the network. Our feature extractor embeds the inputs in a feature space . We denote the vector of parameters in this layer combination as , i.e. . This new feature space is first mapped to a realvalued variable via a mapping function composed of fully connected layers with parameters . Lastly, the same feature vector is mapped to a domain label by a mapping function with parameters .
During training, we aim at minimising a regression loss using the observed RUL values from the source domain . Thus, the parameters of the feature extractor and regressor are optimised towards the same goal, i.e. minimising a regression loss function for the source domain task. This is performed to ensure that the features are discriminative towards the main learning task and can be used to predict the RUL at each time . We also aim at finding features that are domain invariant, i.e. we want to find a feature space in which and are similar. To address this contrasting objective, we look at an auxiliary loss over the domain classifier function . We want to estimate the dissimilarity between domains by inducing a high adversarial loss in the domaininvariant features when the domain classifier has been trained to discriminate between the two domains.
To enforce such behaviour in the network, we train the model in a twopass adversarial goodfellow2014generative procedure. In the first pass, we learn features by minimising the weights of the feature extractor in the direction of the regression loss . In the second pass, we train the algorithm to maximise the same weights in the direction of a domainclassification loss that is being trained to minimise its overall domain classification error .
In other terms, we define the model loss functions in terms of the learning functions and parameters and we aim at minimising a combined loss function expressed as:
(7) 
and the losses and are expressed as:
(8) 
(9) 
where is the RUL prediction at time coming from the source domain, i.e. and is the domain prediction from source and target domains i.e. . Where is a regression loss that take the form of the Mean Absolute Error (MAE) when and the Mean Squared Error (MSE) when . is the binary crossentropy loss between domain labels and is a positive hyperparameter that weighs the domain classification loss during training. Both losses are commonly used loss functions in regression and classification problems.
We optimise the function by searching for a saddle point solution of the minimax problem below:
(10) 
(11) 
and update the learning weights in the network we use gradient updates ganin2016domain of the form:
(12) 
(13) 
(14) 
We use stochastic estimates of the updates in equations (12)  (14
) via Stochastic Gradient Descent (SGD) and its variants. Where the learning rate
represents the learning steps taken by the SGD algorithm as training progresses. To achieve the desired updates, we use a Gradient Reversal Layer (GRL) ganin2016domain alongside the gradient updates. This layer does not perform any changes in the weights of the network during the forward pass. On gradient updates, however, it changes the sign of the gradient of the subsequent levels multiplied by a factor . The GRL makes it possible to learn the weights without many transformations of current implementations of the backpropagation algorithm in common deep learning libraries.3.5 Dropout Regularisation
To effectively learn neural network models one has to account for its capabilities on learning complicated patterns seen in raw data, but also its tendency to overfit the training data. Dropout srivastava2014dropout is a regularisation method that can prevent overfitting in deep neural network architectures. It provides a simple solution to the overfitting problem by randomly dropping network units and their connections during training to prevent such units from getting highly specialised in the training data. At testing, a single fully connected network is used to approximate the averaging over all the thinned networks used during training.
This method significantly reduces overfitting, and it has shown considerable results in many prediction tasks srivastava2014dropout. In this work, we apply the dropout method independently in the feature weights , and . In the feature extraction layers, we want to avoid weights to be too specialised in one of the domains without adapting to changing input data. In the remaining layers, we aim to prevent overfitting on both tasks (regression and classification). We search for the bestperforming dropout fraction for our model during the hyperparameter tuning phase.
4 Design of Experiments
In this section, we describe the experiments using the proposed model to predict the RUL using degradation data coming from different domains. We describe the datasets used in the experiments and the details about the implementation.
4.1 CMAPPS Datasets
The method is evaluated using the benchmark Commercial Modular AeroPropulsion System Simulation (CMAPPS) Saxena2008 datasets containing turbofan engine degradation data. The CMAPPS datasets are composed of four distinct datasets that contain information coming from 21 sensors as well as 3 operational settings. Each of the four datasets possesses a number of degradation engines split into training and testing data. Moreover, the datasets have runtofailure information from multiple engines collected under various operating conditions and fault modes.
Data  FD001  FD002  FD003  FD004 
Engines: Training ()  100  260  100  249 
Engines: Testing  100  259  100  248 
Operating Conditions  1  6  1  6 
Fault Modes  1  1  2  2 
Engines in the datasets are considered to start with various degrees of initial wear but are considered healthy at the start of each record. As the number of cycles increases the engines begin to deteriorate until they can no longer function. At this point in time the engines are considered unhealthy and cannot perform their intended function. Unlike the training datasets, the testing datasets contain temporal data that terminates some time before a system failure.
The original prediction task is to predict the RUL of the testing units using the training units Saxena2008. We expand on this goal and consider the case when one has enough runtofailure data under a set of fault modes and operating conditions but wants to apply a learned model to a different dataset, i.e. we validate our results on a different set of operating conditions and fault modes. We motivate such setup on cases found in maintenance prediction scenarios. Where runtofailure data is available for assets under specific running conditions, but unobserved failure prevents the use of previously learned models in a domain with different conditions and fault modes.
The details about the four datasets are given in Table 1. We refer to the datasets as FD001, FD002, FD003 and FD004. The operating conditions in the datasets vary between one (sea level) in FD001 and FD003, to six, based on different combinations of altitude (0  42000 feet), throttle resolver angle (20  100) and Mach (0  0.84) in FD002 and FD004. Also, fault modes vary between one (HPC degradation) in FD001 and FD002, and two (HPC degradation and Fan degradation) in FD003 and FD004. For our experiments, we consider each one of the datasets as source and target domains and perform domain adaptation on the different sourcetarget pairs.
4.2 Data Preprocessing
The temporal input data coming from 21 sensor values and 3 operational settings are used across the experiments. We note that for both FD001 and FD003 datasets, 7 sensor values have constant readings. However, as the constant readings are not consistent across the datasets, we keep the sensor values in our experiments to be able to consider their variations in different source and target scenarios.
Since the original distributions and feature values across the datasets are similar, we need to ensure that enough distribution shift exists so that performing adaptation would make sense. To induce a higher discrepancy between domains and aid gradient descent weight updates, we normalise the input data and RUL values by scaling each feature individually such that it is in the (01) range using the minmax normalisation method:
(15) 
where denotes the original th data point of the th input feature at time and the vector of all inputs of the th feature. We perform the normalisation for each dataset individually and perform domain adaptation on the normalised input values.
In RUL prediction tasks it is often not straightforward how to determine the health status and the remaining useful lifetime of an equipment. In our datasets, RUL targets are only available at the last time step for each engine in the test datasets. As Heimes2008 has shown it is reasonable to estimate the RUL as a constant value when the engines operate in normal conditions. Similar to other works in the literature ListouEllefsen2019; Lei2018, we propose to use a piecewise linear degradation model to define the correct RUL values in the training datasets. That is, after an initial period with constant RUL values, we assume that the RUL targets decrease linearly as the number of observed cycles progresses. We denote as the initial period in which the engines are still working in their desired conditions. A constant of 125 cycles is selected in our experiments to allow comparison to other proposed models in the literature ListouEllefsen2019; Lei2018. We point out that the choice of such constant impacts the performance of the RUL prediction methods and that further optimisation can be done to select the best performing .
Moreover, we note that the same normalised sensor values coming from distinct datasets can present different distributions according to their degradation level. We show in Figure 3 four normalised sensor values of the training examples coming from each of the four datasets 100 time steps before a failure occurs. We observe a lower distribution shift between the dataset pairs FD001, FD003 and FD002, FD004. This is the case because these pairs have data simulated under the same operating conditions, which causes their sensor values to have similar overall distributions near failure Saxena2008. However, as it can be seen in Figures 2(a), 2(b) and 2(d) there are still some distribution shift observed between FD001 and FD003 due to the varying fault modes. Similarly, in Figures 2(c) and 2(c) we observe a small distribution shift between FD002 and FD004. In practice, these distribution shifts make models dataspecific, i.e. a model trained in one dataset often does not perform well in a different dataset unless the sensor values driving the fault behaviour are similar across sourcetarget pairs.
4.3 Performance Metrics
Similar to other prognostic studies using the same datasets, we measure the performance of the proposed method of target datasets using two metrics. We propose to use the Root Mean Squared Error (RMSE) as this can be directly related to equations (7) and (8) and provide an estimation of how well the model is performing in the target prediction task.
Moreover, we evaluate our model using a scoring function shown in equation (16) proposed by Saxena2008:
(16) 
where and and Saxena2008. That is, is the difference between predicted and observed RUL values. The scoring metric penalises positive errors more than negative errors as these have an impact on RUL prognostics tasks as it can be seen in Figure 4.
5 Training and Hyperparameter Selection
5.1 Training Procedure
For training, we select one of the four CMAPPS datasets as source domain and use our proposed domain adaptation method to learn the remaining useful lifetime on the remaining three target datasets. That is, we use the input features and labelled RUL values from the source data, and only the input features values from the target datasets as inputs to the network. The CMAPPS datasets are normalised individually according to equation (15). We apply the time window transformation in both source and target datasets with to allow consistency between the number of time steps a network sees before making a prediction. The selection of the number of time steps is based on previous literature using the same datasets Lei2018; ListouEllefsen2019. No further feature engineering is performed in the input data as we aim to extract features automatically using the proposed method. regularisation is applied in the weights and in equation 7. Also, we separate the original training data into training (seen by the algorithm) and crossvalidation (used for stopping criteria) data containing 90% and 10% of the original dataset.
We split the training dataset into minibatches (collection of data samples) of data that are used to calculate model error and update model coefficients. During training, we randomly select minibatches coming from source and target domains to update the weights in the network on each gradient pass. As it can be seen in Table 1 the datasets have different number of training samples. To cope with this difference between domains, we over sample the smaller dataset to match the same number of minibatches coming from the larger dataset. After the network has seen all training examples coming from both domains we consider an epoch finished. Further, we define if the training example comes from the source domain and if it comes from the target domain.
Next, the proposed LSTMDANN architecture is defined including the number of LSTM and fully connected hidden layers, number of cells in each layer, learning rate and gradient update algorithm. We train using the Rectified Linear Unit (ReLU) as activation function. The SGD and RMSProp
tieleman2012lecture algorithms are used to update the weights in the network. Our implementation breaks the learning in two models sharing the feature extraction layers. One model aims at learning the regression task using the source domain sensor and RUL information. The other model aims at finding domaininvariant features using the adversarial loss function in equation (7). We point out that the second model is the one responsible for the adversarial learning aspect of the model as it includes the GRL to swap the gradient signals. While the weights are updated to minimise the classification loss between domains, the weights are updated to maximise the domain classifier loss at the same time. The complete diagram of the learning procedure can be seen in Figure 5.We train the models for a maximum of 200 epochs and interrupt training if no improvement is seen for 20 epochs. In our case the MAE, in equation (8), presents the best performing results. In addition, a varying learning rate is adopted, we start with a fixed learning rate, and after 100 epochs the learning rate is multiplied by a 0.1 factor to allow for stable convergence. We clip the norm values of the gradients to 1 in the SGD algorithm to avoid exploding gradients. Finally, the data coming from the target domain including the RUL values are fed to the network to calculate final RUL estimations and the performance measures can be obtained.
5.2 Hyperparameter Selection
We perform grid search on more sensitive hyperparameters: optimiser (opt) and learning rates (lr), and finetune the remaining parameters manually. The range considered for each hyperparameter is shown in Table 2. To asses the quality of the proposed algorithm we need to validate the hyperparameters without using the RUL values coming from the target domain. In our proposed methodology, we evaluate the performance of the adaptation task by observing the crossvalidation error and the domain classifier error on the source domain. In general, we observed that performance results are better when a lower source error is achieved while the domain classification stabilises in loss values that lead to an accuracy close to a random guess. We select the hyperparameters that yield the lowest source RMSE. We report the resulting hyperparameters settings in Table 3.
Hyperparameter  Range 
Learning rate (source regression)  {0.001, 0.01, 0.1} 
Learning rate (domain classification)  {0.001, 0.01, 0.1} 
Batch size  {256, 512, 1024} 
Optimiser  {SGD, RMSProp} 
Number of layers (LSTM)  {1, 2} 
Number of units (LSTM)  {32, 64, 100, 128} 
Number of units ()  {30, 32, 64, 128, 512} 
Number of layers (source regression)  {1, 2} 
Number of units (source regression)  {16, 20, 32, 64, 128} 
Number of layers (domain classification)  {1, 2} 
Number of units (domain classification)  {16, 20, 32, 64, 128} 
Regularisation  {0.0, 0.01, 0.1} 
{0.8, 1.0, 2.0, 3.0} 
Source: FD001  Target 



Units 




batch size  lr source reg.  lr domain class.  opt  
  FD002  0.01  1 (128)  0.5  64  1 (32)  0.3  1 (32)  0.3  0.8  256  0.01  0.01  SGD  
  FD003  0.01  1 (128)  0.5  64  1 (32)  0.3  1 (32)  0.3  0.8  256  0.01  0.01  SGD  
  FD004  0.01  1 (128)  0.7  64  2 (32, 32)  0.3  1 (32)  0.3  1.0  256  0.01  0.1  SGD  
Source: FD002  Target 



Units 




batch size  lr source reg.  lr domain class.  opt  
  FD001  0.01  1 (64)  0.1  64  1 (32)  0.0  2 (16, 16)  0.1  1.0  512  0.01  0.01  SGD  
  FD003  0.01  1 (64)  0.1  512  2 (64, 32)  0.0  2 (64, 32)  0.1  2.0  256  0.1  0.1  SGD  
  FD004  0.01  2 (32, 32)  0.1  32  1 (32)  0.0  1 (16)  0.1  1.0  256  0.1  0.1  SGD  
Source: FD003  Target 



Units 




batch size  lr source reg.  lr domain class.  opt  
  FD001  0.01  2 (64, 32)  0.3  128  2 (32, 32)  0.1  2 (32, 32)  0.1  2.0  256  0.01  0.01  SGD  
  FD002  0.01  2 (64, 32)  0.3  64  2 (32, 32)  0.1  2 (32, 32)  0.1  2.0  256  0.01  0.01  SGD  
  FD004  0.01  2 (64, 32)  0.3  64  2 (32, 32)  0.1  2 (32, 32)  0.1  2.0  256  0.01  0.01  SGD  
Source: FD004  Target 



Units 




batch size  lr source reg.  lr domain class.  opt  
  FD001  0.01  1 (100)  0.5  30  1 (20)  0.0  1 (20)  0.1  1.0  512  0.01  0.01  SGD  
  FD002  0.01  1 (100)  0.5  30  1 (20)  0.0  1 (20)  0.1  1.0  512  0.01  0.01  SGD  
  FD003  0.01  1 (100)  0.5  30  1 (20)  0.0  1 (20)  0.1  1.0  512  0.01  0.01  SGD 
We run the experiments presented in this paper in a machine running an Intel Core i5 7th generation processor with 16 GB RAM and a GeForce GTX 1070 Graphics Processing Unit (GPU). We implement the models using the Python 3.6 programming language and the Keras
chollet2015kerasdeep learning library with the TensorFlow
tensorflow2015whitepaper backend.6 Experimental Results
In this section, the prognostic performance of the proposed domain adaptation method for RUL estimation is presented. All experiments consider each of one the CMAPPS datasets as source domain and the remaining datasets as target domains. In total we have 12 different experiments and results are averaged over 10 trials for each experiment to reduce the effect of randomness. For each experiment we report the mean and standard deviations of each model’s performance.
We start by comparing the proposed method with baseline LSTM models trained in the source domain and applied on the target domain (SOURCEONLY). Also, we compare to models trained in the target domain using the target domain labels (TARGETONLY) representing the ideal situation when target RUL values are available for prediction. Furthermore, we assess our domain adaptation method against other popular methods that attempt to align source and target domains before a prediction model is constructed. We compare against Feed Forward Neural Network (FFNN) models trained on the Transfer Component Alignment (TCA) and CORrelation Alignment (CORAL) domaininvariant spaces. We present the methodologies’ effectiveness in finding representations that can both adapt to the different wear distributions across domains.
Lastly, we show that our TARGETONLY models can be effectively used to predict the RUL values for each of the CMAPPS datasets. We compare our methodology with the current stateoftheart methods to assess the general effectiveness of our proposed method when both sensor and RUL information are available for training.
6.1 Comparison to Nonadapted Models under Domain Shift
In this section, we compare the proposed model with models trained in the source domain and applied on the target domain (SOURCEONLY) serving as a baseline for the domainadapted models and models trained in the target domain using target labels (TARGETONLY) serving as upper bound of for the proposed methods. For the models where no adaptation is performed (SOURCEONLY and TARGETONLY) we train a network of the form: ReLU(LSTM(100)) + Dropout(0.5) + ReLU(Dense(30)) + Dropout(0.1) + ReLU(Dense(20)) + Dense(1) for 100 epochs using the Adam kingma2014adam optimiser with a learning rate of 0.001. We use an MSE loss function and equal to 30, 20, 30, 15 for FD001, FD002, FD003, FD004. These hyperparameters are chosen because they yield the best performances in our experiments. We present, in Table 4, the performances in the target test datasets for each sourcetarget pairs in our experiments. We aim to show the effect of using a model for the target domains trained only on the source domain in comparison with the proposed method. Therefore, we present the percentage change between SOURCEONLY and the LSTMDANN method as %. We also present the normalised RUL prediction results of several engines coming from the target crossvalidation dataset in Figure 6. In the figure, we present the target RUL values as well as the predictions coming from the LSTMDANN, SOURCEONLY and TARGETONLY models. We analyse the results splitting the analysis for each domain, as its selection poses distinct difficulties in the adaptation results.
Source: FD001  Target  SOURCEONLY  LSTMDANN (%)  TARGETONLY 
  FD002  71.70 3.88  48.62 (32%) 6.83  17.76 0.43 
  FD003  51.20 3.39  45.87 (10%) 3.58  12.49 0.29 
  FD004  73.88 4.50  43.82 (41%) 4.15  21.30 1.06 
Source: FD002  Target  SOURCEONLY  LSTMDANN (%)  TARGETONLY 
  FD001  164.84 23.00  28.10 (83%) 5.03  13.64 0.80 
  FD003  154.04 21.79  37.46 (76%) 1.54  12.49 0.29 
  FD004  37.76 2.17  31.85 (16%) 1.65  21.30 1.06 
Source: FD003  Target  SOURCEONLY  LSTMDANN (%)  TARGETONLY 
  FD001  49.94 7.65  31.74 (36%) 9.37  13.64 0.80 
  FD002  70.32 4.02  44.62 (36%) 1.21  17.76 0.43 
  FD004  69.28 4.51  47.94 (31%) 5.78  21.30 1.06 
Source: FD004  Target  SOURCEONLY  LSTMDANN (%)  TARGETONLY 
  FD001  188.00 25.95  31.54 (83%) 2.42  13.64 0.80 
  FD002  20.88 1.66  24.93 (+19%) 1.82  17.76 0.43 
  FD003  157.32 20.37  27.84 (82%) 2.69  12.49 0.29 

Source: FD004
Several RUL prediction results for FD004 acting as source domain are presented in Figures 5(a), 5(b), 5(c). In the figures, one can notice that the RUL predictions of the SOURCEONLY model for target domains FD001 (Fig. 5(b)) and FD003 (Fig. 5(c)) have a large error in comparison with the observed values. On the other hand, despite some errors between the predictions and observed values, our proposed model shows higher accuracy on the target RUL values leading to a smaller error than the one from SOURCEONLY model.We also point out that for target domain FD002, the SOURCEONLY model already provides a good fit for the observed RUL values (Figure 5(b)). In this case, domain adaptation results in predictions similar to SOURCEONLY. Usually, this result is not known beforehand, but can be expected since the marginal distribution shift across domains FD004 and FD002 is low. We also point out that FD004 is the dataset that contains 6 operating conditions and 2 fault modes. Therefore, the adaptation method can use the source domain to find correspondences between the operating conditions and fault modes in each sourcetarget pair. That is of practical value since one could use previous data coming from different conditions and fault modes to estimate better RUL values on unobserved runtofailure data.

Source: FD003
In the examples provided, the adaptation from FD003 to FD002 (Fig. 5(e)) and FD004 (Fig. 5(f)) show worse results than the one for FD001 (Fig. 5(d)). This is the case as FD003 is much more similar to FD001, varying only the number of fault modes. As it can be seen on Table 4 the LSTMDANN model yields a lower RMSE value when compared to the SOURCEONLY model showing that the weights learned by the network can be effectively used even when the domains are already similar.For target domains FD002 and FD004, the differences between domains are more prominent. FD003 has one operating condition while FD002 and FD004 have 6 different operating conditions and different sensor values near a failure. Despite the difficulties in transferring from such distinct domains, the domain adaptation method can improve the SOURCEONLY model showing a better prediction error in the test dataset. However, some higher errors are present as the estimated values are noisy and do not fit the linear degradation model in its complete extension.

Source: FD002
In Figures 5(g), 5(h) and 5(i) we present engines from the FD001, FD003, FD004 crossvalidation target datasets. Similar to the inverted experiment, the similarities between the FD004 and FD002 make the SOURCEONLY model fit the target data with high accuracy. In this experiment pair, our model is also able to fit the target data with a similar error level as the one without adaptation. On the other hand, we can improve the predictions on the FD001 target dataset in comparison to SOURCEONLY. In this case, FD001 and FD002 share the same fault mode (HPC degradation), but FD002 has more operating conditions than FD001 which makes the algorithm able to learn the degradation function better than a model without adaptation. Our method can also produce more accurate results than SOURCEONLY when the target domain is FD003. Although both operating conditions and fault mode are different across domains, the predictions produce lower errors and a better fit to the linear degradation model. The result of this experiment shows that it is possible to transfer from a domain that has more operating conditions than the target domains under the same or different fault modes. This is important, as in practice one is interested in reusing previous gathered or simulation data to predict the RUL on unseen data. 
Source: FD001
In our experiments, FD001 presents the highest errors when functioning as a source domain. When target domains are FD002 of FD004 (Figs. 5(j) and 5(j)) the best solution found is one that yields a flattened curve over the entire cycle. It can be observed that the SOURCEONLY model has a similar behaviour when being used to predict the target dataset. However, the proposed model is capable of adjusting the learned values towards a mean RUL value. Also, we point out that FD001 is the dataset containing only one operating condition and fault mode. That is, we are trying to transfer to domains where the fault modes and conditions are not in the source domain. Similar to other results, this shows to be a much harder problem to the methodology proposed. When the target domain is FD003 (Fig. 5(k)), we are attempting to learn in a domain with one fault mode and predict in a domain with two fault modes and similar operating conditions. For this case, the model results in RUL prediction curves that can fit the trend of the observed RUL values to a lower RMSE than the SOURCEONLY model.
We notice, in Table 4, that the proposed methodology is capable of improving performance over almost all but one SOURCEONLY methods in our experiments. The results of the adapted models change depending on the information contained in the dataset acting as source domain. We achieve better results when FD004 dataset acts as source domain as it contains all 6 operating conditions and 2 fault modes. Also, when the distribution shift in the source and target domains are similar, a model trained in the source domain with no adaptation can already achieve strong performance in the learning task.
In particular, using LSTMDANN does not add much value when the distributions between domains are very similar and the source domain contains more operating conditions and fault modes than the target domain (e.g. FD004 to FD002). For this example, even if we have access to the ground truth values in the target domain the performance (RMSE) among SOURCEONLY, LSTMDANN and TARGETONLY models will not be considerably improved, unlike other experiments. Moreover, among our 10 runs there were cases when the LSTMDANN would be able to find better RMSE values than the ones found by SOURCEONLY indicating that there may be better hyperparameters than the ones proposed in this paper that would result in lower RMSE values.
For the other cases, the results show that when the source data “contain” the target data the adaptation achieve lower RMSE results and a better fit. It is expected that the observed features in the source domain can be used to improve predictions in the target domain; thus, having a source domain with similar degradation data helps to learn. On the other hand, learning from a domain with fewer operating conditions and fault modes (e.g. FD001) to one with more conditions and fault modes proves to be a harder task. The model can adjust for the mean RUL adaptation value but fails to learn the linear degradation model that emerges from the target domain. However, the results still prove to be better estimations of the RUL in a target domain when compared to SOURCEONLY models.
As a rule of thumb, the results presented show that the methodology could be useful whenever one has similar degradation data with a degree of distribution shift across domains. The degree to which the distributions vary can determine how accurate an adaptation could result, but further investigation has to be done to provide a reliable estimate of “when” to transfer. The model achieves the best results when enough source data under different conditions has been observed. In practice, one could use simulated data acting as a source domain with various fault scenarios. In many applications, realworld data is dissimilar to simulation data due to noise and unexpected sensor behaviour. To improve the models, one could use the LSTMDANN method to adapt from simulated data to realworld data and achieve better results in RUL predictions.
6.2 Comparison to Domain Adaptation Approaches
To assess the quality of the model in transferring the degradation patterns from a source to a target domain, we test several wellknown methodologies for unsupervised domain adaptation. Two different methodologies are carried out, Transfer Component Analysis (TCA) and CORrelation ALignment (CORAL).

TCA
Transfer Component Analysis pan2011domainis a wellknown unsupervised domain adaptation method that focuses on finding a common shared feature representation between source and target domains. It transfers components across domains in a Reproducing Kernel Hilbert Space (RKHS) using Maximum Mean Discrepancy (MMD) and different kernels to construct a feature space that minimises the difference between the domains. We use the TCA feature representation to train a shallow neural network with one fully connected layer and 32 units (TCANN) and a deep feed forward neural network (TCADNN) containing the same amount of layers and units as our final models. In our tests, we apply the radial basis function (RBF) kernels extracting 20 transfer components with
and . 
CORAL
CORrelation ALignment (CORAL) sun2016return is a metric that minimises domain shift by aligning the secondorder statistics of a source and target distributions, without requiring target labels. After the alignment is found the new feature space can be used to train a model in the transformed source domain. Similar to the TCA method we use such a method in combination with a shallow (CORALNN) and deep neural network architectures (CORALDNN).
For all compared methods in this study, the input data are the same, the mean squared error (MSE) is used as the loss function and the Adam kingma2014adam optimisation algorithm is applied with a learning rate of 0.001.
TCA and CORAL provide outofthebox methods that can be easily applied in a domain adaptation setup where no observed values are available. Both methods are not entirely suitable for temporal data, for this reason we perform domain adaptation on TCA and CORAL using features coming from time and RUL at time . After the new features are computed we train a FFNN model using the same number of units as in the SOURCEONLY models. The results are summarised for the target crossvalidation datasets (not seen during training) in Table 5. It can be seen that on average our proposed method yields lower RMSE than the methodologies tested in all but one experiment pair. This supports that the proposed deep adaptation architecture is well suited for the studied prognostics prediction problem.
The LSTM structure in combination with the adversarial classification loss are capable of extracting useful temporal features from multivariate time series to perform adaptation from source and target domain pairs. In other terms, this means that the methodology performs better than the tested outofthebox adaptation methods not tailored for timeseries sensor data. The results provide a foundation for using the proposed domain adaptation method in cases when one has limited observed RUL data in one domain, but is concerned with predicting RUL targets in a similar domain with distinct operating conditions and fault modes.
Source: FD001  Target  TCANN  TCADNN  CORALNN  CORALDNN  LSTMDANN 
  FD002  94.1 1.0  90.0 2.9  99.2 3.6  77.5 4.6  46.4 3.6 
  FD003  120.0 1.0  116.1 1.0  60.0 0.7  69.6 5.2  37.3 3.4 
  FD004  120.1 1.0  113.8 6.9  107.7 2.8  84.6 7.0  43.5 5.3 
Source: FD002  Target  TCANN  TCADNN  CORALNN  CORALDNN  LSTMDANN 
  FD001  94.7 1.1  85.6 5.5  77.9 19  80.9 9.4  31.2 5.4 
  FD003  107.4 3.7  111.5 7.2  60.9 15.1  79.8 10.1  32.2 3.1 
  FD004  93.5 2.8  94.4 6.7  37.5 0.5  43.6 3.6  27.7 1.5 
Source: FD003  Target  TCANN  TCADNN  CORALNN  CORALDNN  LSTMDANN 
  FD001  98.7 0.4  90.5 4.6  26.5 0.5  26.5 1.9  30.6 6.2 
  FD002  90.5 0.3  80.8 4.3  113.2 4.5  75.6 9.5  43.1 1.4 
  FD004  78.9 5.3  102.6 3.2  113.9 5.5  77.2 9.1  49.7 9.1 
Source: FD004  Target  TCANN  TCADNN  CORALNN  CORALDNN  LSTMDANN 
  FD001  98.5 0.4  85.6 5.0  119.1 16.7  94.0 8.8  25.4 4.2 
  FD002  75.3 1.7  80.8 5.8  37.3 0.6  30.9 1.4  26.9 3.3 
  FD003  77.2 6.0  102.9 2.7  68.1 11.1  68.6 11.2  23.6 5.0 
6.3 Comparison to Nonadapted RUL Prediction Approaches Using Target Domain Labels
We provide the results of TARGETONLY models in Table 4 as a the best case performance where the RUL target values are used to train the models. That is, these are the best results found if we could use the RUL values in the target domain for training. We also present a comparison of the TARGETONLY models to the stateoftheart results in the CMAPPS datasets for the RMSE and Scoring performance metrics in Tables 6 and 7. We compare our method to the ones in the literature to attest that our model can show similar results when presented with a complete labelled dataset as others methods proposed in the literature.
Dataset  TARGETONLY  GA + LSTM ListouEllefsen2019  CNN + FFNN Li2018  MODBNE zhang2017multiobjective  LSTM + FFNN Zheng2017 
FD001  13.64  12.56  12.61  15.04  16.14 
FD002  17.76  22.73  22.36  25.05  24.49 
FD003  12.49  12.10  12.64  12.51  16.18 
FD004  21.30  22.66  23.31  28.66  28.17 
We compare our TARGETONLY model with the LSTM methods proposed by ListouEllefsen2019 (GA + LSTM) and Zheng2017 as (LSTM + FFNN). We also compare to the Convolutional Neural Network (CNN + FFNN) methodology proposed by Li2018
and the Multiobjective Deep Belief Networks Ensemble (MODBNE) proposed by
zhang2017multiobjective. In Table 6 we report the results of our experiments and the results reported in the original papers of the compared approaches. We notice that our models provides the best known RMSE results for datasets FD002 and FD004 showing that the method proposed can yield a strong performance on the datasets with more operating conditions. For datasets FD001 and FD002 we can produce similar results to the best known performance models proposed in the literature for both RMSE and Scoring metrics. We note that these methods cannot be directly used for dataset without labels as they are trained and tested in the same domain.Dataset  TARGETONLY  GA + LSTM ListouEllefsen2019  CNN + FFNN Li2018  MODBNE zhang2017multiobjective  LSTM + FFNN Zheng2017 
FD001  300  231  274  334  338 
FD002  1,638  3,366  10,412  5,585  4,450 
FD003  267  251  284  422  852 
FD004  2,904  2,840  12,466  6,558  5,550 
7 Discussion
7.1 Relationship to Standardisation
A common straightforward way to attempt to align a source and target domains and reduce the difference between the means and variances of the input distributions is performing standardisation. That is, local mean centering (subtracting the mean) and dividing by the standard deviation of each input feature. This leads to each feature in the data to have zeromean (moment matching) and unitvariance.
To test whether such transformation already suffices as an alignment strategy we standardise the data before feeding it to SOURCEONLY and TARGETONLY architectures. We select FD004 as source domain as results have shown that adaptation is possible for all remaining CMAPPS datasets. We present the RMSE values for each methodology evaluated in the test target datasets in Table 8. We compare to the original LSTMDANN trained on the data with the minmax normalisation and to a model trained on the standardised data (LSTMDANNSTD) and to reference models SOURCEONLY and TARGETONLY.
Source: FD004  Target  SOURCEONLYSTD  LSTMDANN  LSTMDANNSTD  TARGETONLYSTD 
  FD001  53.31 5.02  31.54 2.42  32.62 2.07  14.51 1.55 
  FD002  23.22 1.01  24.93 1.82  21.78 1.71  18.44 0.42 
  FD003  62.76 7.52  27.84 2.69  40.20 7.03  16.03 0.33 
The results in Table 8 show that on average the RMSE performances of SOURCEONLY models are considerably improved when the data is normalised with zero mean and unitvariance. However, our proposed methodology (LSTMDANNSTD) still outperforms the baseline models and provide a better fit to the target data (Figure 7). We point out that training on standardised data causes the LSTMDANNSTD models to saturate and start overfitting much faster than on previous experiments. This effect negatively impacted the adaptation performance on FD003 as training progressed for more epochs than necessary. To address this effect, simple changes in the models hyperparameters could be made. For example one could consider a lowering the value of and reducing learning rates.
8 Conclusion
In this paper, a deep learning method for domain adaptation in prognostics is proposed based on a LongShort Term Memory Neural Network and a Domain Adversarial Neural Network (LSTMDANN). We use time windows to incorporate long term sequences in the feature extraction layers. Experiments are carried out on the popular CMAPPS dataset to show the effectiveness of the proposed method. The goal of the task is to estimate the remaining useful lifetime for aircraft engines units accurately while transferring from a source domain with observed RUL values to a target domain with only input features.
We normalise the datasets independently to allow for higher distribution shifts across domains. It is worth noticing that locally normalising the data to zeromean and unit variance improves the performance of SOURCEONLY methods on the target datasets. Nevertheless, we focus on using the raw features and an adequate training procedure to find the weights representation that can accommodate the original distribution shift between domains. In general, we are capable to achieve lower errors between the prediction and the actual RUL value in comparison to a model with no adaptation features. We notice that the RUL prediction can be more effectively transferred from datasets that have more fault modes or operating conditions than their target counterpart. On the other hand, transferring from a dataset with fewer operating conditions and fault modes to a higher number of characteristics is a much harder task. However, even in the harder latter scenario the model is able to correct the RUL predictions to alleviate the RMSE error without utilising the target RUL values. Furthermore, the prognostic results obtained by the proposed method are compared with the outofthebox domain adaptation models. In our tests, the proposed network has shown superior performance in comparison to other simple domain adaptation methods. We point out that no thorough evaluation of domain adaptation methods was performed as most of the domain adaptation methodologies do not focus on sequential temporal data. Instead, we focused on methods that would require little domain adaptation knowledge and architecture tweaking for comparison.
Additionally, it should be noted that in common realworld online applications, the whole lifespan data are usually not available. As future research, we argue that domain adaptation methods should be able to accommodate noncomplete data coming from the target domain. As it is the case in many realworld scenarios, the sooner an accurate prediction can be done on the RUL of an equipment means that early actions can be planned to prevent equipment downtime. Furthermore, the network could be made such that it can retain hidden states for a longer period of time, by allowing states flow between minibatches of data. Such modification would have allowed the network to ”remember” for longer periods of time while training. This modification comes with a cost, padding and number of training examples within a batch have to be carefully selected to allow states to propagate between batches. In our experiments the varying sizes of the time sequences and number of training examples between domains led to minibatches of a small size, which made training particularly slow.
While promising experimental results have been obtained by the proposed method, further architecture optimisation is still necessary, as hyperparameter search is restrained. Also, deep learning methods generally suffer from high computing burden, thus using larger datasets could be made computationally intractable. In addition, we report that the Scoring function is only used for evaluation purposes. No optimisation steps are taken towards the minimisation of this function. Further development could incorporate the function in a learning algorithm.
9 Acknowledgments
This work was supported by the “Netherlands Organisation for Scientific Research” (NWO). Project: NWO Big data  Real Time ICT for Logistics. Number: 628.009.012