1 Introduction
Complex industrial equipment such as engines, turbines, aircrafts, etc., are typically instrumented with a large number of sensors that result in multivariate time series data. Most deep learning approaches model such multivariate time series data using variants of recurrent neural networks (RNNs) and convolutional neural networks (CNNs), e.g. [7, 12, 1, 18, 10]. These approaches often follow an “endtoend” design philosophy which emphasizes minimal a priori assumptions about the system [9], and therefore, ignore or fail to leverage explicit structures. However, in most industrial setups, a complex equipment has a welldefined and documented structure: it consists of multiple interconnected modules, with the dynamics of one module affecting the dynamics of other modules. Fig. 1 shows an aircraft turbofan engine with several interconnected modules.
Existing deep learning approaches fail to leverage the underlying structure of the complex equipment, and do not have the explicit capacity to reason about intercomponent relations to make decisions over a structured representation of the sensor data.
Recently, a class of models for modeling and reasoning over graphs have been proposed. These include graph neural networks (GNNs) [15, 11], and their recent generalization in form of graph networks [2]. This class of neural networks operate on graphs, and structure their computations accordingly. GNNs provide the desired relational inductive bias [2, 6] to solve problems with underlying structure. For instance, NerveNet [17] uses Gated GNNs (GGNN) [11]
to learn structured policies by explicitly modeling the structure of the agent, and are better than the policies learned by models such as multilayer perceptrons (MLPs) that simply concatenate all the observations from the environment.
In this work, we explore the applicability of GGNNs to explicitly model the underlying graphstructured mechanism of IoTenabled complex equipment. We represent the structure of a particular complex equipment as a directed graph where each node corresponds to a subset of sensors (e.g. those from the same module), and an edge models the relationship or dependency between two nodes or subsets of sensors (e.g. the dependence of a module on another module). Effectively, the multivariate time series of sensor data is then represented in the graph domain to learn a GGNN model. We use remaining useful life (RUL) estimation [16] as a target application to validate our approach.
The key contributions of this work can be summarized as follows:

We propose an approach to capture the knowledge of the structure of a complex equipment by using GGNNs. To the best of our knowledge, this is the first study to evaluate the effect of introducing such relational inductive bias into the deep learning approaches for equipment health monitoring.

We show the advantage of informed modularized processing of the multisensor time series data by grouping the sensors into meaningful subsets guided by the underlying graph structure and modules, rather than the commonly used approach that concatenates the observations from all sensors into one multivariate time series.

We provide insights into the working of GGNNs for RUL estimation: we observe that the modularized processing of multivariate time series using GGNNs followed by a simple attention mechanism for aggregating information across modules can potentially allow the network to focus more on the modules with impending failures.
2 Related Work
Recent works in [13, 17] implement an inductive bias for object and relationcentric representations of complex dynamical systems such as a pendulum, cartpole, toy cheetah, etc. In this work, we draw inspiration from such approaches, and show that leveraging such inductive bias can improve performance in IIoTenabled health monitoring applications, e.g. RUL estimation. To the best of our knowledge, this is the first attempt to model the time series of multivariate time series data for equipment health monitoring applications while leveraging the underlying structure and semantics of the equipment.
Another line of research focuses on incorporating the semantics of the problem into the structure of deep learning models by using ontologies. For instance, [8] propose using dense layers connected as per the ontology of the manufacturing line, followed by an RNN at the top to capture the temporal dependencies. Similarly, [19] attempt to model an equipment as a graph of sensor nodes: they assume a fullyconnected graph where nodes correspond to sensors, and edges capture the dependencies across sensors. However, they do not explicitly model the dependence between various modules of an equipment. Our work can be seen as a generalization of these approaches as it uses the structure of equipment to guide data processing.
Several variants of deep neural networks including CNNs and RNNs have been proposed for equipment health monitoring and RUL estimation, e.g. [7, 12, 1, 10, 5]. However, most of these approaches consider a flat concatenation of readings or observations from all the sensors as a multivari￼ ate input to the neural network, and ignore the structure of the underlying system or mechanism from which the data is generated. In this work, we show that grouping the sensors into meaningful interdependent subsets and processing them separately before the final concatenation step yields superior performance.
3 Problem Setup
We consider the scenario where a complex equipment consists of multiple modules (subsystems) connected to each other in a known fashion. Let denote the set of all the sensors installed to monitor various parameters across various modules of the equipment. The dynamical behavior of any module is observed via the multivariate time series corresponding to a fixed and known subset of sensors (parameters) associated with that module. We represent the equipment as a directed graph , where (for ) is a node in the graph that corresponds to a subset of sensors associated with the module indexed by , is a directed edge from node to that models the influence of on . Note that, in general,
, such that a sensor can be associated with more than one node: for example, a sensor measuring the ambient temperature can be associated with all nodes. We consider the supervised learning setting: we are given a fixed graph structure
and a learning set of time series, where is the target value, and denotes the multivariate time series associated with the nodes of . Here, denotes the dimensional multivariate time series corresponding to node for time , where and denote the number of sensors in .For the RUL estimation task, the time series are collected from one or more instances (installations) of an equipment with structure , and the target variable corresponds to the RUL value at time , e.g. in terms of remaining cycles of operation or remaining operational hours. RUL estimation is then a metric regression task where the goal is to map to . Let denote the total operational life of an instance till the failure point, s.t. at any time , the target RUL is given by . Furthermore, as in a typical practical setting, we assume that all instances of the equipment have the same underlying graph structure , i.e. the different modules of the equipment are connected to each other in same fashion.
4 Approach
As illustrated in Fig. 1(a), each multivariate time series is processed by a neural network (for
) to obtain a fixeddimensional initial node representation vector
. This initial representation is then updated using the representations of neighboring nodes defined by using a message passing mechanism to obtain . Finally, an attention mechanism is used to combine the final node representations to obtain an RUL estimate for the equipment. We refer to this approach as GNMR (Gated Graph Neural Networks for Metric Regression). The entire computation flow of GNMR is differentiable endtoend and the associated parameters are learned via stochastic gradient descent. Next, we describe these steps in more detail.
Learning Node Representations from Time Series
The dimensional time series at node is processed by to obtain
. We consider a gated recurrent units (GRU)based RNN
[3] for processing this time series. In general, is different across the nodes implying that we need to learn and maintain GRU networks. This can pose scalability issues for graphs with large number of nodes. It is, therefore, desirable to use a common GRU, which we refer to as , to process the multivariate time series from all the nodes so as to keep the number of trainable parameters of the network within manageable limits. To this end, any point () at node is first processed via a nodespecific feedforward network to obtain a fixed dimensional vector . Note that is same across nodes allowing us to use the common for further processing of the resulting time series to obtain the initial representation . Effectively, consisting of and maps the input time series to .Message Passing Across Nodes
While the nodelevel representation obtained from can capture the dependencies across sensors in , it ignores the dependencies across nodes. It is desirable to leverage the representations of neighboring nodes to capture the dependencies between nodes, and then aggregate them to obtain a representation for the overall dynamics of the system. To achieve this, the representations for each node are iteratively updated by a GGNN using the representations of the neighboring nodes as described next.
Consider two normalized adjacency matrices and corresponding to the incoming and outgoing edges in graph , as illustrated in Fig. 1(a). Let correspond to the th row of matrix , and denote the representation (or embedding) for node after message propagation steps. GGNN takes , , and the initial node representations as input, and returns an updated set of representations after iterations of message propagation across nodes in the graph s.t. , where represents the parameters of the GGNN function . For any node in the graph and message propagation step , the previous representation of the node , and the aggregated representation of its neighboring nodes (as obtained via Eqs. 13 below) are used to iteratively update the representation of the node times.
More specifically, the representation of node in the message propagation step is updated as follows:
(1)  
(2)  
(3)  
(4)  
(5)  
(6)  
(7) 
where , and denote the th row of and , respectively. Here, and allow to capture the information from upstream and downstream nodes in the system, respectively.
denotes feedforward ReLU layer(s) with parameters
that computes the contribution (message) from to if there is an incoming edge from to , i.e. when . Then, denotes the message from to corresponding to edge . Similarly, computes the message from to if there is an outgoing edge from to , i.e. when . and denote the matrices that contain the information from the incoming and outgoing edges with as starting and ending node, respectively. For , simply returns . The trainable parameters , , and of appropriate dimensions constitute ,is the sigmoid function, and
is the elementwise multiplication operator. Eqs. 47 are the computations equivalent to a gated recurrent unit (GRU) network.Note that for large graphs with many nodes and edges, the number of functions and their associated parameters can be large. However, in many practical applications such as a power plant or a water treatment plant [4], nodes typically have an associated type with more than one node belonging to the same type, e.g. multiple water tanks in a water treatment plant. Under such scenarios, it may be suitable to tie the parameters of the edge functions for a given pair of node types.
Attention Mechanism to Aggregate Nodelevel Representations
The final representations can be aggregated to get a graphlevel output (RUL estimate in our case) by using an attention mechanism, e.g. as used in [11]. In this work, we consider a simple variant of this attention mechanism: For each node, we use the concatenated vector as inputs to two parallel feedforward layers and to obtain and . Here, is a onehot vector of length , and is set to 1 for th position, and 0 otherwise. Also, is used as an additional input for the RUL estimation task as the total life passed can be a useful feature to estimate the wearandtear of the system. We apply softmax over the values from to obtain attention weight for node . The final RUL estimate is then given by
(8) 
This can be interpreted as assigning a weightage to the node while denotes the RUL estimate as per node .
Given the training set
, the loss function used for training is then given by
, with learning parameters being , parameters of , and the parameters of the feedforward layers (s, , and ) with leaky ReLU units.5 Experimental Evaluation
We investigate if GNMR is able to leverage the graph structure to improve upon other strong baselines from literature that consider a simple concatenation of observations as a multivariate input: CNNs [1, 10], LSTMs [20]
, and Deep Belief Networks
[18]. We additionally implement GRUMR: a GRUbased RUL estimation model on lines of Metric Regression approach proposed in [20] as depicted in Fig. 1(b). We ensure comparable hyperparameter settings as well as the same train, validation and test splits for GRUMR and the proposed GNMR, as detailed later. We further study the sensitivity of the approach to the knowledge of the graph structure by synthetically combining or segregating nodes (modules) in the original graph structure. To study other ways of combining time series of sensors, we also consider reducing the number of input dimensions for GRU via principal components analysis within the GRUMR framework, and refer to it as
PCAGRUMR. We report results for 5 PCA components (capturing 85% variance in the data), i.e. 5dimensional input to GRUMR in Table
1.Datasets: We use the four publicly available aircraft turbofan engine benchmark datasets^{1}^{1}1https://ti.arc.nasa.gov/tech/dash/groups/pcoe/prognosticdatarepository/#turbofan FD001FD004, as introduced in [14]. We use the equipment structure information as depicted in Figs. 1 and 3 (refer [14] for details). Each dataset contains a predefined traintest split. We further use a random 8020 split of the original train split to obtain a train and a validation set. The holdout validation set is used for hyperparameter tuning.
Dataset  FD001  FD002  FD003  FD004  Average Rank  

Method  RMSE  S  RMSE  S  RMSE  S  RMSE  S  RMSE  S 
CNNMR [1]  18.45  1,287  30.29  13,570  19.82  1596  29.16  7,886  8  7.75 
LSTMMR [20]  16.14  338  24.49  4,450  16.18  852  28.17  5,550  6.5  5.25 
MODBNE [18]  15.04  334  25.05  5,585  12.51  422  28.66  6,558  4.75  5.25 
CNN + FNN [10]  12.61  274  22.36  10,412  12.64  284  23.31  12,466  3.25  4.5 
GRUMR (ours)  15.36  481  22.43  3,391  12.52  339  22.96  2,964  3.75  3.75 
PCAGRUMR (ours)  15.60  469  22.92  3,916  13.37  860  22.41  2,637  5  4.5 
GNMR (proposed, )  12.73  302  21.38  3,148  13.06  366  21.81  3,414  2.75  2.75 
GNMR (proposed)  12.14  212  20.85  3,196  13.23  370  21.34  2,795  2  2.25 
Dataset  FD001  FD002  FD003  FD004  Average Rank  

Method  RMSE  S  RMSE  S  RMSE  S  RMSE  S  RMSE  S  
Original Graph  8  12.14  212  20.85  3,196  13.23  370  21.34  2,795  1.25  1.5 
One node for all sensors  1  13.20  321  22.36  4,072  13.65  378  22.04  2,747  4.25  3 
Reduced Nodes  4  12.41  299  22.85  4,798  15.26  981  21.87  3,034  4.25  4.5 
Increased Nodes  13  12.15  214  22.20  4,512  13.63  623  20.95  3,813  2.25  3.75 
One node per sensor  21  13.86  293  21.99  3,317  13.39  431  21.48  2,562  3  2.25 
Performance Metrics: We use the standard metrics RMSE and Timeliness Score (S) as introduced in [14]. Let = denote the error in RUL estimate for th test instance, then and , is the number of test instances, if , else . Usually, such that late predictions are penalized more compared to early predictions. We consider and as used in all the baselines. Lower values of RMSE and indicate better performance.
Hyperparameters: We use batch size of 32, and a dropout rate of 0.2 for all feedforward (leaky ReLU) layers. We use 2 leaky ReLU layers for each and in GNMR as well as for the prefinal leaky ReLU layers of GRUMR (refer Fig. 2
). We use Adam optimizer with initial learning rate of 0.001 which is reduced every 10 epochs by a factor of
. The number of hidden units, same as , for all feedforward and recurrent layers is chosen from {30,60}. We use message propagation steps for GGNN. The number of hidden layers for in GNMR and for GRUMR is chosen from {2,3,4}. All hyperparameters are selected via grid search based on validation RMSE.Results and Observations
(1) Comparison with baselines: From Table 1, we observe that GGNN performs better than GRUMR, PCAGRUMR as well as other baselines across most datasets. GNMR has the highest average rank of 2.0 based on both RMSE and S. Furthermore, the special case of GNMR with no message propagation across nodes, i.e. , also compares favorably to GRUMR and other baselines. These results suggest that meaningful grouping of the sensors into nodes based on the knowledge of the modular graph structure of the equipment is advantageous over methods that consider a concatenated vector of all the sensors as one input. Also, allowing for message propagation across nodes gives further improvement in results indicating the advantage of modeling the dependencies between modules.
(2) Effect of varying the graph structure: We consider two scenarios to study the sensitivity of results to the exact knowledge of graph structure: for the Increased Nodes scenario, any original node in with is split into two nodes, say and , by randomly distributing the sensors in to the two nodes such that each new node gets half the sensors. Furthermore, the nodes and thus created are connected to each other. Also, if , then both and are additionally connected to the new nodes and obtained from . For the Reduced Nodes scenario, when combining two neighboring nodes and into a new node, say , we have , and an edge exists between any two new nodes if there was an edge between the original nodes from which the new nodes were created. We also consider the limiting cases of One node per sensor and One node for all sensors for the Increased Nodes and Reduced Nodes scenarios, respectively.
From Table 2, we observe that the performance degrades as we reduce the nodes in the graph, with Reduced Nodes and One node for all sensors being the worst models. On the other hand, the graph with Increased Nodes performs the same as the Original Graph for FD001 while being better on FD004. Nevertheless, the Original Graph still gives best performance on average across datasets. Increased Nodes and One node per sensor have the secondbest performance. However, it is to be noted that the models with increased nodes have a much larger number of MLPs for message propagation (refer Eqs. 12) due to increased number of edges, making it computationally more expensive when compared to the original graph.
(3) Preliminary Analysis of Attention Mechanism (refer Eq. 8): For FD001 and FD002 datasets, the faults in all instances are known to originate in the HPC module^{2}^{2}2For FD003 and FD004, the ground truth information about the faulty node is not available, hence not analyzed. (refer Fig. 3). It is therefore expected that as the degradation increases, the behavior of sensors associated with the HPC module will depict signatures for detecting the impending failure, and in turn, estimating the RUL. We observe that GNMR implicitly tends to capture and leverage this behavior in some cases: we analyze the attention weights s and the corresponding contribution to the RUL estimate from the HPC node versus the remaining nodes. As shown in Fig. 4(b), we observe that for FD002 dataset, the attention for the faulty node (HPC module) increases as the target RUL decreases (i.e. as the engines approach failure), while the attention tends to decrease for the remaining nodes. However, this trend is not observed in FD001 Fig. 4(a). In future, it will be interesting to see if this can be explicitly ensured: while other modules may provide additional information to track the health degradation of a particular module, it may be useful to bias the attention to the moduleofinterest.
6 Conclusion and Future Work
We have proposed an approach to incorporate the readily available information about the modularized structure of complex equipment into deep learning models via gated graph neural networks (GNNs). To the best of our knowledge, our work provides a first set of results for leveraging GNNs in the increasingly important area of automated equipment health monitoring. We analyze the heavilybenchmarked aircraft turbofan engine dataset through the lens of structureaware deep learning, potentially bridging the gap between deep learning and the domainknowledge aware approaches. We hope that this work inspires future research in leveraging equipment structure to model the behavior of complex systems (such as power plants) with interesting applications like optimization and anomaly detection. While the graph structure is readily available in most practical applications as part of domain knowledge, it may not be optimal in terms of reflecting the dependencies between sensors at a node or between sensors across nodes. It will be interesting to explore if the optimal graph structure can itself be learned starting from the domain knowledgebased initial graph structure.
References
 [1] (2016) Deep convolutional neural network based regression approach for estimation of remaining useful life. In International conference on database systems for advanced applications, pp. 214–228. Cited by: §1, §2, Table 1, §5, Datasets and Preprocessing Details.
 [2] (2018) Relational inductive biases, deep learning, and graph networks. arXiv preprint arXiv:1806.01261. Cited by: §1.
 [3] (2014) Learning phrase representations using rnn encoderdecoder for statistical machine translation. arXiv preprint arXiv:1406.1078. Cited by: §4.
 [4] (2016) A dataset to support research in the design of secure water treatment systems. In International Conference on Critical Information Infrastructures Security, pp. 88–99. Cited by: §4.
 [5] (2017) Predicting remaining useful life using time series embeddings based on recurrent neural networks. arXiv preprint arXiv:1709.01073. Cited by: §2.
 [6] (2018) Relational inductive bias for physical construction in humans and machines. arXiv preprint arXiv:1806.01203. Cited by: §1.
 [7] (2008) Recurrent neural networks for remaining useful life estimation. In Prognostics and Health Management, 2008. PHM 2008., pp. 1–6. Cited by: §1, §2.
 [8] (2019) Enhancing deep learning with semantics: an application to manufacturing time series analysis. Procedia Computer Science 159, pp. 437–446. Cited by: §2.
 [9] (2015) Deep learning. nature 521 (7553), pp. 436–444. Cited by: §1.
 [10] (2018) Remaining useful life estimation in prognostics using deep convolution neural networks. Reliability Engineering & System Safety 172, pp. 1–11. Cited by: §1, §2, Table 1, §5, Datasets and Preprocessing Details.
 [11] (2015) Gated graph sequence neural networks. arXiv preprint arXiv:1511.05493. Cited by: §1, §4.

[12]
(2015)
Long Short Term Memory Networks for Anomaly Detection in Time Series.
In
ESANN, 23rd European Symposium on Artificial Neural Networks, Computational Intelligence and Machine Learning
, pp. 89–94. Cited by: §1, §2.  [13] (2018) Graph networks as learnable physics engines for inference and control. arXiv preprint arXiv:1806.01242. Cited by: §2.
 [14] (2008) Damage propagation modeling for aircraft engine runtofailure simulation. In IEEE International Conference on Prognostics and Health Management, 2008. PHM 2008., pp. 1–9. Cited by: Figure 1, Figure 3, §5, §5.
 [15] (2008) The graph neural network model. IEEE Transactions on Neural Networks 20 (1), pp. 61–80. Cited by: §1.
 [16] (2011) Remaining useful life estimation–a review on the statistical data driven approaches. European journal of operational research 213 (1), pp. 1–14. Cited by: §1.
 [17] (2018) Nervenet: learning structured policy with graph neural networks. Cited by: §1, §2.
 [18] (2016) Multiobjective deep belief networks ensemble for remaining useful life estimation in prognostics. IEEE transactions on neural networks and learning systems 28 (10), pp. 2306–2318. Cited by: §1, Table 1, §5.
 [19] (2019) Modeling iot equipment with graph neural networks. IEEE Access 7, pp. 32754–32764. Cited by: §2.
 [20] (2017) Long shortterm memory network for remaining useful life estimation. In Prognostics and Health Management (ICPHM), 2017 IEEE International Conference on, pp. 88–95. Cited by: Table 1, §5, Datasets and Preprocessing Details.
Datasets and Preprocessing Details
The datasets contain time series of readings for 24 sensors (21 sensors and 3 operating condition variables), such that each cycle in the life of an engine provides a 24dimensional vector. The sensor readings in the training set are available from beginning of usage of the engine till the end of life, while those in the test split are clipped at a random time prior to the failure such that the instances are operational at the last available cycle, and the goal is to estimate the RUL for these test instances. Refer Table 3 for details of the datasets.
Dataset  FD001  FD002  FD003  FD004 

Instances (training set)  80  208  80  199 
Instances (validation set)  20  52  20  50 
Instances (test set)  100  259  100  248 
Operating conditions  1  6  1  6 
Fault Modes  1  1  2  2 
As commonly used in RUL estimation approaches [20, 1, 10], we also consider an upper bound on the possible values of target RUL during training as, in practice, it is not possible to predict too far ahead in future. So if , we clip the value of to . The targets are then normalized to , such that the targets in training are in the range 01. We use minmax normalization technique to normalize input time series sensor wise using the minimum and maximum value of each sensor (in the range of 1 to 1) from the training set. The time series for each engine instance is then divided into windows of length with windowshift of 5, refer Table 4
for details of the number of resulting time series after windowing. We use suitable prepadding with mean value of sensor readings to ensure same length for each time series for both GRUMR and GGNN.
Dataset  FD001  FD002  FD003  FD004 

Training  1,875  4,688  2,129  5,383 
Validation  411  1,287  533  1,451 
Test  100  259  100  248 
Comments
There are no comments yet.