1 Introduction
Transportation systems, energy grids, and water distribution systems (WDS) constitute parts of our critical infrastructure that are vital to our society and subject to special protective measures and regulations. As they are under increasing strain in the face of limited resources and as they are vulnerable to attacks, their efficient management and continuous monitoring is of great importance. As an example, the average amount of nonrevenue water amounts to 25% in the EU [6], making the detection of leaks in WDS an important task. Advances in sensor technology and increasing digitalisation hold the potential for intelligent monitoring and adaptive control using AI technologies [5, 25, 13]
. In addition to more classical AI approaches, deep learning technologies are increasingly being used to solve learning tasks in the context of critical infrastructures
[4].A common feature of WDS, energy networks and transport networks is that the data has a temporal and spatial character: Data is generated in real time according to an underlying graph, given by the power grid, the pipe network and the transport routes, respectively. Measurements are available for some nodes that correspond to local sensors, e.g. pressure sensors or smart meters. Based on this partial information, the task is to derive corresponding quantities at every node of the graph, to identify the system state or to derive optimal planning and control strategies. In this paper, we target the learning challenges of the first feature, inferring relevant quantities at each location of the graph based on few measurements. While classical deep learning models such as convolutional networks or recurrent models can reliably handle Euclidean data, graphs constitute nonEuclidean data that require techniques from geometric deep learning. Based on initial approaches dating back more than a decade [28, 10], a variety of graph neural networks (GNNs) have recently been proposed that are able to directly process information such as present in critical infrastructure [2, 3, 7, 14, 29]. First applications demonstrate the suitability of GNNs for the latter [23, 5, 3].
Graphs from the domain of WDS or smart grids display specific characteristics (s. Fig. 3): as they are located in the plane, the node degree is small and the network diameter is large. These characteristics display a challenge for GNNs, as the problem of longterm dependencies and oversmoothing occurs [32, 27]
. In this contribution, we design a GNN architecture capable of dealing with these specific graph structures: We show that our spatial GNN is able to effectively integrate longrange node dependencies and we demonstrate the impact of a suitable transfer function and residual connections. As the required resources quickly become infeasible for big graphs, we also investigate the comparability of a sparse multihop alternative. All methods are evaluated for pressure prediction in WDS for a variety of benchmark networks, displaying promising results.
2 Related Work
The task of pressure estimation at all nodes in a WDS from pressure values available at a few nodes has recently been dealt with [8]. The authors employed spectral graph convolutional neural networks (GCNs) and performed extensive experiments to demonstrate their approach. However, their methodology does not fully benefit from the available structural information of the graph; we provide further details on this in Sec. 4. We propose a spatial GCN based methodology that effectively utilizes the graph structure by using both node and edge features and thus produces significantly better results (s. Sec. 5).
A related task of state (pressure, flow) estimation in WDS based on demand patterns and sparse pressure information has been addressed [31]. The authors used hydraulics in the optimization objective since the task was to model the complex hydraulics used by the popular EPANET simulator [26] using GNNs. They present promising results only on relatively small WDS, the ability to scale to larger WDS is yet to be investigated. While their model solves the task of state estimation in WDS, their approach requires demand patterns from every consumer also during inference. In contrast, our proposed model relies on pressure values computed by the EPANET solver (based on demand patterns) only during the training process. During evaluation, our model estimates pressures solely based on sparse pressure values obtained from a few sensors. Further, it successfully estimates pressures even in case of noisy demands (s. Sec. 5).
GNNs were first introduced in the work [28] as an extension of recursive neural networks for tree structures [11]
. Since then, a number of GCN algorithms have been developed, which can be classified in to spectralbased and spatialbased. The approach
[2] introduced spectral GCNs based on spectral graph theory, which was followed by further work [14, 3, 12, 18, 20]. The counterpart are spatial GCNs which apply a local approximation of the spectral graph kernel [9, 22, 7, 24, 32, 29]. These are also referred to as message passing neural networks.Unlike convolutional neural networks (CNNs), spatial GCNs suffer from issues like vanishing gradient, oversmoothing and overfitting, when used to build deeper models. Generalized aggregation functions, residual connections and normalization layers can address these issues and improve performance on diverse GCN tasks and large scale graph datasets [19] .
To enable highlevel embeddings in feedforward neural networks, self normalizing neural networks (SNNs) were introduced
[15]based on a special activation function called scaled exponential unit (SeLU). We combine residual connections
[19] with SNNs since residual connections help solve the oversmoothing problem when we use multiple GCN layers, whereas selfnormalizing property of SeLU enables the required information propagation in case of sparse features.3 Methodology
The main contribution of our work is a spatial GCN capable of efficiently dealing with the specific graph characteristics as present in WDS. We address the estimation of missing node features based on sparse measurements. As we detail below, we employ multiple spatial GCN layers without suffering from typical problems of vanishing gradient, oversmoothing and overfitting. For this purpose, we combine residual connections ([19]) with SeLU activation function ([15]). To decrease model size, we leverage GCN layers with multiple hops realizing message passing between more distant neighbors comparable to [21]. Our model employs spatial GCNs using both node and edge features. The complete architecture is depicted in Fig. 1. Formally, a graph is represented as , where:

is the set of nodes,

is the set of edges,

is the set of node features, where and is the number of node features

is the set of edge features, where and is the number of edge features
Node and edge features are embedded by fully connected linear layers and :
(1) 
We denote intermediate model activations as for nodes and for edges. Multiple GCN layers convolve the information from the neighboring nodes for estimation of node features. Each GCN layer employs the threestep process of message generation, message aggregation and node feature update. In the layer, the edge features are updated by
(2) 
Adding the absolute difference between the current and neighbor nodes features empirically improves the learning. Then, messages are generated as follows:
(3) 
where
denotes vector concatenation. After concatenation, we employ the SeLU activation function (
[15]) to all messages, which is given by:(4) 
where and
are hyperparameters as in
[15]. SeLU’s selfnormalizing nature greatly improves learning in the light of highly sparse values at the beginning of the training process. All messages from the neighbor nodes are sumaggregated:(5) 
Similar to [19]
, we add residual connections to the aggregated messages and pass these through a MultiLayer Perceptron (MLP):
(6) 
The overall message construction, aggregation and update is [19]:
(7) 
After employing multiple GCN layers, the resultant node embeddings are fed to a final fullyconnected linear layer to estimate all node features.
(8) 
where is the estimated node features, is the last GCN layer and is modeled by the linear layer. We use the L1 loss as objective function:
(9) 
with as ground truth, as number of nodes and as number of samples in a minibatch.
Multihop Variation
Given the sparsity and size of a graph, our methodology requires a comparably large number of GCN layers proportional to the size of graph. This reduces scalability to larger graphs. To reduce the number of parameters, we propose GCN layers with multiple hops as shown in Fig. 2. Specifically, message generation and aggregation are repeated in each GCN layer before passing it to the MLP:
(10) 
with as number of hops. The embedding for the next layer is:
(11) 
This enables the model to gather information from neighbors that are multiple hops away, requiring fewer GCN layers.
4 Experiments
The methodology can be applied to missing node feature estimation on any graph. Here, we investigate WDS, which are modelled as graphs by representing junctions as nodes and pipes between junctions as edges. WDS are especially challenging because pressure sensors are installed at only few nodes due to constraints (size of the system, cost, availability, practicality) [17], resulting in graphs with sparse feature information. Additionally, the node degree in WDS is usually low (s. Tab. 1). These properties can be observed in the popular LTown WDS [30] shown in Fig. 3. Such characteristics require GNNs to model longrange dependencies between nodes to properly integrate the available information.
WDS  Anytown  CTown  LTown  Richmond 

Number of junctions  22  388  785  865 
Number of pipes  41  429  909  79 
Diameter  5  66  79  234 
Degree (min, mean, max)  (1, 3.60, 7)  (1, 2.24, 4)  (1, 2.32, 5)  (1, 2.19, 4) 
To the best of our knowledge, the task of node feature estimation in WDS using GNNs based on sparse features has only been dealt by [8]
. These researchers compared their model to a couple of nonGNN based baselines: The first baseline refers to the mean of known node features as value for unknown node features, the second baseline uses interpolated regularization
[1]. The work [8] demonstrates that the GNN model significantly outperforms both baselines. Therefore, in our experiments, we compare our approach only to the GNN model, ChebNet, of [8]. We run two experiments on simulated data. First, we compare our approach to [8] on three WDS datasets Anytown, CTown, and Richmond. Second, we conduct an indepth evaluation on LTown with extensive hyperparameter tuning.4.1 Datasets
We use a total of four WDS datasets for our experiments: Anytown, CTown, LTown and Richmond ^{1}^{1}1https://engineering.exeter.ac.uk/research/cws/resources/benchmarks/#a8 ^{2}^{2}2https://www.batadal.net/data.html ([30]). Major attributes of the WDS are listed in Table 1. We use the dataset generation methodology of [8] for three of the WDS (Anytown, CTown, Richmond) and record 1000 consecutive time steps for each of the three networks. For each network, we use three different sparsity levels i.e. sensor ratios of 0.05, 0.1 and 0.2. We do not evaluate on sparsity levels of 0.4 and 0.8 as done in [8], which are more easy. We sample 5 different random sensor configurations for each sparsity level and each WDS instead of 20.
Model  Anytown  CTown  Richmond  LTown  
ChebNet  No. of layers  4  4  4  4  
Degrees ()  [39, 43, 45, 1]  [200, 200, 20, 1]  [240, 120, 20, 1]  [240, 120, 20, 1]  
No. of filters ()  [14, 20, 27, 1]  [60, 60, 30, 1]  [120, 60, 30, 1]  [120, 60, 30, 1]  
Parameters (million)  0.038  0.780  0.958  0.929  
mGCN  No. of GCN layers  5  33  60  45  10 
No. of hops  1  2  3  1  5  
No. of MLP layers  2  2  2  2  2  
Latent dimension  32  32  48  96  96  
Parameters (million)  0.031  0.203  0.830  2.488  0.553 
For the popular LTown network, we use only a single configuration of sensors as designed by [30], which gives a sensor ratio of 0.0422. We use two different sets of simulation settings; one with smooth toy demands and the other close to actual noisy demand patterns. The simulations are carried out using EPANET [26] provided by Python package wntr ([16]). The samples are generated every 15 minutes, resulting in 96 samples every day. We use one month of data for training (2880 samples) and evaluate on data of the next two months (5760 samples). The training data is divided in trainvalidationtest splits with 602020 ratio.
4.2 Training setup
The model parameters are summarized in Table 2
. All models are implemented in Pytorch using Adam optimizer. For the ChebNet baseline
[8], we set the learning rate of 3e4 and weight decay of 6e6. For our mGCN models, we use learning rate of 1e5 and no weight decay. We now describe the training setup of the ChebNet baseline and our mGCN model for the two experiments, respectively.For the first experiment the models are trained for 2000 epochs. We set an early stopping criteria such that it stops after 250 epochs if the change in loss is no larger than 1e6. We configure ChebNet similar to
[8]. Input is masked as per the sensor ratio and the mask is concatenated with the pressure values. Hence there are two node features. ChebNet can only use scalar edge fetaures, i.e. edge weights. Out of the three types of edge weights used by [8] (binary, weighted, logarithmically weighted), we use the binary weights since other types did not increase performance. For our model (mGCN), we did not perform an extensive hyperparameter search since we achieved considerably better results than ChebNet model of [8] with a set of intuitive hyperparameter values. We use single hop configuration for Anytown and multihop architectures for CTown and Richmond WDS. We only use masked pressure values as input i.e. one node feature. Further, we use two edge features namely pipe length and diameter.For the second indepth evaluation on LTown, we dropped the second node feature for ChebNet since this significantly improved the results. We use the ChebNet model configuration used for Richmond WDS by the authors. We train our mGCN model with two configurations; one with the default single hop and the second with multiple hops as listed in Table 2. For both mGCN models, we add a third edge feature namely pressure reducing valves (PRVs) mask. PRVs are used at certain connections in a WDS to reduce pressure, hence these edges should be modeled differently. We use a binary mask to pass this information to the model that helps in improving the pressure estimation at neighboring nodes. We train all three models for 5000 epochs without early stopping.
5 Results
WDS  Anytown  CTown  Richmond  

Ratio  Error ()  ChebNet  mGCN  Diff  ChebNet  mGCN  Diff  ChebNet  mGCN  Diff 
0.05  All  54.19  53.15  1.04  12.88  9.77  3.11  4.34  2.17  2.17 
Sensor  7.06  3.77  3.28  7.50  4.61  2.89  3.47  1.81  1.66  
Nonsensor  56.44  55.50  0.94  13.16  10.04  3.12  4.38  2.19  2.19  
0.1  All  35.43  34.85  0.57  8.16  5.47  2.69  3.86  1.93  1.93 
Sensor  6.66  7.19  0.53  7.10  4.83  2.27  3.45  2.02  1.43  
Nonsensor  38.3  37.62  0.68  8.28  5.55  2.73  3.90  1.92  1.98  
0.2  All  14.98  13.51  1.47  7.05  5.58  1.47  3.24  1.59  1.65 
Sensor  5.40  3.06  2.34  6.46  5.46  1.00  3.03  1.62  1.40  
Nonsensor  17.11  15.83  1.28  7.20  5.61  1.59  3.29  1.59  1.71 
Comparison with spectral GCNbased approach
First, we compare our model with the work of [8] using their datasets and training settings. The results of the experiments on Anytown, CTown and Richmond WDS are shown in Table 3. Here, we evaluate on the basis of mean relative absolute error given by:
(12) 
Since Anytown is a much smaller WDS, sensor ratios translate to very few sensors (0.05: 1 sensor, 0.1: 2 sensors, 0.2: 4 sensors). Hence, both models do not accurately estimate the pressures in these cases. The number of available sensors is comparatively bigger for both CTown and Richmond WDS, even for the smallest ratio, thus naturally increasing performance. As can be seen, mGCN outperforms ChebNet [8] by a considerable margin.
Detailed analysis on LTown
We present more indepth analysis for the evaluation results on LTown. Mean relative absolute errors for ChebNet and singlehop mGCN models are plotted in Fig. 4. Both models are trained on smooth data and evaluated on noisy realistic data. As can be seen, error values for mGCN are much lower across all nodes compared to ChebNet. We plot time series of 4 days for a couple of nodes in Fig. 5. The first node (top plot) has an installed sensor, hence the model gets the ground truth value as input and it has to only reconstruct it. The second node (bottom plot) does not have an installed sensor and the model gets zeroinput. As depicted, mGCN is able to successfully reconstruct and estimate both nodes. The results from ChebNet suffer considerable errors. There are areas in the LTown WDS, where water levels are essentially stagnant with some noise. As shown in Fig. 6 our mGCN is able to model those nodes correctly. In contrast, spectral convolutions do not take into account the graph structure and thus end up imposing the seasonality of nodes from other areas of the graph to the nodes in this area.
Similar to our first experiment, we present mean relative absolute error values for all, sensor and nonsensor nodes for LTown in Table 4. Our model produces significantly better results compared to the ChebNet. Since our model is based on neighborhood aggregation, the number of GCN layers required will continue to increase with the increasing size of the graphs. In order to reduce the number of layers and model parameters, we trained our model with only 10 GCN layers with 5 hops each. As evident, we are able to reduce the parameters by almost five times at the expense of some performance. Nevertheless, it is still significantly better than the baseline ChebNet model. Our main motivation for this is that the multihop approach makes the model more scalable to larger graphs. Further, it is a step towards developing a generalized version of the model that can work for different sensor configurations and/or different graph sizes without hyperparameter tuning and retraining.
Model  Error ()  

All  Sensor  Nonsensor  
Smooth Data  
ChebNet  2.55 2.87  2.38 3.55  2.55 2.83 
mGCN (45 x 1)  0.39 0.37  0.43 0.52  0.39 0.36 
mGCN (10 x 5)  0.83 0.68  0.74 0.59  0.83 0.69 
Noisy Data  
ChebNet  2.92 3.35  2.78 4.02  2.93 3.32 
mGCN (45 x 1)  0.54 0.75  0.64 1.06  0.53 0.73 
mGCN (10 x 5)  0.90 0.82  0.81 0.74  0.90 0.83 
6 Conclusion
We have proposed a spatial GCN which is particularly suited for graph tasks on graphs with small node degree and sparse node features, since it is able to model longterm dependencies. We have demonstrated its suitability for node pressure inference based on sparse measurement values as an important and representative task from the domain of WDS, displaying its behavior for a number of benchmarks. Notably, the model generalizes not only across time windows, but also from noiseless toy demand signals to realistic ones. In addition to a very good performance overall, we also proposed first steps to target the challenge of scalability to larger graphs by introducing multihop architectures with considerably fewer parameters as compared to fully connected deep ones. In future work, we will investigate the behavior for larger networks based on these first results. Moreover, unlike simulation tools in the domain, the GNN has the potential to generalize over different graphs structures including partially faulty ones. We will evaluate this capability in future work.
6.0.1 Acknowledgements
We gratefully acknowledge funding from the European Research Council (ERC) under the ERC Synergy Grant WaterFutures (Grant agreement No. 951424). This research was also supported by the research training group “Dataninja” (Trustworthy AI for Seamless Problem Solving: Next Generation Intelligence Joins Robust Data Analysis) funded by the German federal state of North RhineWestphalia, and by funding from the VWFoundation for the project IMPACT funded in the frame of the funding line AI and its Implications for Future Society.
References

[1]
(2004)
Regularization and semisupervised learning on large graphs
. In Learning Theory, J. ShaweTaylor and Y. Singer (Eds.), Berlin, Heidelberg, pp. 624–638. External Links: ISBN 9783540278191 Cited by: §4.  [2] (2014) Spectral networks and locally connected networks on graphs. CoRR. Cited by: §1, §2.
 [3] (2016) Convolutional neural networks on graphs with fast localized spectral filtering. NIPS 29, pp. 3844–3852. Cited by: §1, §2.
 [4] (2019) Deep learning for critical infrastructure resilience. JIS 25 (2), pp. 05019003. Cited by: §1.

[5]
(2022)
Traffic4cast at neurips 2021  temporal and spatial fewshot transfer learning in gridded geospatial processes
. In Proceedings of the NeurIPS 2021 Competitions and Demonstrations Track, Vol. 176, pp. 97–112. Cited by: §1, §1.  [6] (2021) Europe’s water in figures. Cited by: §1.
 [7] (2018) Largescale learnable graph convolutional networks. In SIGKDD, pp. 1416–1424. Cited by: §1, §2.
 [8] (2021) Reconstructing nodal pressures in water distribution systems with graph neural networks. arXiv. External Links: Document Cited by: §2, §4.1, §4.2, §4.2, §4, §5.
 [9] (2017) Inductive representation learning on large graphs. In NIPS, pp. 1025–1035. Cited by: §2.
 [10] (2005) Universal approximation capability of cascade correlation for structures. Neural Comput. 17 (5), pp. 1109–1159. Cited by: §1.

[11]
(2000)
Learning with recurrent neural networks
. Leacture Notes in Control and Information Sciences, Vol. 254, Springer. Cited by: §2.  [12] (2015) Deep convolutional networks on graphstructured data. arXiv preprint arXiv:1506.05163. Cited by: §2.

[13]
(2022)
Leak detection methods in water distribution networks: a comparative survey on artificial intelligence applications
. Journal of Pipeline Systems Engineering and Practice 13 (3), pp. 04022024. Cited by: §1.  [14] (2017) Semisupervised classification with graph convolutional networks. In International Conference on Learning Representations (ICLR), Cited by: §1, §2.
 [15] (2017) Selfnormalizing neural networks. In NIPS, NIPS’17, pp. 972–981. External Links: ISBN 9781510860964 Cited by: §2, §3, §3.
 [16] (2018) An overview of the water network tool for resilience (wntr).. Cited by: §4.1.
 [17] (2013) Twotiered sensor placement for large water distribution network models. JIS 19 (4), pp. 465–473. Cited by: §4.
 [18] (2018) Cayleynets: graph convolutional neural networks with complex rational spectral filters. IEEE Transactions on Signal Processing 67 (1), pp. 97–109. Cited by: §2.
 [19] (2020) DeeperGCN: all you need to train deeper gcns. External Links: 2006.07739 Cited by: §2, §2, §3, §3.
 [20] (2018) Adaptive graph convolutional neural networks. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 32. Cited by: §2.
 [21] (2017) Diffusion convolutional recurrent neural network: datadriven traffic forecasting. arXiv. External Links: Link Cited by: §3.

[22]
(2017)
Geometric deep learning on graphs and manifolds using mixture model cnns.
In
IEEE conference on computer vision and pattern recognition
, pp. 5115–5124. Cited by: §2.  [23] (2022) Graph neural network and koopman models for learning networked dynamics: a comparative study on power grid transients prediction. arXiv. Cited by: §1.
 [24] (2016) Learning convolutional neural networks for graphs. In ICML, pp. 2014–2023. Cited by: §2.
 [25] (2021) Artificial intelligence techniques in smart grid: a survey. Smart Cities 4 (2), pp. 548–568. External Links: ISSN 26246511 Cited by: §1.
 [26] (2020) EPANET 2.2 user’s manual, water infrastructure division. CESER. Cited by: §2, §4.1.
 [27] (2020) A survey on the expressive power of graph neural networks. CoRR. External Links: Link, 2003.04078 Cited by: §1.
 [28] (2009) The graph neural network model. IEEE Transactions on Neural Networks 20 (1), pp. 61–80. Cited by: §1, §2.
 [29] (2018) Graph Attention Networks. ICLR. Note: accepted as poster Cited by: §1, §2.
 [30] (2020) BattLeDIM: battle of the leakage detection and isolation methods. In CCWI/WDSA Joint Conf, Cited by: Figure 3, §4.1, §4.1, §4.
 [31] (2022) Graph neural networks for state estimation in water distribution systems: application of supervised and semisupervised learning. Journal of Water Resources Planning and Management 148 (5). Cited by: §2.
 [32] (2018) How powerful are graph neural networks?. arXiv preprint arXiv:1810.00826. Cited by: §1, §2.