1 Introduction
Water distribution networks (along electricity, transport and communication ones) are a critical infrastructure component. Thus, modeling and observability issues are of paramount importance and have to be handled through increases in connectivity, automation and smart metering.
In particular, pipe leakages (assimilated hereinafter with fault events) have to be detected and isolated as soon and as precisely as possible. While apparently straightforward (the leakage leads to a measurable loss of pressure), several issues conspire in increasing the scheme’s difficulty:
Hence, sensors have to be placed to provide networkwide relevant information while pressure and flow information is obtained either through emulation Rossman (2000) or experimentally Perez et al. (2014)
. Such data driven analysis naturally leads to heuristic implementations which come with specific caveats:

heuristic methods use the data agnostically and ignore information about network structure/particularities;

network size may lead to implementations which clog the resources or are bogged by numerical artifacts.
In light of the previous remarks it is clear that the main issues are sensor placement and subsequent fault detection and isolation (FDI) mechanism. For the former we propose a novel GramSchmidt graphaware procedure and for the later we consider a dictionary learning (DL) approach.
Specifically, we assign the faults affecting a given node to a class and train the dictionary such that its atoms discriminate between these classes. The subsequent classification of a signal in terms of the dictionary’s atoms serves as proxy for FDI. The active atoms for a certain class are seen as a fault signature which unambiguously asserts FDI (if the signature is unique w.r.t. the other possible faults).
DL Dumitrescu and Irofti (2018)
is an active research topic in the signal processing community providing machine learning algorithms that build linear models based on the given (nonlinear) input data. Its applications to classification tasks
Jiang et al. (2013) in general and online classification Irofti and Băltoiu (2019a) in particular provide fast and memory efficient implementations well suited for IoT devices and for online production usage.Our previous work Irofti and Stoican (2017); Stoican and Irofti (2019) has shown encouraging results when adapting DL classification for FDI in water networks. In this paper we propose new methods that tackle large distribution networks and lift data dimensionality limitations by employing online DL strategies. Online methods process data in small batches which translate into smaller computation complexities. This leads to a small cost on the FDI performance in the beginning, but it is quickly attenuated as more data gets processed.
In simulations we consider both the proofofconcept benchmark “Hanoi network” and a generic largescale network Muranho et al. (2012). Furthermore, we use multiple demand profiles, fault magnitudes and discuss different sensor placement strategies and success criteria.
2 Preliminaries
A passive water network (i.e., without active elements like pumps) consists of one or more tank nodes (whose heads^{2}^{2}2In the water network parlance, “head” denotes the height of the column of water in a node wrt a common ground level. remain constant) which feed a collection of junction nodes through a network of interconnected pipes. From a modeling perspective, the question is what are the flows through the pipes, what are the heads through the junction nodes and how do these variables depend on user demand (outflows from some or all of the junction nodes) and unexpected events (in our case: pipe leakages).
2.1 Steadystate behavior
The dynamics of the network are usually ignored. This is a reasonable assumption as long as demand variation is slow and unexpected events (e.g., leakages) are rare. In other words, any transientinducing event is sufficiently rare and the transients themselves are sufficiently fast such that it is a fair approximation to consider the system at equilibrium Brdys and Ulanicki (1996). Since water is incompressible the relevant physical laws which apply are those of mass and energy conservation.
First, the inflows and outflows passing throughout a junction node have to balance:
(1) 
where is the flow through pipe , is the consumption of node and is the adjacency matrix of the network, i.e., takes one of the following values:
(2) 
Next, the empiric HazenWilliams formula Sanz Estapé (2016) gives the head flow variation between nodes linked through a pipe with index (we assume that the pipe of index links the , th nodes):
(3) 
where is the length in , is the diameter in and is the adimensional pipe roughness coefficient; the flow is measured in .
Using (3) we express the flow in terms of the associated head flow :
(4) 
where is the pipe conductivity, defined as
(5) 
Noting that the
th line of the column vector
returns the difference and combining (1) with (4) leads to the nonlinear steadystate relations:(6) 
where and ‘’ denotes the elementwise multiplication of two vectors (i.e., the ith element of is ). Note the addition of term which describes the influence of fixedhead nodes (the tanks which feed the network).
For further use we denote with the number of junction nodes.
2.2 Node consumption
Assuming that all the parameters of (6) are known (gathered in the left side of the equation), there still remains the node consumption as a source of incertitude. Historically user demand data has been acquired sparsely or not at all. Most common approaches are to consider a consumption profile (usually with a weeklong period) and scale it wrt the total consumption in the network:
(7) 
where and are the consumption profile and, respectively, the total water fed to the network at time instant ; denotes the base demand for the ith node and covers ‘nominal’ (those occurring under healthy functioning) incertitudes affecting the ith node (without being exhaustive: normal user variations, seasonal and holiday variations, small leakages).
The issue of interest is how to detect and isolate a pipe leakage. First we note that a pipe leakage means in fact a loss of flow and thus a loss of head in the network’s nodes. We then interpret pipe leakage as an additive term in the node consumption value^{3}^{3}3Hereafter when we speak about isolating a leakage we refer to identifying the node directly affected by the pipe leakage. The actual leakage isolation means checking the pipes which enter into the node.:
(8) 
For further use we consider that the profile can take values from a collection of profiles .
Remark 0
This means that the active profile in (7)–(8) may be unknown at measuring time. This additional source of incertitude may hide water losses due to pipe leakages. A partial solution is to measure total water demand at times when user demand is minimal (middle of the night). At this time, deviations due to leakages represent a larger percentage from the total consumption (wrt with the incertitudes due to the profile), and thus, a change from the expected value may signify that leakages are present (in FDI parlance, a fault is detected).
2.3 Leakage isolation and residual generation
The issue of leakage isolation still remains. To asses the leakage events we have to compare the “healthy” (nominal) behavior, as given in (7), with the measured (and possibly faulty, as given in (8)) behavior of the network’s nodes. This is done through a residual signal which is Blanke et al. (2006): i) constructed from known quantities; ii) sensitive to a fault occurrence^{4}^{4}4Hereinafter, to keep with the FDI context we denote a ‘leakage event’ as a ‘fault occurrence’.; and iii) robust, in as much as is possible, to normal variations.
For further use we make a couple of assumptions.
Assumption 0
We consider that there are no multiple fault occurrences in the network (i.e., the network is either under nominal functioning or with a single node under fault).
Assumption 0
Without loss of generality we assume that the fault magnitude values are the same for each node and are taken from a finite collection ( possible values from ).
For further use we consider the nodes’ head as a proxy for fault occurrences and use its nominal () and measured values () to construct the residual signal. The following aspects are relevant:

as per Remark 1, we consider an interval in which the head values remain relatively constant and average over it to obtain the “steadystate” nominal / measured head values:
(9) 
the residual may be defined in absolute or relative form, i.e.:
(10) Whenever the residuals’ type (absolute or relative) are not relevant we ignore the superscripts .

assuming that the ith node is under fault with magnitude and that the network functions under profile , we note^{5}^{5}5With this notation, the nominal head, , would be denoted as , where is the profile active when the nominal head was measured. Since the nominal head remains constant (we cannot assume that is updated), we keep the simpler notation . the head values with and corresponding residual with .
For further use we gather all the residual vectors into the residual matrix^{6}^{6}6Taking all possible combinations, the residual matrix has columns. For largescale networks or if arbitrary selections of profiles, faults and fault magnitudes are considered, the value of , and consequently, the arranging and content of , may differ. :
(11) 
Remark 0
The residual ordering inside the matrix is not essential. It was chosen thus to make easier the grouping after fault occurrence (all cases which correspond to a certain node under fault are stacked consecutively).
The Hanoi benchmark
To illustrate the aforementioned notions we consider a oftenused benchmark in the literature: the Hanoi water network Casillas et al. (2013). As seen in Fig. 1, the network characteristics are: one tank and junction nodes linked through pipes (each with its own length and diameter); each junction node can be affected by a leakage and some of the nodes will have sensors mounted on them.
With the network profile (which multiplies each of the junction nodes’ base demand) given in Fig. (a)a we simulate the nodes’ head under nominal functioning for one day (with 15 minutes sampling) through the EPANET software Rossman (2000), as seen in Fig. (b)b.
We observe that the empiric rule from Remark 1 holds: the head values remain steady around 3 AM, thus justifying the choice of constructing the head values with information from this time interval.
Further, we consider 9 additional perturbations (hence ) of the nominal profile shown in Fig. (a)a through the addition of uniform noise bounded in the range of . To illustrate the fault effects we consider such an event at node with fault magnitudes taken from , hence .
The resulting head values are shown in Fig. 3 where we can observe, as expected, that the fault affects the node’s head value.
Taking and using it as in (9) to obtain the residuals (10) leads to the plots shown in Fig. 4 (we consider the absolute residual variant).
As expected, node 17 (where the fault happens) is the most affected. Still, measurable effects appear in nodes 14, 15 or 26. This is noteworthy for the subsequent fault detection and isolation analysis as it shows fault propagation throughout the network.
2.4 Problem Statement
The main idea is to detect and isolate fault occurrences (i.e., leakages) within the water network with a limited amount of information (much fewer sensors than nodes). Due to its complexity (network size, nonlinearities, demand uncertainty, etc.), the problem is divided into two consecutive steps:

the sensor placement, i.e., where to place the limited number of sensors such that the subsequent fault detection and isolation is maximized;

the fault isolation procedure which provides an estimation of the fault occurrences (their location within the network).
The ideas, building on Irofti and Stoican (2017), are to provide a dictionary learning framework within which to:

implement a GramSchmidt procedure which uses the gathered data to propose candidate nodes for sensor placement;

onto the reduced residual data, apply an online dictionary learning procedure which first trains a dictionary (an overcomplete basis for the residual signals) which is further used to classify test residuals into one of the predefined classes. Associating each class to a fault event means that the classification step itself becomes the fault detection and isolation mechanism.
Both elements exploit the network’s structure and manage its numerical complexities: the network’s Laplacian penalizes the sensor placement described in Section 3 and the online DL implementation described in Section 4 allows to handle large datasets (otherwise cumbersome or downright impossible through other, offline, procedures).
3 Sensor Placement
Arguably, the main difficulty in maximizing the network’s observability (and, thus, improve the FDI mechanism) comes from inadequate sensor placement: no subsequent FDI mechanism, (regardless of its prowess) can overcome the handicap of inadequate input data.
The problem reduces in finding a sequence of indices with at most elements from within the list of available node indices such that the FDI mechanism provides the best results. As formulated, the problem has a twolayer structure: at the bottom, the FDI mechanism is designed for a certain sensor selection and at the top, the sensor selection is updated to reach an overall optimum. The nonlinearities and large scale of the problem mean that we have to break it into its constituent parts: first the sensors are placed (based on available data and/or model information) and, subsequently, the FDI mechanism is optimized, based on the already computed sensor placement.
While there are multiple approaches in the literature, the sensor placement problem is still largely open. One reason is that the degree to which a node is sensitive to a fault is roughly proportional with the inverse of its distance from the node under fault (particularly so for water networks which are stable under fault and thus avoid fault cascading behavior). Therefore, any selection strategy which does not use the entire information provided by the network is biased towards clumping sensor locations. On the other hand, analytic solutions which consider the network as a whole are computationally expensive (or downright impractical).
These shortcomings have motivated many works for various largescale networks Kim and Wright (2018); Krause et al. (2008) as well as for water networks specifically Meseguer et al. (2014); Perelman et al. (2016); Zan et al. (2014). While varied, the approaches can be grouped into Sela and Amin (2018)
: i) mixedinteger or greedy procedures which solve some variation of the set cover problem; ii) evolutionary algorithms which employ heuristic methods; and iii) topologybased methods which make use of the underlying graph structure of the network.
3.1 MSC and MTC procedures
Let us consider the residual matrix defined as in^{7}^{7}7In fact, we use only a subset of columns from (11), the socalled training set, but for simplicity we abuse the notation. (11). To this matrix corresponds the fault signature matrix
obtained from the former by a binarization procedure
Blanke et al. (2006):(12) 
(12) should be read as follows: if any^{8}^{8}8The “any” condition can be replaced with any selection criterion deemed necessary (e.g., “all entries”, “the majority of the entries”). of the entries of the th node which correspond to the active fault are above a prespecified threshold then the fault is detected by the node (i.e., ).
With , the fault signature matrix, given as in (12), we apply the minimum set cover (MSC) procedure from Perelman et al. (2016), in a variation of the mixedinteger form appearing in Sela and Amin (2018):
(13a)  
s.t.  (13b)  
(13c)  
(13d)  
(13e) 
, denote (in our case, ) the list of faults and nodes, respectively. Parameter limits the number of available sensors. Taking , (13) reduces to finding such that (13b), (13c) and (13e) hold: (13b) ensures that each fault is detected by at least a node ; (13c) ensures that at most selections are made and (13e) ensures that the selection is unambiguous (a node is either selected, or not, ). This formulation may prove to be infeasible (there might be no node selection which permits complete fault detection), thus requiring the addition of the slack variables , their constraining in (13d) and subsequent penalization in the cost (13a).
As noted in Perelman et al. (2016), (13) maximizes fault detection but does not guarantee fault isolation (an ideal solution to (13) would be to find a unique node which detects all faults; this is, obviously, unfortunate from the viewpoint of fault isolation). The solution proposed in Perelman et al. (2016), at the cost of greatly expanding the problem size, is to construct an auxiliary matrix :
(14) 
where is an index enumerating all distinct unordered pairs . The idea is to construct an ‘artificial’ fault and decide that node is sensitive to it (i.e., ) iff only one of the faults happening at or is detected by node . Replacing with , defined as in (14), in (13) leads to the minimum test cover (MTC) procedure which maximizes fault isolation performance.
While other approaches exist in the literature, the MSC and MTC procedures presented above are representative in that they highlight some common issues which lead to a degradation of the subsequent FDI mechanism:

Arguably, the application of a threshold as in (12) discards potentially useful information.

The sensor placement procedures are usually either model or databased. Hybrid ones which make use of both are rarely encountered.
In the following subsection we propose an iterative method which combines unadulterated data (measured/simulated residual values corresponding to multiple fault occurrences) with model information (the graph structure of the network) to decide on an optimal sensor selection.
3.2 Graphaware Gram Schmidt procedure
Recalling that denotes the collection of sensor nodes indices, we note that the submatrix will be the only data available for performing FDI. Thus we want the lowrank matrix to approximate as best as possible the fullrank matrix . Further, if we look at sensor placement as an iterative process, then for each new sensor that we place we get access to the contents of one new row from .
Let denote row of matrix . In order to achieve good matrix approximation we want to make sure that, when placing a new sensor in node , the new row contains as much new information as possible about the water network. In other words we want the projection of on the currently selected rows to be minimal:
(15) 
In this context, the entire iterative process can be seen as a modified GramSchmidt orthogonalization process where we create a sequence of orthogonal vectors chosen from a set of , selected as in (15).
While the process induced by (15) might be good enough for matrix approximation, ignoring the water network’s structure (even if implicitly present in the numerical data gathered in the residual matrix ) is suboptimal.
Considering the underlying undirected weighted graph (via its Laplacian matrix) we are able to compute the shortest path between all nodes using Dijkstra’s algorithm Dijkstra (1959). Thus, we update (15) to take into consideration the distances from the candidate node to the nodes from the existing set to encourage a better sensor placement spread across the network.
Let be the vector whose elements represent the distance between node and each of the nodes from set . The penalized row selection criteria becomes
(16) 
where is a scaling parameter that we further discuss in Section 5. For (16) is equivalent to (15).
The penalty mechanism works as follows: if the projection is small and the sum of distances from node to the nodes of is large, then the distance penalty is small also and node is a good candidate. On the other hand, if the sum of distances is small then the penalty grows and the possibility of selecting decreases.
The result is a data topologyaware selection process, that encourages a good distribution of sensors inside the network in order to facilitate FDI. We gather the instructions necessary for sensor placement in Algorithm LABEL:alg:placement.
algocf[!ht] First, we select the row whose energy is largest (step 1) and place the first sensor there (step 2). We place the normalized row in the first column of matrix (step 3) where we will continue to store the orthogonal vector sequence as discussed around (15). This auxiliary matrix helps us with future projections computations. From this initial state, step 4 loops until we place the remaining sensors. We begin iteration by projecting the candidate rows onto the existing selection (step 5). The element represents the projection of the candidate row on the selected node . Step 6 completes the projection by multiplying the innerproducts with the corresponding orthogonal vectors. Next, we store in vector the projection norm of each candidate row (step 7). We are given the shortest path between any two nodes in matrix , where represents the distance from node to node . In step 8 we penalize the projections by summing up the inverse of the distance from candidate node to each of the selected nodes. Node corresponding to the smallest element in is found (step 9) and added to the set . Steps 11 and 12 perform the GramSchmidt orthogonalization process: first the redundant information is removed from the row
(by substracting the projection on the old set) and then the resulting vector is normalized and added to the orthogonal matrix
.Remark 0
The algorithm computations are dominated by the large matrix multiplications in steps 5 and 6. At step we need to perform operations in order to obtain the matrix . The rest of the instructions require minor computational efforts in comparison. This results in a complexity of for the entire loop.
Remark 0
Arguably, the weights appearing in the Laplacian graph should be proportional with the headloss between two linked nodes. This is not trivial since the headloss depends nonlinearly on pipe length, diameter and roughness coefficient, see (3).
Sensor placement in the Hanoi network
Using the example from Section 2 we now consider the three methods introduced earlier (MSC, MTC and GraphGS) to generate sensor placements. We limit to the nominal profile case and compute the fault signature matrix as in (12), for . The result is illustrated in Fig. 5 where a bullet at coordinates means that the ith node detects the jth fault for at least one of its magnitudes.
Applying the MSC procedure as in (13) leads to the sensor selection . Constructing the extended signature matrix as in (14) and using it for the MTC procedure leads to the sensor selection . Lastly, the GraphGS approach retrieves the selection for the parameter . In all cases we assumed (note that the MSC/MTC procedures may select fewer than nodes since (13c) is an inequality).
The MSC and MTC procedures are able to run for this proofofconcept network but the required number of binary variables increases in lockstep with the number of junction nodes for MSC and exponentially for MTC (e.g., in this particular example, 31 and, respectively,
). The computation times are negligible here but they increase significantly in the case of large systems (as further seen in Section 5). Lastly, by counting the cases for which from (13) we estimate the number of fault detection errors in MSC (3 cases) and of fault isolation errors in MTC (34 cases). On the other hand, the GraphGS procedure is much less sensitive to problem size and can handle easily large problems.Fig. 6 illustrates the selections resulted for each method (circle, bullet and ’X’ symbols for MTC, MSC and GraphGS) for sensor numbers ranging from 2 to 10.
A word of caution is in order: regardless of method, the performance of a sensor selection is not truly validated until the FDI block is run and its output is compared with the actual fault occurrences. This is the scope of Section 4.
4 Dictionary Learning and Classification Strategies
Dictionary learning (DL) Dumitrescu and Irofti (2018)
is an active field in the signal processing community with multiple applications such as denoising, compression, superresolution, and classification. Recent studies have also shown good results when dealing with anomaly detection in general
Băltoiu et al. (2020); Irofti et al. (2019); Irofti and Băltoiu (2019b), and particularly when applied to the FDI problem in water networks Irofti and Stoican (2017); Stoican and Irofti (2019).Dictionary Learning
Starting from a sample , our aim is to find an overcomplete base , called the dictionary, with whom we can represent the data by using only a few of its columns, also called atoms. Thus we express the DL problem as
(17a)  
s.t.  (17b)  
(17c) 
where are the sparse representations corresponding of the signals. (17b) dictates that each column has a sparse represention that uses at most atoms from (i.e. is modeled as the linear combination of at most columns from ). (17c) is there to avoid the multiplication ambiguity between and (i.e., it allows to interpret the atoms as directions that the sparse representations follow; thus the elements in act as scaling coefficients of these directions).
Remark 0
Solving (17) is difficult because the objective is nonconvex and NPhard. Existing methods approach the problem through iterative alternate optimization techniques. First the dictionary is fixed and we find the representations , this is called the sparse representation phase and is usually solved by greedy algorithms among which Orthogonal Matching Pursuit (OMP) Pati et al. (1993) is a fast, performant and popular solution. Next, we fix the representations and we find the dictionary. This is called the dictionary training or dictionary refinement phase and most algorithms solve it by updating each atom at a time (and sometimes also the representations using it) while fixing the rest of the dictionary. Popular routines are KSVD Aharon et al. (2006) and Approximate KSVD (AKSVD)Rubinstein et al. (2008). A few iterations of the representation and dictionary training steps usually bring us to a good solution.
Dictionary classification
The choice of the nonzero elements of a given column (also called the representation support), highlights the participating atoms. These form a specific pattern which allows us to emit certain statements about the data which led to it. For example, let us assume that we can split the input data
into distinct classes. Then it follows naturally that signals from a certain class will probably use a characteristic pattern more than the signals from the other classes. Thus we can classify a signal as being part of certain class by just looking at the atoms used in its representation. In its direct form (
17) the dictionary learning and classification suffers (at least from the viewpoint of FDI analysis) from a couple of shortcomings:
the procedure is not discriminative: the learned dictionary may in fact lead to classifications with overlapping patterns of atom selection;

problem dimensions can grow fast as the number of network nodes increases.
Let be the number of classes^{9}^{9}9Given that the dataitems in are already labeled, we already know to which class they belong to. and let us assume, without any loss of generality, that is sorted and can be split into submatrices, each containing the dataitems of a single class: . An alternative method to (17) called Label consistent KSVD (LCKSVD) Jiang et al. (2013) adds extra penalties to (17) such that the atoms inherit class discriminative properties and, at the same time, trains a classifier to be used afterwards, together with , to perform classification.
Let be the data labeling matrix with its columns corresponding to the ones in . If belongs to class then , where
is the ith column of the identity matrix. Let matrix
be the discriminative matrix whose rows correspond to the dictionary atoms and whose columns correspond to the training signals. Column has ones in the positions corresponding to the atoms associated with the class that belongs to and zeros elsewhere. Usually atoms are split equally among classes. When the training signals are sorted in class order, matrix consists of rectangular blocks of ones arranged diagonally. LCKSVD solves the optimization problem(18) 
where the first term is the DL objective (17). The second term connects the label matrix to the sparse representations through matrix . We can view this as a separate DL problem where the training data are the labels and the dictionary is in fact a classifier. The tradeoff between small representation error and accurate classification is tuned by parameter . The third term learns such that it alters the sparse representations so that their support follows the atoms allocated to their corresponding class instead of the support that minimizes the representation error.
After the learning process is over, in order to classify a signal we need to first compute its representation with dictionary , again by using OMP or a similar algorithm, and then find the largest entry of whose position corresponds to the class that belongs to
(19) 
Remark 0
While (18) introduces additional variables it is, qualitatively, similar with (17) as it can be reduced to a similar formulation. Indeed, (18) can be reformulated as a “composite dictionary learning problem”
(20) 
where are learned from data provided by and , respectively. Note that in this particular instance, after the learning process is discarded as it indirectly instilled discriminative properties to dictionary .
4.1 Online learning
When dealing with large scale distribution networks the problem dimensions explode. Typical water networks may have hundreds or even thousands of nodes. If each node represents one class, with 3 atoms per class we end up with a 15,000 column dictionary. Training on 30,000 signals, we end up with a representations matrix. The computations involved become prohibitive on most systems. To accommodate largescale scenarios we propose an online learning alternative.
Online DL handles one signal at a time, thus most operations become simple vector multiplications. At time we are given signal which we use to update the current dictionaries , and . The TODDLeR algorithm Irofti and Băltoiu (2019a) adapts objective (18) for online learning using the recursive least squares approach Skretting and Engan (2010). Besides signal classification, its goal is to learn from all incoming signals: labeled or not. TODDLeR ensures that the model is not broken through missclassification by regulating the rate of change each signal brings.
(21)  
For convenience we dropped the superscripts above. The problem does not have a closedform solution and is solved in two steps. First we solve the (18) problem for a single vector using the first three terms in (21). As shown in Irofti and Băltoiu (2019a), this translates to updating , , through a simple rank1 update based on the and its sparse representation . Keeping everything fixed in (21) except for and respectively, leads to the following two objectives
(22) 
(23) 
meant to temper the updates brought by to the dictionaries in the first step. Equations (22) and (23) are simple leastsquares problems. By looking at and as generalized Tikhonov regularizations, it was motivated in Irofti and Băltoiu (2019a) that good parameter choices are or , . Here is the Gram matrix of the representations that is rank1 updated by each incoming signal .
4.2 FDI mechanism
Recall that the DL procedure classifies an input vector wrt a set of a priori defined classes. Thus, assimilating a fault to a class means that the classification step of the DL procedure actually implements fault detection and isolation. The details are provided in Algorithm LABEL:alg:fdi.
algocf[!ht]
Considering the residual matrix from (11) and the sensor selection obtained in Section 3 we arrive to submatrix . To this we associate the fault labels (the active fault index for each of the columns of ).
Step 1 of the algorithms divides the residuals and associated labels into disjoint ‘pretrain’, ‘train’ and ‘test’ collections , , . These are used in steps 2, 4 and 8 to construct and respectively update the dictionary. Step 9 handles the actual FDI procedure by selecting the class best fitted to the current test residual (). The class estimations are collected in , at step 10 and compared with the actual test fault labels in step 12 to assess the success criterion of the FDI mechanism.
Remark 0
Arguably, it makes sense tweaking criterion to count for near misses: the classification is successful not only if the correct node is identified but also if one of its neighbors is returned by the classification procedure.
Remark 0
By construction, (19) returns the index corresponding to the largest value in the vector . This ignores the relative ranking of the classifiers, as it does not attach a degree of confidence for the selected index (i.e., large if there is a clear demarcation between classifications and small if the values are closely grouped.
Illustration of the FDI mechanism
In our experiments each network node represents one class. During DL we used an extra shared dictionary to eliminate the commonalities within classspecific atoms Dumitrescu and Irofti (2018). This lead to classes for which we allocated 3 atoms per class leading to dictionary atoms. An initial dictionary was obtained through pretraining on 2480 signals. Afterwards we performed online training on 2170 signals. Crossvalidation showed good results when using and . When updating and we used . With the resulting dictionaries we tested the FDI mechanism online on 4650 unlabeled signals. Applying the GraphGS method for sensor selection, the rate of success was .
For illustration we show in Fig. 7 the full (blue line with bullet markers) residual signal corresponding to class (i.e., the case where 22th node is under fault) and the actuallyused data (red diamond markers), at nodes with sensors (those with indices from ).
The actual classification was done as in (19) and resulted in a classifier vector whose nonzero values are , at indices . Clearly, the most relevant atom in the description is the one with index which lies in the subset corresponding to class . The classification produces an indicator vector where the first and second largest values are and thus showing that the procedure unambiguously produces the correct response (see Remark 10).
Further, we consider not only the success rate as defined in step 12 of Algorithm LABEL:alg:fdi but also count the cases where the fault is identified in the correct node’s neighbors and in the neighbors’ neighbors (as per Remark 9). This leads to an increase from to and , respectively, for the success rate.
Lastly, using the MSC sensor selection procedure we arrive to success rates , and which proved to be significantly lower than the GraphGS selection method. The MTC method, even for this smallscale network does not provide a solution.
5 Validation over a generic largescale water network ^{10}^{10}footnotemark: 10
^{11}^{11}footnotetext: Code available at https://github.com/pirofti/ddnetonlineTo illustrate the DLFDI mechanism we test it over a largescale generic network obtained via the EPANET plugin WaterNetGen Muranho et al. (2012). To generate the network we considered 200 junctions, 1 tank and 253 pipes interconnecting them into a single cluster, as shown in Fig. 8.
To test our algorithms we first take a nominal profile of node demands and perturb it with around its nominal values. Further, we consider that fault events are denoted by nonzero emitter values in the EPANET emulator. With the notation from Algorithm LABEL:alg:fdi we run the EPANET software to obtain residual vectors (in absolute form) as follows:

2400 = 200 12 pretrain residuals; under the nominal profile, for each fault event we consider emitter values from the set ;

2400 = 200 12 train residuals; for each node we consider 12 random combination of profile (from the set ) and of emitter value (from the set );

3200 test residuals; we select random combinations of profile, fault and emitter values (taken from the sets , and , respectively).
For further use we also divide the graph into communities using the community detection tool Blondel et al. (2008). For illustration we depict in Fig. 8 three of these communities (semitransparent blue blobs).
The first step is to apply the sensor placement algorithm Algorithm LABEL:alg:placement to retrieve the submatrix which gathers the pretrain residuals, taken at indices corresponding to sensor placements. The result is again visible in Fig. 8 where we plotted (red circles) the first sensor selections. Note that due to the particularities of the GraphGS method each lowerorder sensor selection is completely included within any largerorder sensor selection, e.g., selecting sensors gives the collection which is a subset of , obtained when selecting sensors.
As a first validation we consider that each node fault is a distinct class and apply the DLFDI mechanism described in Algorithm LABEL:alg:fdi to detect and isolate them. We quantify the success of the scheme in three distinct ways, by counting all the cases where:

the estimated node equals the actual node under fault;

the estimated node is, at most, the neighbor of the node under fault^{12}^{12}12Arguably this criterion imposes no performance penalty. The fault event is in fact a pipe leakage and associating the fault with a node is a simplification usually taken in the state of the art. In reality, if a node is labelled as being faulty, the surrounding ones need to be checked anyway.;

the estimated node is, at most, the onceremoved neighbour of the node under fault.
The three previous criteria can be interpreted as 0, 1 and 2distances in the network’s Laplacian. Arbitrary, ndistance, neighbors can be considered but their relevance becomes progressively less important.
The aforementioned community partitioning is another way of solving the FDI problem: each class corresponds to a community, i.e., any fault within the the community is labelled as being part of the same class. This approach leads to an additional success criterion:

the estimated class corresponds to the the community within which the fault appears.
Running Algorithm LABEL:alg:fdi for a number of selected (as in Algorithm LABEL:alg:placement) sensors ranging from 5 to 30 we obtain the success rates shown in Fig. 9.
In our simulations we set the parameters in (21) as follows: and for the classification dictionaries as found via crossvalidation Irofti and Băltoiu (2019a); Jiang et al. (2013); for the update regularization we initialized and proceeded with automated parameter tuning . TODDLeR was jump started by first running LCKSVD Jiang et al. (2013) on a small dataset in order to obtain an initial dictionary and representations . LCKSVD used 20 iterations of AKSVD Rubinstein et al. (2008) to train the atoms block belonging to each class and then 50 more iterations on the entire dictionary.
Several remarks can be drawn. First, and as expected, an increase in sensors, generally leads to an increase in performance. Still, care should be taken with the numbers considered: we note that even a small number (5) gives a good success rate and that after around sensors the performance improvements taper off. Second, the classification failures appear to be ‘nearmisses’ as can be seen when comparing the S1), S2) and S3) criteria. The S2) and S3) values approach fast 100% which means that the difference (in the topological sense) between the estimated and the actual node under fault is small). In fact, having 24 or more sensors selected means that (as illustrated by the S3) criterion) the estimated fault location is never further away than 2 nodes from the actual fault location. Reducing the number of classes as in criterion S4) significantly reduces the computation time but also leads to a marked decrease in performance (which does not appear to improve with an increase in the number of sensors).
As stated earlier, the FDI is a classification procedure which exploits the discriminative and sparsity properties of the associated dictionary.
To highlight these properties we illustrate in Fig. 10 the active dictionary atoms obtained for the case of sensors for each of the test residuals considered (a marker at coordinates (i,j) means that in the classification of the ith residual appears the jth atom). Note that for a better illustration we reordered the test residuals such that the faults appear contiguously.
To better illustrate the approach we take the test residuals corresponding to class 140 (faults affecting node 140), which in Fig. 10 correspond to residuals indexed from 2270 to 2290 and show them into the middle inset. We note that a reduced number of atoms () describe the residuals, hence proving the sparsity of the representation. The bottom inset plots the values of the classifier for each of the considered test residuals. We note that the classification returns 3 times class 44 (missclassification) and 18 times class 140 (correct classification) – recall that, as per (19), the largest value in indicates the class. For this particular class the success rates are around the average shown in Fig. 9, specifically, S1) is , S2) and S3) are since node 44 is the neighbor of node 140.
The diagonal effect in Fig. 10 is the result of how matrix was built. Recall that the lines in correspond to the atoms in and its columns to the signals in and that we set element if atom should represent signal belonging to class . This indirectly states that atom has to represent signals of class . The signals in Fig. 10 were resorted in classorder thus the atom index of the classspecific atoms (dictated by ) also changes every 20 or so residuals resulting in the ascending diagonal aspect. This is in fact the visual confirmation of the fact that our discrimination strategy worked as residuals might use atoms from the entire dictionary, but they always use at least one from their given class.
6 Conclusions
We have shown that datadriven approaches can be used successfully for sensor placement and subsequent fault detection and isolation in water networks. Performing sensor placement through a GrammSchmidtlike procedure constrained by the network Laplacian and then using the resulting sensor data for online dictionary learning has allowed us to move forward from Irofti and Stoican (2017) and tackle large networks. Adaptive learning and classification Irofti and Băltoiu (2019a) provides the benefit of a continuous integration of new data into the existing network model, be it for learning or testing purposes.
The results have shown good accuracy and pointed towards some promising directions of study such as: network partitioning into communities, adapting online dictionary learning to further integrate the network structure (e.g. by enforcing graph smoothness Yankelevsky and Elad (2016)) and providing synergy between the three phases: placement, learning, FDI (e.g. allow a flexible placement scheme where the learning iteration is allowed to change the sensor nodes based on current classification results).
References
 KSVD: An Algorithm for Designing Overcomplete Dictionaries for Sparse Representation. IEEE Trans. Signal Proc. 54 (11), pp. 4311–4322. Cited by: Remark 7.
 Graph anomaly detection using dictionary learning. In The 21st World Congress of the International Federation of Automatic Control, pp. 1–8. Cited by: §4.
 Diagnosis and faulttolerant control. Vol. 2, Springer. Cited by: §2.3, §3.1.
 Fast unfolding of communities in large networks. Journal of statistical mechanics: theory and experiment 2008 (10), pp. P10008. Cited by: §5.
 Operational control of water systems: structures, algorithms and applications. Automatica 32 (11), pp. 1619–1620. Cited by: item 2, §2.1.

Optimal sensor placement for leak location in water distribution networks using genetic algorithms
. Sensors 13 (11), pp. 14984–15005. Cited by: §2.  A note on two problems in connexion with graphs. Numerische mathematik 1 (1), pp. 269–271. Cited by: §3.2.
 Dictionary learning algorithms and applications. Springer. External Links: Document, ISBN 9783319786735 Cited by: §1, §4, §4.
 Malware identification with dictionary learning. In 2019 27th European Signal Processing Conference (EUSIPCO), pp. 1–5. Cited by: §1, §4.1, §5, §6.
 Dictionary learning strategies for sensor placement and leakage isolation in water networks. In The 20th World Congress of the International Federation of Automatic Control, pp. 1589–1594. Cited by: §1, §2.4, §4, §6.
 Unsupervised dictionary learning for anomaly detection. Arxiv: arXiv:2003.00293. Cited by: §4.
 Fraud detection in networks: stateoftheart. arXiv preprint arXiv:1910.11299. Cited by: §4.
 Label Consistent KSVD: Learning a Discriminative Dictionary for Recognition. IEEE Trans. Pattern Anal. Mach. Intell. 35 (11), pp. 2651–2664. Cited by: §1, §4, §5.

PMU placement for line outage identification via multinomial logistic regression
. IEEE Transactions on Smart Grid 9 (1), pp. 122–131. Cited by: §3.  Nearoptimal sensor placements in gaussian processes: theory, efficient algorithms and empirical studies. Journal of Machine Learning Research 9 (Feb), pp. 235–284. Cited by: §3.
 A decision support system for online leakage localization. Environmental modelling & software 60, pp. 331–345. Cited by: §3.
 WaterNetGen: an epanet extension for automatic water distribution network models generation and pipe sizing. Water science and technology: water supply 12 (1), pp. 117–123. Cited by: §1, §5.
 Orthogonal Matching Pursuit: Recursive Function Approximation with Applications to Wavelet Decomposition. In 27th Asilomar Conf. Signals Systems Computers, Vol. 1, pp. 40–44. Cited by: Remark 7.
 Sensor placement for fault location identification in water networks: a minimum test cover approach. Automatica 72, pp. 166–176. Cited by: item 1, §3.1, §3.1, §3.
 Leak localization in water networks: a modelbased methodology using pressure sensors applied to a real network in barcelona [applications of control]. IEEE Control Systems 34 (4), pp. 24–36. Cited by: §1.
 EPANET 2: users manual. US Environmental Protection Agency. Office of Research and Development. National Risk Management Research Laboratory. Cited by: §1, §2.
 Efficient Implementation of the KSVD Algorithm Using Batch Orthogonal Matching Pursuit. Technical report Technical Report CS200808, Technion Univ., Haifa, Israel. Cited by: §5, Remark 7.
 Demand modeling for water networks calibration and leak localization. Ph.D. Thesis, Universitat Politècnica de Catalunya. Cited by: §2.1.
 Robust sensor placement for pipeline monitoring: mixed integer and greedy optimization. Advanced Engineering Informatics 36, pp. 55–63. Cited by: item 2, §3.1, §3.
 Recursive least squares dictionary learning. IEEE Trans. Signal Proc. 58 (4), pp. 2121–2130. Cited by: §4.1.
 Aiding dictionary learning through multiparametric sparse representation. Algorithms 12 (7), pp. 131. Cited by: §1, §4.
 Identifying sets of key nodes for placing sensors in dynamic water distribution networks. Journal of Water Resources Planning and Management 134 (4), pp. 378–385. Cited by: item 1.
 Dual graph regularized dictionary learning. IEEE Transactions on Signal and Information Processing over Networks 2 (4), pp. 611–624. Cited by: §6.
 Event detection and localization in urban water distribution network. IEEE Sensors Journal 14 (12), pp. 4134–4142. Cited by: §3.
Comments
There are no comments yet.