Untrained Graph Neural Networks for Denoising

A fundamental problem in signal processing is to denoise a signal. While there are many well-performing methods for denoising signals defined on regular supports, such as images defined on two-dimensional grids of pixels, many important classes of signals are defined over irregular domains such as graphs. This paper introduces two untrained graph neural network architectures for graph signal denoising, provides theoretical guarantees for their denoising capabilities in a simple setup, and numerically validates the theoretical results in more general scenarios. The two architectures differ on how they incorporate the information encoded in the graph, with one relying on graph convolutions and the other employing graph upsampling operators based on hierarchical clustering. Each architecture implements a different prior over the targeted signals. To numerically illustrate the validity of the theoretical results and to compare the performance of the proposed architectures with other denoising alternatives, we present several experimental results with real and synthetic datasets.

Authors

• 5 publications
• 37 publications
• 30 publications
• 14 publications
• Understanding Graph Neural Networks from Graph Signal Denoising Perspectives

Graph neural networks (GNNs) have attracted much attention because of th...
06/08/2020 ∙ by Guoji Fu, et al. ∙ 73

• An Underparametrized Deep Decoder Architecture for Graph Signals

While deep convolutional architectures have achieved remarkable results ...
08/02/2019 ∙ by Samuel Rey, et al. ∙ 0

• Chickenpox Cases in Hungary: a Benchmark Dataset for Spatiotemporal Signal Processing with Graph Neural Networks

Recurrent graph convolutional neural networks are highly effective machi...
02/16/2021 ∙ by Benedek Rozemberczki, et al. ∙ 0

• On the choice of graph neural network architectures

Seminal works on graph neural networks have primarily targeted semi-supe...
11/13/2019 ∙ by Clément Vignac, et al. ∙ 80

• Stationary signal processing on graphs

Graphs are a central tool in machine learning and information processing...
01/11/2016 ∙ by Nathanaël Perraudin, et al. ∙ 0

• Graph Signal Restoration Using Nested Deep Algorithm Unrolling

Graph signal processing is a ubiquitous task in many applications such a...
06/30/2021 ∙ by Masatoshi Nagahama, et al. ∙ 7

• Spectral Networks and Locally Connected Networks on Graphs

Convolutional Neural Networks are extremely efficient architectures in i...
12/21/2013 ∙ by Joan Bruna, et al. ∙ 0

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

Vast amounts of data are generated and stored every day, propelling the deployment of data-driven solutions to address a wide variety of real-world problems. Unfortunately, the input data suffers from imperfections and is corrupted with noise, oftentimes associated with the data-collection process. Noisy signals appear in a gamut of applications, with examples including the processing of voice and images, the measurements in electric, social and transportation networks, or the monitoring of biological signals. As a result, signal denoising, which is the process of separating the signal from the noise, is a critical and ubiquitous task in contemporary data-science applications. While most existing works focus on the denoising of signals defined over regular domains (time and space), signals with irregular supports are becoming pervasive. Hence, designing (nonlinear) denoising schemes for signals defined over irregular domains arises as a worth-investigating problem. Examples of applications that benefit from reducing the amount of noise present in the data include processing signals defined over sensor networks, signal measured in the different regions of the brain, or signals related to protein structures, to name a few

[26].

A versatile and tractable approach to handle information supported on irregular domains is to represent the structure of the domain as a graph, with nodes representing variables and edges encoding levels of similarity, influence, or statistical dependence among nodes. Successful examples of this approach can be found in the subareas of network analytics, machine learning over graphs, and graph signal processing (GSP)

[17, 18, 9], with graph neural networks (GNNs) and GSP being particularly relevant for the architectures presented in this paper [42, 1]. Since traditional data-processing architectures may incur difficulties learning the more complex structure present in many contemporary applications, GSP provides a principled approach to handle this issue [42, 9, 26]. Assuming that the structure of the signals can be modeled by a graph, GSP uses the information encoded in the graph topology to analyze, process, and learn from the data. As a result, it is not surprising that GSP has been successfully applied to design and analyze GNNs [1, 35, 50, 12], a class of neural network (NN) architectures that incorporate the graph topology information to enhance their performance when the data is composed of signals defined over a graph.

The importance of leveraging the graph influence when using deep non-linear architectures is reflected in the wide range of GNNs that co-exist in the literature, including graph convolutional NNs (GCNNs) [33, 38, 19], graph recurrent NNs [6]

, graph autoencoders

[45, 30, 29][46, 20], or simplicial NNs [32, 37, 31], to name a few. Incorporating the graph structure into deep non-linear models involves a wide range of options when designing the architecture. For example, GCNNs can be defined with or without pooling layers and the convolution over a graph can be implemented in several ways (vertex vs frequency), each leading to an architecture with different properties and performance. In fact, one of the key questions when designing a GNN is to decide the particular way in which the graph is incorporated into the architecture.

Motivated by the previous discussion, the goal of this work is twofold. First, we propose new graph-based NN architectures to denoise (smooth) graph signals, with the difference between the architectures residing in how they incorporate the information encoded in the graph. Second, we provide theoretical guarantees for the denoising capabilities of this approach, and show that it is directly influenced by the properties of the underlying graph. The mathematical analysis, performed on particular instances of these architectures, provides guarantees on their denoising performance under specific assumptions for the original signal and its underlying graph. In addition, we numerically validate the denoising performance of our method for more general scenarios than those covered by our theory, illustrating that the proposed graph-aware untrained architectures can effectively denoise graph signals.

Since the presented architectures are untrained NNs, only one noisy observation is needed to recover the original signal and no training data is used. The underlying assumption is that, due to their architecture, the NNs are capable of learning the structure of the original signal faster than the noise. Hence, the denoising process for each observed signal is carried out by fitting the weights for a few iterations. This same phenomenon has been observed to hold true in non-graph deep learning architectures. In the context of denoising, the optimization of the overparametrized architecture is stopped early, so that overfitting to the noise is avoided.

To incorporate the topology of the graph, the first architecture multiplies the input at each layer by a fixed (non-learnable) graph filter [40], which can be seen as a generalization of a (low-pass) message passing operation. The second architecture performs graph upsampling operations to progressively increase the size of the input until it matches the size of the observed signal. The upsampling operators are based on hierarchical clustering algorithms [16, 2, 3, 30] so that, in contrast with [10], matrix inversions are not required, avoiding the related numerical issues. Our work is substantially different from [30, 29], which deal with graph encoder-decoder architectures. On top of our theoretical analysis and extensive numerical simulations, additional differences to prior work are that: (a) our graph decoder is an untrained network, and thus, it does not need a training phase; (b) we only require a decoder-like architecture for denoising graph signals, so it is not necessary to jointly design and train two different architectures as done in [30, 29].

Contributions and outline. In summary, the contributions of the paper are the following: (i) we present two new overparametrized and untrained GNNs for solving graph-signal denoising problems; (ii) mathematical analysis is conducted for each architecture offering bounds for their performance, improving our understanding about non-linear architectures and the influence of incorporating graph structure into NNs; and (iii) the proposed architectures are evaluated and compared to other denoising alternatives through numerical experiments carried out with synthetic and real-world data.

The remainder of the paper is organized as follows. Section I-A reviews related works dealing with graph-signal denoising. Section II explains fundamental concepts leveraged along the paper. Section III formally introduces the problem at hand and presents our general approach. Sections IV and V detail the proposed architectures and provide the mathematical analysis for each of them. Numerical experiments are presented in Section VI and concluding remarks are provided in Section VII.

I-a Related works

Untrained NNs are a family of architectures that, by carefully incorporating prior information of the signals in the architecture, enable the recovery of signals without the need of training over large (or any) datasets [44, 22, 14, 13]. In [44], it is shown that fitting a standard convolutional autoencoder to only one noisy signal using early stopping enables the effective denoising of an image. For this approach to work, it is critical that the signal class (images) matches the NN architecture (two-dimensional convolutional NN with particular filters).

Previous approaches to the graph-signal denoising task included a graph-regularization term that promoted desired properties on the estimated signals

[4]. Some existing works minimize the graph total variation pushing the signal value at neighboring nodes to be close [4, 47]. A related approach assumes that the signals are smooth on the graph and add a regularization parameter based on the quadratic form of the graph Laplacian [27]. Also, in [25], the authors propose a spectral graph trilateral filter as a regularizer, based on the prior assumption that the gradient is smooth over the graph. Although these alternatives rely on imposing some notion of smoothness on the original graph signal, the actual relation between the signal and the graph may be of a different nature. Furthermore, the actual prior may be more complex than that represented by linear and quadratic terms.

More recently, non-linear solutions for denoising graph signals have been proposed to tackle the aforementioned issues. In [43], a median graph filter [41, 39] is used to denoise a set of time-varying graph signals defined over dynamic graphs. The idea is to use a smooth non-linear (median) operator that combines values of neighboring nodes, leveraging both spatial and temporal adjacency relations. A different non-linear approach is followed in [10], where a graph autoencoder is trained to recover the denoised signals. To change the size of the graph, the autoencoder relies on Kron reduction operations [11]. However, since the Kron reduction is based on the inverse of a submatrix of the graph Laplacian, it could fall into numerical issues if the submatrix is singular. Moreover, both architectures need several observations to recover the noiseless signals. The median graph filter approach is constrained to the case where a time series of graph signals is given, while the latter needs a high enough number of observations to train its network parameters before being able to denoise the observed signals.

Ii Processing architectures for graph signals

In this section, we introduce notation, present the fundamentals of GSP, and discuss GNNs. A key theme throughout this section is to formalize how the properties of a given signal depend on the supporting graph, which is critical for our denoising methods.

Ii-a Fundamentals of GSP

Let denote an undirected111Although our theoretical results assume that the graph is undirected, the architectures and algorithms proposed in this paper can tackle signals defined on directed graphs [21]. graph, where is the set of nodes with cardinality , and is the set of links such that and belong to if nodes and are connected. The set denotes the neighborhood of node . For a given graph , the (sparse) symmetric adjacency matrix has non-zero entries only if . If is unweighted, its entries are binary. If the graph is weighted, then the value of captures the strength of the link between nodes and . In this paper, we focus on the processing of graph signals which are defined on

. Graph signals can be represented as a vector

, where the -th entry represents the value of the signal at node . Since the signal is defined on , the core assumption of GSP is that the properties of depend on the topology of . For instance, consider a graph that encodes similarity. If the value of is high, then one expects the signal values and to be similar or closely related.

Graph-shift operator (GSO). The GSO is defined as an matrix whose entry can be non-zero only if or . Common choices for are the adjacency matrix , or its degree normalized alternative , where is the degree matrix, 1 is the vector of all ones, and is the diagonal operator that turns a vector into a diagonal matrix. Another common choice is the combinatorial graph Laplacian , defined as [42, 9]

. The GSO accounts for the topology of the graph and, at the same time, it represents a linear transformation that can be computed

locally. Specifically, if is defined as , then node can compute provided that it has access to the values of at its neighbors . We also assume that the GSO is diagonalizable so that there exists an orthonormal matrix and a diagonal matrix , both of size , such that .

Graph filtering. Graph filters, an important tool of GSP, are linear operators that can be expressed as a polynomial of the GSO of the form

 H:=M−1∑m=0hmSm, (1)

where is the graph filter, are the graph filter coefficients, and  [40]. Since encodes the -hop neighborhoods of the graph , graph filters can be used to diffuse input graph signals across the graph as . Because graph filters are capable of diffusing signals across -hop neighborhoods, they are widely used to generalize the convolution operation to signals defined over graphs. Furthermore, since the graph filter is a polynomial on

, it follows that both matrices have the same eigenvectors

.

Frequency representation.

The frequency domain of graph signals and filters is determined by the eigendecomposition of the GSO. More precisely, the frequency representation of the graph signal

is given by the -dimensional vector , with

acting as the graph Fourier transform (GFT)

[34]. Similarly, the frequency response of graph filter can be defined as , that is, an

-dimensional vector collecting the eigenvalues of

[34, 40].

A graph signal (filter) is said to be bandlimited (low-pass) if its frequency domain representation satisfies that for , where is referred to as the bandwidth of the signal . If is bandlimited with bandwidth it holds that

 x=VK~xK, (2)

with collecting the active frequency components and collecting the corresponding eigenvectors. In other words, the bandlimited representation states that the original -dimensional signal lies in a subspace of reduced dimensionality related to the spectrum of the graph. This reduced-dimensionality representation, which can be generalized to graph filters as well, has been shown to bear practical relevance in real-world datasets and can be exploited in denoising and other inverse problems [5].

Ii-B Fundamentals of GNNs

Generically, we represent a GNN using a parametric non-linear function that depends on the graph . The parameters of the architecture are collected in , and the matrix represents the input of the network. Although there are many possibilities for defining a specific GNN, a broad range of such architectures can be represented by recursively applying a graph-aware linear transformation followed by an entry-wise non-linearity. Then, a generic deep architecture with layers can be described as

 ^Y(ℓ) =T(ℓ)Θ(ℓ){Y(ℓ−1)|G},1≤ℓ≤L, (3) Y(ℓ)ij =g(ℓ)(^Y(ℓ)ij),1≤ℓ≤L, (4)

where and denote the input and output of the architecture, is a graph-aware linear transformation performed at layer , are the parameters that define such a transformation, is a scalar

nonlinear transformation (e.g., a ReLU function), which is oftentimes omitted in the last layer. Moreover,

and represent the number of nodes and features at layer , collects all the parameters of the architecture, and denotes the output of the GNN. Note that although the function has been introduced as generating output signals defined in , which is the case of interest for this paper, it can be easily adapted to output graph signals with more than one feature.

Iii GNNs for graph-signal denoising

We now formally introduce the problem of graph-signal denoising within the GSP framework, and present our approach to tackle it using untrained GNN architectures. Given the graph , let us consider the observed graph signal , which is a noisy version of the original graph signal . With being a noise vector, the relation between and is

 x=x0+n. (5)

Then, the goal of graph-signal denoising is to remove as much noise as possible from the observed signal to estimate the original signal , which is performed by exploiting the information encoded in .

A traditional approach for the graph-signal denoising task is to solve an optimization problem of the form

 ^x0=argminx0 ∥x−x0∥22+αR(x0|G). (6)

The first term promotes fidelity to the signal observations, the regularizer promotes denoised signals with desirable properties over the given graph , and controls the influence of the regularization. Common choices for the regularizer include the quadratic Laplacian  [27], or regularizers involving high-pass graph filters that foster smoothness on the estimated signal.

While those traditional approaches exhibit a number of advantages (including interpretability, mathematical tractability, and convexity), they may fail to capture more complex relations between and , motivating the development of non-linear graph-denoising approaches.

As summarized in Algorithm 1, in this paper we advocate handling the graph-signal denoising task by employing an overparametrized GNN (denoted by ) as described in (3)-(4). The weights of the architecture, collected in

, are learned by minimizing the loss function

 L(x,Θ)=12∥x−fΘ(Z|G)∥22, (7)

applying stochastic gradient descent (SGD) and regularizing it with early stopping to avoid overfitting the noise. The entries of the parameters

and the input matrix are initialized at random using an iid zero-mean Gaussian distributions, and the weights learned after a few iterations of denoising the observation are denoted as . Note that is fixed to its random initialization. Finally, the denoised graph signal estimate is computed as

 ^x0=f^Θ(x)(Z|G). (8)

The intuition behind this approach is as follows: since the architecture is overparametrized it can in principle fit any signal, including noise. However, as shown formally later, both empirically and theoretically, the proposed architectures fit graph signals faster than the noise, and therefore with early stopping they fit most of the signal and little of the noise, enabling signal denoising.

Regarding the specific implementation of the untrained network , there are multiple possibilities for selecting the linear and non-linear transformations and defined in equations (3) and (4), respectively. Since we are dealing with the denoising of analog signals, we set the entrywise non-linearity to be the operation, defined as , and focus on the design of the linear transformation, which is responsible for incorporating the structure of the graph. The two following sections postulate the implementation of two particular linear transformations (each giving rise to a different GNN) and analyze the resulting architectures.

Iv Graph convolutional generator

Our first architecture to address the graph-signal denoising task is a graph-convolutional generator (GCG) network that incorporates the topology of the graph into the NN pipeline via vertex-based graph convolutions. To formally define the GCG architecture, we select the normalized adjacency matrix as the GSO . Then, leveraging the fact that convolutions of a graph signal on the vertex domain can be represented by a graph filter [40], we define the linear transformation for the convolutional generator as (cf. (3))

 TΘ(ℓ){Y(ℓ−1)|G}=HY(ℓ−1)Θ(ℓ). (9)

Remember that the matrix collects the learnable weights of the -th layer, and the graph filter is given by (1) with its coefficients fixed a priory so that is a low-pass graph filter of degree . Using the linear transformation defined in (9), the output of the GCG with layers is given by the recursion

 Y(ℓ) =ReLU(HY(ℓ−1)Θ(ℓ)),forℓ=1,...,L−1, (10) y(L) =HY(L−1)Θ(L), (11)

where and the is not applied in the the last layer of the architecture.

With the proposed linear transformation, the GCG learns to combine the features within each node by fitting the weights of the matrices while the graph filter interpolates the signal by mixing features from neighborhoods. Therefore, since is a low-pass graph filter, the GCG promotes smooth outputs and, thus, a smooth denoised estimate . In addition, for a given layer, despite the linear mapping being from

, we limit the degrees of freedom by imposing a Kronecker structure so only

parameters are involved (cf. (9)), and only of all the parameters need to be learned since is given.

Although we define the GCG using a graph convolutional layer, there is an important difference when comparing it with other GCNNs. In some GCNNs, the parameters of the graph filter are learned, but in the proposed architecture the graph filter is fixed so it promotes desired properties on the estimate . Moreover, from the polynomial definition of it can be noted that the fixed graph filter may be interpreted as a generalization of the message passing procedure [8], a typical approach for performing graph convolutions in NNs.

In the remainder of the section, we adopt some simplifying assumptions to provide theoretical guarantees on the denoising capability of the GCG, and then we rely on numerical evaluation to demonstrate that the results also hold in more general settings.

Iv-a Guaranteed denoising with the GCG

To formally prove that the proposed architecture can successfully denoise the observed graph signal , we consider a two-layer GCG given by

 (12)

where and are the learnable coefficients. With denoting the number of features, we consider the overparametrized regime where , and analyze the behavior and performance of denoising with the untrained network defined in (12).

We start by noting that scaling the -th entry of is equivalent to scaling the -th column of , so that, without loss of generality, we can set the weights to , where is a vector of size with half of its entries set to and the other half to . Furthermore, since

is a random matrix of dimension

, the column space of spans , and hence, minimizing over is equivalent to minimizing over . With these considerations in place, the optimization over (7) can be equivalently performed replacing the two-layer GCG described in (12) by its simplified form

 fΘ(H)=fΘ(Z|G)=ReLU(HΘ)b. (13)

Note that we replaced with since the graph influence is modeled by the graph filter , and the influence of the matrix is absorbed by the learnable weights .

The denoising capability of the two-layer architecture is related to the eigendecomposition of its expected squared Jacobian[14]. However, to understand which signals can be effectively denoised with the proposed architecture, we need to connect the spectral domain of the expected squared Jacobian with the spectrum of the graph, given by the eigenvectors of the GSO.

To that end, we next compute the expected squared Jacobian of the two-layer architecture in (13). Denote as the Jacobian matrix of with respect to , which is given by

 JTΘ(H)=⎡⎢ ⎢ ⎢⎣b1HTdiag(ReLU′(Hθ1))⋮bFHTdiag(ReLU′(HθF))⎤⎥ ⎥ ⎥⎦∈RNF×N, (14)

where represents the -th column of , and is the derivative of the , which is the step function. Then, define the expected squared Jacobian matrix as

 X:=EΘ[JΘ(H)JTΘ(H)]. (15)

Taking the expectation of (14) with respect to the parameters , and leveraging the results from [7, Section 3.2], we obtain that the matrix is given by

 X=0.5(11T−1πarccos(C−1H2C−1))⊙HHT, (16)

where represents the Hadamard (entry-wise) product, is computed entry-wise, represents the -th column (row) of , is a normalization term so that is the autocorrelation of the graph filter .

Since is symmetric and positive (semi) definite, it has an eigendecomposition . Here, the columns of the orthonormal matrix are the eigenvectors, and the nonnegative eigenvalues in the diagonal matrix are assumed to be ordered as .

After defining the two-layer GCG and its expected square Jacobian , we formally analyze its performance when denoising bandlimited graph signals. This is particularly relevant given the importance of (approximate) bandlimited graph signals both from analytical and practical points of view [9]. For the sake of clarity, we first introduce the main result (Theorem 1) and then we detail a key intermediate result (Lemma 1) that provides additional insight.

Formally, consider the -bandlimited graph signal as described in (2), and let the architecture have a sufficiently large number of features :

 F≥(σ21σ2N)26ξ−8N, (17)

where is an error tolerance parameter for some prespecified . Then, for a specific set of graphs that is introduced later in the section (cf. Assumption 1), if we solve (7) running gradient descent with a step size , the following result holds (see Appendix A).

Theorem 1.

Let be the network defined in equation (13), and assume it is sufficiently wide, i.e., it satisfies condition (17) for some error tolerance parameter . Let be a -bandlimited graph signal spanned by the eigenvectors , and let and be the -th eigenvector and eigenvalue of . Let be the noise present in , and set and to small positive numbers. Then, for large enough (), the error for each iteration of gradient descent with stepsize used to fit the architecture is bounded as

 ∥x0−fΘ(t)(H)∥2≤((1−ησ2K)t+δ(1−ησ2N)t)∥x0∥2 +ξ∥x∥2+ ⎷N∑i=1((1−ησ2i)t−1)2(wTin)2, (18)

with probability at least

.

As explained next, the fitting (denoising) bound provided by the theorem first decreases and then increases with the number of iterations . To be more precise, let us analyze separately each of the three terms in the right hand side of (1). The first term captures the part of the signal that is fitted after iterations while accounting for the misalignment of the eigenvectors and . This term decreases with and, since can be made arbitrary small for sufficiently large enough graphs (cf. Lemma 1), vanishes for moderately low values of . The second term is an error term that is negligible if the network is sufficiently wide so that can be chosen to be sufficiently while condition (17) remains satisfied. Finally, the third term, which depends on the noise present in each of the spectral components of the squared Jacobian , grows with . More specifically, if the associated with a spectral component is very small, the term is close to and, hence, the noise power in the -th frequency will be small. Only when grows very large the coefficient vanishes and the -th frequency component of the noise is fitted. As a result, if the filter is designed such that eigenvalues of the squared Jacobian satisfy that , then there will be a range of moderate-to-high values of for which: i) the first term is zero and ii) only the strongest components of the noise have been fitted, so that the third term can be approximated as . Clearly, as grows larger, the coefficient will also be close to one for , meaning that additional components of the noise will be fitted as well, deteriorating the performance of the denoising architecture. This implies that if the optimization algorithm is stopped before grows too large, the original signal is fitted along with the noise that aligns with the signal, but not the noise present in other components.

In other words, Theorem 1 not only characterizes the performance of the two-layer GNN, but also illustrates that, if early stopping is adopted, our overparametrized architecture is able to effectively denoise the bandlimited graph signal.

Note that a critical step to attain Theorem 1 is to relate the eigenvectors of with those of the GSO , denoted as . To achieve this, we assume that is random and provide high-probability bounds between the leading eigenvectors of and . More specifically, consider a graph drawn from a stochastic block model (SBM) [24] with communities. Also, denote by the SBM with expected adjacency matrix , and by the minimum expected degree . Given some , we define as the class of SBMs with nodes for which , where denotes the (conventional) asymptotic dominance. In this context, we consider the following assumption.

Assumption 1.

The model from which is drawn satisfies .

Intuitively, it is assumed that the expected minimum degree of the SBM increases as the number of nodes grows. Under these conditions, the following result holds.

Lemma 1.

Let the matrix be defined as in (16), set and to small positive numbers, and denote by and the leading eigenvectors in the respective eigendecompositions of and . Under Assumption 1, there exists an orthonormal matrix and an integer such that, for , the bound

 ∥VK−WKQ∥F≤δ,

holds with probability at least .

The proof is provided in Appendix B. Lemma 1 guarantees that, if the size of the graph is big enough, the difference between the subspaces spanned by the leading eigenvectors of and is bounded, becoming arbitrary small as the number of nodes increases. An inspection of (16) reveals that the result in Lemma 1 is not entirely unexpected. Indeed, since is a polynomial in , so is . This implies that are also the eigenvectors of , and because appears twice on the right hand side of (16), a relationship between the eigenvectors of and can be anticipated. However, the presence of the Hadamard product and the (non Lipschitz continuous) non-linearity renders the exact analysis of the eigenvectors a challenging task. Consequently, we resorted to a stochastic framework in deriving Lemma 1.

Iv-B Analyzing the deep GCG

While for convenience, the previous section focused on analyzing the GCG architecture with layers, in practice we often work with a larger number of layers. In this section, we provide numerical evidence showing that the relation between matrices and described in Lemma 1 also holds when .

To that end, Figure 1 shows the pairs of eigenvectors and for the indexes , for a given graph drawn from an SBM with nodes and 4 communities. The GCG is composed of layers and, to obtain the eigenvectors of the squared Jacobian matrix, the Jacobian is computed using the autograd

functionality of PyTorch. The nodes of the graph are sorted by communities, i.e., the first

nodes belong to the first community and so on. It can be clearly seen that, even for moderately small graphs, the leading eigenvectors of and are almost identical, becoming more dissimilar as the eigenvectors are associated with smaller eigenvalues. It can also be observed how leading eigenvectors have similar values for entries associated with nodes within the same community. Moreover, Figure 2 depicts the matrix product , where it is observed that the leading eigenvectors of both matrices are orthonormal. The presented numerical results strengthen the argument that the analytical results obtained for the two-layer case can be extrapolated to deeper architectures.

In addition to using an architecture with only two layers, another important assumption of Lemma 1 is that the graph is drawn from an SBM. This assumption facilitates the derivation of a bound relating the spectra of and (i.e., the subspaces spanned by the eigenvectors and ). The numerical experiments reported in Figure 3 illustrate that such a relation also holds for other type of graphs. The figure has 12 panels (3 rows and 4 columns). Each of the rows corresponds to a different graph, namely: 1) a realization of a small-world (SW) graph [48] with nodes, 2) the Zachary’s Karate graph [52] with nodes, and 3) a graph of weather stations across the United States [23]. Each of the three first columns correspond to an matrix, namely: 1) the normalized adjacency matrix , 2) , the squared version of a low pass graph filter with and whose coefficients are drawn from a uniform distribution and set to unit norm, and 3) the squared Jacobian matrix . Although we may observe some similarity between and , the relation between and the graph becomes apparent when comparing the matrices and . The matrix is a random graph filter used in the linear transformation of the convolutional generator , and it is clear that the vertex connectivity pattern of is related to that of . Since and are closely related and we know that the eigenvectors of and those of are the same, we expect (the eigenvectors of ) and (the eigenvectors of ) to be related as well. To verify this, the fourth column of Figure 3 represents , i.e., the pairwise inner products of the leading eigenvectors of and those of . It can be observed that the leading eigenvectors are close to orthogonal, which means that the relation observed in the vertex domain carries over to the spectral domain and and expand the same subspace. As a result, our GCG will be capable of denoising a signal that lives in the subspace spanned by for all the considered graphs.

To summarize, the presented results illustrate that the analytical characterization provided in Section IV-A, which considered a two-layer GCG operating over SBM graphs, carries over to more general setups.

V Graph upsampling decoder

The GCG architecture presented in Section IV incorporated the topology of via the vertex-based convolutions implemented by the graph filter with . In this section, we introduce the graph decoder (GD)222Please, do not confuse with the common acronym for gradient descent. architecture, a new graph-aware denoising NN that incorporates the topology of via a (nested) collection of graph upsampling operators [28]. Specifically, we propose the linear transformation for the GD denoiser to be given by

 T(ℓ)Θ(ℓ){Y(ℓ−1)|G}=U(ℓ)Y(ℓ−1)Θ(ℓ), (19)

where , with , are graph upsampling matrices to be defined soon. Note that, compared to (9), the graph filter is replaced with the upsampling operator that depends on . Adopting the proposed linear transformation, the output of the GD with layers is given by the recursion

 Y(ℓ) =ReLU(U(ℓ)Y(ℓ−1)Θ(ℓ)),forℓ=1,...,L−1, (20) y(L) =U(L)Y(L−1)Θ(L), (21)

where the is also removed from the last layer.

Similarly to the GCG, the proposed GD learns to combine the features within each node, with the interpolation of the signals being controlled by the graph upsampling operators . The size of the input is now a design parameter that will determine the implicit degrees of freedom of the architecture. Note that, from the GSP perspective, the input feature matrix represents graph signals, each of them defined over a graph with nodes. Therefore, even though the input is still a random white matrix across rows and columns, since , the dimensionality of the input is progressively increasing.

When compared to the GCG, the smaller dimensionality of the input endows the GD architecture with less degrees of freedom, rendering the architecture more robust to noise. Furthermore, instead of relying on graph filters, the graph information is included through the graph upsampling operators . Clearly, the method used to design the graph upsampling matrices, which is the subject of the next section, will have an impact on the type of graph signals that can be efficiently denoised using the GD architecture.

V-a Graph upsampling operator from hierarchical clustering

Regular upsampling operators have been successfully used in NN architectures to denoise signals defined on regular domains [14]. While the design of upsampling operators in regular grids is straightforward, when the signals at hand are defined on irregular domains the problem becomes substantially more challenging. The approach that we put forth in this paper is to use agglomerative hierarchical clustering methods [16, 2, 3] to design a graph upsampling operator that leverages the graph topology. These methods take a graph as an input and return a dendrogram; see Figure 4. A dendrogram can be interpreted as a rooted-tree structure that shows different clusters at the different levels of resolution . At the finest resolution () each node forms a cluster of its own. Then, as increases, nodes start to group together (agglomerate) in bigger clusters and, when the resolution becomes large (coarse) enough, all nodes end up being grouped in the same cluster.

By cutting the dendrogram at resolutions, including , we obtain a collection of node sets with parent-child relationships inherited by the refinement of clusters. Since we are interested in performing graph upsampling, note that the dendrogram is interpreted from left to right. This can be observed in the example shown in Figure 4, where the three red nodes in the second graph (, layer ) are children of the red parent in the coarsest graph (, layer ). We leverage these parent-children relations to define the membership matrices , where the entry only if the -th node in layer is the child of the -th node in layer . Moreover, the clusters at layer can be understood as nodes of a graph with nodes and adjacency matrix , which represents a coarser-resolution version of the original graph . There are several ways of defining based on the original adjacency matrix . While our architecture does not focus on a particular form, in the simulations we set only if, in the original graph , there is at least one edge between nodes belonging to the cluster and nodes from cluster . In addition, the weight of the edge depends on the number of existing edges between the two clusters.

With the definition of the membership matrix , and letting denote the degree-normalized version of the adjacency matrix , the upsampling operator of the -th layer is given by

 U(ℓ)=(γI+(1−γ)~A(ℓ))P(ℓ), (22)

where is a pre-specified constant. Notice that in (22) copies the signal value from the parents to the children by applying matrix and, then, every children performs a convex combination between this value and the average signal value of its neighbors. Therefore, the design of conveys a notion (prior) of smoothness on the targeted graph signals, since we are promoting that nodes descending from the same parent have similar (related) values.

Because the membership matrices are designed using a clustering algorithm over , and the matrices capture how strongly connected the clusters of layer are in the original graph, these two matrices are responsible for incorporating the information of into the upsampling operators . Furthermore, we remark that the upsampling operator can be reinterpreted as the application of followed by the application of a graph filter

 ~H(ℓ)=γI+(1−γ)~A(ℓ), (23)

which uses as the GSO, and sets the filter coefficients as and .

V-B Guaranteed denoising with the GD

As we did for the GCG, our goal is to theoretically characterize the denoising performance of the GNN architecture defined by (20)-(22). To achieve that goal, we replicate the approach implemented in Section IV-A. We first derive the matrix and provide theoretical guarantees when denoising a -bandlimited graph signal with the GD. Then, to gain additional insight, we detail the relation between the subspace spanned by the eigenvectors and the spectral domain of the GSO. This relation is key in deriving the theoretical analysis.

We start by introducing the two-layer GD

 (24)

Upon following a reasoning similar to that provided after (13), optimizing the previous architecture is equivalent to optimizing its simplified version

 fΘ(U)=fΘ(Z|G)=ReLU(UΘ)b. (25)

An important difference with respect to the GCG presented in (13) is that the matrix has a dimension of , so it spans instead of . Since , the smaller subspace spanned by the weights of the GD renders the architecture more robust to fitting noise, but, on the other hand, the number of degrees of freedom to learn the graph signal of interest are reduced. As a result, the alignment between the targeted graph signals and the low-pass vertex-clustering architecture becomes more important.

The expected squared Jacobian is obtained following the procedure used to derive (16), arriving at the expression

 X=0.5(11T−1πarccos(~C−1UUT~C−1))⊙UUT, (26)

where represents the -th row of , and is a normalization matrix.

Then, let be a -bandlimited graph signal and let have a number of features satisfying (17). If we solve (7) running gradient descent with a step size , the following result holds.

Theorem 2.

Let be the network defined in equation (25). Consider the conditions described in Theorem 1 and let match the number of communities (see Assumption 1). Then, the error for each iteration of gradient descent with stepsize used to fit the architecture is bounded as (1), with probability at least .

The proof of the theorem is analogous to the one provided in Appendix A but exploiting Lemma 2 instead of Lemma 1. Lemma 2 is fundamental in attaining Theorem 2 and is presented later in the section.

Theorem 2 formally establishes the denoising capability of the GD when is a -bandlimited graph signal and matches the number of communities in the SBM graph. When compared with the GCG, the smaller dimensionality of the input , and thus the smaller rank of the matrix , constrains the learning capacity of the architecture, making it more robust to the presence of noise. However, this additional robustness also implies that the architecture is more sensitive to model mismatch, since its capacity to learn arbitrary signals is smaller. Intuitively, the GD represents an architecture tailored for a more specific family of graph signals than the GCG. Moreover, employing the GD instead of the GCG has a significant impact on the relation between the subspaces spanned by and .

To establish the new relation between and , assume that the adjacency matrix is drawn from an SBM with communities such that , so that the SBM follows Assumption 1. In addition, set the size of the latent space to the number of communities so . Under this setting, the counterpart to Lemma 1 for the case where is a GD architecture follows.

Lemma 2.

Let the matrix be defined as in (26), set and to small positive numbers, and denote by and the leading eigenvectors in the respective eigendecompositions of and . Under Assumption 1, there exist an orthonormal matrix and an integer such that for the bound

 ∥VK−WKQ∥F≤δ,

holds with probability at least .

Lemma 2 asserts that the difference between the subspaces spanned by and becomes arbitrarily small as the size of the graph increases. The proof is provided in Appendix C and the intuition behind it arises from the fact that the upsampling operator can be understood as , where is a graph filter of the specific form described in (23). Remember that is a binary matrix encoding the cluster in the layer to which the nodes in the layer belong. Since we are only considering two layers, and we have that , the matrix is encoding the node-community membership of the SBM graph and, hence, the product is a block matrix with constant entries matching the block pattern of . As shown in the proof, this property can be leveraged to bound the eigendecomposition of and .

V-C Analyzing the deep GD

The deep GD composed of layers can be constructed following the recursion presented in (20) and (21). In this case, by stacking more layers we perform the upsampling of the input signal in a progressive manner and, at the same time, we add more non-linearities, which helps alleviating the rank constraint related to the input size . In the absence of non-linear functions, the maximum rank of the weights would be , and thus, only signals in a subspace of size could be learned. By properly selecting the number of layers and the input size when constructing the network, we can obtain a trade-off between the robustness of the architecture and its learning capability.

In addition, the effect of adding more layers is also reflected on the smoothness assumption inherited from the construction of the upsampling operator. Adding more layers is related to less smooth signals, since the number of nodes in with a common parent, and thus, with similar values, is smaller.

We note that numerically illustrating that the bound between and holds true for the deep GD, and that its denoising capability is not limited to signals defined over SBM graphs provide results similar to those in Section IV-B. Therefore, instead of replicating the previous section, we directly illustrate the performance of the deep GD under more general settings in the following section, where we present the numerical evaluation of the proposed architectures.

Vi Numerical results

This section presents different experiments to numerically validate the theoretical claims introduced in the paper, and to illustrate the denoising performance of the GCG and the GD. The experiments are carried out using synthetic and real-world data, and the proposed architectures are compared to other graph-signal denoising alternatives. The code for the experiments and the architectures is available on GitHub. For hyper-parameter settings and implementation details the interested reader is referred to the online available code.

Vi-a Denoising capability of graph untrained architectures

The goal of the experiment shown in Figures 5a and 5b is to illustrate that the proposed graph untrained architectures are capable of learning the structured original signal faster than the noise, which is one of the core claims of the paper. To that end, we generate an SBM graph with nodes and communities, and define 3 different signals: (i) “Signal”: a piece-wise constant signal with the value of each node being the label of its community; (ii) “Noise”: zero-mean white Gaussian noise

with unit variance; and (iii) “Signal + Noise”: a noisy observation

where the noise present a normalized power of . Figures 5a and 5b show the normalized mean squared error (MSE) obtained for each realization as . The mean is computed for 100 realizations of the noise as the number of epochs increases when the different signals are fitted by the 2-layer GCG and the 2-layer GD, respectively. It can be seen how, in both cases, the error when fitting the noisy signal decreases for a few epochs until it reaches a minimum, and then starts to increase. This is because the proposed untrained architectures learn the signal faster than the noise, but if they fit the observation for too many epochs, they start learning the noise as well and, hence, the MSE increases. As stated by Theorem 1 and Theorem 2, this result illustrates that, if early stopping is applied, both architectures are capable of denoising the observed graph signals without a training step. It can also be noted that, under this setting, the GD learns the signal faster than the GCG and, at the same time, is more robust to the presence of noise. This can be seen as a consequence of GD implicitly making stronger assumptions about the smoothness of the targeted signal.

On the second test case, we illustrate that the result presented in Lemma 1 is not constrained to the family of SBM, but can be generalized to other families of random graphs as well. Figure 5c contains the mean eigenvector similarity measured as as a function of the number of nodes in the graph. The eigenvector similarity is computed for 50 realizations of random graphs and the presented error is the median of all the realizations. The random graph models considered are: the SBM (“SBM”), the connected caveman graph (“CAVE”)[49], the regular graph whose fixed degree increases with its size (“REG”), the small world graph (“SW”)[48], and the power law cluster graph model (“PLC”)[15]. The second term in the legend denotes the number of leading eigenvectors taken into account in each case, which depends on the number of active frequency components of the specific random graph model. We can clearly observe that for most of the random graph models the eigenvector error goes to 0 as increases, with the only exception of the connected caveman graph. This helps to illustrate that, although the conditions assumed for Lemma 1 and Lemma 2 focus on the specific setting of the SBM, the results can be applied to a wider class of graphs, motivating thus the extension of the proposed theorems to more general settings as a future line of work.

Vi-B Denoising synthetic data

We now proceed to comment on the denoising performance of the proposed architectures with synthetic data. The usage of synthetic signals allows us to study how the properties of the noiseless signal influence the quality of the denoised estimate.

The first experiment, shown in Figure 6a, studies the error of the denoised estimate obtained with the 2-layer GCG as the number of epochs increases. The reported error is the normalized MSE of the estimated signal , and the figure shows the median values of 100 realizations of graphs and graph signals. The normalized power of the noise present in the data is . Graphs are drawn from an SBM with nodes and 4 communities, and the graph signals are generated as: (i) a zero-mean white Gaussian noise with unit variance (“Rand”); (ii) a bandlimited signal using the leading eigenvectors of as base (“J”); (iii) a bandlimited graph signal using the leading eigenvectors of as base (“BL”); and (iv) a diffused white (“DW”) signal created as , where is a white vector whose entries are sampled from , is a low-pass graph filter, and represents the graph-aware median operator such that the value of the node is the median of its neighborhood [43, 41, 39]. The results in Figure 6a show that the best denoising error is obtained when the signal is composed of just a small number of eigenvectors, and the performance deteriorates as the bandwidth (i.e., the number of leading eigenvectors that span the subspace where the signal lives) increases, obtaining the worst result when the signal is generated at random. This result is aligned with the theoretical claims since it is assumed that the signal is bandlimited. It is also worth noting that the architecture also achieves a good denoising error with the “DW” model, showcasing that the GCG is also capable of denoising other types of smooth graph signals.

Next, Figure 6b compares the performance of the 2-layer GCG (“2L-GCG”), the deep GCG (“GCG”) and the deep GD (“GD”) with the baseline models introduced in Section III, which are the total variation (“TV”), Laplacian regularization (“LR”) and bandlimited model (“BL”). In this setting, the graphs are SBM with nodes and 8 communities, and the signals are bandlimited with a bandwidth of 8. Since the “BL” model with captures the actual generative model of the signal , it achieves the best denoising performance. However, it is worth noting that the GCG obtains a similar result, outperforming the other alternatives. Moreover, the benefits of using the deep GCG instead of the 2-layer architecture are apparent, since it achieves a better performance in fewer epochs.

On the other hand, Figure 6c illustrates a similar experiment but with the graph signals generated as “DW”. Under this setting, it is clear that the GD outperforms the other alternatives, showcasing that it is more robust to the presence of noise when the signals are aligned with the prior implicitly captured by the GD architecture.

Vi-C Denoising temperature measurements

We now evaluate the proposed architectures using a real-world dataset. We consider a network of 316 weather stations distributed across the United States where the graph signals represent the daily temperature measured by each station on the first three months of the year 2003. Also, similar to [34], we consider the graph given by the 8-nearest neighbors. The weight of each edge is inversely proportional to the distance between the stations.

The results are presented in Figure 7, which shows the evolution of the mean MSE as the normalized noise power increases. In this experiment, we have selected as denoising alternatives the bandlimited model with the 15% of active frequency components (“BL”), a graph-aware median operator such that the value of is the median of its neighborhood (“MED”) [43], and a GCNN. It can be observed that the GD is more robust to the presence of noise, since it outperforms the other alternatives and achieves a mean MSE of when the noise power attains a value of . Moreover, note that the GCG outperforms the GCNN showcasing the advantage of using a fixed graph filter instead of learning the filter parameters. In the absence of noise, the GCG outperforms the other alternatives, including the GD. This illustrates that the GCG can be interpreted as a less regularized architecture than the GD.

Vii Conclusion

In this paper, we faced the relevant task of graph-signal denoising. To approach this problem, we presented two overparametrized and untrained GNNs and provided theoretical guarantees on the denoising performance of both architectures when denoising -bandlimited graph signals under some simplifying assumptions. Moreover, we numerically illustrated that the proposed architectures are also capable of denoising graph signals in more general settings. The key difference between the two architectures resided in the linear transformation that incorporates the information encoded in the graph. The GCG employs fixed (non-learnable) low-pass graph filters to model convolutions in the vertex domain, promoting smooth estimates. On the other hand, the GD relies on a nested collection of graph upsampling operators to progressively increase the input size, limiting the degrees of freedom of the architecture, and providing more robustness to noise. In addition to the aforementioned analysis, we tested the validity of the proposed theorems and evaluated the performance of both architectures with real and synthetic datasets, showcasing a better performance than other classical and non-linear methods for graph-signal denoising.

Appendix A Proof of Theorem 1

Let be a bandlimited graph signal as described in (2), which is spanned by the leading eigenvectors of the graph , with denoting its frequency representation. Denote as the bandlimited signal using as basis and whose frequency response is also . Let be an orthonormal matrix that aligns the subspaces spanned by and , and note that can be interpreted as recovering from its frequency response using instead of . Also note that represents the error between the signal and its approximation inside the subspace spanned by . With these definitions in place, we have from [14, Theorem 3] with probability at least that

 ∥x0−fΘ(t)(Z|G)∥2≤∥Ψx0∥2+ξ∥x∥2 (27) + ⎷N∑i=1((1−ησ2i)t−1)2(wTin)2,

with , and the identity matrix. However, note that the error bound for the term provided in [14] does not apply since is not spanned by . Accordingly, we further bound this term as

 ∥Ψx0∥2 =∥Ψ(x0+¯x0−¯x0)∥2 (i)=∥ΨK¯x0+Ψ(VK−WKQ)~x0∥2 (iii)≤∥ΨK∥2∥¯x0∥2+∥Ψ∥2∥VK−WKQ∥F∥~x0∥2 (iv)≤(∥ΨK∥2+δ∥Ψ∥2)∥x0∥2 (v)=((1−ησ2K)t+δ(1−ησ2N)t)∥x0∥2. (28)

Here, , and represents a diagonal matrix containing the first leading eigenvalues . We have that follows from being bandlimited in , so . Then, follows from the triangle inequality, and from the norm being submultiplicative and using the Frobenius norm as an upper bound for the norm. In we apply the result of Lemma 1, which holds with probability at least because , and the fact that, since both and are orthonormal matrices, we have that . We obtain from the largest eigenvalues present in and .

Finally, replacing (A) in (27) the proof concludes.

Appendix B Proof of Lemma 1

Define as and let be given by (16). Denote by a graph filter defined as a polynomial of the expected adjacency matrix , and let be the expected squared Jacobian using the graph filter , i.e.,

 ¯X=0.5