I Introduction
With the explosive growth of information and communication, data today is generated at an unprecedented rate from various sources, including social, biological and physical infrastructure [37, 45], among others. Unlike timeseries signals or images, these signals possess complex, irregular structures, which can be modeled as graphs. Analyzing graph signals requires new concepts and tools to handle the underlying irregular relationships, leading to the emerging fields of graph signal processing [48] and graph neural networks [7]. Graph signal processing generalizes the classical signal processing toolbox to the graph domain and provides a series of techniques to process graph signals [57, 55], including graphbased transformations [31], graph filter bank design [44, 56] and graph topology learning [20]
. On the other hand, graph neural networks expand deep learning techniques to the graph domain and provide a datadriven framework to learn from graph signals with graph structures as induced biases
[7]. For example, graph convolutional networks and the variants have attained remarkable success in social network analysis [38], point cloud processing [67], action recognition [40] and relational inference [34]. In this work, we consider sampling and recovery of graph signals, which has been highly concerned in both graph signal processing [11, 3, 4] and graph neural networks [69, 39].In classical signal processing, signal sampling and recovery are key techniques to link continuoustime signals (functions of a real variable) and discretetime signals (sequences indexed by integers) [64]. Sampling produces a sequence from a function, and recovery produces a function from a sequence. In the graph domain, sampling is a reduction of a graph signal to a small number of measurements, and recovery is a reconstruction of a graph signal from noisy, missing, or corrupted measurements. The recovery techniques are usually related to the sampling procedure because the sampling procedure determines the property of the measurements. Several types of graph signal sampling are considered in literature. For example, subsampling selects one vertex in each measurement [11, 3]; localneighboring sampling selects a set of connected vertices in each measurement [65]; and aggregated sampling considers a timeevolving graph signal and selects one vertex at different time stamps in each measurement [42]. We focus on subsampling in this paper.
Based on the graph signal processing framework, analytical graph signal sampling and recovery operators are designed by mathematics and provide optimal or nearoptimal solutions for some specific graph signal model, which is a class of graph signals with certain graphrelated properties. For example, [11, 2] focus on a bandlimited graph signal model and presents a graph counterpart of Shannon–Nyquist sampling theory. Recently, [61] considers a periodic graph spectrum subspace and provides a generalized sampling framework. The techniques related to analytical graph signal sampling and recovery have been broadly applied to multiscale graph filter bank design [56], sensor placement [52], semisupervised learning [24], traffic monitoring [15], 3D point cloud processing [10], and many others.
A fundamental challenge for analytical sampling and recovery is that we may not know an appropriate graph signal model in practice. When a mathematicallydesigned graph signal model is far away from the groundtruth, yet unknown graph signal model that generates realworld data, it might cause a significant performance gap between theory and practice. Furthermore, a groundtruth graph signal model could be too complicated to be precisely described in mathematics, leading to computationally intensive algorithms. To solve this issue, it is desirable to achieve a datadriven model that is appropriately learnt from given graph signals. In other words, sampling and recovery strategies should have sufficient flexibility to learn from and adapt to arbitrary graph signal models.
On the other hand, from a perspective of deep neural networks, sampling, or called pooling, is also critical in many learning tasks as it can adjust data resolution and enable multiscale analysis, which has achieved significant success in image and video analysis [51]. In the graph domain, graph sampling/pooling is also a key component for a graph neural network. Compared to sampling on regular lattices, sampling of graph signals is technically more challenging due to highly irregular structures. To handle this, most previous methods directly train a sampling operator purely based on downstream tasks without explicitly exploring graphrelated property [69, 39]. It is thus hard to explain the vertexselection strategy of the learnt sampling operator, which potentially impedes us from improving the method.
In this work, we propose new graphneuralnetworkbased sampling and recovery operators, which leverage the learning ability of graph neural networks and provide interpretability from a signal processing perspective. To achieve sampling, our strategy is to select those vertices that can maximally express their corresponding neighborhoods. Those vertices can be regarded as the centers of their corresponding communities. We can mathematically quantify the expressiveness by the mutual information between vertex and neighborhood features. Through neural estimation of mutual information, the proposed graph neural sampling module can be optimized and naturally provide an affinity metric between a vertex and a neighborhood in a datadriven fashion. By varying the neighborhood radius, we are allowed to tradeoff between local and global mutual information. Compared to analytical graph signal sampling [11, 3, 61], the proposed graph neural sampling module is modelfree and sufficiently flexible to adapt to arbitrary graph signal models and arbitrary subsequent tasks. Compared to previous graphneuralnetworkbased sampling [69, 39], the proposed one explicitly exploits the graphrelated property and provides interpretability. Moreover, previous methods need a subsequent task to provide additional supervision, while the proposed method is rooted in estimating the dependencies between vertices and neighborhoods, which can be trained in either an unsupervised or a supervised setting.
To achieve recovery, we propose a new graph neural recovery module based on algorithm unrolling. We first propose an iterative recovery algorithm and then transform it to a graph neural network by mapping each iteration to a network layer. Compared to analytical graph recovery, the proposed trainable method is able to learn a variety of graph signal models from given graph signals by leveraging neural networks. Compared to many other neural networks [38, 62], the proposed method is interpretable by following analytical iterative steps.
Based on the proposed graph neural sampling and recovery modules, we further propose a new multiscale graph neural network, which is a trainable counterpart of a multiscale graph filter bank. The proposed network includes three key basic operations: graph downsampling^{1}^{1}1Graph downsampling includes two parts: graph signal downsampling and graph structure downsampling. Graph signal sampling is equivalent to graph signal downsampling., graph upsampling and graph filtering. Each operations is trainable and extended from its analytical counterpart. Here we implement graph downsampling and graph upsampling by the proposed graph neural sampling and recovery modules, respectively. By adjusting the last layer and the final supervision, the proposed multiscale graph neural network can handle various graphrelated tasks. Compared to a conventional multiscale graph filter bank [56], the proposed multiscale graph neural network is trainable and more flexible to adapt to a new task. Compared to previous multiscale graph neural networks [27, 18], our model leverages the proposed novel graph neural sampling and recovery modules and presents a new featurecrossing layer that allows multiscale features to communicate in the intermediate network layers, and further improve the learning ability. Because of the structure of the featurecrossing layer, we also call our model graph cross network (GXN).
To illustrate the sampling strategy of the proposed graph neural sampling module, we compare the selected vertices based on various sampling methods for both simulated and realworld scenarios. We find that (i) when dealing with smooth graph signals, the proposed graph neural sampling module has similar performance with the analytical sampling; (ii) the proposed graph neural sampling and recovery modules can flexibly adapt to various graph signals, such as piecewisesmooth graph signals. In a task of activesamplingbased semisupervised learning, the proposed graph neural sampling module improves the classification accuracies of previous sampling methods over in Cora dataset.
To validate the performance of the proposed multiscale graph neural network, we conduct extensive experiments on several standard datasets for both vertex classification and graph classification. Compared to stateoftheart methods for these two tasks, the proposed method improves the average classification accuracies by and , respectively.
Contributions. The main contributions of the paper include:

We propose a novel graph neural sampling module, which is designed and optimized through estimating the dependency between vertex and neighborhood features.

We propose a novel graph neural recovery module by unrolling an analytical recovery algorithm, leading to a trainable and interpretable network.

We propose novel multiscale graph neural networks, which is a trainable counterpart of multiscale graph filter banks by leveraging the proposed graph neural sampling and recovery modules.

We conduct experiments to illustrate the sampling strategy of the proposed graph neural sampling module and the recovery performance of the proposed graph neural recovery module.

We validate the proposed multiscale graph neural network on two tasks: graph classification and vertex classification. Compared to stateoftheart methods for these two tasks, the proposed method improves the average classification accuracies by and , respectively.
The rest of the paper is organized as follows: Section II formulates the task of sampling and recovery of graph signals. Sections III and IV propose a graph neural sampling module and a graph neural recovery module, respectively. Based on these two modules, Section V further proposes a multiscale graph neural network, which is a trainable counterpart of a multiscale graph filter bank. We illustrate the effects of the proposed graph neural sampling and recovery modules on several exemplar graphs in Section VI. Finally, the experiments validating the advantages of the proposed multiscale graph neural network are provided in Section VII.
Ii Problem Formulation
In this section, we formulate the task of sampling and recovery of graph signals based on two approaches: an analytical approach and a neuralnetworkbased approach. The analytical approach overviews the conventional methods in graph signal processing and designs sampling and recovery operators by mathematics. The neuralnetworkbased approach designs trainable sampling and recovery operators based on deep learning techniques, which lays a foundation for the proposed sampling and recovery methods.
Iia Basic concepts
We consider a graph , where is the set of vertices and is the graph adjacency matrix, where the element represents an weighted edge from the th to the th vertex to characterize the corresponding vertex relation, such as similarity or dependency. Based on the graph , a graph signal is defined as a map that assigns a signal coefficient to the vertex ; that is, all the vertices carry their associated signal coefficients to form the global signal on the whole graph. A graph signal can be formulated as a lengthvector defined by where the th vector element is indexed by the vertex .
The multiplication between the graph adjacency matrix and a graph signal, , replaces the signal coefficient at each vertex with the weighted linear combination of the signal coefficients of the corresponding neighbors according to the relations represented by . In other words, the graph adjacency matrix enables the value at each vertex shift to its neighbors; we thus call it a graph shift operator [54].
IiB Analytical sampling & recovery framework
The dual process of sampling and recovery for a graph signal is to use a few measurements to represent a complete graph signal, where sampling is a reduction of a graph signal to a small number of measurements, and recovery is a reconstruction of a graph signal from noisy, missing, or corrupted measurements. There are many types of sampling considered in literature, including subsampling [11, 3], neighborhood sampling [65] and aggregated sampling [42]. Here we focus on subsampling, which samples the signal coefficient at one vertex in each measurement.
Suppose that we first sample coefficients of a graph signal to produce a sampled signal (), where with the th sampled index
. Then, we interpolate the sampled signal
to obtain the reconstruction , which recovers the original graph signal either exactly or approximately. Mathematically, the whole process of graph sampling and recovery is formulated aswhere the sampling operator is an matrix whose th element is
(1) 
and the recovery operator : maps the measurements to a reconstructed graph signal. In general, it is difficult to perfectly recover the original graph signal by using its reconstruction due to the information loss during sampling.
We usually consider that an original graph signal is generated from a known and fixed graph signal model , which is a class of graph signals with specific properties that are related to the irregular graph structure . For example, the bandlimited and approximately bandlimited graph signal model are commonly used to model smooth graph signals [11, 14]. A main challenge of designing a pair of sampling and recovery operators is to cope with the graph signal model . A common prototype is to solve the following optimization problem:
(2) 
where the expectation can be replaced by other aggregation functions, such as the maximum. The analytical solutions of (2) provide a pair of sampling and recovery operators that theoretically minimize the expected recovery error for the graph signals in the graph signal model . When is the bandlimited graph signal model, we can design the analytical sampling and recovery operators to guarantee the perfect recovery [11], which expands the classical ShannonNyquist sampling theory to the irregular graph domain [22].
With additional constraints, we could consider many variants of (2). For example, we could consider either deterministic or randomized methods to obtain a sampling operator [12, 50]. We could also introduce additional graphregularization terms to guide the recovery process, such as the quadratic form of graph Laplacian [57] and the graph total variantion [9]; see more details in recent review papers [41, 60].
IiC Neural sampling & recovery framework
A fundamental limitation of the abovementioned analytical framework is that we may not know the graph signal model in practice. It is then hard to either appropriately choose a graph signal model or precisely design a new mathematical model. To solve this issue, it is desirable to learn an appropriate model in a datadriven manner. Suppose that we are given a set of graph signals, which sufficiently represents a graph signal model that we are interested in. We then aim to train a pair of neuralnetworkbased sampling and recovery operators according to the given data. Those operators then can leverage the learning ability of neural networks to adapt to an arbitrary graph signal model.
Let and be graph neural sampling and recovery modules, respectively, where the subscript generally reflects both functions involve training parameters. We then have
(3a)  
(3b) 
where is the graph adjacency matrix, the matrix follows the definition of the sampling operator (1). The graph neural sampling module is supposed to aggregate information from both graph structure and graph signals to determine the sampling operator ; and the graph neural recovery module is supposed to exploit information from both graph structure and measurements to construct a graph signal.
Instead of defining an explicit graph signal model mathematically, we want the network to learn a graph signal model from given data. This is a practical assumption because graph structures and graph signals are provided together in many scenarios. For example, in a social network, the users’ relationship forms a graph structure and users’ profile information forms graph signals. Let
(4) 
be a matrix that contains graph signals generated from an unknown graph signal model . The th column vector is the th graph signal and the th row vector collects the signal coefficients supported at vertex .
Based on the given data, we consider the following optimization problem
(5)  
Comparing (2) and (5), there are two major differences: i) we work with graph signals, which can be regarded as a proxy of an unknown graph signal model; and ii) the sampling and recovery operators are substituted by neuralnetworkbased modules, which are proposed to implicitly capture an appropriate graph signal model from given graph signals. In (2), and are solved analytically; while in (5), and are obtained through training.
We can solve the optimization problem by stochastic gradient descent
[29]. After training, we fix the trainable parameters in . Then, for any arbitrary graph signal generated from the same model , we can follow (3): using to take measurements and using to recover the complete graph signal .Based on this new trainable framework, we are going to specifically design a graph neural sampling module and a graph neural recovery module in the following two sections, respectively.
Iii Graph Neural Sampling Module
In this section, we are going to design a neural network to select a vertices set that contains vertices. Two key challenges include: i) we need to explicitly describe the importance of a vertex in a complex and irregular graph; and ii) we need to consider information from two sources: graph structures and given graph signals. A trivial neural network architecture with bruteforce endtoend learning can hardly work in this case. We need to exploit graphrelated properties. Our main intuition is to select those vertices that can maximally express their corresponding neighborhoods. Those vertices can be regarded as the centers of their corresponding communities. We can mathematically measure the expressiveness by the mutual information between vertices’ features and neighborhoods’ features. Those features include information from both graph structures and graph signals. The proposed graph neural sampling module can be optimized through estimating this mutual information.
Iiia Mutual information neural estimation
We first define a vertex’s neighborhood. For a vertex , its neighborhood is a subgraph, . The corresponding vertex set contains the vertices whose geodesic distances to are no greater than a threshold . The corresponding graph adjacency matrix is the corresponding submatrix of . We call is vertex ’s neighborhood; corresponding, is the anchor vertex of the neighborhood .
We then define vertex’s features and neighborhood’s features. For vertex , the feature is the given signal coefficients in (4). For its neighborhood, , its feature includes both the connectivity information and vertex features in , denoted as . Given an arbitrary vertex set , we have access to the associated vertex features and the associated neighborhood features whose anchors are in .
We now want to quantify the dependency between the vertex features and the neighborhood features in the vertex set
, which could be quantified by the mutual information. Let a random variable
be the feature of a randomly picked vertex in , the distribution of is , where is the outcome feature when we pick vertex . Similarly, let a random variable be the neighborhood feature of a randomly picked anchor vertex in , the distribution of is , where is the outcome feature when we pick vertex’s neighborhood. The mutual information between selected vertices and neighborhoods is the KLdivergence between the joint distribution
and the product of marginal distributions ; that is,(6)  
where follows from divergence representation based on KL divergence [5]; is an arbitrary function that maps the features of a pair of vertex and neighborhood to a real value, here reflecting the dependency between vertex and neighborhood.
Since our goal is to propose a vertexselection criterion based on the dependencies between vertices and neighborhoods, it is not necessary to compute the exact mutual information based on KL divergence. Instead, we can use nonKL divergences to achieve favourable flexibility and convenience in optimization, which shares the same representation framework [47]. Here we consider a GANlike divergence.
where
is the sigmoid function. Intuitively, the function
evaluates the affinity between a vertex and a neighborhood. In practice, we cannot go through the entire function space ; instead, we can parameterize by a neural network , where denotes the trainable parameters. Through optimizing over , we obtain a neural estimation of , denoted as .We define our vertex selection criterion function to be this neural estimation; that is,
(7)  
In , the first term reflects the average affinities between vertices and their own neighborhoods in the vertex set ; and the second term reflects the discrepancies between vertices and arbitrary neighborhoods. Notably, a higher score indicates that a vertex maximally reflects its own neighborhood and meanwhile minimally reflects all the other neighborhoods.
To specify the affinity network , we consider
where the subscript indicates the associated functions are trainable^{2}^{2}2The trainable parameters in , , and do not share weights., and are embedding functions for a vertex and a neighborhood, respectively, and is an affinity function to quantify the affinity between a vertex and a neighborhood; see an illustration in Figure 2. We implement and
by multilayer perceptrons (MLPs)
[29], which is a standard fullyconnected neural network structure. We further implement by aggregating vertex features and neighborhood connectivities in ; that iswhere is the trainable weight matrix associated with the th hop of neighbors. Here is a variant of graph convolutional network [38].
In our method, the form of mutual information estimation and maximization is similar to deep graph infomax (DGI) [63], where both DGI and the proposed method use the techniques of mutual information neural estimation [5, 32] to the graph domain; however, there are three major differences. First, DGI aims to train a graph embedding function while the proposed method aims to evaluate the importance of a vertex via its affinity to its neighborhood. Second, DGI considers the relationship between a vertex and an entire graph while we consider the relationship between a vertex and a neighborhood. By varying the neighborhop in , the proposed method is able to tradeoff local and global information. Third, DGI has to train on multiple graphs while the proposed one can work with an individual graph.
IiiB Vertex selection optimization
To evaluate the vertex selection criterion in (7) and solve the maximization problem, which naturally optimizes the internal affinity network . Note that i) the proposed network is based on the estimation of the dependencies between vertices and neighborhood, exploiting the graphrelated property; and ii) the proposed network aggregates information from two sources: graph structures and graph signals.
We now can select the most informative vertex set according to the criterion function (7). The vertex selection optimization problem is
(8) 
To solve (8), we consider the submodularity of mutual information [16] and employ a greedy algorithm. We select the first vertex with maximum with ; and we next add a new vertex sequentially by maximizing greedily; however, it is computationally expensive to evaluate for two reasons: (i) for any vertex set , we need to solve an individual optimization problem; and (ii) the second term of includes all the pairwise interactions involved with quadratic computational cost. To address Issue (i), we set the vertex set to all the vertices in the graph, maximize to train the affinity network . We then fix this network and evaluate . To address Issue (ii), we perform negative sampling, approximating the second term [43].
Note that the proposed neural network only provides the vertexselection criterion, which serves as the objective function in the optimization problem (8). The subsequent selection algorithm is not part of the network.
IiiC Training details
After solving the vertex selection problem (8), we obtain that contains unique vertices selected from , leading to the sampling operator in (3a). In this way, we train the graph neural sampling module without considering the subsequent recovery task. Since the supervision is simply the relationships between vertices and neighborhoods in the graph structure and we do not use any additional labels as the network supervision, we consider this an unsupervised training setting.
We could also make the affinity network be aware of the final recovery error, we introduce a trainable attention vector to provides a path to flow gradients for training . The attention vector shares the same network with and acts as a bridge connecting a graph neural sampling module and a subsequent graph neural recovery module. For vertex , the attention score is
(9) 
which measures the affinity between a vertex and its own neighborhood. With the attention vector , we assign an trainable importance to each vertex, provide a path for backpropagation to flow gradients, and unify the training of a graph neural sampling module and a graph neural recovery module. Therefore, we can train the affinity network by using the supervisions of both the vertex selection criterion function and the final recovery error. Since we include the recovery loss, we consider it as a supervised training setting.
When we sample a graph signal , we first sample the signal coefficients supported on the vertex set and then use the attention vector to adjust the measurements. The final samples is
(10) 
where the sampling operator is associted with the selected vertices and is a diagonal matrix. Note that i) is a weighted sampling operator, where provides the selected vertex indices and provides the weights; and ii) the selected vertex indices in are not a direct output from a neural network. They are obtained by solving the optimization (8), whose objective function is provided by a neural network. In (3a), we use a graph neural sampling module to represent the entire process for simplicity.
IiiD Relations to analytical sampling
Analytical sampling usually provides an optimal or nearoptimal solution for sampling a graph signal generated from some specific graph signal model. Some widelyused models include the bandlimited class [11], the approximatleybandlimited class [14], the piecewisesmooth class [13] and the periodic class [61]. Many works also extend the sampling setting from subsampling to localneighborhood sampling [65], aggregation sampling [42] and generalized sampling [61]. However, a fundamental issue of those previous works is that the groundtruth graph signal model in a specific task might be far away from our assumption, causing a significant performance gap between theory and practice.
To cope with this issue, the proposed graph neural sampling module is dataadaptive and leverages the training ability of neural networks to implicitly capture the underlying graph signal model. At the same time, our method is different from some recent graph pooling methods proposed in the deep learning literature [69, 39, 18] from three aspects. First, the previous graph pooling methods only depend on graph signals, while our method considers both graph structure and graph signals. Second, the previous graph pooling methods purely rely on trainable neural networks to directly generate sampling operators without exploring any graphrelated properties, while our method explicitly considers the dependency between vertices and local neighborhoods in a graph. A trainable neural network is only used to estimate the mutual information and the output from this network has an interpretable semantic meaning. Third, the previous graph pooling methods have to have a subsequent task to provide additional supervision, while our method can be either supervised and unsupervised as the network is optimized via mutual information neural estimation.
Iv Graph Neural Recovery Module
In this section, we are going to design a neural network to recover a complete graph signal from a few measurements collected by the graph neural sampling module . Two key challenges include: i) we need to design an appropriate architecture to recover unknown signal coefficients; and ii) we need to enable endtoend learning, so that both and can be trained together. Here we design the graph neural recovery module based on the algorithmunrolling technique, which transform an analytical graph recovery algorithm to a neural network. In the end, we use the recovery error as the supervision to train and simultaneously.
Iva General analytical graph signal recovery
We start with an analytical graph signal recovery algorithm. Let be a graph filter with filter coefficients, which is a polynomial of the graph shift operator. We aim to solve the following optimization problem for recovery:
(11a)  
subject to  (11b) 
The objective function (11a) promotes that the signal response after graph filtering should be small and the constraint (11b) requires that when we use the same procedure to sample a recovered graph signal, the measurements should be the same with the current measurements. The design of filter coefficients depends on a specific graph signal model. For example, to minimize the graph total variation, we can choose a highpass graph Haar filter; that is, [10]. Here we use a general form for generality.
We can obtain the closedform solution for the optimization (11). Without loss of generality, we assume that the known coefficients correspond to the first graph nodes this arrangement can always be achieved by reordering nodes. Then, a graph signal is
where is the measured part of the signal and is the unknown part to be recovered. Let . Correspondingly, we represent in a block form as
Lemma 1.
The closedform solution for the graph signal recovery problem (11) is given by
(12) 
where is the elementwise division.
Proof.
To avoid calculating the computationally expensive matrix inversion, we can use an iterative algorithm to obtain the same closedform solution.
Algorithm 1.
Set the intial conditions as , , , and do the following three steps iteratively
(13)  
until convergence, where is an appropriate step size.
The following theorem analyzes the convergence property of Algorithm 1.
Theorem 1.
Proof.
We can rewrite the updating step (13) as
where comes from the induction. Because of initialization , the second term is zero. We then obtain
where comes from the property of the matrix inversion. Denote , we have
where follows the definition of spectral norm. We finally obtain
∎
The following theorem shows that it is fairly easy to choose to lead to the convergence.
Theorem 2.
Let . Then,
Proof.
Since is real and symmetric, we decompose it as
(14) 
where
is an unitary matrix and
are a diagonal matrix whose diagonal elements are are the corresponding eigenvalues, denoted as
; moreover, . Since , we have(15) 
We then bound the spectral norm of as
where comes from the property of spectral norm, follows the definition of , follows (14), comes from the unitary property of , comes from the definition of spectral norm and follows (15). ∎
We now obtain an iterative algorithm to solve the recovery optimization (11). The key step (13) is to use to filter the solution from the previous step and is essentially a polynomial of , which is again a graph filter. Then a fundamental challenge is now to choose the filter coefficients to adapt to a specific task. Previously we have to manually design the filter coefficients for a predefined graph signal model. Now with the neural network framework, we are going to adaptively learn those filter coefficients from data.
IvB Algorithm unrolling
Algorithm unrolling provides a concrete and systematic connection between iterative algorithms in signal processing and deep neural networks, paving the way to developing interpretable network architectures. The core strategy is to unroll an iterative algorithm into a graph neural network by mapping each iteration into a single network layer and stacking multiple layers together.
To unroll Algorithm 1, we substitute the fixed by a trainable graph filter. We then consider the th network layer in the neural recovery module works as
(16)  
where filter coefficients at each layer are trainable parameters. We can also consider a multichannel setting and some other variants of a graph filter [25].
After forwardpropagation through layers, we output as the final solution of the neural recovery module.
IvC Training details
We then can train the proposed graph neural sampling and recovery modules together. Given a set of graph signals (4), we consider the following training loss
(17) 
where a recoveryerror loss
with and following from the graph neural sampling module (10), and following from the graph neural recovery module (16), and a vertexselection loss based on mutual information neural estimation
(18) 
which is the vertexselection criterion and relates to and . The hyperparameter balances the overall recovery task and vertexselection task. Note that our neural sampling module is trained through both and , which makes the graph neural sampling module adapt to the final recovery task; we consider this as the supervised setting. When we only use to train the graph neural sampling module, it is the unsupervised setting. In the entire process, the trainable components include the affinity network for sampling and the graph filter coefficients for recovery.
IvD Relations to analytical recovery
Analytical recovery usually solves an optimization problem with a predefined graphregularization term. For example, (11a) regularizes the filter responses. Some other options include the quadratic form of graph Laplacian [57], the form of graph incident matrix [66]. However, a fundamental issue is that which graphregularization term is appropriate in a specific recovery task. An arbitrary graphregularization term would provide misleading induced bias, causing a significant performance gap between theory and reality.
The proposed graph neural recovery method provides both flexibility and interpretability. On the one hand, it is dataadaptive and supposed to learn a variety of graph signal models from given graph signals by leveraging deep neural networks. On the other hand, we unroll an analytical recovery algorithm to a graph neural network whose operations are interpretable by following analytical iterative steps.
V Multiscale Model: Graph Cross Network
For analytical graph signal sampling and recovery, one of the biggest applications is to empower a graph filter bank to achieve multiscale analysis. For the proposed graph neural sampling and recovery modules, we can use them to design a multiscale graph neural network.
One benefit of multiscale representations is to achieve flexible tradeoff between local and global information [33, 56]
. A multiscale graph filter bank is an array of bandpass graph filters that separates an input graph signal into multiple components supported at multiple graph scales. As a trainable counterpart of a multiscale graph filter bank, a multiscale graph neural network consists of three basic operations: graph downsampling, graph upsampling and graph filtering, which are all trainable and implemented by neural networks. Based on graph downsampling and upsampling, we can achieve multiscale graph generation; and based on graph filtering, we achieve multiscale feature extraction. Moreover, we propose a new featurecrossing layer to promote the fusion of the intermediate features across graph scales, improving the learning ablity. The iconic structure of featurecrossing layer also leads to the model name: graph cross network (GXN).
Va Multiscale graph generation
To build a multiscale graph neural network, we first need to generate multiple scales for a graph and provide the corresponding vertextovertex association across scales. To achieve this, we need to design graph downsampling and upsampling.
Graph downsampling. Graph downsampling is a process to compress a larger graph structure with the associated graph signal to a smaller graph structure with the associated graph signal. It involves two related, yet different parts: graph structure downsampling and graph signal downsampling. Here graph signal downsampling is the same with graph signal sampling. During graph downsampling, we have to reduce the number of vertices and edges, causing information loss. A good graph downsampling method should preserve as much information as possible.
The proposed graph neural sampling module naturally selects informative vertices and collects the associated signal coefficients. This learnt sampling operator also provides the direct vertextovertex association across scales. To further achieving graph structure downsampling, we need to connect the selected vertices according to the original connections. Here we consider three approaches:
Direct reduction, that is, where is the sampling operator. This is simple and straightforward, but loses significant connectivity information;
Fused reduction, that is, with , where softmax is a rowwise softmax function, achieving the rowwise normalization Each row of represents the neighborhood of a selected vertex and the intuition is to fuse the neighboring information to the selected vertices; and
Kron reduction [21], which is the Schur complement of the graph Laplacian matrix and preserves the graph spectral properties, but it is computationally expensive due to the matrix inversion. We can convert the original graph adjacency matrix to a graph Laplacian matrix, execute Kron reduction to obtain a downsampled graph Laplacian matrix, and convert it back to a downsampled graph adjacency matrix.
In our experiments, we see that Kron reduction leads the effectiveness; the direct reduction leads the efficiency; and the fused reduction achieves the best tradeoff between effectiveness and efficiency. We thus consider the fused reduction as our default.
Overall, we can use graph downsampling several times to generate multiscale representations for a graph. Given an input graph structure with the associated graph signal, , we first initialize the finest scale of graph structure as with , and the associated graph signal . We then recursively apply graph downsampling for times to obtain a series of coarser scales of graph structure and the associated graph signals from and , respectively, where for . Here the superscript indexes the graph scale.
Graph upsampling. Graph upsampling is an inverse process of graph downsampling. Since we have the original graph structure and the vertextovertex association across scales, we only need to design graph signal upsampling, which is equivalent to graph signal recovery. We can directly use the proposed graph neural recovery module to obtain the signal coefficients at the unselected vertices.
After extracting features at each scale, we can use graph upsampling to lift features extracted in a coarser scale to a finer scale. For example, let and be the graph structure and the extracted feature vector at the th scale, respectively. To obtain the corresponding feature vector at the th scale (), we can recursively apply graph upsampling for times based on the graph structure at each scale.
The proposed graph downsampling and upsampling together can generate multiscale representations of a graph and enable the feature conversion across scales. Note that both are trainable and adaptive to specific data and tasks.
VB Multiscale feature extraction
Given multiple scales of a graph, we build a graph neural network at each scale to extract features. Each network consists of a sequence of trainable graph filters. After feature extraction at each scale, we combine deep features at all the scales together to obtain the final representation. We use graph upsampling to align features at different scales. We finally leverage a trainble graph filter to synthesize the fused multiscale features and generate the final representation for various downstream tasks, such as graph classification and vertex classification.
To further enhance information flow across scales, we propose a featurecrossing layer between two consecutive scales at various network layers, allowing multiscale features to communicate and merge in the intermediate network layers. Mathematically, let be the feature vector at the th scale and the th network layer. In the same th network layer, we downsample the feature vector at the th scale to the th scale and obtain . We also upsample the feature vector at the th scale to the th scale and obtain . These two steps are implemented by using the proposed graph downsampling and graph upsampling. After a featurecrossing layer, we add feature vectors from these three sources and obtain the fused feature vector
Since the proposed featurecrossing layer forms a cross shape, we call this multiscale graph neural network architecture graph cross network (GXN); see Figure 3. Compared to the standard multiscale graph neural network, the intermediate cross connections promote the information flow across multiple scales and improve the performances.
VC Training details
Here we consider training GXN for two tasks: vertex classification and graph classification. We use the same network architecture to implement multiscale graph generation and feature extraction for both tasks. The only differences between these two tasks are the final output and the loss function.
In vertex classification, we aim to classify each vertex to one or more predicted categories. The final output of GXN is a predicted labeling vector
for vertex , where is the number of vertex categories in a graph. We then use the crossentropy loss between the predicted and groundtruth labeling vector to supervise the training; that is,(19) 
where is groundtruth labeling vector for vertex .
In graph classification, we aim to classify an entire graph to one or more predicted categories. After obtaining the final feature vector for each vertex, we use the standard SortPool [71] to remove the vertex dimension and obtain a graph labeling vector for a graph , where is the number of graph categories in a dataset. We then use the crossentropy loss between the predicted and groundtruth labeling vector to supervise the training; that is,
where is groundtruth labeling vector for a graph .
VD Relations to graph filter banks
A graph filter bank uses a series of bandpass graph filters that expands the input graph signal into multiple subband components [44, 58, 56, 53]. By adjusting the component in each subband, a graph filter bank can flexibly modify and reconstruct a graph signal. The expansion part is called analysis and the reconstruct part is called synthesis. To analyze and synthesize at multiple scales, a multiscale graph filter bank uses analytical graph signal sampling to achieve pyramid representations [56]. The benefit is that we can extract useful features at each scale and combine them in the end. A multiscale graph filter bank includes three building blocks: graph downsampling, graph upsampling and graph filtering.
The proposed multiscale graph neural network is essentially a trainable multiscale graph filter bank. The graph downsampling follows from graph neural sampling in Section III; the graph upsampling follows from graph neural recovery in Section IV; and graph filtering follows from trainable graph convolution operation proposed in previous works [25, 26]. All these three components are trainable and dataadaptive. The analysis and synthesis modules are implicitly implemented during learning. By adjusting the last layer and the supervision for the proposed multiscale graph neural network, we can handle various graphrelated tasks. Compared to a conventional multiscale graph filter bank, a multiscale graph neural network is more flexible for a new task.
Vi Illustration of Sampling and Recovery
In this section, We are going to compare the analytical solutions and the neuralnetwork solutions on a few toy examples. We try to provide some insights on various vertexselection strategies and the corresponding recovery performances.
Via Vertex selection strategy
Here we consider two experiments. First, we consider sampling of bandlimited signals on two similar, yet different graph structures, where we obtain some intuitions about how graph structures influence the vertex selection. In the second experiment, we consider sampling of bandlimited and piecewisebandlimited signals on the same graph structure, which helps understand how given graph signals influence the vertex selection.
Effect of graph structures. We generate two graph structures with vertices based on the stochastic block models [1]. Each graph has two communities, which have and
vertices, respectively. For the first graph, we set the connection probability between vertices in the first community be
, the connection probability between vertices in the second community be , and the connection probability between vertices from two different communities be . Overall, all the vertices in this graph have similar degrees and the average degree is around . We thus name it the similardegree graph. For the second graph, we set the connection probability between vertices in the same communities be , and the connection probability between vertices from two different communities be . In each community, the number of edges is approximately proportional to the number of vertices; that is, the average degrees in the first and second communities are around and , respectively. We thus name it the similardensity graph. Given a sampling method, we select vertices out from vertices. We run trials and then compute the corresponding probability that the selected vertices fall into the smaller community in each of two graphs.We apply three sampling methods: bandlimitedspace (BLS) sampling [11], spectralproxy (SP) sampling [2] and the proposed graph neural sampling module. The selected vertices designed by bandlimitedspace sampling aim to maximally preserve information in the bandlimited space, which is spanned by the first eigenvectors of the graph Laplacian matrix. Spectralproxy sampling promotes the similar idea; however, instead of using the exact eigenvectors, it uses the spectral proxy to approximate the bandlimited space. Spectralproxy sampling involves a hyperparameter, , which is the proxy order. When is larger, the spectral proxy has a better approximation to the bandlimited space. Both bandlimitedspace sampling and spectralproxy sampling are analytically designed based on explicit graph signal models. For the proposed neural sampling, we use the first eigenvectors of the graph Laplacian matrix as graph signals to train neural networks; in other words, we guide the neural networks to preserve information in the bandlimited space. Therefore, we expect the proposed graph neural sampling module should have similar performances with bandlimitedspace sampling and spectralproxy sampling.
Similardegree  Similardensity  

Bandlimitedspace sampling  
Spectralproxy sampling,  
Spectralproxy sampling,  
Spectralproxy sampling,  
Neural sampling with recovery  
Neural sampling without recovery 
Table I shows the probability that the selected vertices fall into the smaller community in either the similardegree graph or the similardensity graph. We see that in the similardegree graph, bandlimitedspace sampling selects vertices from the smaller community, which has vertices in the entire graph. The intuition is that the vertices in both communities have similar amount of information. In comparison, in the similardensity graph, bandlimitedspace sampling selects vertices from the smaller community. The intuition is that the vertices with weaker connectivities are more informative because their information is much harder to be accessed from other vertices. Spectralproxy sampling shows similar trends, especially when we increase the proxy order . For the graph neural sampling module, we consider both with and without using the recoveryerror loss. Both cases show consistent performances and are similar to bandlimitedspace sampling. This reflects that given the same graph signal model, the proposed graph neural sampling module has the similar performances with analytical sampling and it is adaptive to the underlying graph structure. Figure 4 illustrates the vertices selected by sampling methods in both graph structures.













Effect of graph signals. Based on a geometric graph with vertices, we generate two types of graph signals, bandlimited graph signals, which are the first eigenvectors of the graph Laplaican matrix, and piecewisebandlimited graph signals, where we intentionally introduce a boundary by applying a mask to the bandlimited graph signals. We use bandlimitedspace sampling and the proposed graph neural sampling module to select vertices, respectively, and compare the selected vertices in Figure 5. Since bandlimitedspace sampling is designed solely based on graph structures, the selected vertices are the same for both bandlimited and piecewisebandlimited graph signals. On the other hand, the proposed graph neural sampling module adapts to both graph signals and graph structures. We see that the selected vertices for piecewisebandlimited graph signals are much closer to the boundary than those vertices for bandlimited graph signals, reflecting the vertices along the boundary are informative.
ViB Recovery performance
We further validate the recovery performance of the proposed graph neural recovery module. Based on a geometric graph with vertices, we generate two types of graph signals, bandlimited graph signals and piecewisebandlimited graph to train two pairs of the proposed graph neural sampling and recovery modules, respectively. For a new bandlimited or piecewisebandlimited graph signal, we use the proposed graph neural sampling module to select vertices and take the corresponding measurements; see Figure 6 (c) and (d), respectively. We then apply the proposed graph recovery sampling module to reconstruct the original graph signals. Figure 6 (e) and (f) illustrates that the reconstructions well approximates the original graph signals. This validates that i) the proposed graph neural sampling and recovery modules can adapt to various types of graph signals; and ii) the proposed graph neural sampling and recovery modules can be well generalized to new data with small amount of training data.
(a) Cora.  (b) Citeseer. 
(c) Pubmed. 
Dataset  Cora  Citeseer  Pubmed  

# Vertices (Classes)  2708 (7)  3327 (6)  19717 (3)  
Supervision  fullsup.  semisup.  fullsup.  semisup.  fullsup.  semisup. 
DeepWalk [49]  78.4 1.7  67.2 2.0  68.5 1.8  43.2 1.6  79.8 1.1  65.3 1.1 
ChebNet [17]  86.4 0.5  81.2 0.5  78.9 0.4  69.8 0.5  88.7 0.3  74.4 0.4 
GCN [38]  86.6 0.4  81.5 0.5  79.3 0.5  70.3 0.5  90.2 0.3  79.0 0.3 
GAT [62]  87.8 0.7  83.0 0.7  80.2 0.6  73.5 0.7  90.6 0.4  79.0 0.3 
FastGCN [8]  85.0 0.8  80.8 1.0  77.6 0.8  69.4 0.8  88.0 0.6  78.5 0.7 
ASGCN [36]  87.4 0.3    79.6 0.2    90.6 0.3   
Graph UNet [27]    84.4    73.2    79.6 
GXN  88.9 0.4  85.1 0.6  80.9 0.4  74.8 0.4  91.8 0.3  80.2 0.3 
GXN (noCross)  87.3 0.4  83.2 0.5  79.5 0.4  73.7 0.3  91.1 0.2  79.6 0.3 
Dataset  IMDBB  IMDBM  COLLAB  D&D  PROTEINS  ENZYMES 
# Graphs (Classes)  1000 (2)  1500 (3)  5000 (3)  1178 (2)  1113 (2)  600 (6) 
Avg. # Vertices  19.77  13.00  74.49  284.32  39.06  32.63 
PatchySAN [46]  76.27 2.6  69.70 2.2  43.33 2.8  72.60 2.2  75.00 2.8   
ECC [59]      67.79  72.54  72.65  53.50 
Set2Set [28]      71.75  78.12  74.29  60.15 
DGCNN [71]  70.00 0.9  47.83 0.9  73.76 0.5  79.37 0.9  73.68 0.9   
DiffPool [69]  70.40  47.83  75.84  80.64  76.25  62.53 
Graph UNet [27]  72.10  48.33  77.56  82.43  77.68  58.57 
SAGPool [39]  72.80  49.43  78.52  82.84  78.28  60.23 
AttPool [35]  73.60  50.67  77.04  79.20  76.50  59.76 
StructPool [70]  74.70  52.47  74.22  84.19  80.36  63.83 
GXN  77.30 0.8  54.57 0.9  80.62 0.8  84.26 1.3  80.38 1.2  60.43 1.0 
GXN (gPool)  76.40 1.0  53.16 0.6  79.85 1.1  83.44 1.4  78.74 0.8  59.74 1.3 
GXN (SAGPool)  76.90 0.7  52.74 0.8  80.28 0.8  84.14 1.3  79.58 1.1  59.87 1.1 
GXN (AttPool)  76.85 0.9  53.62 0.9  80.37 0.9  84.07 1.0  79.09 1.3  59.45 1.0 
GXN ()  77.10 0.6  54.22 1.0  80.11 0.8  84.13 1.0  79.87 0.8  59.48 0.8 
GXN ()  76.80 1.1  54.08 0.7  80.28 1.0  83.61 1.2  79.64 1.2  58.95 1.3 
GXN (noCross)  74.80 1.1  52.68 0.9  79.94 0.7  83.64 0.9  79.26 0.9  59.37 1.2 
GXN (early)  77.10 0.6  53.83 0.6  80.18 0.8  84.24 1.0  80.30 1.0  60.43 1.0 
GXN (late)  76.30 0.9  54.12 1.0  79.88 1.1  83.85 1.5  80.03 1.2  59.84 0.9 
Vii Applications
In this section, we present three applications: activesamplingbased semisupervised classification, vertex classification and graph classification. The first application validates the quality of the selected vertices obtained by the graph neural sampling module. The second and third applications show the superiority of the proposed graph cross network.
Viia Activesamplingbased semisupervised classification
The task is to classify each vertex to a predefined category. Here we are allowed to actively query the category labels associated with a few selected vertices as our training data and then classify all the rest vertices in a semisupervised paradigm. We compare various sampling methods followed by the same classifier. The classification accuracy thus reflects the amount of information carried by the selected vertices. The goal of this experiment is to validate the effectiveness of the proposed graph neural sampling module.
Datasets. We use three classical citation networks: Cora, Citeseer and Pubmed [38], whose vertices are articles and edges are references. Cora has vertices with predefined vertex categories, Citeseer has vertices with predefined vertex categories and Pubmed has vertices with predefined vertex categories. Each dataset has multiple binary vertex features.
Experimental setup. We consider four sampling methods to actively select a few vertices: random sampling, which selects each node randomly and uniformly, bandlimitedspace (BLS) sampling [11], spectralproxy (SP) sampling [2] and the proposed graph neural sampling module, which uses both vertex features and graph structures. Once the training samples are selected, we then use the same standard graph convolutional network [38] as the semisupervised classifier to classify the rest vertices.
Results. Figure 7 shows the classification accuracy as a function of the number of selected vertices in three datasets. The axis is the number of selected vertices and the axis is the classification accuracy. We expect with more selected vertices, the classification accuracy is increasing. We see that across three different datasets, The proposed graph neural sampling module (in red) significantly outperforms the other methods. For example, in Cora, given selected vertices, the gap between the proposed graph neural sampling module and bandlimitedspace sampling is more than . At the same time, two analytical sampling methods, bandlimitedspace sampling (in green) and spectralproxy sampling (in purple), consistently outperform random sampling (in blue). The intuition is that both analytical sampling methods assume smooth graph signal models, which are beneficial, but cannot perfectly fit arbitrary datasets; while the proposed graph neural sampling module combines information from both vertex features and graph structures to capture the underlying implicit graph signal model, and is able to adaptively select informative vertices in each dataset.
ViiB Vertex classification
The task is to classify each vertex to a predefined category under both fullsupervised and semisupervised settings. The goal of this experiment is to validate the effectiveness of the proposed multiscale graph neural network, GXN.
Datasets. We use three standard citation networks: Cora, Citeseer and Pubmed [38]. We perform both fullsupervised and semisupervised vertex classification. For fullsupervised classification, we label all the vertices in training sets for model training; for semisupervised, we only label a few vertices (around 7% on average) in training sets. We use the default separations of training/validation/test subsets [38].
Experimental setup. We consider three scales in the proposed GXN, which preserve , and vertices from the original scales, respectively. For both input and readout layers, we use one graph convolution layer [38]; for multiscale feature extraction, we use two graph convolution layers followed by ReLUs at each scale and featurecrossing layers between any two consecutive scales at any layers. For the involved graph neural sampling module in GXN, we modify by preserving only the first term to improve the efficiency of solving problem (8). In this way, each vertex contributes the vertex set independently. The optimal solution is to select top vertices. The hidden feature is 128dimension across the network. In the loss function (19), the hyperparameter decays from to during training, where the graph neural sampling module needs fast convergence for vertex selection; and the model gradually focuses more on tasks based on the effective sampling. We use Adam optimizer [29] and the learining rates range from to for different datasets.
Results. We compare the proposed GXN to stateoftheart methods: DeepWalk [49], GCN [38], GraphSAGE [30], FastGCN [8], ASGCN [36], and Graph UNet [27] for vertex classification. We reproduce these methods for both fullsupervised and semisupervised learning based on their official codes. Table II compares the vertex classification accuracies of various methods. Under both fullsupervised and semisupervised settings, the proposed GXN achieves higher average accuracy by . To be specific, GXN(noCross) means a degraded GXN without any featurecrossing layer. Overall, we see that the featurecrossing layers improves the accuracies by on average. Introducing more connections across graph scales improves the classification performances.
ViiC Graph classification
The task is to classify an entire graph to a predefined category. The goal of this experiment is again, to validate the effectiveness of the proposed GXN.
Datasets. We use social network datasets: IMDBB, IMDBM and COLLAB [68], and bioinformatic datasets: D&D [19], PROTEINS [23], and ENZYMES [6]. Table III shows the dataset information. Note that no vertex feature is provided in three social network datasets, and we use onehot vectors to encode the vertex degrees as vertex features, explicitly utilizing some structural information. We use the same dataset separation as in [27], perform 10fold crossvalidation, and show the average accuracy for evaluation.
Experimental setup. We consider the same setting used in the task of the vertex classification for the proposed GXN. The only difference is that after the readout layers, we unify various graph embeddings to the same dimension by using the same aggregation method adopted in DGCNN [71], AttPool [35] and Graph UNet [27].
Results. We compare the proposed GXN to other GNNbased methods, including PatchySAN [46], ECC [59], Set2Set [28], DGCNN [71], DiffPool [69], Graph UNet [27], SAGPool [39], AttPool [35], and StructPool [70], where most of them performed multiscale graph feature learning. Additionally, we design several variants of GXN: 1) to test the superiority of the proposed graph neural sampling module, we apply gPool [27], SAGPool
Comments
There are no comments yet.