Graph Convolutional Neural Networks via Scattering

03/31/2018 ∙ by Dongmian Zou, et al. ∙ 0

We generalize the scattering transform to graphs and consequently construct a convolutional neural network on graphs. We show that under certain conditions, any feature generated by such a network is approximately invariant to permutations and stable to graph manipulations. Numerical results demonstrate competitive performance on relevant datasets.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 12

page 19

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Many interesting and modern datasets can be described by graphs. Examples include social [1], physical [2], and transportation [3] networks. The recent survey paper of Bronstein et al. [4]

on geometric deep learning emphasizes the need to develop deep learning tools for such datasets and even more importantly to understand the mathematical properties of these tools, in particular, their invariances.

They also mention two types of problems that may be addressed by such tools. The first problem is signal analysis on graphs with applications such as classification, prediction and inference on graphs. The second problem is learning the graph structure with applications such as graph clustering and graph matching.

Several recent works address the first problem [5, 6, 7, 8]. In these works, the filters of the networks are designed to be parametric functions of graph operators, such as the graph adjacency and Laplacian, and the parameters of those functions have to be trained.

The second problem is often explored with random graphs generated according to two common models: Erdős–Rényi, which is used for graph matching, and the Stochastic Block Model (SBM), which is used for community detection. Some recent graph neural networks have obtained state-of-the-art performance for graph matching [9] and community detection [10, 11] with synthetic data generated from the respective graph models. As above, the filters in these works are parametric functions of either the graph adjacency or Laplacian, where the parameters are trained.

Despite the impressive progress in developing graph neural networks for solving these two problems, the performance of these methods is poorly understood. Of main interest is their invariance or stability to basic signal and graph manipulations. In the Euclidean case, the stability of a convolutional neural network [12] to rigid transformations and deformations is best understood in view of the scattering transform [13]

. The scattering transform has a multilayer structure and uses wavelet filters to propagate signals. It can be viewed as a convolutional neural network where no training is required to design the filters. Training is only required for the classifiers given the transformed data. Nevertheless, there is freedom in the selection and design of the wavelets. The scattering transform is approximately invariant to translation and rotation. More precisely, under strong assumptions on the wavelet and scaling functions and as the coarsest scale

approaches , the scattering transform becomes invariant to translations and rotations. Moreover, it is Lipschitz continuous with respect to smooth deformation. These properties are shown in [13] for signals in and , where is a compact Lie group.

It is interesting to note that the design of filters in existing graph neural networks is related to the design of wavelets on graphs in the signal processing literature. Indeed, the construction of wavelets on graphs use special operators on graphs such as the graph adjacency and Laplacian. As mentioned above, these operators are commonly used in graph neural networks. The earliest works on graph wavelets [14, 15] apply the normalized graph Laplacian to define the diffusion wavelets on graphs and use them to study multiresolution decomposition of graph signals. Hammond et al. [16] use the unnormalized graph Laplacian to define analogous graph wavelets and study properties of these wavelets such as reconstructibility and locality. One can easily construct a graph scattering transform by using any of these wavelets. A main question is whether this scattering transform enjoys the desired invariance and stability properties.

In this work, we use a special instance of the graph wavelets of [16] to form a graph scattering network and establish its covariance and approximate invariance to permutations and stability to graph manipulations. We also demonstrate the practical effectiveness of this transform in solving the two types of problems discussed above.

The rest of the paper is organized as follows. The scattering transform on graphs is defined in Section 2. Section 3 shows that the full scattering transform preserves the energy of the input signal. This section also provides an absolute bound on the energy decay rate of components of the transform at each layer. Section 4 proves the permutation covariance and approximate invariance of the graph scattering transform. It also briefly discusses previously suggested candidates for the notion of translation or localization on graphs and the possible covariance and approximate invariance of the scattering transform with respect to them. Furthermore, it clarifies why some special permutations are good substitutes for Euclidean rigid transformations. Section 5 establishes the stability of the scattering transform with respect to graph manipulations. Section 6 demonstrates competitive performance of the proposed graph neural network in solving the two types of problems.

2 Wavelet graph convolutional neural network

We first review the graph wavelets of [16] in Section 2.1. We then use these wavelets and ideas of [13] to construct a graph scattering transform in Section 2.2.

2.1 Wavelets on graphs

We review the wavelet construction of Hammond et al. [16] and adapt it to our setting. Our general theory applies to what we call simple graphs, that is, weighted, undirected and connected graphs with no self-loops. We remark that we may also address self-loops, but for simplicity we exclude them. Throughout the paper we fix an arbitrary simple graph with

vertices. We also consistently use uppercase boldface letters to denote matrices and lowercase boldface letters to denote vectors or vector-valued functions.

The weight matrix of is an symmetric matrix with zero diagonal, where denotes the weight assigned to the edge of G. The degree matrix of is an diagonal matrix with

(1)

The (unnormalized) Laplacian of is the matrix

(2)

The eigenvalues of

are non-negative and the smallest one is

. Since the graph is connected, the eigenspace of

(that is, the kernel of ) has dimension one. It is spanned by a vector with equal nonzero entries for all vertices. This vector represents a signal of the lowest possible “frequency”.

The graph Laplacian is symmetric and can be represented as

(3)

where are the eigenvalues of ,

are the corresponding eigenvectors, and

denotes the conjugate transpose. We remark that the phases of the eigenvectors of and their order within any eigenspace of dimension larger than 1 can be arbitrarily chosen without affecting our theory for the graph scattering transform formulated below.

Let be a graph signal. Note that in our setting we can regard , and without further specification we shall consider

. We define the Fourier transform

by

(4)

and the inverse Fourier transform by

(5)

Let denote the Hadamard product, that is, for , , , . Define the convolution of and in as the inverse Fourier transform of , that is,

(6)

When emphasizing the dependence of on the graph , we denote it by .

Euclidean wavelets use shift and scale in Euclidean space. For signals defined on graphs, which are discrete, the notions of translation and dilation need to be defined in the spectral domain. Hammond et al. [16] view as the spectral domain since it contains the eigenvalues of . Their procedure assumes a scaling function and a wavelet functions [17, 18] with corresponding Fourier transforms and . They have minimal assumptions on and . In our construction, we consider dyadic wavelets, that is,

(7)

Also, we fix a scale of coarsest resolution and assume that and can be constructed from multiresolution analysis, that is,

(8)

The graph wavelets of [16] are constructed as follows in our setting. For , denote by the vector in with the following entries: , . Similarly, . In view of (6),

(9)

Note that and are both in . The graph wavelet coefficients of are defined by

(10)

We use boldface notation for to emphasize that they are operators even though the wavelet coefficients , are scalars. At last, we note that (8) implies that . Combining this fact and (9) results in

(11)

2.2 Scattering on graphs

Our construction of convolutional neural networks on graphs is inspired by Mallat’s scattering network [13]. As a feature extractor, the scattering transform defined in [13] is translation and rotation invariant when the coarsest scale approaches and the wavelet and scaling functions satisfy some strong admissibility conditions. It is also Lipschitz continuous with respect to small deformations. The neural network representation of the scattering transform is the scattering network. It has been successfully used in image and audio classification problems [19].

We form a scattering network on graphs in a similar way, while using the graph wavelets defined above and the following definitions. A path is an ordering of the scales of wavelets . The length of the path is . The length of an empty path is zero. For a path as above and a scale , we define the path as . For a vector we denote and note that the vectors and have the same norm.

For a scale , the one-step propagator is defined by

(12)

For , the scattering propagator is defined by

(13)

For the empty path, we define . We note that for any path and any scale

(14)

The windowed scattering transform for a path is defined by

(15)

Let denote the set of all paths of length , i.e., . The collection of all paths of finite length is denoted by . The scattering propagator and the scattering transform with respect to , which we denote by and respectively, are defined as

(16)

When we emphasize the dependence of the scattering propagator and transform on the graph , we denote them by and respectively. A naturally defined norm on and is

(17)

where denotes the -norm on .

In the terminology of deep learning, the scattering transform acts as a convolutional neural network on graphs. At the -th layer, where , the propagated signal is and the extracted feature is . This network is illustrated in Figure 1.

Figure 1: Network representation of the scattering transform.

In a similar way, we can define the scattering transform for matrices of signals on graphs. Let , where for each , is a complex signal of length on the same underlying graph. We define

(18)

and

(19)

Note that is the Frobenious norm of the matrix . Here and throughout the rest of the paper we denote by the Frobenius norm of a matrix .

3 Energy preservation

We discuss the preservation of energy of a given signal by the scattering transform. The signal is either with the energy or with the energy . We first formulate our main result.

Theorem 3.1.

The scattering transform is norm preserving. That is, for or ,

(20)

The analog of Theorem 3.1 in the Euclidean case appears in [13, Theorem 2.6]. However, the proof is different for the graph case. One basic observation analogous to the Euclidean case is the following.

Proposition 3.2.

For and ,

(21)

This proposition can be rephrased as follows: the propagated energy at the -th layer splits into the propagated energy at the next layer and the output energy at the current layer. In order to conclude Theorem 3.1 from Proposition 3.2, we quantify the decay rate of the propagated energy, which may be of independent interest. Fast decay rate means that few layers are sufficient to extract most of the energy of the signal. We define the decay rate of the scattering transform at a given layer as follows.

Definition 1.

For , and , the energy decay rate of at the -th layer is if

(22)

In practice, different choices of graph and scale lead to different energy decay rates. Nevertheless, we establish the following generic result that applies to all graph scattering transforms under the construction in Section 2.2.

Proposition 3.3.

The scattering transform has energy decay rate of at least at all layers but the first one. This is the sharpest generic decay rate, though a better one can be obtained with additional assumptions on , , and .

Note that in the Euclidean domain, no such generic result exists. Therefore, one has to choose the wavelets very carefully (see the admissibility condition in [13, Theorem 2.6]). Numerical results illustrating the energy decay in the Euclidean domain are given in [19]. Furthermore, theoretical rates are provided in [20] and [21], where [20] introduces additional assumptions on the smoothness of input signals and the bandwidth of filters and [21] studies time-frequency frames instead of wavelets.

In practice, the propagated energy seems to decrease much faster than the generic rate stated in Proposition 3.3. Figure 2 illustrates this claim. It considers

randomly selected images from the MNIST database

[22]. A graph that represents a grid of pixels shared by these images is used. Details of the graph and the dataset are described in Section 6.1. The figure reports the box plots of the cumulative percentage of the output energy of the scattering transform with for the first four layers and the 100 input images. That is, at layer the cumulative percentage for an image is

. We see that in the third layer, the scattering transform already extracts almost all the energy of the signal. Therefore, in practice we can estimate the graph scattering transform with a small number of layers, which is also evident in practice for the Euclidean scattering transform

[19].

Figure 2: Demonstration of fast energy decay rate for the graph scattering transform on MNIST. One hundred random images are drawn from the MNIST database, and the scattering transform is applied with the graph described in Section 6.1. The box plots summarize the distribution of the cumulative energy percentages for the random images.

3.1 Proof of Proposition 3.2

Application of (9) and later (8) implies that for any

(23)

Replacing with , summing over all paths with length and applying (14) yields

(24)

3.2 Proof of Proposition 3.3

Recall that and where with . Note that (8) implies that . Note further that for any , , the entries of are non-negative due to the absolute value in (12), and thus . Consequently,

(25)

Furthermore, we claim that

(26)

Indeed, in view of (11)–(13) and the form of , , where satisfies . One can easily show that the minimal value of , over all satisfying and , equals 2 and this concludes (26).

Combining (25) and (26) and summing the resulting inequality over yields

(27)

The combination of (21) and (27) concludes the proof as follows

(28)

An improvement of this decay rate is possible if and only if one may strengthen the single inequality in (25) and the inequality in (26). We show that these inequalities can be equalities for special cases and thus the stated generic decay rate is sharp. We first note that equality occurs in the inequality of (25) if, for example, is the indicator function of . Equality occurs in the second inequality when has exactly two non-zero elements, for example, when . These two cases can be simultaneously satisfied. We comment that certain choices of , , and imply different inequalities with stronger decay rates.

3.3 Proof of Theorem 3.1

We write (21) as

(29)

and sum over , while recalling that , to obtain that

(30)

Combining Proposition 3.3 and (23) yields

(31)

The first equality in (20) clearly follows from (30) and (31). The second equality in (20) is an immediate consequence of the first equality and the observation that for , .

4 Permutation covariance and invariance

When applying a transformation to a graph signal it is natural to expect that relabeling the graph vertices and the corresponding signal’s indices before applying the transformation has the same effect as relabeling the corresponding indices after applying the transformation. More precisely, let be a permutation, where denotes the symmetric group on letters, then it is natural to ask whether

(32)

In deep learning, the property expressed in (32) is referred to as covariance to permutations. On the other hand, invariance to permutations means that

(33)

Ideally, a graph-based classifier should not be sensitive to “graph-consistent relabeling” of the signal coordinates. The analog of this ideal request in the Euclidean setting is that a classifier of signals defined on

should not be sensitive to their rigid transformations. In the case of classifying graph signals by first applying a feature-extracting transformation and then a standard classifier, this ideal request translates to permutation invariance of the initial transformation. However, permutation invariance is a very strong property that often contradicts the necessary permutation covariance. We show here that the scattering transform is permutation covariant and if the scaling function is sufficiently smooth and

approaches infinity, then it becomes permutation invariant.

We first exemplify the basic notions of covariance and invariance in Section 4.1. Section 4.2 reviews the few existing results on permutation covariance and invariance of graph neural networks and then presents our results for the scattering network. Section 4.3 explains why some permutations are natural generalizations of rigid transformations and then discusses previous broad generalizations of the notion of “graph translation” and their possible covariance and invariance properties. Sections 4.4 and 4.5 prove the main results formulated in Section 4.2.

4.1 Basic examples of graph permutations, covariance and invariance

For demonstration, we focus on the graph depicted in Figure 2(a). In this graph, each drawn edge has weight one and thus the double edge between the first two nodes has the total weight 2. The weight matrix of the graph is

(34)

The signal is depicted on the graph with different colors corresponding to different values. The following permutation is applied to the graph in Figure 2(b):

(35)

Figure 2(c) applies the permutation both to the signal and the graph.

(a)
(b)
(c)
Figure 3: Illustration of permutation for a particular example of a graph and signal discussed in this section.

An example of a transformation can be the replacement of the signal values in the two vertices connected by the edge of weight 2. This transformation is independent of the labeling of the graph and is thus permutation covariant. This can be formally verified as follows, while using for simplicity the permutation defined in (35). For a signal , . Furthermore, swaps the first two entries of a signal, while swaps the second and the fourth entries (the second claim is obvious from Figure 2(b)). Accordingly, and . One can readily check that indeed .

Another example is the transformation , where is the weight matrix in (34). This transformation is also independent of the labeling of the graph and thus permutation covariant. This property can also be formally verified as follows:

(36)

Similarly, , where is the graph Laplacian, is permutation covariant.

The above three examples of permutation covariant transformations are not permutation invariant. An example of a permutation invariant transformation , but not permutation covariant, maps the signal to the signal . Clearly the output is not affected by permutation of the input signal and is thus permutation invariant. On the other hand, zeroing out three specified signal coordinates, instead of three vertices with unique graph properties (e.g., the vertices connected by at least two edges), violates permutation covariance.

The latter example demonstrates in a very simplistic way the value of invariance for classification. Indeed, assume that there are two types of signals with low and high values and a classifier tries to distinguish between the two classes according to the first coordinate of by checking whether it is larger than a certain threshold or not. Then this procedure can distinguish the two types of signals without getting confused with signal relabeling. Permutation covariance does not play any role in this simplistic setting, since the classifier only considers the first coordinate of and ignores the rest of them.

4.2 Permutation covariance and invariance of graph neural networks

The recent works of Gilmer et al. [23] and Kondor et al. [24] discuss permutation covariance and invariance for composition schemes on graphs, where message passing is a special case. Composition schemes are covariant to permutations since they do not depend on any labeling of the graph vertices. Moreover, if the aggregation function of the composition scheme is invariant to permutations, so is the whole scheme [24, Proposition 2]. However, aggregation leads to loss of local information, which might weaken the performance of the scheme.

Methods based on graph operators, such as the graph adjacency, weight or Laplacian, are not invariant to permutations (see demonstration in Section 4.1). Nevertheless, the scattering transform is approximately permutation invariant when the wavelet scaling function is sufficiently smooth. Furthermore, when approaches infinity it becomes invariant to permutations. We first formulate its permutation covariance and then its approximate permutation invariance.

Proposition 4.1.

Let be a simple graph and be the graph scattering transform with respect to . For any and ,

(37)
Theorem 4.2.

Let be a simple graph and be the graph scattering transform with respect to . Assume that the Fourier transform of the scaling function of decays as follows: , where is a constant depending on . For any and

(38)

In particular, the scattering transform is invariant as approaches infinity. The result also holds if is replaced with and the Euclidean norm is replaced with the Frobenius norm.

4.3 Generalized graph translations

Permutation invariance on graphs is an important notion, which is motivated by concrete applications [25, 26]. It can be seen as an analog of translation invariance in Euclidean domains, which is also essential for applications [13, 12]. A different line of research asks for the most natural notion of translation on a graph [3, 27]. We show here that very special permutations of signals on graphs naturally generalize the notion of translation or rigid transformation of a signal in a Euclidean domain. More precisely, there is a planar representation of the graph on which the permutation acts like a rigid transformation. However, in general, there are many permutations that act very differently than translations or rigid transformations in a Euclidean domain, though, they still preserve the graph topology. Indeed, the underlying geometry of general graphs is richer than that of the Euclidean domain. We later discuss previously suggested generalized notions of “translations” on graphs and the possible covariance and invariance of a modified graph scattering transform with respect to these.

We first present two examples of permutations of graphs that can be viewed as Euclidean translations or rigid transformations. We later provide examples of permutations of the same graphs that are different than rigid transformations. The first example, demonstrated in Figure 3(a), shows a periodic lattice graph and signal with two values denoted by white and blue. Note that the periodic graph can be embedded in a torus, whereas the figure only shows the projection of its 25 vertices into a

grid in the plane. The edges are not depicted in the figure, but they connect points to their four nearest neighbors on the torus. That is, including “periodic padding” for the

grid of vertices, each vertex in the plane is connected with its four nearest neighbors. For example vertex 21 is connected with vertices 1, 16, 22, 25. The graph signal obtains a non-zero constant value on the four vertices colored in blue (3, 4, 7 and 8) and is zero on the rest of them. Figure 3(b) demonstrates an application of a permutation to both the graph and the signal. At last, Figure 3(c) depicts the permuted graph and signal of Figure 3(b) when the indices are rearranged so that the representation of the lattice in the plane is the same as that in Figure 3(a) (this is necessary as the lattice lives in the torus and may have more than one representation in the plane). The relation between the consistent representations of (,) in Figure 3(a) and (, ) in Figure 3(c) is obviously a translation. That is, graph and signal permutation in this example corresponds to translation. We remark that the fact that Figure 3(c) coincides with the description of (,) is incidental for this particular example and does not occur in the next example.

(a)
 
Figure 4: Demonstration of graph permutation as Euclidean translation. Figure 3(a) shows a signal lying on a lattice in the torus embedded onto a planar grid. Figure 3(b) demonstrates a permutation of the graph and signal. Figure 3(c) shows a planar representation of the permuted graph and signal that is consistent with the one of Figure 3(a). The permutation clearly corresponds to translations in a Euclidean space.
(b)
 
(c) Indices of
rearranged as in (a)

Figure 5 depicts a different example where a permutation of a graph signal can be viewed as a variant of a Euclidean rigid transformation. The graph and the signal are shown in Figure 5, where is supported on the vertices marked in blue (indexed by 1, 2 and 3). Figure 5 demonstrates an application of a permutation (mapping to ) to the graph and signal. Figure 5 shows a different representation of , which is consistent with the one of presented in Figure 5. The comparison between Figures 5 and 5 makes it clear that the graph and signal permutation corresponds to a Euclidean rigid transformation in the planar representation of the graph. At last, Figure 5 demonstrates that unlike the example in Figure 3(a), the rearrangement of is generally different than the graph . Indeed, the subgraph associated with the blue values of the signal is not a triangle and thus the topology is different.

relabel

rearrange
(a)
(b)
(c) with indices of embedded the same way as in (a)
(d)
Figure 5: Another example where a graph signal permutation corresponds to Euclidean translation. Figures 5-5 are created analogously to Figures 3(a)-3(c). Figure 5 shows , which is different from the rearrangement procedure depicted in 5.

We remark that many permutations on graphs do not act like translations or rigid transformations. We demonstrate this claim using the graphs of the previous two examples. In Figure 5(a), we consider the same graph as in Figure 3(a), but with a different permutation. The difference in permutations can be noticed by comparing the second columns of the grids in Figures 3(b) and 5(b). We note that the rearrangement of the vertices in Figure 5(c) does not yield an analog of a Euclidean translation. The reason is that the rearranged vertices do not form a grid. To demonstrate this claim, note that in Figure 5(a), label 17 is connected to 22, but in Figure 5(b), and consequently in the rearranged representation in Figure 5(c), they are disconnected.

(a)
 
Figure 6: A different permutation of Figure 3(a), which is not similar to rigid motion, but still preserves the graph topology. Note that 5(c) does not maintain the planar geometry of the graph: for instance, the vertices 19 and 24 are not connected by an edge.
(b)
 
(c) Indices of
rearranged as in (a)

Figure 6(a) demonstrates a permutation that does not act like a rigid transformation with respect to the graph of Figure 5. Clearly, the rearranged graph and signal in Figure 6(c) have different planar geometry. We remark that while the permutations demonstrated in Figures 5(a) and 6(a) do not preserve the planar geometry, they still preserve the topology of the graphs. Indeed, the notion of permutation invariance is richer than invariance to rigid transformations in the Euclidean domain.

(a)
 
Figure 7: A different permuation of Figure 5, which is not similar to rigid motion, but still preserves the graph topology.
(b)
 
(c) Indices of
rearranged as in (a)

In the signal processing community, some candidates were proposed for translating signals on graphs. Shuman et al. [3] defined a “graph translation” (or in retrospect, a graph localization procedure) as follows

(39)

They established useful localization properties of , which justify a corresponding construction of a windowed graph Fourier transform. They also demonstrated the applicability of this tool for the Minnesota road network [3, Figure 7]. We remark that in their definition may not be well-defined. To make it well-defined one needs to assume fixed choices of the phases of , , and that the algebraic multiplicities of all eigenvalues equal one.

Sandryhaila and Moura [27] define a “shift” of a graph signal by , where is the weight matrix of the graph. This definition is motivated by the example of a directed cyclic graph, where an application of the weight matrix is equivalent to a shift by one vertex. Note that in this special case, the graph signal permutation advocated in this section also results in a vertex shift. We remark that it is unclear to us why this notion of shift is useful for general graphs.

If one needs covariance and approximate invariance of a graph scattering transform to the graph localization procedure defined in [3], then one may modify the nonlinearity of the scattering transform as and redefine for . Note that

(40)

and

(41)

Therefore, the nonlinearity and the modified scattering transform are covariant to the localization operator . Similarly, by following the proof for Theorem 4.2 one can show that the modified scattering transform is approximately invariant to as long as its energy decay is sufficiently fast.

The scattering transform cannot be adjusted to be covariant and approximately invariant to the “shift” defined by . The reason is that unlike , does not commute in general with the eigenvectors .

4.4 Proof of Proposition 4.1

We need to show that for each path , where ,

(42)

Note that the Laplacian of is , which has the same eigenvalues as and has eigenvectors , . Equation (9) implies that for

(43)

Therefore, for

(44)

Consequently, applying the absolute value pointwise,

(45)

Similarly,

(46)

Application of (45) and (46) results in the identity

(47)

In view of (13) – (15), (42) is equivalent to (47), and the proof is thus concluded.

4.5 Proof of Theorem 4.2

According to (15) and (37),

(48)

We bound the right-hand-side of (48) by a function that approaches zero as . We first bound . We apply (15) as well as the following facts: , and (since is connected) to obtain that for

(49)

Hence

(50)

It remains to bound . The application of (17), Proposition 3.3 and (23) results in