An Anatomy of Graph Neural Networks Going Deep via the Lens of Mutual Information: Exponential Decay vs. Full Preservation

10/10/2019 ∙ by Nezihe Merve Gürel, et al. ∙ Microsoft ETH Zurich 28

Graph Convolutional Network (GCN) has attracted intensive interests recently. One major limitation of GCN is that it often cannot benefit from using a deep architecture, while traditional CNN and an alternative Graph Neural Network architecture, namely GraphCNN, often achieve better quality with a deeper neural architecture. How can we explain this phenomenon? In this paper, we take the first step towards answering this question. We first conduct a systematic empirical study on the accuracy of GCN, GraphCNN, and ResNet-18 on 2D images and identified relative importance of different factors in architectural design. This inspired a novel theoretical analysis on the mutual information between the input and the output after l GCN and GraphCNN layers. We identified regimes in which GCN suffers exponentially fast information lose and show that GraphCNN requires a much weaker condition for similar behavior to happen.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 7

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Extending convolutional neural networks (CNN) over images to a graph has attracted intense interest recently. One early attempt is the GCN model proposed by

Kipf and Welling (2016a). However, when applying GCN to many practical applications, one discrepancy lingers — although traditional CNN usually gets higher accuracy when it goes deeper, GCN, as a natural extension of CNN, does not seem to benefit much from going deeper by stacking multiple layers together.

This phenomenon has been the focus of multiple recent papers (Li et al., 2018, 2019; Oono and Suzuki, 2019). On the theoretical side, Li et al. (2018) and Oono and Suzuki (2019) identified the problem as oversmoothing — under certain conditions, when multiple GCN layers are stacked together, the output will converge to a region that is independent of weights and inputs. On the empirical side, Li et al. (2019) showed that many techniques that were designed to train a deep CNN, e.g., the skip connections in ResNet (He et al., 2016a), can also make it easier for GCN to go deeper. However, two questions remain: Does there exist a set of techniques that make GCN at least as powerful (in terms of accuracy) as state-of-the-art CNN? If so, can we prove that the oversmoothing problem of GCN will cease to exist after these techniques are implemented?

Answering these questions requires us to put GCN and state-of-the-art CNNs on the same ground, both empirically and theoretically. In this paper, we conduct a systematic empirical study and a novel theoretical analysis as the first step to answering these questions.

Pillar 1. Empirical Study

Our work builds upon that of Such et al. (2017), who proposed the GraphCNN model, which, in principle, is more expressive than CNN. Intuitively, an (unnormalized) GCN layer has the form where is the adjacency matrix of the graph, is the input, and is the learned weight matrix; in GraphCNN, the adjacency matrix is decomposed into multiple matrices: and a GraphCNN layer becomes . As illustrated in Figure 1, under one decomposition strategy, a GraphCNN layer recovers a CNN layer. Although it is not surprising that GraphCNN can match the accuracy of CNN under this decomposition strategy, we ask:

How fundamental is this decomposition step? Can GCN match the accuracy of ResNet empirically if we integrate all standard techniques and tricks, such as stride, skip connection, and average pooling?

Figure 1: Illustration of one layer in GCN and one layer under one decomposition strategy in GraphCNN. is the adjacency matrix, is the input, and () are learnable weights. In GraphCNN, and for . In our experiments and analysis, we follow the original paper and normalize in GCN.

In this paper, we first present a systematic empirical study. We take CIFAR-10 as our dataset and convert the images into two equivalent representations: an image representation and a graph representation. We then compare GCN, GraphCNN, and ResNet with the same depths. For each model, we study the impact of (1) stride, (2) skip connection, and (3) pooling. As we will see, the decomposition step in GraphCNN is fundamental — although stride, skipped connection and pooling significantly improve the accuracy of GCN, which is consistent with Li et al. (2019), they themselves cannot help GCN achieve state-of-the-art accuracy of ResNet. Meanwhile, GraphCNN, using all three techniques, matches the accuracy of ResNet.

Pillar 2. Theoretical Analysis

Motivated by this empirical result, we then focus on understanding the theoretical property of the decomposition step in GraphCNN. Specifically, we ask: Can we precisely analyze the benefits introduced by graph decomposition in GraphCNN, compared with GCN?

This question poses three challenges that existing analyses on GCN oversmoothing (Li et al., 2018; Oono and Suzuki, 2019) cannot handle: (1) While both frameworks reason about how closely the GCN output after layers will approach a region that is independent of weights and inputs, to answer our question we need to reason not only geometrically but also more direct notion of utility — being close to a bad region geometrically is definitely bad, but being far away from it does not necessarily mean it is better (See Section 4). (2) While both theoretical frameworks provide an upper bound of the distance, however, the upper bound itself is not enough to answer our question. (3) None of the existing analysis on GCN considered the impact of graph decomposition in GraphCNN.

In this paper, we conduct a theoretical analysis that directly reasons about the mutual information between the output after layers and the input. We show that:

  1. Under certain conditions, the MI (Mutual Information) after GCN layers with (parametric) ReLUs converges to 0 exponentially fast.

  2. Under certain conditions, the MI after GCN layers with (parametric) ReLUs perfectly preserves all information in the input.

  3. Under certain conditions, the MI after GraphCNN layers with (parametric) ReLUs perfectly preserves all information in the input. More importantly, compared with GCN, GraphCNN requires a much weaker condition for information to be perfectly preserved, largely because of the decomposition structure introduced in GraphCNN.

Putting these results together, we provide a precise theoretical description of the power of graph decomposition introduced in GraphCNN. To the best of our knowledge, this is one of the first results of its form.

Moving Forward.

Our analysis brings up a natural question: “How can we choose the decomposition strategy in GraphCNN? Moreover, can we learn it automatically?” We believe that this offers an interesting direction for further work and hope that this paper can help to facilitate future endeavours in this direction.

2 Related Work

Applying deep neural networks to graphs has attracted intense interest in recent years. Motivated by the success of Convolutional Neural Network (CNN) (Krizhevsky et al., 2012), Spectral Convolutional Neural Network (Spectral CNN) (Bruna et al., 2013) models the filters as learnable parameters based on the spectrum of the graph Laplacian. ChebNet (Defferrard et al., 2016)

reduces computation complexity by approximating the filter with Chebyshev polynomials of the diagonal matrix of eigenvalues; Graph Convolutional Network (GCN)

(Kipf and Welling, 2016b) goes further, introducing a first-order approximation of ChebNet and making several simplifications. GCN and its variants have been widely applied in various graph-related applications, including semantic relationship recognition (Xu et al., 2017), graph-to-sequence learning (Beck et al., 2018), traffic forecasting (Li et al., 2017) and molecule classification (Such et al., 2017).

Although GCN and its variants have achieved promising results on various graph applications, one limitation of GCN is that it cannot obtain better performance with the increase of network depths. For instance, Kipf and Welling (2016a) show that a two-layer GCN would achieve peak performance in a classic graph dataset, while stacking more layers cannot help to improve the performance. Rahimi et al. (2018) develop a highway GCN for user geolocation in social media graphs, in which they add “highway” gates between layers to facilitate gradient flow. Even with these gates, the authors demonstrate performance degradation after six layers of depth. This phenomena is counterintuitive and blocks GCN-style models from making further improvements. There are plenty of results (Zhou et al., 2018; Wu et al., 2019b) trying to figure out the reasons and provide workarounds. Wu et al. (2019a) hypothesize that nonlinearity between GCN layers is not critical, which essentially implies that the deep GCN model lacks sufficient expressive ability because it is a linear model. In addition, Li et al. (2019) show that the techniques such as skip connection in ResNet can help GCN to train deeper; however, they do not provide an empirical study of whether this modification is enough for GCN to match the quality of state-of-the-art CNNs (e.g., ResNet) on images.

Most related to our work are two recent papers: Li et al. (2018) and Oono and Suzuki (2019). Li et al. (2018) show that GCN is a special form of Laplacian smoothing, and they prove that, under certain conditions, by repeatedly applying Laplacian smoothing many times, the features of vertices within each connected component of the graph will converge to the same values. Therefore, the oversmoothing property of GCN will make the features indistinguishable and thus hurt the classification accuracy. Oono and Suzuki (2019) also conduct more engaged theoretical analysis. Compared with these papers, as discussed in the previous section, the theoretical framework we develop is different, because we hope to answer a different question: We directly reason about mutual information and we are more interested in understanding the decomposition structure in GraphCNN instead of the oversmoothing property of GCN. Our work also builds on Such et al. (2017), which proposes GraphCNN that consists of multiple adjacency matrices. As shown by Such et al. (2017), this formulation is more expressive than CNN. In this work, we use the same framework but focus on providing a novel empirical study and theoretical analysis to understand the behavior of GCN and the power of graph decomposition in GraphCNN.

3 Preliminaries

Notation:

Hereafter, scalars will be written in italics, vectors in bold lower-case and matrices in bold upper-case letters. For an

real matrix , the matrix element in the th row and th column is denoted as . We vectorize a matrix by concatenating its columns, and denote it by . For matrices and , we denote the Kronecker product of and by . Finally, we denote the

th largest singular value of a matrix

by .

Gcn

Let be an undirected graph with a vertex set and set of edges . We refer to individual elements of as nodes, and associated with each as features. We denote the node feature attributes by whose rows are given by . The adjacency matrix (weighted or binary) is derived as an matrix with if , and elsewhere.

We define the following operator that is composed of (1) a linear function parameterized by the adjacency matrix and a weight matrix at layer  

, and (2) an activation function as parametric ReLU such that

with

that applies following the linear transformation of previous layer element-wise. Given the input matrix

, let . Each layer of the network maps it to an output vector of the same shape:

(1)

GraphCNN

Let now be decomposed into additive matrices such that . The layer-wise propagation rule becomes:

(2)

4 Empirical Study

State-of-the-art CNNs go well beyond having multiple convolutional layers: to match the accuracy of such a network, e.g., ResNet, the computer vision community has developed many techniques and tricks, such as stride, skip connection, and pooling. To understand the behavior of deep graph neural networks, it is necessary not only to study the difference between GCN, CNN and GraphCNN layers, but also to take these optimization techniques/tricks into consideration.

In this section, we conduct a systematic empirical study to understand the impact of different types of layers and different techniques/tricks. As we will see, (1) the techniques/tricks designed for CNN can also improve the accuracy of GCN significantly, which is consistent with previous work (Li et al., 2019); however, (2) the graph decomposition step introduced in GraphCNN is a fundamental step whose impact cannot be offset, even if we apply all techniques and tricks. This motivates our theoretical study in the next section which tries to theoretically describe the impact of graph decomposition.

Figure 2: Testing Accuracy on CIFAR-10.
Figure 3: Training Accuracy on CIFAR-10.

4.1 Experimental Setup

We take the CIFAR-10 dataset and construct an equivalent graphical representation of the images. We treat each pixel as one node in the graph and the surrounding pixels in 9 directions (including itself) as neighboring nodes, in order to mimic the behavior of a convolution. In this way, a CIFAR-10 image can be viewed as a special type of graph. CIFAR-10 consists of 60000 images of pixels with RGB channels: in the graphical representation, each image corresponds to a graph with 1024 () vertices, each of which connects to the 8 neighbors plus a self-connection.

We already know multiple results from previous work — (1) a deep CNN achieves state-of-the-art quality for image classification, whereas a deep GCN cannot benefit from deep architectures; (2) a deep GraphCNN, with all optimization tricks and the right graph decomposition strategy, also matches the accuracy of state-of-the-art CNNs, simply because it is equivalent in expressiveness to a CNN. The goal of our study is to understand the relative impact of graph decomposition and useful tricks designed for CNNs.

Model Architectures.

We compare three model architectures: CNN, GCN, and GraphCNN:

1. CNN (Krizhevsky et al., 2012) The architecture is stacked by convolution layers. The input channel of the first CNN layer is 3 (including RGB) and the output channel is set as 128. All the input and output channels of the succeeding convolution layers are 128.

2. GCN (Kipf and Welling, 2016b) We treat all edges in the graph equally and leverage a similar network architecture as CNN. The only difference is that we replace each convolution layer with a GCN layer.

3. GraphCNN (Such et al., 2017) We replace each convolution layer with a GraphCNN layer, which is decomposed as illustrated in Figure 1. Specifically, we decompose the adjancency matrix into 9 submatrices . For two arbitrary pixels and , we set the edges of each submatrix when the following equation holds; otherwise the corresponding edges are set as zero in that matrix: (1) and ; (2) and ; (3) and ; (4) and ; (5) and ; (6) and ; (7) and ; (8) and ; (9) and .

For each of these three architectures, we experiment with four settings of different tricks designed for CNNs:

1. Original. We apply convolution or graph convolution operations in each layer with and without skip connections. At the last layer, we reshape the 2D image to a 1D embedding vector and add a fully connected layer with Softmax activation on the top to generate classification results. All hidden size = 128.

2. Stride. The stride of each layer is aligned with ResNet-18. Specifically, we apply for the 9th, 13th, and 17th layer. After a stride operation is applied, both the length and width of the original image will be halved. We follow a common strategy in which the hidden size is doubled once a stride operation is applied, (original hidden size = 128). To imitate the stride behavior for GCN and GraphCNN, we perform convolution first, then choose the nodes that will be reserved by the strides to construct a new grid graph corresponding to the smaller image.

3. Stride+Skip. We add skip connections between the corresponding layers (i.e., , , , , , , , ) following the standard architecture of ResNet-18. Other configurations are kept the same with the Stride setting.

4. Stride+Skip+AP. After considering strides and skip connections, the network architecture of the 17-layer CNN looks similar to ResNet-18. The only difference is that it does not adopt average pooling before the final fully connected layer. To align with ResNet-18, we also compare the models in the architectures with average pooling on the top, in which way the 17-layer CNN matches the network architecture of a standard ResNet-18 exactly.

We used the standard data argument method in all experiments, including random cropping and random flipping (Simonyan and Zisserman, 2014). All experiments were trained using SGD with Momentum (Ruder, 2016), where the momentum ratio was set as and weight decay factor was set as . We chose the best learning rate via grid search and initialized the network parameters using Xavier  (Glorot and Bengio, 2010). Similar to a standard ResNet (He et al., 2016b), we did not use dropout in the experiments. All experiments were conducted on a Tesla P100 with 16GB GPU memory.

4.2 Results and Discussions

We illustrate the results in Figure 2 and Figure 3. Figure 2 shows the test accuracy and Figure 3 the training accuracy. Our findings can be summarized as follows.

Not surprisingly, GraphCNN performs as well as CNN, even without pooling layers. With depth=1, CNN and GraphCNN already outperform GCN significantly. As the depth of the architecture increases, all of the models become better, but the performance of GCN is always lower than GraphCNN. However, without any tricks designed for CNN (Figure 2(a) and Figure 3(a)), when the depth becomes greater than 9 layers, GCN does not get better, while the CNN and GraphCNN counterparts still benefit from deeper architectures. It seems that GCN can go deep to some extent, but the optimal depth is much smaller and the accuracy is much lower than the state-of-the-art models.

We then add the standard tricks designed for CNNs one by one. We first set Stride to 2 for some layers following ResNet (Figure 2(b) and Figure 3(b)). The performances of all models have been improved, especially for GCN, where the best test performance has been improved from 57.1% to 60.2%. At the same time the comparison between different models exhibits similar trends. The result from the GCN model is still much worse than the CNN and GraphCNN solutions.

There are previous works (Li et al., 2019) claiming that the existence of skip connections is a key point for training a deeper GCN. We further add skip connections )(Figure 2(c) and Figure 3

(c)) and find that the residual connections do have a positive effect for training a deep GCN network, improving the best test score from 60.2% to 64.4%. However, it is still well behind the state-of-the-art results from CNN and GraphCNN.

To fully match the architecture of a state-of-the-art ResNet, we further add an average pooling layer at the end (Figure 2(d) and Figure 3(d)). We see that the average pooling layer provides an obvious improvements on GCN. Nevertheless, there is still a significant gap between GCN and GraphCNNs/CNNs — the GCN model suffers from severe overfitting, obtaining only 72.8% accuracy with a 17-layer architecture, even though the training accuracy achieved 94.0%.

As we see from these experiments, the graph decomposition introduced in GraphCNN is fundamental: even if we add all the desirable tricks people have developed for a deep ResNet architecture, without decomposition, GCN still has an almost 20 percent point gap compared with CNN, whereas with the right decomposition strategy, GraphCNN matches the accuracy of CNN. This observation motivates our theoretical study, in which we try to precisely characterize the behavior of GraphCNN and GCN in theoretical terms.

4.3 Measuring Information Loss Empirically

As we will see in the next section, our theoretical analysis reasons about mutual information between the output after GCN/GraphCNN layers and the original input. In this section, we provide some empirical observations that will inspire our analysis.

Our goal is to empirically measure the mutual information directly and compare GCN and GraphCNN. We adapt a methodology similar to Belghazi et al. (2018) and use the architecture illustrated in Figure 7(a) as the proxy of the MI after layers. Specifically, to measure the MI after layers, we take the first GCN/GraphCNN layers and add a fully connected layer that shrinks the hidden unit size. We then add a decoder that is a single fully connected layer that reconstructs the hidden unit size to the input. We measure the reconstruction error as modeled by loss (Janocha and Czarnecki, 2017). We train the network end-to-end and optimize the hyper-parameters by Random Search (Bergstra and Bengio, 2012). if the network is expressive enough to preserve information after layers, we should be able to train a decoder to recover the original input.

Figure 7(b, c) illustrates the reconstruction results after layers. The overall reconstruction error of GraphCNN (0.781) outperforms that of GCN (0.818) significantly. The GCN reconstruction results clearly show the phenomenon of over-smoothing, a behavior that has already been identified in previous work (Li et al., 2018; Oono and Suzuki, 2019). In the meantime, GraphCNN is able to preserve significantly more information (if the training objective is to maintain as much information as possible). This observation inspires our theoretical analysis. As we will show in the next section, this empirical observation could be made mathematically precise under certain conditions.

Figure 4:

(a) The neural network architecture that illustrates the mutual information decay after three GCN layers or three GraphCNN layers. Intuitively, the decoder estimates the MI in a similar way as MINE. (b/c) Reconstructions of test images from the output after 3 GCN/GraphCNN layers. The first row is the input images and the second row is the output images of the decoder.

4.4 Decomposition Strategy Matters

In all previous experiments we assume that GraphCNN uses the same graph decomposition strategy that matches a 2D convolutional layer. How does a different decomposition strategy impact the accuracy? Although the full answer to this question is far beyond the scope of this single paper, we report the result of a simple experiment that inspired our theoretical analysis in the next section. Specifically, we tested GraphCNN with three random decomposition strategies and observed that the accuracy with a 17-layer GraphCNN drops significantly. Specifically, in the Stride+Skip+AP setting, the accuracy drops from 93.2% to 83.8% (the average performance of three randomly decomposed GraphCNN). This indicates that the decomposition strategy does have a significant impact to the final performance. However, interestingly, the randomly decomposed GCN still outperforms the vanilla GCN layers (72.8% accuracy).

Moving Forward.

The above results indicate two interesting future directions. First, the observation that even GraphCNN with a random graph decomposition could outperform GCN is very promising, and it would be interesting to understand the relationship between GraphCNN with a random graph decomposition and previous work on random features. Second, the fact that GraphCNN with a random graph decomposition achieves lower accuracy than GraphCNN with a CNN-like decomposition indicates the importance of integrating domain knowledge into graph neural networks. Another interesting future direction would be to design a system abstraction to allow users to specify graph decompositions or even use some automatic approaches similar to NAS (Zoph and Le, 2016) to automatically search for the optimal graph decomposition strategies for GraphCNN.

5 Theoretical Analysis

The dramatic difference between GCN and GraphCNN can look quite counter-intuitive at the first glance. Why can a simple decomposition of the adjacency matrix have such a significant impact on both the accuracy and the preservation property of mutual information? In this section, we provide a theoretical analysis of the mutual information between the th layer of either network and the input. Specifically, we show that, (1) under certain technical conditions the information after GCN layers with (parametric) ReLUs asymptotically converges to 0 exponentially fast, (2) under a different set of conditions the information after GCN layers with (parametric) ReLUs can also be perfectly preserved at the output, and (3) under certain technical conditions the mutual information after GraphCNN layers with (parametric) ReLUs can be perfectly preserved at its output. More importantly, compared with GCN, after layers GraphCNN:

  • requires a weaker condition for information to be perfectly preserved,

  • requires a stronger condition for information to be fully lost.

Our theoretical analysis suggests that GraphCNN has a better data processing capability than that of GCN under the same characteristics of layer-wise weight matrices, justifying the observation that GraphCNN overcomes the overcompression introduced by GCN as we pile up more layers.

5.1 Gcn

In this section, our goal is to investigate the regimes where GCN (1) does not benefit from going deeper, or (2) is guaranteed to preserve all information at its output. We aim to understand this by analyzing the behaviour of mutual information between input and output layer of the network at different depths. We relegate all the proofs to the Appendix.

We begin with the following definition.

Definition 1.

Throughout the paper, we denote the vectorized input and th layer output by and , respectively. For some -dimensional real random vectors and defined over finite alphabets and , we denote entropy of x by and mutual information between x and y by . Moreover, information loss is defined by , i.e., relative entropy of with respect to .

First, we reason about the non-linear activation functions. The characteristics of the layer-wise propagation rule in (1) lead us to the following result:

Lemma 1.

For GCNs with parametric ReLU activations with , let be a diagonal matrix whose nonzero entries are in such that if , and elsewhere. can be written as

Following our earlier discussion, we will now state our first result which characterizes the regime in which our theory predicts the information transferred across the network exponentially decays to zero.

Theorem 1.

Let GCN follows the propagation rule introduced in (1). Suppose and . If , then , and hence .

There are also regimes in which GCN will perfectly preserve the information, stated as follows:

Theorem 2.

Following Theorem 1, let now and . If , then .

Effect of Normalized Laplacian:

The results obtained above holds for any adjacency matrix . The unnormalized , however, comes with a major drawback as changing the scaling of feature vectors. To overcome this problem, is often normalized such that its rows sum to one. We then adopt our results to GCN with normalized Laplacian whose largest singular value is one. We have the following result:

Corollary 1.

Let denote the degree matrix such that , and be the associated normalized Laplacian . Suppose GCN uses the following mapping Let also . If , then , and hence .

5.2 GraphCNN

Similarly as in Lemma 1, can be reduced to for a diagonal matrix such that if , and otherwise.

Following a similar proof for GCN, we obtain the following result for GraphCNN:

Theorem 3.

Let denotes the maximum singular value of   such that . If , then , and hence .

Theorem 3 describes the condition on the layer-wise weight matrices where GraphCNN fails in capturing the feature characteristics at its output in the asymptotic regime. We then state the second result for GraphCNN which ensures the information loss as follows.

Theorem 4.

Consider the propagation rule in (2). Let denotes the minimum singular value of   such that . If , then we have .

In order to understand the role of decomposition in GraphCNN, we revisit the conditions on full information loss () and full information preservation () for a specific choice of decomposition, which will later be used to demonstrate the information processing capability of GraphCNN.

Corollary 2.

Suppose the singular value decomposition of

is given by , and each is set to where if and elsewhere. We then have the following results: For and , i.e., if , then .

Corollary 3.

Let . If , , then .

While the universally optimal decomposition strategy is unknown and its existence is debatable, the choice of decomposition introduced above will later highlight the dramatic difference between the capabilities of GCN and GraphCNN.

Outline of the Proof:

Following Lemma 1, the next key step in proving above results is as follows.

Lemma 2.

Consider the singular value decomposition such that , and let . We have

(3)

where (1, 3) results from that and are invertible, and equality holds in (2) iff is invertible, i.e., singular values of are nonzero.

Theorem 1, 2, 3 and 4 can easily be inferred from Lemma 2. That is, iff in the asymptotic regime. Similarly, iff , is maximized and given by , hence .

In particular Theorem 13 and Corollary 2, i.e., exponential decay to zero, also hold for traditional ReLU with .

Our results presented so far focus on covering the edge cases: or . While our primary goal is to understand why GraphCNN has a better capability of going deep than that of GCN, we note several points about Lemma 2 in a viewpoint of entropy or uncertainty:

  1. Rigorous theoretical guarantees quantifying the amount of information preserved across the network is not straightforward, and further requires the knowledge on the statistical properties of node features. Despite its simplicity, Lemma 2 forms a direct link from the information processing capability of the network to the characteristics of the weights and entropy of the nodes, ,

  2. Whereas the compression and generalization capability of the network are closely related, we emphasize here that our analysis here is to understand why and when GraphCNN overcome the overcompression introduced by GCN. In future, we plan to investigate this via the information bottleneck principle,

  3. In our formulation, we omit the effect of perturbation in the input nodes considering our discussion will remain valid under the same perturbation characteristics,

  4. If all node features , for instance, have similar entropy, roughly linearly scales with the rank of ,

  5. Lifting up singular values of layer-wise weight matrices are beneficial for better data processing in a viewpoint of information theory. In the next section, we will demonstrate through edge cases how GraphCNN can overcome overcompression of GCN by achieving singular value lifting.

5.3 Discussion: GCN vs. GraphCNN

Consider the setting where is fixed and same for both GCN and GraphCNN. The discussions below will revolve around the regime of singular values of layer-wise weight matrices, and where the information loss , for the specific decomposition strategy used in Corollary 3.

Recall from Theorem 2 and Corollary 3 that while GCN requires singular values of all weight matrices to compensate for the minimum singular value of such that to ensure , GraphCNN relaxes this condition by introducing a milder constraint. That is, the singular values of its weight matrices need to compensate only for the singular value of their respective component , that is, guarantees that . In other words, singular values of weight matrices of GraphCNN are lower bounded by much smaller values than that of GCN such that information can be fully recovered at the output layer, hence results for GraphCNN in a much larger regime of weights.

Under the same weight characteristics, i.e., singular values of layer-wise weight matrices, therefore, GraphCNN is better capable of going deeper than GCN by preserving more information about node features after layers.

References

  • D. Beck, G. Haffari, and T. Cohn (2018) Graph-to-sequence learning using gated graph neural networks. arXiv preprint arXiv:1806.09835. Cited by: §2.
  • M. I. Belghazi, A. Baratin, S. Rajeshwar, S. Ozair, Y. Bengio, A. Courville, and D. Hjelm (2018) Mutual information neural estimation. In

    Proceedings of the 35th International Conference on Machine Learning

    , J. Dy and A. Krause (Eds.),
    Proceedings of Machine Learning Research, Vol. 80, Stockholmsmässan, Stockholm Sweden, pp. 531–540. External Links: Link Cited by: §4.3.
  • J. Bergstra and Y. Bengio (2012) Random search for hyper-parameter optimization. Journal of Machine Learning Research 13 (Feb), pp. 281–305. Cited by: §4.3.
  • J. Bruna, W. Zaremba, A. Szlam, and Y. LeCun (2013) Spectral networks and locally connected networks on graphs. arXiv preprint arXiv:1312.6203. Cited by: §2.
  • M. Defferrard, X. Bresson, and P. Vandergheynst (2016) Convolutional neural networks on graphs with fast localized spectral filtering. In Advances in neural information processing systems, pp. 3844–3852. Cited by: §2.
  • X. Glorot and Y. Bengio (2010) Understanding the difficulty of training deep feedforward neural networks. In

    Proceedings of the thirteenth international conference on artificial intelligence and statistics

    ,
    pp. 249–256. Cited by: §4.1.
  • K. He, X. Zhang, S. Ren, and J. Sun (2016a) Deep residual learning for image recognition. See DBLP:conf/cvpr/2016, pp. 770–778. External Links: Link, Document Cited by: §1.
  • K. He, X. Zhang, S. Ren, and J. Sun (2016b) Deep residual learning for image recognition. In

    Proceedings of the IEEE conference on computer vision and pattern recognition

    ,
    pp. 770–778. Cited by: §4.1.
  • K. Janocha and W. M. Czarnecki (2017)

    On loss functions for deep neural networks in classification

    .
    arXiv preprint arXiv:1702.05659. Cited by: §4.3.
  • T. N. Kipf and M. Welling (2016a) Semi-supervised classification with graph convolutional networks. arXiv preprint arXiv:1609.02907. Cited by: §1, §2.
  • T. N. Kipf and M. Welling (2016b) Semi-supervised classification with graph convolutional networks. arXiv preprint arXiv:1609.02907. Cited by: §2, §4.1.
  • A. Krizhevsky, I. Sutskever, and G. E. Hinton (2012) Imagenet classification with deep convolutional neural networks. In Advances in neural information processing systems, pp. 1097–1105. Cited by: §2, §4.1.
  • G. Li, M. Müller, A. Thabet, and B. Ghanem (2019) Can gcns go as deep as cnns?. arXiv preprint arXiv:1904.03751. Cited by: §1, §1, §2, §4.2, §4.
  • Q. Li, Z. Han, and X. Wu (2018)

    Deeper insights into graph convolutional networks for semi-supervised learning

    .
    In Thirty-Second AAAI Conference on Artificial Intelligence, Cited by: §1, §1, §2, §4.3.
  • Y. Li, R. Yu, C. Shahabi, and Y. Liu (2017)

    Graph convolutional recurrent neural network: data-driven traffic forecasting

    .
    arXiv preprint arXiv:1707.01926. Cited by: §2.
  • K. Oono and T. Suzuki (2019) On asymptotic behaviors of graph cnns from dynamical systems perspective. arXiv preprint arXiv:1905.10947. Cited by: §1, §1, §2, §4.3.
  • A. Rahimi, T. Cohn, and T. Baldwin (2018) Semi-supervised user geolocation via graph convolutional networks. arXiv preprint arXiv:1804.08049. Cited by: §2.
  • S. Ruder (2016) An overview of gradient descent optimization algorithms. arXiv preprint arXiv:1609.04747. Cited by: §4.1.
  • K. Simonyan and A. Zisserman (2014) Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556. Cited by: §4.1.
  • F. P. Such, S. Sah, M. A. Dominguez, S. Pillai, C. Zhang, A. Michael, N. D. Cahill, and R. Ptucha (2017) Robust spatial filtering with graph convolutional neural networks. IEEE Journal of Selected Topics in Signal Processing 11 (6), pp. 884–896. Cited by: §1, §2, §2, §4.1.
  • E. Telatar (1999) Capacity of multi‐antenna gaussian channels. European transactions on telecommunications 10, pp. 585–595. Cited by: §A.1.
  • F. Wu, T. Zhang, A. H. d. Souza Jr, C. Fifty, T. Yu, and K. Q. Weinberger (2019a) Simplifying graph convolutional networks. arXiv preprint arXiv:1902.07153. Cited by: §2.
  • Z. Wu, S. Pan, F. Chen, G. Long, C. Zhang, and P. S. Yu (2019b) A comprehensive survey on graph neural networks. arXiv preprint arXiv:1901.00596. Cited by: §2.
  • D. Xu, Y. Zhu, C. B. Choy, and L. Fei-Fei (2017) Scene graph generation by iterative message passing. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 5410–5419. Cited by: §2.
  • J. Zhou, G. Cui, Z. Zhang, C. Yang, Z. Liu, and M. Sun (2018) Graph neural networks: a review of methods and applications. arXiv preprint arXiv:1812.08434. Cited by: §2.
  • B. Zoph and Q. V. Le (2016)

    Neural architecture search with reinforcement learning

    .
    arXiv preprint arXiv:1611.01578. Cited by: §4.4.

References

  • D. Beck, G. Haffari, and T. Cohn (2018) Graph-to-sequence learning using gated graph neural networks. arXiv preprint arXiv:1806.09835. Cited by: §2.
  • M. I. Belghazi, A. Baratin, S. Rajeshwar, S. Ozair, Y. Bengio, A. Courville, and D. Hjelm (2018) Mutual information neural estimation. In

    Proceedings of the 35th International Conference on Machine Learning

    , J. Dy and A. Krause (Eds.),
    Proceedings of Machine Learning Research, Vol. 80, Stockholmsmässan, Stockholm Sweden, pp. 531–540. External Links: Link Cited by: §4.3.
  • J. Bergstra and Y. Bengio (2012) Random search for hyper-parameter optimization. Journal of Machine Learning Research 13 (Feb), pp. 281–305. Cited by: §4.3.
  • J. Bruna, W. Zaremba, A. Szlam, and Y. LeCun (2013) Spectral networks and locally connected networks on graphs. arXiv preprint arXiv:1312.6203. Cited by: §2.
  • M. Defferrard, X. Bresson, and P. Vandergheynst (2016) Convolutional neural networks on graphs with fast localized spectral filtering. In Advances in neural information processing systems, pp. 3844–3852. Cited by: §2.
  • X. Glorot and Y. Bengio (2010) Understanding the difficulty of training deep feedforward neural networks. In

    Proceedings of the thirteenth international conference on artificial intelligence and statistics

    ,
    pp. 249–256. Cited by: §4.1.
  • K. He, X. Zhang, S. Ren, and J. Sun (2016a) Deep residual learning for image recognition. See DBLP:conf/cvpr/2016, pp. 770–778. External Links: Link, Document Cited by: §1.
  • K. He, X. Zhang, S. Ren, and J. Sun (2016b) Deep residual learning for image recognition. In

    Proceedings of the IEEE conference on computer vision and pattern recognition

    ,
    pp. 770–778. Cited by: §4.1.
  • K. Janocha and W. M. Czarnecki (2017)

    On loss functions for deep neural networks in classification

    .
    arXiv preprint arXiv:1702.05659. Cited by: §4.3.
  • T. N. Kipf and M. Welling (2016a) Semi-supervised classification with graph convolutional networks. arXiv preprint arXiv:1609.02907. Cited by: §1, §2.
  • T. N. Kipf and M. Welling (2016b) Semi-supervised classification with graph convolutional networks. arXiv preprint arXiv:1609.02907. Cited by: §2, §4.1.
  • A. Krizhevsky, I. Sutskever, and G. E. Hinton (2012) Imagenet classification with deep convolutional neural networks. In Advances in neural information processing systems, pp. 1097–1105. Cited by: §2, §4.1.
  • G. Li, M. Müller, A. Thabet, and B. Ghanem (2019) Can gcns go as deep as cnns?. arXiv preprint arXiv:1904.03751. Cited by: §1, §1, §2, §4.2, §4.
  • Q. Li, Z. Han, and X. Wu (2018)

    Deeper insights into graph convolutional networks for semi-supervised learning

    .
    In Thirty-Second AAAI Conference on Artificial Intelligence, Cited by: §1, §1, §2, §4.3.
  • Y. Li, R. Yu, C. Shahabi, and Y. Liu (2017)

    Graph convolutional recurrent neural network: data-driven traffic forecasting

    .
    arXiv preprint arXiv:1707.01926. Cited by: §2.
  • K. Oono and T. Suzuki (2019) On asymptotic behaviors of graph cnns from dynamical systems perspective. arXiv preprint arXiv:1905.10947. Cited by: §1, §1, §2, §4.3.
  • A. Rahimi, T. Cohn, and T. Baldwin (2018) Semi-supervised user geolocation via graph convolutional networks. arXiv preprint arXiv:1804.08049. Cited by: §2.
  • S. Ruder (2016) An overview of gradient descent optimization algorithms. arXiv preprint arXiv:1609.04747. Cited by: §4.1.
  • K. Simonyan and A. Zisserman (2014) Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556. Cited by: §4.1.
  • F. P. Such, S. Sah, M. A. Dominguez, S. Pillai, C. Zhang, A. Michael, N. D. Cahill, and R. Ptucha (2017) Robust spatial filtering with graph convolutional neural networks. IEEE Journal of Selected Topics in Signal Processing 11 (6), pp. 884–896. Cited by: §1, §2, §2, §4.1.
  • E. Telatar (1999) Capacity of multi‐antenna gaussian channels. European transactions on telecommunications 10, pp. 585–595. Cited by: §A.1.
  • F. Wu, T. Zhang, A. H. d. Souza Jr, C. Fifty, T. Yu, and K. Q. Weinberger (2019a) Simplifying graph convolutional networks. arXiv preprint arXiv:1902.07153. Cited by: §2.
  • Z. Wu, S. Pan, F. Chen, G. Long, C. Zhang, and P. S. Yu (2019b) A comprehensive survey on graph neural networks. arXiv preprint arXiv:1901.00596. Cited by: §2.
  • D. Xu, Y. Zhu, C. B. Choy, and L. Fei-Fei (2017) Scene graph generation by iterative message passing. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 5410–5419. Cited by: §2.
  • J. Zhou, G. Cui, Z. Zhang, C. Yang, Z. Liu, and M. Sun (2018) Graph neural networks: a review of methods and applications. arXiv preprint arXiv:1812.08434. Cited by: §2.
  • B. Zoph and Q. V. Le (2016)

    Neural architecture search with reinforcement learning

    .
    arXiv preprint arXiv:1611.01578. Cited by: §4.4.

Appendix A Appendix

a.1 Proofs

We begin by introducing our notation. Hereafter, scalars will be written in italics, vectors in bold lower-case and matrices in bold upper-case letters. For an real matrix , the matrix element in the th row and th column is denoted as , and th entry of a vector by . Also, th column of is denoted by , or . Similarly, we denote th row by . The inner product between two vectors and is denoted by .

We vectorize a matrix by concatenating its columns such that

and denote it by . For matrices and , we denote the kronecker product of and by such that

Note that is of size .

Next, we list some existing results which we require repeatedly throughout this section.

Preliminaries.

  1. Suppose , and . We have

    (4)
  2. Let , and ,

    (5)
  3. For and , singular values of is given by , and .

  4. Let and be an -dimensional random vector defined over finite alphabets and , respectively. We denote entropy of x by and mutual information between x and y by . We list the followings:

    (6)

    such that is some deterministic function, and equality holds for both inequalities iff is bijective.

Proofs.

The proofs are listed below in order.

Proof of Lemma 1.

Applying vectorization to the layer-wise propagation rule introduced in (1), we have

(7)

where (a) follows from the element-wise application of , (b) follows from (4), and (c) results from introducing a diagonal matrix with diagonal entries in such that if , and elsewhere.

By a recursive application of (7c), we have

We drop the transpose from in order to avoid cumbersome notation. The singular values of are our primary interest thereof our results still hold.

Proof of Lemma 2.

Let be a matrix with singular value decomposition . Inspired by the derivation for the capacity of deterministic channels introduced by Telatar (1999), we derive the following

(8)

(a) and (b) are a result of (6b) and that and are unitary hence invertible (bijective) transformations. (c) follows from the change of variables .

Note that . Using (6a), we further have which completes the proof. ∎

We recall that we are interested in regimes where and . In Lemma 2, we show that if , and maximized (and given by ) when is invertible. Therefore, maximum and minimum singular values of are of our interest.

Proof of Theorem 1.

Let and . That is, given singular values of is in , . We, moreover, have . Therefore, if , by Lemma 2 we have , and . ∎

Proof of Theorem 2.

We now denote and . Hence . Moreover, . If , , hence and results by Lemma 2. ∎

Proof of Corollary 1.

Let denote the degree matrix such that , and be the associated normalized Laplacian . Due to the property of normalized Laplacian such that , we have . Inserting this into Theorem 1, the corollary results. ∎

Similarly as in (7), can be derived from (2) as follows:

(9)

where is a diagonal matrix with diagonal entries in with such that if , and otherwise.

Therefore, is given by

Consider (8) where is replaced with .

We deduce the followings:

Proof of Theorem 3.

Suppose denotes the largest singular value of   such that . Following the same argument as in the proofs of Theorem 1 and 2, Lemma 2 implies that if , then , and hence results. ∎

Proof of Theorem 4.

We now denote the minimum singular value of   such that . By Lemma 2, it immediately follows that if , then we have . ∎

Before we move on to the proofs of Corollary 2 and 3, we state the following lemma.

Lemma 3.

Let the singular value decomposition of is given by and we set each to with if and elsewhere. For such specific composition, we argue that singular values of