Pytorch Repo for DeepGCNs (ICCV'2019 Oral, TPAMI'2021), DeeperGCN (arXiv'2020) and GNN1000(ICML'2021): https://www.deepgcns.org
Convolutional Neural Networks (CNNs) have been very successful at solving a variety of computer vision tasks such as object classification and detection, semantic segmentation, activity understanding, to name just a few. One key enabling factor for their great performance has been the ability to train very deep CNNs. Despite their huge success in many tasks, CNNs do not work well with non-Euclidean data which is prevalent in many real-world applications. Graph Convolutional Networks (GCNs) offer an alternative that allows for non-Eucledian data as input to a neural network similar to CNNs. While GCNs already achieve encouraging results, they are currently limited to shallow architectures with 2-4 layers due to vanishing gradients during training. This work transfers concepts such as residual/dense connections and dilated convolutions from CNNs to GCNs in order to successfully train very deep GCNs. We show the benefit of deep GCNs with as many as 112 layers experimentally across various datasets and tasks. Specifically, we achieve state-of-the-art performance in part segmentation and semantic segmentation on point clouds and in node classification of protein functions across biological protein-protein interaction (PPI) graphs. We believe that the insights in this work will open a lot of avenues for future research on GCNs and transfer to further tasks not explored in this work. The source code for this work is available for Pytorch and Tensorflow at https://github.com/lightaime/deep_gcns_torch and https://github.com/lightaime/deep_gcns respectively.READ FULL TEXT VIEW PDF
Convolutional Neural Networks (CNNs) achieve impressive results in a wid...
This paper presents a novel adaptively connected neural network (ACNet) ...
State-of-the-art approaches for semantic image segmentation are built on...
Nowadays, many network representation learning algorithms and downstream...
Recent advances in deep learning, especially deep convolutional neural
Recent works have demonstrated that global covariance pooling (GCP) has ...
Convolutional Neural Networks have been a subject of great importance ov...
Pytorch Repo for DeepGCNs (ICCV'2019 Oral, TPAMI'2021), DeeperGCN (arXiv'2020) and GNN1000(ICML'2021): https://www.deepgcns.org
GCNs have become a prominent research topic in recent years. There are several reasons for this trend, but above all, GCNs promise a natural extension of CNNs to non-euclidean data. While CNNs are very powerful when dealing with grid-like structured data, e.g. images, it turns out that their performance on more irregular data, e.g. point clouds, graphs, etc. is sub-par. Since many real-world applications need to leverage such data, GCNs are a very natural fit. There has already been some success in using GCNs to predict individual relations in social networks , model proteins for drug discovery [2, 3], enhance predictions of recommendation engines [4, 5], and efficiently segment large point clouds . While these works show promising results, they rely on simple and shallow network architectures.
In the case of CNNs, the primary reason for their continued success and state-of-the-art performance on many computer vision tasks, is the ability to train very deep network architectures reliably. Surprisingly, it is not clear how to train deep GCN architectures and many existing works have investigated this limitation along with other shortcomings of GCNs [7, 8, 9]
. Similar as for CNNs, stacking multiple layers in GCNs leads to the vanishing gradient problem. Over-smoothing problem occurs when repeatedly applying GCN many layers. They observed the features of vertices within each connected component will converge to the same value and become indistinguishable. As a result most state-of-the-art GCNs are limited to very shallow network architecture, usually no deeper than layers .
) We show the square root of the training loss for GCNs with 7, 14, 28, 56 and 112 layers, with and without residual connections forepochs. We note that adding more layers without residual connections translates to a substantially higher loss and for very deep networks e.g. 112 layers even to divergence. (right) In contrast, training GCNs with residual connections results in consistent training stability across all depths.
The vanishing gradient problem is well known and studied in the realm of CNNs. As a matter of fact, it was the key limitation for deep convolutional networks before ResNet  proposed a simple and yet effective solution. The introduction of residual connections  between consecutive layers addressed the vanishing gradient problem by providing additional paths for the gradient. This enabled deep residual networks with more than a hundred layers that could be trained reliably, e.g. ResNet-152. The idea was further extended by DenseNet  where additional connections are added across layers.
Training deep networks reveals another bottleneck which is especially relevant for tasks that rely on the spatial composition of the input image, e.g.
object detection, semantic segmentation, depth estimation,etc. With increased network depth, more spatial information can potentially be lost during pooling. Ideally, the receptive field should increase with network depth without loss of resolution. For CNNs, dilated convolutions  were introduced to tackle this issue. The idea is again simple but effective. Essentially, the convolutions are performed across more distant neighbors as the network depth increases. In this way, multiple resolutions are seamlessly encoded in deeper CNNs. Several innovations, in particular residual/dense connections and dilated convolutions, have enabled reliable training of very deep CNNs achieving state-of-the-art performance on many tasks. This yields the following question: do these innovations have a counterpart in the realm of GCNs?
In this work, we present an extensive study of methodologies that allow training very deep GCNs. We adapt concepts that were successful in training deep CNNs, in particular residual connections, dense connections, and dilated convolutions. We show how these concepts can be incorporated into a graph framework. In order to quantify the effect of these additions, we conduct an extensive analysis of each component and its impact on accuracy and stability of deep GCNs. To showcase the potential of these concepts in the context of GCNs, we apply them to the popular tasks of semantic segmentation and part segmentation of point clouds as well as node classification of biological graphs. Adding either residual or dense connections in combination with dilated convolutions, enables successful training of GCNs with a depth of layers (refer to Figure 1). The proposed deep GCNs improve the state-of-the-art on the challenging point cloud dataset S3DIS  by mIOU and outperform previous methods in many classes of PartNet . The same deep GCN architecture also achieves an F1 score of on the very different PPI dataset .
Contributions. The contributions of this work comprise 3 aspects. (1) We adapt residual connections, dense connections, and dilated convolutions which were introduced for CNNs to GCNs. (2) We present extensive experiments on point cloud data and biological graph data, showing the effect of each component to the stability and performance of training deep GCNs. We use semantic segmentation and part segmentation on point clouds as well as node classification of biological networks as our experimental testbeds. (3) We show how these new concepts enable successful training of a 112-layer GCN, the deepest GCN architecture by a large margin. With only 28 layers, we already improve the previous best performance by almost in terms of mIOU on the S3DIS dataset ; we also outperform previous methods in the task of part segmentation for many classes of PartNet . Similarly, we set a new state-of-art for the PPI dataset  in the task of node classification in biological networks.
A preliminary version of this work was published in ICCV 2019 . This journal manuscript extends the initial version in several aspects. First, we investigate even deeper DeepGCN architectures with more than 100 layers. Interestingly, we find that when training a PlainGCN with 112 layers it diverges while our proposed counterpart, ResGCN with skip connections and dilated convolutions, converges without problem (see Figure 1). Second, to investigate the generality of our DeepGCN framework, we perform extensive additional experiments on the tasks of part segmentation on PartNet and node classification on PPI. Third, we examine the performance and efficiency of MRGCN, a memory-efficient GCN aggregator we propose, with thorough experiments on the PPI dataset. Our results show that DeepMRGCN models are able to outperform current state-of-the-art methods. We also demonstrate that MRGCN is very memory-efficient compared to other GCN operators via GPU memory usage experiments. Finally, to ensure reproducibility of our experiments and contribute to the graph learning research community, we have published code for training, testing and visualization along with several pretrained models in both Tensorflow and Pytorch. To the best of our knowledge, our method is the first approach that successfully trains deep GCNs beyond 100 layers and achieves state-of-the-art results on both point cloud data and biological graph data.
A large number of real-world applications deal with non-Euclidean data, which cannot be systematically and reliably processed by CNNs in general. To overcome the shortcomings of CNNs, GCNs provide well-suited solutions for non-Euclidean data processing, leading to greatly increasing interest in using GCNs for a variety of applications. In social networks , graphs represent connections between individuals based on mutual interests/relations. These connections are non-Euclidean and highly irregular. GCNs help better estimate edge strengths between the vertices of social network graphs, thus leading to more accurate connections between individuals. Graphs are also used to model chemical molecule structures [2, 3]. Understanding the bio-activities of these molecules can have substantial impact on drug discovery. Another popular use of graphs is in recommendation engines [4, 5]
, where accurate modelling of user interactions leads to improved product recommendations. Graphs are also popular modes of representation in natural language processing[16, 17], where they are used to represent complex relations between large text units.
GCNs also find many applications in computer vision. In scene graph generation, semantic relations between objects are modelled using a graph. This graph is used to detect and segment objects in images, and also to predict semantic relations between object pairs [18, 19, 20, 21]. Scene graphs facilitate the inverse process as well, where an image is reconstructed given a graph representation of the scene . Graphs are also used to model human joints for action recognition in video [23, 24]. GCNs are a perfect candidate for 3D point cloud processing, especially since the unstructured nature of point clouds poses a representational challenge for systematic research. Several attempts in creating structure from 3D data exist by either representing it with multiple 2D views [25, 26, 27, 28], or by voxelization [29, 30, 31, 32]. More recent work focuses on directly processing unordered point cloud representations [33, 34, 35, 36, 37]. The recent EdgeConv method by Wang et al.  applies GCNs to point clouds. In particular, they propose a dynamic edge convolution algorithm for semantic segmentation of point clouds. The algorithm dynamically computes node adjacency at each graph layer using the distance between point features. This work demonstrates the potential of GCNs for point cloud related applications and beats the state-of-the-art in the task of point cloud segmentation. Unlike most other works, EdgeConv does not rely on RNNs or complex point aggregation methods.
Current GCN algorithms including EdgeConv are limited to shallow depths. Recent works have attempted to train deeper GCNs. For instance, Kipf et al. trained a semi-supervised GCN model for node classification and showed how performance degrades when using more than 3 layers . Pham et al.  proposed Column Network (CLN) for collective classification in relational learning and showed peak performance for 10 layers and degrading performance for deeper graphs. Rahimi et al.  developed a Highway GCN for user geo-location in social media graphs, where they add “highway” gates between layers to facilitate gradient flow. Even with these gates, the authors demonstrate performance degradation after 6 layers of depth. Xu et al.  developed a Jump Knowledge Network for representation learning and devised an alternative strategy to select graph neighbors for each node based on graph structure. As with other works, their network is limited to a small number of layers (). Recently, Li et al.  have studied the depth limitations of GCNs and showed that deep GCNs can cause over-smoothing, which results in features at vertices within each connected component converging to the same value. Other works [8, 9] also show the limitations of stacking multiple GCN layers, which leads to highly complex back-propagation and the common vanishing gradient problem.
Many difficulties facing GCNs nowadays (e.g. vanishing gradients and limited receptive field) were also present in the early days of CNNs [10, 12]. We bridge this gap and show that the majority of these drawbacks can be remedied by borrowing several orthogonal tricks from CNNs. Deep CNNs achieved a huge boost in performance with the introduction of ResNet . By adding residual connections between inputs and outputs of layers, ResNet tends to alleviate the vanishing gradient problem. DenseNet  takes this idea a step further and adds connections across layers as well. Dilated Convolutions 
are another recent approach that has lead to significant performance gains, specifically in image-to-image translation tasks such as semantic segmentation, by increasing the receptive field without loss of resolution. In this work, we show how one can benefit from concepts introduced for CNNs, mainly residual/dense connections and dilated convolutions, to train very deep GCNs. We support our claim by extending different GCN variants to deeper versions by adapting these concepts, and therefore significantly increasing their performance. Extensive experiments on the tasks of semantic segmentation and part segmentation of point clouds and node classification in biological graphs validate these ideas for general graph scenarios.
Graph Definition. A graph is represented by a tuple where is the set of unordered vertices and is the set of edges representing the connectivity between vertices . If , then vertices and are connected to each other with an edge .
Graph Convolution Networks. Inspired by CNNs, GCNs intend to extract richer features at a vertex by aggregating features of vertices from its neighborhood. GCNs represent vertices by associating each vertex
with a feature vector, where is the feature dimension. Therefore, the graph as a whole can be represented by concatenating the features of all the unordered vertices, i.e. , where is the cardinality of set . A general graph convolution operation at the -th layer can be formulated as the following aggregation and update operations,
and are the input and output graphs at the -th layer, respectively. and
are the learnable weights of the aggregation and update functions respectively, and they are the essential components of GCNs. In most GCN frameworks, aggregation functions are used to compile information from the neighborhood of vertices, while update functions perform a non-linear transform on the aggregated information to compute new vertex representations. There are different variants of those two functions. For example, the aggregation function can be a mean aggregator
, a max-pooling aggregator[33, 42, 6], an attention aggregator  or an LSTM aggregator 
. The update function can be a multi-layer perceptron[42, 45], a gated network , etc. More concretely, the representation of vertices is computed at each layer by aggregating features of neighbor vertices for all as follows,
where is a vertex feature aggregation function and is a vertex feature update function, and are the vertex features at the -th layer and -th layer respectively. is the set of neighbor vertices of at the -th layer, and is the feature of those neighbor vertices parametrized by . contains the learnable parameters of these functions. For simplicity and without loss of generality, we use a max-pooling vertex feature aggregator, without learnable parameters, to pool the difference of features between vertex and all of its neighbors: . We then model the vertex feature updater
as a multi-layer perceptron (MLP) with batch normalization with its aggregate features from to form its input.
Dynamic Edges. As mentioned earlier, most GCNs have fixed graph structures and only update the vertex features at each iteration. Recent work [48, 6, 49] demonstrates that dynamic graph convolution, where the graph structure is allowed to change in each layer, can learn better graph representations compared to GCNs with fixed graph structure. For instance, ECC (Edge-Conditioned Convolution)  uses dynamic edge-conditional filters to learn an edge-specific weight matrix. Moreover, EdgeConv  finds the nearest neighbors in the current feature space to reconstruct the graph after every EdgeConv layer. In order to learn to generate point clouds, Graph-Convolution GAN (Generative Adversarial Network)  also applies -NN graphs to construct the neighbourhood of each vertex in every layer. We find that dynamically changing neighbors in GCNs helps alleviate the over-smoothing problem and results in an effectively larger receptive field, when deeper GCNs are considered. In our framework, we propose to re-compute edges between vertices via a Dilated -NN function in the feature space of each layer to further increase the receptive field.
Designing deep GCN architectures [8, 9] is an open problem in the graph learning domain. Recent work [7, 8, 9] suggests that GCNs do not scale well to deep architectures, since stacking multiple layers of graph convolutions leads to high complexity in back-propagation. As such, most state-of-the-art GCN models are usually no more than three layers deep . Inspired by the huge success of ResNet , DenseNet  and Dilated Convolutions , we transfer these ideas to GCNs to unleash their full potential. This enables much deeper GCNs that reliably converge in training and achieve superior performance in inference. In what follows, we provide a detailed description of three operations that can enable much deeper GCNs to be trained: residual connections, dense connections, and dilated aggregation.
In the original graph learning framework, the underlying mapping , which takes a graph as an input and outputs a new graph representation (see Equation (1)), is learned. Here, we propose a graph residual learning framework that learns an underlying mapping by fitting another mapping . After is transformed by , vertex-wise addition is performed to obtain . The residual mapping learns to take a graph as input and outputs a residual graph representation for the next layer. is the set of learnable parameters at layer . In our experiments, we refer to our residual model as ResGCN.
DenseNet  was proposed to exploit dense connectivity among layers, which improves information flow in the network and enables efficient reuse of features among layers. Inspired by DenseNet, we adapt a similar idea to GCNs so as to exploit information flow from different GCN layers. In particular, we have:
The operator is a vertex-wise concatenation function that densely fuses the input graph with all the intermediate GCN layer outputs. To this end, consists of all the GCN transitions from previous layers. Since we fuse GCN representations densely, we refer to our dense model as DenseGCN. The growth rate of DenseGCN is equal to the dimension of the output graph (similar to DenseNet for CNNs ). For example, if produces a dimensional vertex feature, where the vertices of the input graph are dimensional, the dimension of each vertex feature of is .
Dilated wavelet convolution is an algorithm originating from the wavelet processing domain [50, 51]. To alleviate spatial information loss caused by pooling operations, Yu et al.  propose dilated convolutions as an alternative to applying consecutive pooling layers for dense prediction tasks, e.g. semantic image segmentation. Their experiments demonstrate that aggregating multi-scale contextual information using dilated convolutions can significantly increase the accuracy on the semantic segmentation task. The reason behind this is the fact that dilation enlarges the receptive field without loss of resolution. We believe that dilation can also help with the receptive field of deep GCNs.
Therefore, we introduce dilated aggregation to GCNs. There are many possible ways to construct a dilated neighborhood. We use a Dilated -NN to find dilated neighbors after every GCN layer and construct a Dilated Graph. In particular, for an input graph with Dilated -NN and as the dilation rate, the Dilated -NN returns the nearest neighbors within the neighborhood region by skipping every neighbors. The nearest neighbors are determined based on a pre-defined distance metric. In our experiments, we use the distance in the feature space of the current layer. Let denote the -dilated neighborhood of vertex . If are the first sorted nearest neighbors, vertices are the -dilated neighbors of vertex (see Figure 3), i.e.
Hence, the edges of the output graph are defined on the set of -dilated vertex neighbors . Specifically, there exists a directed edge from vertex to every vertex . The GCN aggregation and update functions are applied, as in Equation (1), by using the edges created by the Dilated -NN, so as to generate the feature of each output vertex in . We denote this layer operation as a dilated graph convolution with dilation rate , or more formally: . We visualize and compare it to a conventional dilated convolution used in CNNs in Figure 3. To improve generalization, we use stochastic dilation
in practice. During training, we perform the aforementioned dilated aggregations with a high probabilityleaving a small probability to perform random aggregation by uniformly sampling neighbors from the set of neighbors . At inference time, we perform deterministic dilated aggregation without stochasticity.
In our experiments in the paper, we mostly work with a GCN based on EdgeConv  to show how very deep GCNs can be trained. However, it is straightforward to build other deep GCNs with the same concepts we propose (e.g. residual/dense graph connections and dilated graph convolutions). To show that these concepts are universal operators and can be used for general GCNs, we perform additional experiments. In particular, we build ResGCNs based on GraphSAGE  and Graph Isomorphism Network (GIN) . In practice, we find that EdgeConv learns a better representation than the other implementations. However, it is less efficient in terms of memory and computation. Therefore, we also propose a simple GCN operation combining the advantages of both which we refer to as MRGCN (Max-Relative GCN). In the following we discuss each GCN operator in detail.
EdgeConv. Instead of aggregating neighborhood features directly, EdgeConv  proposes to first get local neighborhood information for each neighbor by subtracting the feature of the central vertex from its own feature. In order to train deeper GCNs, we add residual/dense graph connections and dilated graph convolutions to EdgeConv:
GraphSAGE. GraphSAGE  proposes different types of aggregator functions including a Mean aggregator, LSTM aggregator and Pooling aggregator. Their experiments show that the Pooling aggregator outperforms the others. We adapt GraphSAGE with the max-pooling aggregator to obtain ResGraphSAGE:
In the original GraphSAGE paper, the vertex features are normalized after aggregation. We implement two variants, one without normalization (see Equation (6)), the other one with normalization .
GIN. The main difference between GIN  and other GCNs is that an is learned at each GCN layer to give the central vertex and aggregated neighborhood features different weights. Hence ResGIN is formulated as follows:
MRGCN. We find that first using a max aggregator to aggregate neighborhood relative features is more efficient than aggregating raw neighborhood features or aggregating features after non-linear transforms. We refer to this simple GCN as MRGCN (Max-Relative GCN). The residual version of MRGCN is as such:
Here and are the hidden state of vertex at and is the hidden state of the residual graph. All the mlp
(multilayer perceptron) functions use a ReLU as activation function; all themax and sum functions above are vertex-wise feature operators; concat functions concatenate features of two vertices into one feature vector. denotes the neighborhood of vertex obtained from Dilated -NN.
We propose ResGCN and DenseGCN to handle the vanishing gradient problem of GCNs. To enlarge the receptive field, we define a dilated graph convolution operator for GCNs. To evaluate our framework, we conduct extensive experiments on the tasks of semantic segmentation and part segmentation on large-scale point cloud datasets and demonstrate that our methods significantly improve performance. In addition, we also perform a comprehensive ablation study to show the effect of different components of our framework.
Point cloud segmentation is a challenging task because of the unordered and irregular structure of 3D point clouds. Normally, each point in a point cloud is represented by its 3D spatial coordinates and possibly auxiliary features such as color and surface normal. We treat each point as a vertex in a directed graph and we use -NN to construct the directed dynamic edges between points at every GCN layer (refer to Section 3.1). In the first layer, we construct the input graph by executing a dilated -NN search to find the nearest neighbor in 3D coordinate space. At subsequent layers, we dynamically build the edges using dilated -NN in feature space. For the segmentation task, we predict the categories of all the vertices at the output layer.
We use the overall accuracy (OA) and mean intersection over union (mIoU) across all classes as evaluation metrics. For each class, the IoU is computed as, where is the number of true positive points, is the number of ground truth points of that class, and is the number of predicted positive points. We perform the majority of our experiments on semantic segmentation of point clouds on the S3DIS dataset. To motivate the use of deep GCNs, we do a thorough ablation study on area 5 to analyze each component and provide insights. We then evaluate our proposed reference model (backbone of 28 layers with residual graph connections and stochastic dilated graph convolutions) on all 6 areas and compare it to the shallow DGCNN baseline  and other state-of-the-art methods. In order to validate that our method is general and does not depend on a specific dataset, we also show results on PartNet for the task of part segmentation of point clouds.
As shown in Figure 2, all the network architectures in our experiments have three blocks: a GCN backbone block, a fusion block and an MLP prediction block. The GCN backbone block is the only part that differs between experiments. For example, the only difference between PlainGCN and ResGCN is the use of residual skip connections for all GCN layers in ResGCN. Both have the same number of parameters. We linearly increase the dilation rate of dilated -NN with network depth. For fair comparison, we keep the fusion and MLP prediction blocks the same for all architectures. The GCN backbone block takes as input a point cloud with 4096 points, extracts features by applying consecutive GCN layers to aggregate local information, and outputs a learned graph representation with 4096 vertices. The fusion and MLP prediction blocks follow a similar architecture as PointNet  and DGCNN . The fusion block is used to fuse the global and multi-scale local features. It takes as input the extracted vertex features from the GCN backbone block at every GCN layer and concatenates those features, then passes them through a 11 convolution layer followed by max pooling. The latter layer aggregates the vertex features of the whole graph into a single global feature vector, which in return is concatenated with the feature of each vertex from all previous GCN layers (fusion of global and local information). The MLP prediction block applies three MLP layers to the fused features of each vertex/point to predict its category. In practice, these layers are 11 convolutions.
PlainGCN. This baseline model consists of a PlainGCN backbone block, a fusion block, and a MLP prediction block. The backbone stacks EdgeConv  layers with dynamic -NN, each of which is similar to the one used in DGCNN . No skip connections are used here.
ResGCN. We construct ResGCN by adding dynamic dilated -NN and residual graph connections to PlainGCN. These connections between all GCN layers in the GCN backbone block do not increase the number of parameters.
DenseGCN. Similarly, DenseGCN is built by adding dynamic dilated -NN and dense graph connections to the PlainGCN. As described in Section 3.3, dense graph connections are created by concatenating all the intermediate graph representations from previous layers. The dilation rate schedule of our DenseGCN is the same as for ResGCN.
For semantic segmentation on S3DIS , we implement our models using Tensorflow, and for part segmentation on PartNet , we implement them using PyTorch. For fair comparison, we use the Adam optimizer with the same initial learning rate and the same learning rate schedule for all experiments; the learning rate decays every gradient decent steps. Batch normalization is applied to every layer. Dropout with a rate of is used at the second MLP layer of the MLP prediction block. As mentioned in Section 3.4, we use dilated -NN with a random uniform sampling probability for GCNs with dilations. In order to isolate the effect of the proposed deep GCN architectures, we do not use any data augmentation or post processing techniques. We train our models end-to-end from scratch for epochs. We evaluate every epoch on the test set and report the best result for each model. The networks are trained with two NVIDIA Tesla V100 GPUs using data parallelism in semantic segmentation on S3DIS, and the batch size is set to for each GPU. For part segmentation on PartNet, we set the batch size to and the networks are trained with one NVIDIA Tesla V100 GPU.
For convenient referencing, we use the naming convention BackboneBlock-#Layers to denote the key models in our analysis. We focus on residual graph connections for our analysis, since ResGCN-28 is easier and faster to train, but we expect that our observations also hold for dense graph connections.
In order to thoroughly evaluate the ideas proposed in this paper, we conduct extensive experiments on the Stanford large-scale 3D Indoor Spaces Dataset (S3DIS), a large-scale indoor dataset for 3D semantic segmentation of point clouds. S3DIS covers an area of more than with semantically annotated 3D meshes and point clouds. In particular, the dataset contains 6 large-scale indoor areas represented as colored 3D point clouds with a total of 695,878,620 points. As is common practice, we train networks on 5 out of the 6 areas and evaluate them on the left out area.
We begin with the extensive ablation study where we evaluate the trained models on area 5, after training them on the remaining ones. Our aim is to shed light on the contribution of each component of our novel network architectures. To this end, we investigate the performance of different ResGCN architectures, e.g. with dynamic dilated -NN, with regular dynamic -NN (without dilation), and with fixed edges. We also study the effect of different parameters, e.g. number of -NN neighbors (4, 8, 16, 32), number of filters (32, 64, 128), and number of layers (7, 14, 28, 56). To ensure that our contributions (residual/dense connections, dilated graph convolutions) are general, we apply them to multiple GCN variants that have been proposed in the literature. Overall, we conduct 25 experiments and report detailed results in Table I. We also summarize the most important insights of the ablation study in Figure 4. In the following, we discuss each block of experiments.
Effect of residual graph connections. Our experiments in Table I (Reference) show that residual graph connections play an essential role in training deeper networks, as they tend to result in more stable gradients. This is analogous to the insight from CNNs . When the residual graph connections between layers are removed (i.e. in PlainGCN-28), performance dramatically degrades (-12% mIoU). Figure 5 shows the importance of residual graph connections very clearly. As network depth increases skip connections become critical for convergence. We also show similar performance gains by combining residual graph connections and dilated graph convolutions with other types of GCN layers. These results can be seen in the ablation study Table I (GCN Variants) and are further discussed later on in this section.
Effect of dilation. Results in Table I (Dilation)  show that dilated graph convolutions account for a 2.85% improvement in mean IoU (row 3), motivated primarily by the expansion of the network’s receptive field. We find that adding stochasticity to the dilated -NN does help performance but not to a significant extent. Interestingly, our results in Table I also indicate that dilation especially helps deep networks when combined with residual graph connections (rows 1,8). Without such connections, performance can actually degrade with dilated graph convolutions. The reason for this is probably that these varying neighbors result in ‘worse’ gradients, which further hinder convergence when residual graph connections are not used.
Effect of dynamic -NN. While we observe an improvement when updating the nearest neighbors after every layer, we would also like to point out that it comes at a relatively high computational cost. We show different variants without dynamic edges in Table I (Fixed -NN).
Effect of dense graph connections. We observe similar performance gains with dense graph connections (DenseGCN-28) in Table I (Connections). However, with a naive implementation, the memory cost is prohibitive. Hence, the largest model we can fit into GPU memory uses only filters and nearest neighbors, as compared to filters and neighbors in the case of its residual counterpart ResGCN-28. Since the performance of these two deep GCN variants is similar, residual connections are more practical for most use cases and hence we focus on them in our ablation study. Yet, we do expect the same insights to transfer to the case of dense graph connections.
Effect of nearest neighbors. Results in Table I (Neighbors) show that a larger number of neighbors helps in general. As the number of neighbors is decreased by a factor of 2 and 4, the performance drops by 2.5% and 3.3% respectively. However, a large number of neighbors only results in a performance boost, if the network capacity is sufficiently large. This becomes apparent when we increase the number of neighbors by a factor of 2 and decrease the number of filters by a factor of 2 (refer to row 3 in Neighbors).
Effect of network depth. Table I (Depth) shows that increasing the number of layers improves network performance, but only if residual graph connections and dilated graph convolutions are used, as is clearly shown in Table I (Connections).
Effect of network width. Results in Table I (Width) show that increasing the number of filters leads to a similar increase in performance as increasing the number of layers. In general, a higher network capacity enables learning nuances necessary for succeeding in corner cases.
GCN Variants. Our experiments in Table I (GCN Variants) show the effect of using different GCN operators. The results clearly show that different deep GCN variants with residual graph connections and dilated graph convolutions converge better than the PlainGCN. Using our proposed MRGCN operator achieves almost the same performance as the ResGCN reference model which relies on EdgeConv while only using half of the GPU memory. The GraphSAGE operator performs slightly worse and our results also show that using normalization (i.e. GraphSAGE-N) is not essential. Interestingly, when using the GIN- operator, the network converges well during the training phase and has a high training accuracy but fails to generalize to the test set. This phenomenon is also observed in the original paper  in which they find setting to leads to the best performance.
Qualitative Results. Figure 6 shows qualitative results on S3DIS , area 5. As expected from the results in Table I, our ResGCN-28 and DenseGCN-28 perform particularly well on difficult classes such as board, beam, bookcase and door. Rows 1-4 clearly show how ResGCN-28 and DenseGCN-28 are able to segment the board, beam, bookcase and door respectively, while PlainGCN-28 completely fails. Please refer to the supplementary material for more qualitative results and further results.
. Many important parts of the objects are classified incorrectly usingPlainGCN-28.
Comparison to state-of-the-art. Finally, we compare our reference network (ResGCN-28), which incorporates the ideas put forward in the methodology, to several state-of-the-art baselines in Table II. The results clearly show the effectiveness of deeper models with residual graph connections and dilated graph convolutions. ResGCN-28 outperforms DGCNN  by 3.9% (absolute) in mean IoU. DGCNN has the same fusion and MLP prediction blocks as ResGCN-28 but a shallower PlainGCN-like backbone block. Furthermore, we outperform all baselines in 9 out of 13 classes. We perform particularly well in the difficult object classes such as board, where we achieve 51.1%, and sofa, where we improve state-of-the-art by about 10% mIOU.
This significant performance improvement on the difficult classes is probably due to the increased network capacity, which allows the network to learn subtle details necessary to distinguish between a board and a wall for example. The first row in Figure 6 is a representative example for this occurrence. Our performance gains are solely due to our innovation in the network architecture, since we use the same hyper-parameters and even learning rate schedule as the baseline DGCNN  and only decrease the number of nearest neighbors from to and the batch size from to due to memory constraints. We outperform state-of-the art methods by a significant margin and expect further improvement from tweaking the hyper-parameters, especially the learning schedule.
|Method||bed||bottle||chair||clock||dishw.||disp.||door||earph.||fauc.||knife||lamp||micro.||fridge||st. furn.||table||tr. can||vase|
We further experiment with our architecture on the task of part segmentation and evaluate it on the recently proposed large-scale PartNet  dataset. PartNet consists of over 26,671 3D models from 24 object categories with 573,585 annotated part instances. The dataset establishes three benchmarking tasks for part segmentation on 3D objects: fine-grained semantic segmentation, hierarchical semantic segmentation, and instance segmentation. For the following experiments, we focus on the fine-grained level of semantic segmentation which includes 17 out of the 24 object categories present in the PartNet dataset.
As is common practice, we use 10,000 sampled points as input to our network architecture. We use the same reference architecture, ResGCN-28, as for the S3IDS dataset. We compare it to several state-of-the-art baseline architectures with default training hyper-parameters as reported in the original papers, namely PointNet , PointNet++ , SpiderCNN  and PointCNN .
Qualitative results. Figure 7 shows qualitative results on 4 categories of PartNet : bottle, bed, microwave, and refrigerator. As expected from the results in Table III, ResGCN-28 performs very well compared to the baseline PlainGCN-28, where there are no residual connections between layers. Although ResGCN-28 produces some incorrect outputs compared to the ground truth in categories like microwave and refrigerator, it still outperforms PlainGCN-28 and segments the important parts of the object. We provide further qualitative results in the supplementary material.
Comparison to state-of-the-art. We summarize the results of our performance compared to other state-of-the-art methods in Table III. Our model ResGCN-28 performs very well in comparison to current state-of-the-art methods. In particular, ResGCN-28 achieves better results than previous methods in the categories dishwasher, display, door, earphone, knife, microwave, refrigerator, trash can and vase.
ResGCN-28 performs substantially better than all previous methods including PointCNN , in the difficult door category with an improvement of 4.7% in terms of part-category mIoU. Note that PointCNN uses heavy data augmentation and hyper-parameter tuning unlike our ResGCN-28. For this reason, ResGCN-28 is outperformed by PointCNN in many categories. However, despite this huge disadvantage, ResGCN-28 outperforms PointCNN on some categories including door, display, knife and refrigerator where PointCNN achieves 37.8%, 82.5%, 34.1% and 42.9% respectively.
In order to demonstrate the generality of our contributions and specifically our deep ResGCN architecture, we conduct further experiments on general graph data. We choose the popular task of node classification on biological graph data which is very different from point cloud data. In the following experiments, we mainly study the effects of skip connections, the number of GCN layers (i.e. depth), number of filters per layer (i.e. width) and different graph convolutional operators.
The main difference between biological networks and point cloud data is that biological networks have inherent edge information and high dimensional input features. For the graph learning task on biological networks, we use the PPI  dataset to evaluate our architectures. PPI is a popular dataset for multi-label node classification, containing 24 graphs with 20 in the training set, 2 in the validation set, and 2 in the testing set. Each graph in PPI corresponds to a different human tissue, each node in a graph represents a protein and edges represent the interaction between proteins. Each node has positional gene sets, motif gene sets and immunological signatures as input features (50 in total) and 121 gene ontology sets as labels. The input of the task is a graph which contains 2373 nodes on average and the goal is to predict which labels are contained in each node.
We use essentially the same reference architecture as for point cloud segmentation described in Section 4.1 but predict multiple labels. The number of filters of the first layer and last layer are changed to adapt to this task. Instead of constructing edges by means of -NN, we use the edges provided by PPI directly. If we were to construct edges dynamically there is a chance to lose the rich information provided by the edges.
We use the micro-average F1 (m-F1) score as the evaluation metric. For each graph, the F1 score is computed as follows:
where , , is the total number of true positive points of all the classes, is the number of ground truth positive points, and is the number of predicted positive points. We find the best model, i.e. the one with the highest accuracy (m-F1 score) on the validation set in the training phase, and then calculate the m-F1 score across all the graphs in the test dataset.
We show the performance and GPU memory usage of our proposed MRGCN and compare them with other graph convolutions, e.g. EdgeConv , GATConv , SemiGCN  and GINConv . We conduct an extensive ablation study on the number of filters and the number of GCN layers to show their effect in the backbone network. Our ablation study also includes experiments to show the importance of residual graph connections and dense graph connections in DeepGCNs. To ensure a fair comparison, all networks in our ablation study share the same architecture. Finally, we compare our best models to several state-of-the-art methods.
On this biological network node classification task, we implement all our models based on Pytorch Geometric . We use the Adam optimizer with the same initial learning rate and learning rate schedule with learning rate decay of every gradient decent steps for all the experiments. The networks are trained with one NVIDIA Tesla V100 GPU with a batch size of . Dropout with a rate of is used at the first and second MLP layers of the prediction block. For fair comparison, we do not use any data augmentation or post processing techniques. Our models are trained end-to-end from scratch.
We study the effect of residual graph connections and dense graph connections on the performance of multi-label node classification. We also investigate the influence of different parameters, e.g. the number of filters and the number of layers . To show the generality of our framework, we also apply the proposed residual graph connections to multiple GCN variants.
Effect of graph connections. Our experiments in Table IV show that both residual graph connections and dense graph connections help train deeper networks. When the network is shallow, models with graph connections achieve similar performance as models without graph connections. However, as the network goes deeper, the performance of models without graph connections drops dramatically, while the performance of models with graph connections is stable or even improves further. For example, when the number of filters is 32 and the depth is 112, the performance of ResMRGCN-112 is nearly 37.66% higher than PlainMRGCN-112 in terms of the m-F1 score. We note that DenseMRGCN achieves slightly better performance than ResMRGCN with the same network depth and width.
Effect of network depth. The results in Table IV show that increasing the number of layers improves network performance if residual graph connections or dense graph connections are used. Although ResMRGCN has a slight performance drop when the number of layers reaches , the m-F1 score is still much higher than the corresponding PlainMRGCN. The performance of DenseMRGCN increases reliably as the network goes deeper; however DenseMRGCN consumes more memory than ResMRGCN due to concatenations of feature maps. Due to this memory issue we are unable to train some models and denote them with ’-’ in Table IV. Meanwhile, PlainMRGCN, which has no graph connections, only enjoys a slight performance gain as the network depth increases from to . For depths beyond layers, the performance drops significantly. Clearly, using graph connections leads to better performance, especially for deeper networks where it becomes essential.
Effect of network width. Results of each row in Table IV show that increasing the number of filters can increase performance consistently. A higher number of filters can also help convergence for deeper networks. However, a large number of filters is very memory consuming. Hence, we only consider networks with up to filters in our experiments.
Effect of GCN variants. Table V shows the effect of using different GCN operators with different model depths. Residual graph connections and GCN operators are the only difference when the number of layers is kept the same. The results clearly show that residual graph connections in deep networks can help different GCN operators achieve better performance than the PlainGCN. Interestingly, when the network goes deeper, the performance of PlainSemiGCN, PlainGAT and PlainGIN decreases dramatically; meanwhile, PlainEdgeConv and PlainMRGCN only observe a relatively small performance drop. Besides, our proposed MRGCN operator achieves the best performance among all the models.
Memory usage of GCN variants. In Figure 8 we compare the total memory usage and performance of different GCN operators. All these models share the the same architecture except for the GCN operations. They all have layers, filters and use residual graph connections. We implement all the models with Pytorch geometric and they are all trained on one NVIDIA Tesla V100. The GPU memory usage is measured when the memory usage is stable. Results show that our proposed ResMRGCN achieves the best performance and only uses around % GPU memory compared to ResEdgeConv.
Comparison to state-of-the-art. Finally, we compare our DenseMRGCN-14 and ResMRGCN-28 to several state-of-the-art baselines in Table VI. The results clearly show the effectiveness of deeper models with residual graph connections and dense graph connections. DenseMRGCN-14 and ResMRGCN-28 outperform the previous state-of-the-art Cluster-GCN  by 0.07% and 0.05% respectively. It is worth mentioning that there are a total of ten models in Table IV which surpass Cluster-GCN and achieve the new state-of-the-art.
|Number of filters||32||64||128||256|
|Number of layers||3||7||14||28||56|
|Model||m-F1 score (%)|
This work shows how proven concepts from CNNs (i.e. residual/dense connections and dilated convolutions) can be transferred to GCNs in order to make GCNs go as deep as CNNs. Adding skip connections and dilated convolutions to GCNs alleviates the training difficulty which was impeding GCNs to go deeper and impeding further progress. A large number of experiments on semantic segmentation and part segmentation of 3D point clouds and node classification on biological graphs shows the benefit of deeper architectures by state-of-the-art performance. We also show that our approach generalizes across several GCN operators.
On the point cloud data we achieve the best results using EdgeConv  as GCN operators for our backbone networks. Moreover, we find that dilated graph convolutions help to gain a larger receptive field without loss of resolution. Even with a small amount of nearest neighbors, deep GCNs can achieve high performance on point cloud semantic segmentation. ResGCN-112 and ResGCN-56 perform very well on this task, although they only use and nearest neighbors respectively compared to for ResGCN-28. On the biological graph data we achieve the best results using the MRGCN operator which we propose as a novel memory-effiencent alternative to EdgeConv. We successfully trained ResMRGCN-112 and DenseMRGCN-56; both networks converged very well and achieved state-of-the-art results on the PPI dataset.
Our results show that after solving the vanishing gradient problem plaguing deep GCNs, we can either make GCNs deeper or wider to get better performance. We expect GCNs to become a powerful tool for processing graph-structured data in computer vision, natural language processing, and data mining. We show successful cases for adapting concepts from CNNs to GCNs (i.e. skip connection and dilated convolutions). In the future, it will be worthwhile to explore how to transfer other operators (e.g. deformable convolutions ), other architectures (e.g. feature pyramid architectures ), etc.. It will also be interesting to study different distance measures to compute dilated -NN, constructing graphs with different at each layer, better dilation rate schedules [63, 60] for GCNs, and combining residual and dense connections.
We also point out that, for the specific task of point cloud semantic segmentation, the common approach of processing the data in columns is sub-optimal for graph representation. A more suitable sampling approach should lead to further performance gains on this task. For the task of node classification, the existing datasets are relatively small. We expect that experimenting on larger datasets will further unleash the full potential of DeepGCNs.
The authors thank Adel Bibi for his help with the project. This work was supported by the King Abdullah University of Science and Technology (KAUST) Office of Sponsored Research through the Visual Computing Center (VCC) funding.
Q. Li, Z. Han, and X.-M. Wu, “Deeper insights into graph convolutional networks for semi-supervised learning,” in
Thirty-Second AAAI Conference on Artificial Intelligence, 2018.
Proceedings of the IEEE conference on computer vision and pattern recognition, 2016, pp. 770–778.
I. Armeni, S. Sax, A. R. Zamir, and S. Savarese, “Joint 2D-3D-Semantic Data for Indoor Scene Understanding,”ArXiv e-prints, Feb. 2017.
J. Bastings, I. Titov, W. Aziz, D. Marcheggiani, and K. Simaan, “Graph convolutional encoders for syntax-aware neural machine translation,” inProceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, 2017, pp. 1957–1967.
X. Ye, J. Li, H. Huang, L. Du, and X. Zhang, “3d recurrent neural networks with context fusion for point cloud semantic segmentation,” inEuropean Conference on Computer Vision. Springer, 2018, pp. 415–430.
We conduct a run-time experiment comparing the inference time of the reference model ResGCN-28 (28 layers, =16) with dynamic k-NN and fixed k-NN. The inference time with fixed k-NN is 45.63ms. Computing the dynamic k-NN increases the inference time by 150.88ms. It is possible to reduce computation by updating the k-NN less frequently (e.g. computing the dynamic k-NN every 3 layers).
To showcase the consistent improvement of our framework over the baseline DGCNN , we reproduce the results of DGCNN111The results across all classes were not provided in the DGCNN paper. in Table VII and find our method outperforms DGCNN in all classes.
|Class||DGCNN ||ResGCN-28 (Ours)|