Progressive Graph Convolutional Networks for Semi-Supervised Node Classification

03/27/2020 ∙ by Negar Heidari, et al. ∙ Aarhus Universitet 0

Graph convolutional networks have been successful in addressing graph-based tasks such as semi-supervised node classification. Existing methods use a network structure defined by the user based on experimentation with fixed number of layers and employ a layer-wise propagation rule to obtain the node embeddings. Designing an automatic process to define a problem-dependant architecture for graph convolutional networks can greatly help to reduce the computational complexity of the training process. In this paper, we propose a method to automatically build compact and task-specific graph convolutional networks. Experimental results on widely used publicly available datasets indicate that the proposed method outperforms the related graph-based learning algorithms in terms of classification performance and network compactness.



There are no comments yet.


page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

Convolutional Neural Networks (CNNs) [1]

, as an end-to-end deep learning paradigm, have been very successful in many machine learning and computer vision tasks

[2, 3, 4]. While CNNs have high ability to extract latent representations and local meaningful statistical patterns from data, they can only operate on Euclidean data structures such as audio, images and videos which have a form based on D, D and D regular grids, respectively. Recently, there has been an increasing research interest in applying deep learning approaches on non-euclidean data structures, like graphs, which lack geometrical properties, but have high ability to model complex irregular data such as social networks [5, 6]

citation networks and knowledge graphs


Motivated by CNNs, the notion of convolution was generalized from grid data to graph structures, which correspond to locally connected data items, by aggregating the node’s neighbors’ features with its own features [8]. Combining such convolutional operator with a data trasformation process and hierarchical structures Graph Convolutional Networks (GCNs) are obtained, which can be trained in an end-to-end fashion to optimize an objective defined on individual graph nodes, or the graph as a whole. As an example, the GCN method proposed in [6] is a semi-supervised node classification method consisting of two convolutional layers. Each graph convolutional layer learns nodes’ features by applying aggregation rule on their corresponding first-order neighbors. GCNs [9], have been successful in various graph mining tasks like node classification, [10], graph classification [11], link prediction [12], and visual data analysis [13, 14].

One of the main drawbacks of existing GCN-based methods is that they require an extensive hyper-parameter search process to determine a good topology of the network. This process is commonly based on extensive experimentation involving the training of multiple GCN topologies and monitoring their performance on a hold-out set. To avoid such a computationally demanding process, a common practice is to empirically define a network topology and use it for all examined problems (data sets). As an example, a two layer GCN model cannot gain sufficient global information by aggregating the information from just a two-hop neighborhood for each node. Meanwhile, in the reported results of [6] it has been shown that adding more layers makes the training process harder and does not necessarily improve the classification performance. It was also shown that the topology of the GCN plays a crucial role in performance which is related to the underlying difficulty of the problem to be solved [15].

Problem-specific design of the neural networks’ architecture contributes in improving the performance and the efficiency of a learning system. Recently, methods of finding an optimized network topology have been receiving much attention and many works were proposed to define compact network topologies by employing various learning strategies, such as compressing pre-trained networks, adding neurons progressively to the network, pruning the network’s weights and applying weight quantization

[16, 17, 18, 19, 20, 21]. However, all these methods work with grid data in Euqlidean spaces. In this regard, learning a compact topology for GCNs makes a step towards increasing the training efficiency and reducing the computational complexity and storage needed while achieving comparable performance with existing methods.

In this paper, we propose a method to jointly define a problem-specific GCN topology and optimize its parameters, by progressively growing the network’s structure both in width and depth. The contributions of our work are:

  • We propose a method to learn an optimized and problem-specific GCN topology progressively without user intervention. The resulting networks are compact in terms of number of parameters, while performing on par, or even better, compared to other recent GCN models.

  • We provide a convergence analysis for the proposed approach, showing that the progressively building GCN topology is guaranteed to converge to a (local) minimum.

  • We conduct experiments on widely-used graph datasets and compare the proposed method with recently proposed GCN models to demonstrate both the efficiency and competitive performance of the proposed method. Our experiments include an analysis of the effect of the network’s complexity with respect to the underlying complexity of the classification problem, highlighting the importance of network’s topology optimization.

The rest of the paper is organized as follows: Section 2 provides a description of graph-based semi-supervised classification, along with the terminology used in this paper. Section 3 reviews the GCN method [6] as the baseline method of our work. The proposed method is described in detail in section 4. The conducted experiments are described in section 5, and conclusions are drawn in Section 5.

Ii Semi-supervised graph-based classification

Let be an undirected graph where and denote the set of nodes and the set of edges , respectively. denotes the adjacency matrix of encoding the node connections. The elements can be either binary values, indicating presence or absence of an edge between nodes and , or real values encoding the similarity between and , based on a similarity measure. Using , we define the degree matrix which is a diagonal matrix with elements equal to . Each node of the graph is also equipped with a representation

, which is used to form the feature vector matrix

. When such a feature vector for each graph node is not readily available for the problem at hand (e.g. in the case of processing citation graphs), vector-based node representations are learned by using some node embedding method, like [22, 23].

Traditional graph-based semi-supervised node classification methods [24, 25, 26, 27], learn a mapping from the nodes’ feature vectors to labels, which for a -class classification problem are usually represented by -dimensional vectors following the -of- encoding scheme. This mapping exploits a graph-based regularization term and it commonly has the form:


where denotes the function mapping the -dimensional node representations to the -dimensional class vectors, is a matrix formed by the class label vectors of the labelled nodes which form the matrix , and is the unnormalized graph Laplacian matrix. The first term in Eq. (1) is the classification loss of the trained model measured on the labelled graph nodes, and the second term corresponds to graph Laplacian-based regularization incorporating a smoothness prior to the optimization function . expresses the relative importance of the two terms. By following this approach the labels’ information of the labelled graph nodes is propagated over the entire graph.

Iii Graph Convolutional Networks

GCNs are mainly categorized into spatial-based and spectral-based methods. The spatial-based methods update the features of each node by aggregating its spatially close neighbors’ feature vectors. In these methods, the convolution operation is defined on the graph with a specified neighborhood size which propagates the information locally [10, 28, 29, 5].

The spectral-based GCN methods follow a graph signal processing approach [6]. Let us denote by a filter and by a multi-dimensional signal defined over the nodes of the graph. The signal transformation using is given by:


where denotes the convolution operator,

is the matrix of eigenvectors of the normalized graph Laplacian


being a diagonal matrix having as elements the corresponding eigenvalues, and

being the graph Fourier transform of

. Since computing the eigen-decomposition of is computationally expensive, low-rank approximations using truncated Chebyshev polynomials have been proposed [30]. The transformation in Eq. (2) corresponds to the building block of a GCN. Followed by an element-wise activation it forms a GCN layer.

The multilayer GCN for semi-supervised node classification was proposed in [6] by stacking multiple GCN layers. To achieve a fast and scalable operation a first-order approximation of the spectral graph convolution is proposed leading to:


where and . Let us denote by the graph node representations at layer of the multi-layer GCN. The propagation rule for calculating the graph node representations at layer is given by:


where is the layer weight matrix and

denotes the activation function, such as

, or used for the output layer. For a two layer GCN model this leads to:


where and denotes the predicted feature vectors for all the graph nodes in classes. The model parameters () are finetuned by minimizing the cross entropy loss over the labeled nodes.

One of the drawbacks of all existing GCN methods is that they use a predefined network topology, which is selected either based on the user’s experience, or empirically by testing multiple network topologies. In the next section, we describe a method for automatically determining a problem-specific compact GCN topology based on a data-driven approach.

Iv Progressive Graph Convolutional Network

PGCN follows a data-driven approach to learn a problem-dependant compact network topology, in terms of both depth and width, by progressively building the network’s topology based on a process guided by its performance. That is, the learning process of GCN jointly determines the network’s topology and estimates its parameters to optimize the cost function defined at the output of the network.

Fig. 1: PGCN model architecture, where is the normalized adjacency matrix, is the node feature matrix and denotes the predicted label vector for data sample. , , indicate the node features’ dimension, block size and number of classes respectively. Each layer consists of several blocks of neurons and each block contains neurons which apply graph convolution on the input graph data and transform the input features to

dimensional space. The bottom left part of the figure shows the graph convolution and activation operations applied on the hidden representation of each node in

layer to produce its hidden representation in layer. The bottom right part of the figure shows the progressive learning procedure in each layer, when a new block of neurons is added to the layer. The solid lines indicate the fixed weights finetuned in previous steps, and the dashed lines indicate the randomly initialized weights trained by semi-supervised regression.

The learning process starts with a single hidden layer formed by one block of neurons equipped with an activation function (e.g. ) and an output layer with neurons. At iteration , the synaptic weight matrix connecting the input layer to the hidden-layer neurons is initialized randomly. The graph nodes’ representations defined at the outputs of the hidden layer , where the index indicates that the one block of hidden-layer neurons is used, are obtained by using graph convolution:


By setting , the network’s output for all graph nodes

is calculated using a linear transformation as follows:


where denotes the weight matrix connecting hidden layer to the output layer. The transformation matrix can be calculated by minimizing the regression problem:


where denotes the trace operator, denotes the relative importance of model loss, denotes the hidden layer representations of the labeled graph nodes and is a matrix formed by the labeled nodes’ target vectors.

To exploit the information in both the labeled and unlabeled nodes’ feature vectors in

, we can replace the linear regression problem in Eq. (

8) by a semi-supervised regression problem exploiting the smoothness assumption of semi-supervised learning [31, 26] expressed by the term:


Minimization of Eq. (9) with respect to leads to node feature vectors at the output of the network which are similar for nodes connected in the graph. To incorporate this property in the optimization problem of Eq. (8) the term in Eq. (9) is added as a regularizer. Thus, is obtained by minimizing:


where is the number of all (labeled and unlabeled) graph nodes and denotes the relative importance of Laplacian regularization and is set to which is the natural scale factor to estimate the Laplacian operator empirically [24].

The optimal solution of Eq. (10) is obtained by setting , and is given by:



is identity matrix.

After the initialization of both the hidden layer and output layer weights, these are finetuned based on Backpropagation using the loss of the model on the labeled graph nodes. Finally, the model’s performance (classification accuracy on the labeled nodes)

is recorded.

At iteration , the network’s topology grows by adding a second block of hidden layer neurons. We keep the weights connecting the input layer to the first block of hidden layer neurons fixed, while the weight matrix of the newly added block is initialized randomly. The hidden layer representations corresponding to the newly added neurons are calculated as in Eq. (6) by replacing with . Then, we combine and the output weight matrix is calculated by using Eq. (11).

After fine-tuning the adjustable parameters and recording the model’s performance, the network’s progression is evaluated based on the rate of performance improvement given by:


where and denote the model’s performance before and after adding the second block of hidden layer neurons. If the addition of the second GCN block does not improve the model’s performance, i.e. when , the newly added block is removed and the progression in the current hidden layer terminates. After stopping the progressive learning in the first hidden layer, all of its parameters are fixed and a new hidden layer is formed which takes as input the previous hidden layers’ output. The block-based progression of the newly added hidden layer starts by using a single block of neurons and repeats in the same way as for the first hidden layer until model’s performance converges.

Let us assume that at iteration , the network’s topology comprises of layers (the input layer corresponds to ) and it is growing at the layer giving as outputs . Before adding the block formed by neurons, the weights of all the existing blocks in the network are fixed. The newly added block in the hidden layer takes the output of the previous hidden layer as input, and the graph convolutional operation for the block is given by:


where and denote the hidden representation of the newly added block and the randomly initialized synaptic weights connecting the hidden layer to block of layer, respectively. Given which denotes the hidden representations formed by using both the existing blocks and the newly added block in the hidden layer, the models’ output is calculated using a linear transformation as follows:


where denotes the transformation matrix which is calculated based on semi-supervised linear regression:


Similar to other GCN-based methods, the linear transformation for calculating the models’ output in Eq. (14) (and Eq. (7), respectively) can also be followed by softmax activation (on a node basis) as follows:


and instead of Mean Squared Error (MSE), the Cross-Entropy (CE) loss function can be employed for finetuning the model.

1:function PGCN(, , , , , , , , , )
6:     for  to  do
7:         for  to  do
8:              Initialize
12:              Finetune ,
13:              Calculate
15:              if  then
16:                  break
17:              end if
20:         end for
22:         if  then
23:              break
24:         end if
28:     end for
29:     Finetune ,
30:     Return ,
31:end function
Algorithm 1 Progressive Graph Convolutional Network

After the initialization step, the synaptic weights of all the existing blocks are fixed and all the adjustable weight parameters of the newly added block are fine-tuned with respect to the labeled data using Eq. (8). To evaluate the network’s progression, the model’s performance is recorded and the rate of the improvement is given by:


where and denote the classification accuracy before and after adding the block, respectively. If the addition of a new GCN block in step does not improve the model’s performance, i.e. when , the progression of the layer terminates. After stopping the block progression for the layer, all the learned parameters are fixed and the algorithm evaluates whether the network’s performance converged using the rate of the performance before and after adding the new hidden layer:


When , the last hidden layer is removed and the algorithm stops growing the network’s topology. Subsequently, all network’s parameters are finetuned. Here we should note that it is also possible to use other performance metrics, such as model loss, to evaluate the network progression process in (17) and (18).

Algorithm 1 summarizes the PGCN algorithm. We show that the proposed approach of building GCN layers in a progressive manner converges in Appendix A.

V Experiments

V-a Datasets

We evaluated the proposed method for semi-supervised node classification task following transductive setting, on four widely-used benchmark datasets including three standard citation networks, Citeseer, Cora and Pubmed [32], and a knowledge graph, NELL [33].

The citation networks represent published documents as nodes and the citation links between them as undirected edges. Each node in a citation network is represented by a sparse binary Bag-of-Words (BoW) feature vector extracted from articles’ abstract and a class label which represents the articles’ subject. The symmetric binary adjacency matrix is built using the list of undirected edges between the nodes and the task to be solved is the prediction of the articles’ subject based on the BoW features and their citations to other articles.

The knowledge graph dataset NELL [33] is a bipartite graph extracted from a knowledge graph and contains a set of entities as nodes and labeled relations between them as directed edges. Each entity is described by a sparse feature vector and the values in the adjacency matrix indicate the existence of one or more edges between different pairs of nodes.

To perform a fair comparison, we follow the same experimental setup as in [34, 6] for data configuration and preprocessing. The detailed datasets statistics are summerized in Table I. For training the model, 20 labeled nodes per class are used for each citation network and one labeled node per class is used for NELL dataset. In Table I, label rate denotes the number of training labeled nodes divided by the total number of nodes for each dataset. In all datasets, the validation set contains 500 randomly selected samples and the trained model is evaluated on 1000 test nodes. The labels of validation set are not used for training the model. Following the transductive learning setup, only the labeled nodes of the training set but all the feature vectors are used for training. The feature vectors are row-normalized.

Dataset Citeseer Cora Pubmed NELL
Type Citation Citation Citation Knowledge
Nodes 3327 2708 19717 65775
Edges 4732 5429 44338 266144
Classes 6 7 3 210
Features 3703 1433 500 5414
Label rate 0.036 0.052 0.003 0.001
TABLE I: Summary of datasets’ statistics used in experiments
Method Citeseer Cora Pubmed NELL
GAT [35] -
N-GCN [36]
H-GCN [37]
GCN [6]
H-GCN (reproduced)
GCN (reproduced)
P-GCN (CE) 74.3 80.2
TABLE II: Node classification performance in terms of accuracy
Method Citeseer Cora Pubmed NELL
GCN 59.4k 23.1k 8.1k 360.2k
N-GCN 445.4k 173k 60.5k 677.5k
H-GCN 127.3k 54.6k 24.6k 188.7k
GAT 237.5k 92.3k 32.3k 360.2k
P-GCN (CE) 168.2k 21.8k 5.043k 572.81k
P-GCN (MSE) 74.3k 22.1k 10.1k 572.81k
TABLE III: Number of learnable parameters
Citeseer [D, h1: 16, 6] [D, h1: 45, h2: 30, 6] [D, h1: 20, 6]
Cora [D, h1: 16, 7] [D, h1: 15, h2: 15, 7] [D, h1: 15, h2: 25, 7]
Pubmed [D, h1: 16, 3] [D, h1: 10, 3] [D, h1: 20, 3]
NELL [D, h1: 64, 210] [D, h1:100, h2:100, 210] [D, h1:100, h2:100, 210]
                                  denotes the dimensionality of the input data
                                  hX denotes the hidden layer.
TABLE IV: Network architectures comparison

V-B Competing Methods

We compared the proposed method with the baseline GCN [6] and related methods N-GCN [36], H-GCN [37], and GAT [35]. N-GCN [36] trains a network of GCNs over the neighboring nodes discovered at different distances in random walk. Each GCN module is uses a different power of adjacancy matrix like , where indicates the statistics collected from the step of a random walk on the graph. It combines the information from different graph scales by using the weighted sum or the concatenation of all the GCNs’ outputs into a final classification layer and finetunes the entire model for semi-supervised node classification. H-GCN [37] uses a deep hierarchical topology for GCN with multiple coarsening and refining layers to aggregate more global information from graph data. Each coarsening layer updates the nodes’ feature vectors by employing convolutional operation, then constructs hyper-nodes by combining the structurally similar nodes together. In this way, it increases the graph convolution receptive field and the nodes’ representations are captured by exploiting the local to global structure of the graph. The original graph structure is reconstructed by applying symmetric graph refining layers then. The graph attention network (GAT) [35] specifies different weights to different neighboring nodes by employing self-attention. The attention coefficients are shared across all graph edges, so there is no need to capture global graph strucure.

V-C Experimental settings

We implemented our method in tensorflow

[38] and trained it using Adam optimizer [39] for epochs with learning rate of

. The network weight parameters are initialized randomly using uniform distribution. To handle the effect of randomness on network performance, we ran it 3 times on each dataset. For each dataset, the set of hyper-parameters which leads to best validation performance is selected and the corresponding performance on test set and architectural information are reported. To avoid overfitting, dropout

[40] and weight regularization techniques are employed to regularize the network. The norm constraint maximum value is selected from and regularization factor of and are selected for citation network and knowledge network datasets respectively. We apply dropout on the output of hidden layers, not on input features. The dropout rate is selected from . The regularizer is selected from . The size of the block is selected from and for citation networks and knowledge graph datasets, respectively. The maximum topology of the network is limited to layers with neurons per layer and the threshold values are set to .

The baseline and competing methods also optimized the hyper-parameters on the same data splits. We used the implementations provided by authors to reproduce their results following the experimental settings explained in their reports. GCN method optimized the hyper-parameters on Cora dataset and used the same set of hyper-parameters for Citeseer and Pubmed datasets too. The two layer GCN is trained for 200 epochs using Adam optimizer with a learning rate of 0.01 and early stoping of step size 10. GCN with dropout rate of , hidden layer of neurons and regularization of is applied on citation datasets. For NELL dataset, the dropout rate is , the regularizer is and the hidden layer has neurons. The reported results for GCN method are the mean classification accuracy based on 100 runs with random weight initalization.

To reproduce the results of H-GCN and N-GCN, we followed the network configuration and hyper-parameter settings described in [37] and [36], respectively. HGCN is trained using Adam optimizer with learning rate of for epochs. The regularization factor and the dropout rate in this method are set to and , respectively. N-GCN uses the same optimizer with learning rate of for epochs. regularization factor is set to . In GAT method, the hyper-parameters are optimized on Cora dataset and then reused for Citeseer dataset. The regularization factor and the dropout rate are set to , respectively for Citeseer and Cora datasets. For pubmed dataset the The regularization factor is set to . The dropout is applied on both input feature vectors and hidden layers output.

V-D Results

Table II shows the performance in terms of classification accuracy for all the methods on the datasets. The best performance is shown in bold fonts for each dataset. The test classification accuracy of our method is achieved by capturing the model parameters with the highest validation accuracy. We evaluate our method with both Cross Entropy (CE) and Mean Squared Error (MSE) loss functions. Running GAT on NELL dataset needs more than 64GB memory, so its performance is not reported due to the memory issues.

The obtained results indicate that the proposed method has outperformed the baseline GCN method and other competing methods on Citeseer dataset and it has comparative performance on Cora and Pubmed. On NELL dataset, H-GCN and N-GCN have the best performance. The reason can be explain as the NELL dataset has fewer labeled samples per class than in citation datasets, so that the labeled nodes are faraway from other nodes. The H-GCN and N-GCN try to increase the receptive field to explore global information as well as local information and propagate the label information to other nodes more sufficiently. However, we couldn’t reproduce the results which are reported in H-GCN. One of the reasons is that they first train the network on link prediction task and then use the pretrained network for node classification. Since for GCN and H-GCN methods we got different results, we show both their reported results and the reproduced results in table (II).

N-GCN uses a fixed architecture for all datasets. It has blocks of GCN which are use different powers of : , , , . Each GCN, as a two layer network with hidden neurons, is replicated three times and the weighted sum of the outputs is introduced to the final classification layer. The H-GCN method uses a symmetric network architecture, with coarsening and refining layers, which comprises of layers for citation networks and layers for NELL dataset while the four-channel GCNs with hidden layers are applied in all the layers.

GAT method use a two layer network architecture. The first layer consists of attention heads, each computing features, followed by ELU [41] activation function and the second layer is a single attention head computing features followed by softmax activation function.

The model sizes, i.e., the number of model parameters, of all trained models are reported in table III. Table IV also shows the model architectures which are learned by our proposed method and the model architectures used by the baseline method GCN.

The results indicate that the network topologies which are learned by the proposed method on Cora and Pubmed datasets are much more compact compared to the fixed network topologies used by other methods, while the classification performance of our compact models is similar or better than others. Figure 2 illustrates the t-SNE visualization of learned feature vectors of the Cora dataset from last layer of network before applying the softmax activation. On NELL dataset, the best perforamnce and the most compact architecture corresponds to H-GCN with respect to the results reported in [37]. The second best result belongs to N-GCN which trains more parameters than our method. The best classification accuracy on Citeseer dataset is achieved by the proposed method. Although our trained model doesn’t have the minimum number of parameters for this dataset, but it outperforms the baseline method GCN by .

(a) Input node features
(b) Pre-softmax node features
Fig. 2: Demonstraion of Cora dataset node features visualized by t-SNE

V-E Analysis on Dataset Statistics

The results of the previous section indicate that all the competing methods perform on par with the baseline GCN which has a simple network architecture. This can be explained by the benchmark dataset statistics. The dataset complexity is defined by the number of labeled nodes and the dimentionality of nodes’ feature vectors , and the ratio is extremely small for all the benchmark datasets of Table I. It has been recently shown that even the heavily regularized linear methods can obtain high performance on classification problems on datasets with low complexity [42, 43, 44]. Therefore, all the GCN-based methods, with simple or sophisticated network structures, can lead to comparable performance on these widely used benchmark datasets. In [15], it has been experimentally shown that the performance of the GCN-based methods heavily depends on the underlying difficulty of the problem and non-linear models with more complex structure perform significantly better on datasets with higher ratio. That is, it is expected that the difference in performance of various methods will increase when the underlying semi-supervised classification problem becomes more complex.

Here we highlight the importance of optimizing the network structure based on the problem’s complexity. We compare the performance of the proposed method with the baseline method GCN [6], by tuning the ratio using different input data dimensionalities when and of nodes are labeled in citation networks and knowledge graph datasets, respectively. We use the same data splits for both methods and follow the same approach as in [15] to control the ratio by mapping the input data representations to a subspace through random projections. Specifically, we use a random sketching matrix with

, which is drawn from a Normal distribution, to obtain new data representations

as follows:


To avoid bias of the performance values obtained for different values of , we first randomly sample a square matrix and subsequently we use its first columns to map the input data from to its subspace . Such an approach guarantees that when a subspace of a higher dimensionality is used, it corresponds to an augmented version of the initial (lower-dimensional) subspace. We applied three experiments for each choice of on each dataset and we report the performance on the test data corresponding to the best validation performance.

(a) Classification performance comparison
(b) Difference in classification accuracy between GCN and PGCN
(c) Model complexity comparision
Fig. 3: Performance and model complexity comparison on citation networks with varying number of dimensions .

Figure 3 compares the classification performance of GCN and PGCN methods on all datasets when problems of different difficulty are considered. It can be observed that for all data dimensionalities, our method performs better than the baseline method, while they have a larger difference in classification accuracy in lower dimensionalities, i.e. when the ratio is higher. This is in line with the findings of [15] indicating that in high-dimensional feature spaces neural network structures tend to perform in similar manners irrespectively to their complexity. On the other hand, when the classification problem is encoded in lower-dimensional feature spaces and, thus, becomes more complex, the structure of the neural network’s topology is important. Indeed, as can be observed in Figure 3b, PGCN outperforms with a high margin GCN when the dimensionality of the node representations is low. The comparison of model complexity with respect to number of trainable parameters in Figure 3c shows that both methods have similar complexity when lower-dimensional feature spaces are used on Citeseer and Cora datasets, while PGCN outperforms GCN in terms of classification perfromance. For Pubmed dataset, PGCN outperforms GCN in terms of both model complexity and classification performance in all cases. For NELL dataset, the models learned by PGCN method are more complex while they outperform GCN in terms of classification performance in all cases. The difference in performance is striking in NELL dataset, as PGCN by optimizing its topology is able to outperform GCN by a margin much greater than for low-dimensional data.

Vi Conclusion

In this paper, we proposed a method for progressively training graph convolutional networks for semi-supervised node classification which jointly defines a problem-dependant compact topology and optimizes its parameters. The proposed method employs a learning process which utilizes the input to each layer data to grow the network’s structure both in terms of depth and width. This is achieved by operating an efficient layer-wise propagation rule leading to a self-organized network structure exploiting data relationships expressed by their vector representations and the adjacency matrix of the corresponding graph structure. Experimental results on four commonly used datasets for evaluating graph convolutional networks on semi-supervised classification indicate that the proposed method outperforms the baseline method GCN and performs on par, or even better, compared to more complex recently proposed methods in terms of classification performance and efficiency.

Appendix A Proof of Convergence

Here we show that the progressive learning in each layer of the PGCN method converges. Lets assume that denotes the hidden representations of data produced by using the first blocks in GCN layer and denotes the finetuned weights connecting all the blocks of layer to the output layer. We prove that the sequence of graph-regularized MSE, obtained with and is monotonically decreasing while it is bounded bellow by .

Given the fixed hidden representation , the finetuned output weights are not necessarily the optimized weights in terms of MSE. It can be explained by the following relation:


where denotes the optimized output weights which are obtained by solving the semi-supervised linear regression problem as follows:


In the next step, when the block is added to the layer, the new hidden representation of layer would be in which is fixed from previous step and is generated by new randomly initialized weights. The new optimal output weights which connect the layer to output layer is initialized according to Eq. (21) by substituting by . The MSE after adding the block would be as follows:


Since Eq. (22) holds for all , we can replace , with , respectively to obtain the following relation:


After finetuning , , the output weights are denoted by and the MSE would be

. It has been proven that stochastic gradient descent converges to a local optimum

[45] with small enough learning rate, so the following relation holds for the MSE:


According to (24), (25) we have the following relation:


which indicates that the sequence is monotonically decreasing.

Based on the connection of the linear activation function combined with the mean-square error to the soft-max activation function combined with the cross-entropy criterion and maximum likelihood optimization [46], an analysis following the same steps as above can be used to show that (when the latter is employed) the sequence is also monotonically decreasing.