citation networks and knowledge graphs.
Motivated by CNNs, the notion of convolution was generalized from grid data to graph structures, which correspond to locally connected data items, by aggregating the node’s neighbors’ features with its own features . Combining such convolutional operator with a data trasformation process and hierarchical structures Graph Convolutional Networks (GCNs) are obtained, which can be trained in an end-to-end fashion to optimize an objective defined on individual graph nodes, or the graph as a whole. As an example, the GCN method proposed in  is a semi-supervised node classification method consisting of two convolutional layers. Each graph convolutional layer learns nodes’ features by applying aggregation rule on their corresponding first-order neighbors. GCNs , have been successful in various graph mining tasks like node classification, , graph classification , link prediction , and visual data analysis [13, 14].
One of the main drawbacks of existing GCN-based methods is that they require an extensive hyper-parameter search process to determine a good topology of the network. This process is commonly based on extensive experimentation involving the training of multiple GCN topologies and monitoring their performance on a hold-out set. To avoid such a computationally demanding process, a common practice is to empirically define a network topology and use it for all examined problems (data sets). As an example, a two layer GCN model cannot gain sufficient global information by aggregating the information from just a two-hop neighborhood for each node. Meanwhile, in the reported results of  it has been shown that adding more layers makes the training process harder and does not necessarily improve the classification performance. It was also shown that the topology of the GCN plays a crucial role in performance which is related to the underlying difficulty of the problem to be solved .
Problem-specific design of the neural networks’ architecture contributes in improving the performance and the efficiency of a learning system. Recently, methods of finding an optimized network topology have been receiving much attention and many works were proposed to define compact network topologies by employing various learning strategies, such as compressing pre-trained networks, adding neurons progressively to the network, pruning the network’s weights and applying weight quantization[16, 17, 18, 19, 20, 21]. However, all these methods work with grid data in Euqlidean spaces. In this regard, learning a compact topology for GCNs makes a step towards increasing the training efficiency and reducing the computational complexity and storage needed while achieving comparable performance with existing methods.
In this paper, we propose a method to jointly define a problem-specific GCN topology and optimize its parameters, by progressively growing the network’s structure both in width and depth. The contributions of our work are:
We propose a method to learn an optimized and problem-specific GCN topology progressively without user intervention. The resulting networks are compact in terms of number of parameters, while performing on par, or even better, compared to other recent GCN models.
We provide a convergence analysis for the proposed approach, showing that the progressively building GCN topology is guaranteed to converge to a (local) minimum.
We conduct experiments on widely-used graph datasets and compare the proposed method with recently proposed GCN models to demonstrate both the efficiency and competitive performance of the proposed method. Our experiments include an analysis of the effect of the network’s complexity with respect to the underlying complexity of the classification problem, highlighting the importance of network’s topology optimization.
The rest of the paper is organized as follows: Section 2 provides a description of graph-based semi-supervised classification, along with the terminology used in this paper. Section 3 reviews the GCN method  as the baseline method of our work. The proposed method is described in detail in section 4. The conducted experiments are described in section 5, and conclusions are drawn in Section 5.
Ii Semi-supervised graph-based classification
Let be an undirected graph where and denote the set of nodes and the set of edges , respectively. denotes the adjacency matrix of encoding the node connections. The elements can be either binary values, indicating presence or absence of an edge between nodes and , or real values encoding the similarity between and , based on a similarity measure. Using , we define the degree matrix which is a diagonal matrix with elements equal to . Each node of the graph is also equipped with a representation
, which is used to form the feature vector matrix. When such a feature vector for each graph node is not readily available for the problem at hand (e.g. in the case of processing citation graphs), vector-based node representations are learned by using some node embedding method, like [22, 23].
Traditional graph-based semi-supervised node classification methods [24, 25, 26, 27], learn a mapping from the nodes’ feature vectors to labels, which for a -class classification problem are usually represented by -dimensional vectors following the -of- encoding scheme. This mapping exploits a graph-based regularization term and it commonly has the form:
where denotes the function mapping the -dimensional node representations to the -dimensional class vectors, is a matrix formed by the class label vectors of the labelled nodes which form the matrix , and is the unnormalized graph Laplacian matrix. The first term in Eq. (1) is the classification loss of the trained model measured on the labelled graph nodes, and the second term corresponds to graph Laplacian-based regularization incorporating a smoothness prior to the optimization function . expresses the relative importance of the two terms. By following this approach the labels’ information of the labelled graph nodes is propagated over the entire graph.
Iii Graph Convolutional Networks
GCNs are mainly categorized into spatial-based and spectral-based methods. The spatial-based methods update the features of each node by aggregating its spatially close neighbors’ feature vectors. In these methods, the convolution operation is defined on the graph with a specified neighborhood size which propagates the information locally [10, 28, 29, 5].
The spectral-based GCN methods follow a graph signal processing approach . Let us denote by a filter and by a multi-dimensional signal defined over the nodes of the graph. The signal transformation using is given by:
where denotes the convolution operator,
is the matrix of eigenvectors of the normalized graph Laplacianwith
being a diagonal matrix having as elements the corresponding eigenvalues, and
being the graph Fourier transform of. Since computing the eigen-decomposition of is computationally expensive, low-rank approximations using truncated Chebyshev polynomials have been proposed . The transformation in Eq. (2) corresponds to the building block of a GCN. Followed by an element-wise activation it forms a GCN layer.
The multilayer GCN for semi-supervised node classification was proposed in  by stacking multiple GCN layers. To achieve a fast and scalable operation a first-order approximation of the spectral graph convolution is proposed leading to:
where and . Let us denote by the graph node representations at layer of the multi-layer GCN. The propagation rule for calculating the graph node representations at layer is given by:
where is the layer weight matrix and
denotes the activation function, such as, or used for the output layer. For a two layer GCN model this leads to:
where and denotes the predicted feature vectors for all the graph nodes in classes. The model parameters () are finetuned by minimizing the cross entropy loss over the labeled nodes.
One of the drawbacks of all existing GCN methods is that they use a predefined network topology, which is selected either based on the user’s experience, or empirically by testing multiple network topologies. In the next section, we describe a method for automatically determining a problem-specific compact GCN topology based on a data-driven approach.
Iv Progressive Graph Convolutional Network
PGCN follows a data-driven approach to learn a problem-dependant compact network topology, in terms of both depth and width, by progressively building the network’s topology based on a process guided by its performance. That is, the learning process of GCN jointly determines the network’s topology and estimates its parameters to optimize the cost function defined at the output of the network.
The learning process starts with a single hidden layer formed by one block of neurons equipped with an activation function (e.g. ) and an output layer with neurons. At iteration , the synaptic weight matrix connecting the input layer to the hidden-layer neurons is initialized randomly. The graph nodes’ representations defined at the outputs of the hidden layer , where the index indicates that the one block of hidden-layer neurons is used, are obtained by using graph convolution:
By setting , the network’s output for all graph nodes
is calculated using a linear transformation as follows:
where denotes the weight matrix connecting hidden layer to the output layer. The transformation matrix can be calculated by minimizing the regression problem:
where denotes the trace operator, denotes the relative importance of model loss, denotes the hidden layer representations of the labeled graph nodes and is a matrix formed by the labeled nodes’ target vectors.
To exploit the information in both the labeled and unlabeled nodes’ feature vectors in
, we can replace the linear regression problem in Eq. (8) by a semi-supervised regression problem exploiting the smoothness assumption of semi-supervised learning [31, 26] expressed by the term:
Minimization of Eq. (9) with respect to leads to node feature vectors at the output of the network which are similar for nodes connected in the graph. To incorporate this property in the optimization problem of Eq. (8) the term in Eq. (9) is added as a regularizer. Thus, is obtained by minimizing:
where is the number of all (labeled and unlabeled) graph nodes and denotes the relative importance of Laplacian regularization and is set to which is the natural scale factor to estimate the Laplacian operator empirically .
The optimal solution of Eq. (10) is obtained by setting , and is given by:
is identity matrix.
After the initialization of both the hidden layer and output layer weights, these are finetuned based on Backpropagation using the loss of the model on the labeled graph nodes. Finally, the model’s performance (classification accuracy on the labeled nodes)is recorded.
At iteration , the network’s topology grows by adding a second block of hidden layer neurons. We keep the weights connecting the input layer to the first block of hidden layer neurons fixed, while the weight matrix of the newly added block is initialized randomly. The hidden layer representations corresponding to the newly added neurons are calculated as in Eq. (6) by replacing with . Then, we combine and the output weight matrix is calculated by using Eq. (11).
After fine-tuning the adjustable parameters and recording the model’s performance, the network’s progression is evaluated based on the rate of performance improvement given by:
where and denote the model’s performance before and after adding the second block of hidden layer neurons. If the addition of the second GCN block does not improve the model’s performance, i.e. when , the newly added block is removed and the progression in the current hidden layer terminates. After stopping the progressive learning in the first hidden layer, all of its parameters are fixed and a new hidden layer is formed which takes as input the previous hidden layers’ output. The block-based progression of the newly added hidden layer starts by using a single block of neurons and repeats in the same way as for the first hidden layer until model’s performance converges.
Let us assume that at iteration , the network’s topology comprises of layers (the input layer corresponds to ) and it is growing at the layer giving as outputs . Before adding the block formed by neurons, the weights of all the existing blocks in the network are fixed. The newly added block in the hidden layer takes the output of the previous hidden layer as input, and the graph convolutional operation for the block is given by:
where and denote the hidden representation of the newly added block and the randomly initialized synaptic weights connecting the hidden layer to block of layer, respectively. Given which denotes the hidden representations formed by using both the existing blocks and the newly added block in the hidden layer, the models’ output is calculated using a linear transformation as follows:
where denotes the transformation matrix which is calculated based on semi-supervised linear regression:
Similar to other GCN-based methods, the linear transformation for calculating the models’ output in Eq. (14) (and Eq. (7), respectively) can also be followed by softmax activation (on a node basis) as follows:
and instead of Mean Squared Error (MSE), the Cross-Entropy (CE) loss function can be employed for finetuning the model.
After the initialization step, the synaptic weights of all the existing blocks are fixed and all the adjustable weight parameters of the newly added block are fine-tuned with respect to the labeled data using Eq. (8). To evaluate the network’s progression, the model’s performance is recorded and the rate of the improvement is given by:
where and denote the classification accuracy before and after adding the block, respectively. If the addition of a new GCN block in step does not improve the model’s performance, i.e. when , the progression of the layer terminates. After stopping the block progression for the layer, all the learned parameters are fixed and the algorithm evaluates whether the network’s performance converged using the rate of the performance before and after adding the new hidden layer:
When , the last hidden layer is removed and the algorithm stops growing the network’s topology. Subsequently, all network’s parameters are finetuned. Here we should note that it is also possible to use other performance metrics, such as model loss, to evaluate the network progression process in (17) and (18).
We evaluated the proposed method for semi-supervised node classification task following transductive setting, on four widely-used benchmark datasets including three standard citation networks, Citeseer, Cora and Pubmed , and a knowledge graph, NELL .
The citation networks represent published documents as nodes and the citation links between them as undirected edges. Each node in a citation network is represented by a sparse binary Bag-of-Words (BoW) feature vector extracted from articles’ abstract and a class label which represents the articles’ subject. The symmetric binary adjacency matrix is built using the list of undirected edges between the nodes and the task to be solved is the prediction of the articles’ subject based on the BoW features and their citations to other articles.
The knowledge graph dataset NELL  is a bipartite graph extracted from a knowledge graph and contains a set of entities as nodes and labeled relations between them as directed edges. Each entity is described by a sparse feature vector and the values in the adjacency matrix indicate the existence of one or more edges between different pairs of nodes.
To perform a fair comparison, we follow the same experimental setup as in [34, 6] for data configuration and preprocessing. The detailed datasets statistics are summerized in Table I. For training the model, 20 labeled nodes per class are used for each citation network and one labeled node per class is used for NELL dataset. In Table I, label rate denotes the number of training labeled nodes divided by the total number of nodes for each dataset. In all datasets, the validation set contains 500 randomly selected samples and the trained model is evaluated on 1000 test nodes. The labels of validation set are not used for training the model. Following the transductive learning setup, only the labeled nodes of the training set but all the feature vectors are used for training. The feature vectors are row-normalized.
|Method||GCN||PGCN (CE)||PGCN (MSE)|
|Citeseer||[D, h1: 16, 6]||[D, h1: 45, h2: 30, 6]||[D, h1: 20, 6]|
|Cora||[D, h1: 16, 7]||[D, h1: 15, h2: 15, 7]||[D, h1: 15, h2: 25, 7]|
|Pubmed||[D, h1: 16, 3]||[D, h1: 10, 3]||[D, h1: 20, 3]|
|NELL||[D, h1: 64, 210]||[D, h1:100, h2:100, 210]||[D, h1:100, h2:100, 210]|
|denotes the dimensionality of the input data|
|hX denotes the hidden layer.|
V-B Competing Methods
We compared the proposed method with the baseline GCN  and related methods N-GCN , H-GCN , and GAT . N-GCN  trains a network of GCNs over the neighboring nodes discovered at different distances in random walk. Each GCN module is uses a different power of adjacancy matrix like , where indicates the statistics collected from the step of a random walk on the graph. It combines the information from different graph scales by using the weighted sum or the concatenation of all the GCNs’ outputs into a final classification layer and finetunes the entire model for semi-supervised node classification. H-GCN  uses a deep hierarchical topology for GCN with multiple coarsening and refining layers to aggregate more global information from graph data. Each coarsening layer updates the nodes’ feature vectors by employing convolutional operation, then constructs hyper-nodes by combining the structurally similar nodes together. In this way, it increases the graph convolution receptive field and the nodes’ representations are captured by exploiting the local to global structure of the graph. The original graph structure is reconstructed by applying symmetric graph refining layers then. The graph attention network (GAT)  specifies different weights to different neighboring nodes by employing self-attention. The attention coefficients are shared across all graph edges, so there is no need to capture global graph strucure.
V-C Experimental settings
We implemented our method in tensorflow and trained it using Adam optimizer  for epochs with learning rate of
. The network weight parameters are initialized randomly using uniform distribution. To handle the effect of randomness on network performance, we ran it 3 times on each dataset. For each dataset, the set of hyper-parameters which leads to best validation performance is selected and the corresponding performance on test set and architectural information are reported. To avoid overfitting, dropout and weight regularization techniques are employed to regularize the network. The norm constraint maximum value is selected from and regularization factor of and are selected for citation network and knowledge network datasets respectively. We apply dropout on the output of hidden layers, not on input features. The dropout rate is selected from . The regularizer is selected from . The size of the block is selected from and for citation networks and knowledge graph datasets, respectively. The maximum topology of the network is limited to layers with neurons per layer and the threshold values are set to .
The baseline and competing methods also optimized the hyper-parameters on the same data splits. We used the implementations provided by authors to reproduce their results following the experimental settings explained in their reports. GCN method optimized the hyper-parameters on Cora dataset and used the same set of hyper-parameters for Citeseer and Pubmed datasets too. The two layer GCN is trained for 200 epochs using Adam optimizer with a learning rate of 0.01 and early stoping of step size 10. GCN with dropout rate of , hidden layer of neurons and regularization of is applied on citation datasets. For NELL dataset, the dropout rate is , the regularizer is and the hidden layer has neurons. The reported results for GCN method are the mean classification accuracy based on 100 runs with random weight initalization.
To reproduce the results of H-GCN and N-GCN, we followed the network configuration and hyper-parameter settings described in  and , respectively. HGCN is trained using Adam optimizer with learning rate of for epochs. The regularization factor and the dropout rate in this method are set to and , respectively. N-GCN uses the same optimizer with learning rate of for epochs. regularization factor is set to . In GAT method, the hyper-parameters are optimized on Cora dataset and then reused for Citeseer dataset. The regularization factor and the dropout rate are set to , respectively for Citeseer and Cora datasets. For pubmed dataset the The regularization factor is set to . The dropout is applied on both input feature vectors and hidden layers output.
Table II shows the performance in terms of classification accuracy for all the methods on the datasets. The best performance is shown in bold fonts for each dataset. The test classification accuracy of our method is achieved by capturing the model parameters with the highest validation accuracy. We evaluate our method with both Cross Entropy (CE) and Mean Squared Error (MSE) loss functions. Running GAT on NELL dataset needs more than 64GB memory, so its performance is not reported due to the memory issues.
The obtained results indicate that the proposed method has outperformed the baseline GCN method and other competing methods on Citeseer dataset and it has comparative performance on Cora and Pubmed. On NELL dataset, H-GCN and N-GCN have the best performance. The reason can be explain as the NELL dataset has fewer labeled samples per class than in citation datasets, so that the labeled nodes are faraway from other nodes. The H-GCN and N-GCN try to increase the receptive field to explore global information as well as local information and propagate the label information to other nodes more sufficiently. However, we couldn’t reproduce the results which are reported in H-GCN. One of the reasons is that they first train the network on link prediction task and then use the pretrained network for node classification. Since for GCN and H-GCN methods we got different results, we show both their reported results and the reproduced results in table (II).
N-GCN uses a fixed architecture for all datasets. It has blocks of GCN which are use different powers of : , , , . Each GCN, as a two layer network with hidden neurons, is replicated three times and the weighted sum of the outputs is introduced to the final classification layer. The H-GCN method uses a symmetric network architecture, with coarsening and refining layers, which comprises of layers for citation networks and layers for NELL dataset while the four-channel GCNs with hidden layers are applied in all the layers.
GAT method use a two layer network architecture. The first layer consists of attention heads, each computing features, followed by ELU  activation function and the second layer is a single attention head computing features followed by softmax activation function.
The model sizes, i.e., the number of model parameters, of all trained models are reported in table III. Table IV also shows the model architectures which are learned by our proposed method and the model architectures used by the baseline method GCN.
The results indicate that the network topologies which are learned by the proposed method on Cora and Pubmed datasets are much more compact compared to the fixed network topologies used by other methods, while the classification performance of our compact models is similar or better than others. Figure 2 illustrates the t-SNE visualization of learned feature vectors of the Cora dataset from last layer of network before applying the softmax activation. On NELL dataset, the best perforamnce and the most compact architecture corresponds to H-GCN with respect to the results reported in . The second best result belongs to N-GCN which trains more parameters than our method. The best classification accuracy on Citeseer dataset is achieved by the proposed method. Although our trained model doesn’t have the minimum number of parameters for this dataset, but it outperforms the baseline method GCN by .
V-E Analysis on Dataset Statistics
The results of the previous section indicate that all the competing methods perform on par with the baseline GCN which has a simple network architecture. This can be explained by the benchmark dataset statistics. The dataset complexity is defined by the number of labeled nodes and the dimentionality of nodes’ feature vectors , and the ratio is extremely small for all the benchmark datasets of Table I. It has been recently shown that even the heavily regularized linear methods can obtain high performance on classification problems on datasets with low complexity [42, 43, 44]. Therefore, all the GCN-based methods, with simple or sophisticated network structures, can lead to comparable performance on these widely used benchmark datasets. In , it has been experimentally shown that the performance of the GCN-based methods heavily depends on the underlying difficulty of the problem and non-linear models with more complex structure perform significantly better on datasets with higher ratio. That is, it is expected that the difference in performance of various methods will increase when the underlying semi-supervised classification problem becomes more complex.
Here we highlight the importance of optimizing the network structure based on the problem’s complexity. We compare the performance of the proposed method with the baseline method GCN , by tuning the ratio using different input data dimensionalities when and of nodes are labeled in citation networks and knowledge graph datasets, respectively. We use the same data splits for both methods and follow the same approach as in  to control the ratio by mapping the input data representations to a subspace through random projections. Specifically, we use a random sketching matrix with
, which is drawn from a Normal distribution, to obtain new data representationsas follows:
To avoid bias of the performance values obtained for different values of , we first randomly sample a square matrix and subsequently we use its first columns to map the input data from to its subspace . Such an approach guarantees that when a subspace of a higher dimensionality is used, it corresponds to an augmented version of the initial (lower-dimensional) subspace. We applied three experiments for each choice of on each dataset and we report the performance on the test data corresponding to the best validation performance.
Figure 3 compares the classification performance of GCN and PGCN methods on all datasets when problems of different difficulty are considered. It can be observed that for all data dimensionalities, our method performs better than the baseline method, while they have a larger difference in classification accuracy in lower dimensionalities, i.e. when the ratio is higher. This is in line with the findings of  indicating that in high-dimensional feature spaces neural network structures tend to perform in similar manners irrespectively to their complexity. On the other hand, when the classification problem is encoded in lower-dimensional feature spaces and, thus, becomes more complex, the structure of the neural network’s topology is important. Indeed, as can be observed in Figure 3b, PGCN outperforms with a high margin GCN when the dimensionality of the node representations is low. The comparison of model complexity with respect to number of trainable parameters in Figure 3c shows that both methods have similar complexity when lower-dimensional feature spaces are used on Citeseer and Cora datasets, while PGCN outperforms GCN in terms of classification perfromance. For Pubmed dataset, PGCN outperforms GCN in terms of both model complexity and classification performance in all cases. For NELL dataset, the models learned by PGCN method are more complex while they outperform GCN in terms of classification performance in all cases. The difference in performance is striking in NELL dataset, as PGCN by optimizing its topology is able to outperform GCN by a margin much greater than for low-dimensional data.
In this paper, we proposed a method for progressively training graph convolutional networks for semi-supervised node classification which jointly defines a problem-dependant compact topology and optimizes its parameters. The proposed method employs a learning process which utilizes the input to each layer data to grow the network’s structure both in terms of depth and width. This is achieved by operating an efficient layer-wise propagation rule leading to a self-organized network structure exploiting data relationships expressed by their vector representations and the adjacency matrix of the corresponding graph structure. Experimental results on four commonly used datasets for evaluating graph convolutional networks on semi-supervised classification indicate that the proposed method outperforms the baseline method GCN and performs on par, or even better, compared to more complex recently proposed methods in terms of classification performance and efficiency.
Appendix A Proof of Convergence
Here we show that the progressive learning in each layer of the PGCN method converges. Lets assume that denotes the hidden representations of data produced by using the first blocks in GCN layer and denotes the finetuned weights connecting all the blocks of layer to the output layer. We prove that the sequence of graph-regularized MSE, obtained with and is monotonically decreasing while it is bounded bellow by .
Given the fixed hidden representation , the finetuned output weights are not necessarily the optimized weights in terms of MSE. It can be explained by the following relation:
where denotes the optimized output weights which are obtained by solving the semi-supervised linear regression problem as follows:
In the next step, when the block is added to the layer, the new hidden representation of layer would be in which is fixed from previous step and is generated by new randomly initialized weights. The new optimal output weights which connect the layer to output layer is initialized according to Eq. (21) by substituting by . The MSE after adding the block would be as follows:
Since Eq. (22) holds for all , we can replace , with , respectively to obtain the following relation:
After finetuning , , the output weights are denoted by and the MSE would be
. It has been proven that stochastic gradient descent converges to a local optimum with small enough learning rate, so the following relation holds for the MSE:
which indicates that the sequence is monotonically decreasing.
Based on the connection of the linear activation function combined with the mean-square error to the soft-max activation function combined with the cross-entropy criterion and maximum likelihood optimization , an analysis following the same steps as above can be used to show that (when the latter is employed) the sequence is also monotonically decreasing.
-  Y. LeCun, Y. Bengio, and G. Hinton, “Deep learning,” Nature, vol. 521, no. 7553, pp. 436–444, 2015.
A. Krizhevsky, I. Sutskever, and G. E. Hinton, “Imagenet classification with deep convolutional neural networks,” inAdvances in Neural Information Processing Systems, 2012, pp. 1097–1105.
C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan,
V. Vanhoucke, and A. Rabinovich, “Going deeper with convolutions,” in
IEEE Conference on Computer Vision and Pattern Recognition, 2015, pp. 1–9.
-  K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” in IEEE Conference on Computer Vision and Pattern Recognition, 2016, pp. 770–778.
-  W. Hamilton, Z. Ying, and J. Leskovec, “Inductive representation learning on large graphs,” in Advances in Neural Information Processing Systems, 2017, pp. 1024–1034.
-  T. N. Kipf and M. Welling, “Semi-supervised classification with graph convolutional networks,” arXiv:1609.02907, 2016.
-  T. Hamaguchi, H. Oiwa, M. Shimbo, and Y. Matsumoto, “Knowledge transfer for out-of-knowledge-base entities: A graph neural network approach,” arXiv:1706.05674, 2017.
-  Z. Wu, S. Pan, F. Chen, G. Long, C. Zhang, and P. S. Yu, “A comprehensive survey on graph neural networks,” IEEE Transactions on Neural Networks and Learning Systems (Early Access), DOI: 10.1109/TNNLS.2020.2978386, 2019.
-  E. L. Denton, W. Zaremba, J. Bruna, Y. LeCun, and R. Fergus, “Exploiting linear structure within convolutional networks for efficient evaluation,” in Advances in Neural Information Processing Systems, 2014, pp. 1269–1277.
-  J. Atwood and D. Towsley, “Diffusion-convolutional neural networks,” in Advances in neural information processing systems, 2016, pp. 1993–2001.
M. Zhang, Z. Cui, M. Neumann, and Y. Chen, “An end-to-end deep learning
architecture for graph classification,” in
AAAI Conference on Artificial Intelligence, 2018.
-  M. Zhang and Y. Chen, “Link prediction based on graph neural networks,” in Advances in Neural Information Processing Systems, 2018, pp. 5165–5175.
-  X. Zhang, C. Xu, X. Tian, and D. Tao, “Graph edge convolutional neural networks for skeleton-based action recognition,” IEEE Transactions on Neural Networks and Learning Systems (Early Access), DOI: 10.1109/TNNLS.2019.2935173, 2019.
-  H. Shi, Y. Zhang, Z. Zhang, N. Ma, X. Zhao, Y. Gao, and J. Sun, “Hypergraph-induced convolutional networks for visual classification,” IEEE Transactions on Neural Networks and Learning Systems, vol. 30, no. 10, pp. 2963–2972, 2018.
-  C. Vignac, G. Ortiz-Jimenez, and P. Frossard, “On the choice of graph neural network architectures,” in arXiv:1911.05384, 2019.
-  Y. Cheng, D. Wang, P. Zhou, and T. Zhang, “A survey of model compression and acceleration for deep neural networks,” arXiv:1710.09282, 2017.
-  A. G. Howard, M. Zhu, B. Chen, D. Kalenichenko, W. Wang, T. Weyand, M. Andreetto, and H. Adam, “Mobilenets: Efficient convolutional neural networks for mobile vision applications,” arXiv:1704.04861, 2017.
-  D. T. Tran, A. Iosifidis, and M. Gabbouj, “Improving efficiency in convolutional neural networks with multilinear filters,” Neural Networks, vol. 105, pp. 328–339, 2018.
D. T. Tran, S. Kiranyaz, M. Gabbouj, and A. Iosifidis, “Heterogeneous multilayer generalized operational perceptron,”IEEE Transactions on Neural Networks and Learning Systems, vol. 31, no. 3, pp. 710–724, 2019.
-  S. Wiedemann, K.-R. Müller, and W. Samek, “Compact and computationally efficient representation of deep neural networks,” IEEE Transactions on Neural Networks and Learning Systems, vol. 31, no. 3, pp. 772–785, 2019.
-  Y. Sun, B. Xue, M. Zhang, and G. G. Yen, “Completely automated cnn architecture design based on blocks,” IEEE Transactions on Neural Networks and Learning Systems (Early Access), DOI: 10.1109/TNNLS.2019.2919608, 2019.
-  S. Abu-El-Haija, B. Perozzi, R. Al-Rfou, and A. Alemi, “Watch your step: Learning graph embeddings through attention,” arXiv:1710.09599, 2017.
-  A. Grover and J. Leskovec, “node2vec: Scalable feature learning for networks,” in ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 2016, pp. 855–864.
-  M. Belkin, P. Niyogi, and V. Sindhwani, “Manifold regularization: A geometric framework for learning from labeled and unlabeled examples,” Journal of Machine Learning Research, vol. 7, no. Nov, pp. 2399–2434, 2006.
-  J. Weston, F. Ratle, H. Mobahi, and R. Collobert, “Deep learning via semi-supervised embedding,” in Neural Networks: Tricks of the trade. Springer, 2012, pp. 639–655.
-  A. Iosifidis, A. Tefas, and I. Pitas, “Regularized extreme learning machine for multi-view semi-supervised action recognition,” Neurocomputing, vol. 145, pp. 250–262, 2014.
-  D. Ienco and R. G. Pensa, “Enhancing graph-based semisupervised learning via knowledge-aware data embedding,” IEEE Transactions on Neural Networks and Learning Systems (Early Access), DOI: 10.1109/TNNLS.2019.2955565, 2019.
-  C. Zhuang and Q. Ma, “Dual graph convolutional networks for graph-based semi-supervised classification,” in World Wide Web Conference, 2018, pp. 499–508.
-  F. Monti, D. Boscaini, J. Masci, E. Rodola, J. Svoboda, and M. M. Bronstein, “Geometric deep learning on graphs and manifolds using mixture model cnns,” in IEEE Conference on Computer Vision and Pattern Recognition, 2017, pp. 5115–5124.
-  D. K. Hammond, P. Vandergheynst, and R. Gribonval, “Wavelets on graphs via spectral graph theory,” Applied and Computational Harmonic Analysis, vol. 30, no. 2, pp. 129–150, 2011.
-  S. Yan, D. Xu, B. Zhang, H.-J. Zhang, Q. Yang, and S. Lin, “Graph embedding and extensions: A general framework for dimensionality reduction,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 29, no. 1, pp. 40–51, 2006.
-  P. Sen, G. Namata, M. Bilgic, L. Getoor, B. Galligher, and T. Eliassi-Rad, “Collective classification in network data,” AI Magazine, vol. 29, no. 3, pp. 93–93, 2008.
-  A. Carlson, J. Betteridge, B. Kisiel, B. Settles, E. R. Hruschka, and T. M. Mitchell, “Toward an architecture for never-ending language learning,” in AAAI Conference on Artificial Intelligence, 2010.
-  Z. Yang, W. W. Cohen, and R. Salakhutdinov, “Revisiting semi-supervised learning with graph embeddings,” arXiv:1603.08861, 2016.
-  P. Veličković, G. Cucurull, A. Casanova, A. Romero, P. Lio, and Y. Bengio, “Graph attention networks,” arXiv:1710.10903, 2017.
-  S. Abu-El-Haija, A. Kapoor, B. Perozzi, and J. Lee, “N-gcn: Multi-scale graph convolution for semi-supervised node classification,” arXiv:1802.08888, 2018.
-  F. Hu, Y. Zhu, S. Wu, L. Wang, and T. Tan, “Hierarchical graph convolutional networks for semi-supervised node classification,” arXiv:1902.06667, 2019.
-  M. Abadi, A. Agarwal, P. Barham, E. Brevdo, Z. Chen, C. Citro, G. S. Corrado, A. Davis, J. Dean, M. Devin et al., “Tensorflow: Large-scale machine learning on heterogeneous distributed systems,” arXiv:1603.04467, 2016.
-  D. P. Kingma and J. Ba, “Adam: A method for stochastic optimization,” arXiv:1412.6980, 2014.
-  N. Srivastava, G. Hinton, A. Krizhevsky, I. Sutskever, and R. Salakhutdinov, “Dropout: a simple way to prevent neural networks from overfitting,” Journal of Machine Learning Research, vol. 15, no. 1, pp. 1929–1958, 2014.
-  D.-A. Clevert, T. Unterthiner, and S. Hochreiter, “Fast and accurate deep network learning by exponential linear units (elus),” arXiv:1511.07289, 2015.
-  F. Wu, T. Zhang, A. H. d. Souza Jr, C. Fifty, T. Yu, and K. Q. Weinberger, “Simplifying graph convolutional networks,” arXiv:1902.07153, 2019.
-  H. NT and T. Maehara, “Revisiting graph neural networks: All we have is low-pass filters,” arXiv:1905.09550, 2019.
-  J. Klicpera, A. Bojchevski, and S. Günnemann, “Predict then propagate: Graph neural networks meet personalized pagerank,” arXiv:1810.05997, 2018.
-  H. Robbins and S. Monro, “A stochastic approximation method in: Herbert robbins selected papers,” NewYork, USA: Springer, vol. 102, p. 109, 1985.
-  C. M. Bishop, Pattern Recognition and Machine Learning. Springer, 2007.