I Introduction
Convolutional Neural Networks (CNNs) [1]
, as an endtoend deep learning paradigm, have been very successful in many machine learning and computer vision tasks
[2, 3, 4]. While CNNs have high ability to extract latent representations and local meaningful statistical patterns from data, they can only operate on Euclidean data structures such as audio, images and videos which have a form based on D, D and D regular grids, respectively. Recently, there has been an increasing research interest in applying deep learning approaches on noneuclidean data structures, like graphs, which lack geometrical properties, but have high ability to model complex irregular data such as social networks [5, 6]citation networks and knowledge graphs
[7].Motivated by CNNs, the notion of convolution was generalized from grid data to graph structures, which correspond to locally connected data items, by aggregating the node’s neighbors’ features with its own features [8]. Combining such convolutional operator with a data trasformation process and hierarchical structures Graph Convolutional Networks (GCNs) are obtained, which can be trained in an endtoend fashion to optimize an objective defined on individual graph nodes, or the graph as a whole. As an example, the GCN method proposed in [6] is a semisupervised node classification method consisting of two convolutional layers. Each graph convolutional layer learns nodes’ features by applying aggregation rule on their corresponding firstorder neighbors. GCNs [9], have been successful in various graph mining tasks like node classification, [10], graph classification [11], link prediction [12], and visual data analysis [13, 14].
One of the main drawbacks of existing GCNbased methods is that they require an extensive hyperparameter search process to determine a good topology of the network. This process is commonly based on extensive experimentation involving the training of multiple GCN topologies and monitoring their performance on a holdout set. To avoid such a computationally demanding process, a common practice is to empirically define a network topology and use it for all examined problems (data sets). As an example, a two layer GCN model cannot gain sufficient global information by aggregating the information from just a twohop neighborhood for each node. Meanwhile, in the reported results of [6] it has been shown that adding more layers makes the training process harder and does not necessarily improve the classification performance. It was also shown that the topology of the GCN plays a crucial role in performance which is related to the underlying difficulty of the problem to be solved [15].
Problemspecific design of the neural networks’ architecture contributes in improving the performance and the efficiency of a learning system. Recently, methods of finding an optimized network topology have been receiving much attention and many works were proposed to define compact network topologies by employing various learning strategies, such as compressing pretrained networks, adding neurons progressively to the network, pruning the network’s weights and applying weight quantization
[16, 17, 18, 19, 20, 21]. However, all these methods work with grid data in Euqlidean spaces. In this regard, learning a compact topology for GCNs makes a step towards increasing the training efficiency and reducing the computational complexity and storage needed while achieving comparable performance with existing methods.In this paper, we propose a method to jointly define a problemspecific GCN topology and optimize its parameters, by progressively growing the network’s structure both in width and depth. The contributions of our work are:

We propose a method to learn an optimized and problemspecific GCN topology progressively without user intervention. The resulting networks are compact in terms of number of parameters, while performing on par, or even better, compared to other recent GCN models.

We provide a convergence analysis for the proposed approach, showing that the progressively building GCN topology is guaranteed to converge to a (local) minimum.

We conduct experiments on widelyused graph datasets and compare the proposed method with recently proposed GCN models to demonstrate both the efficiency and competitive performance of the proposed method. Our experiments include an analysis of the effect of the network’s complexity with respect to the underlying complexity of the classification problem, highlighting the importance of network’s topology optimization.
The rest of the paper is organized as follows: Section 2 provides a description of graphbased semisupervised classification, along with the terminology used in this paper. Section 3 reviews the GCN method [6] as the baseline method of our work. The proposed method is described in detail in section 4. The conducted experiments are described in section 5, and conclusions are drawn in Section 5.
Ii Semisupervised graphbased classification
Let be an undirected graph where and denote the set of nodes and the set of edges , respectively. denotes the adjacency matrix of encoding the node connections. The elements can be either binary values, indicating presence or absence of an edge between nodes and , or real values encoding the similarity between and , based on a similarity measure. Using , we define the degree matrix which is a diagonal matrix with elements equal to . Each node of the graph is also equipped with a representation
, which is used to form the feature vector matrix
. When such a feature vector for each graph node is not readily available for the problem at hand (e.g. in the case of processing citation graphs), vectorbased node representations are learned by using some node embedding method, like [22, 23].Traditional graphbased semisupervised node classification methods [24, 25, 26, 27], learn a mapping from the nodes’ feature vectors to labels, which for a class classification problem are usually represented by dimensional vectors following the of encoding scheme. This mapping exploits a graphbased regularization term and it commonly has the form:
(1) 
where denotes the function mapping the dimensional node representations to the dimensional class vectors, is a matrix formed by the class label vectors of the labelled nodes which form the matrix , and is the unnormalized graph Laplacian matrix. The first term in Eq. (1) is the classification loss of the trained model measured on the labelled graph nodes, and the second term corresponds to graph Laplacianbased regularization incorporating a smoothness prior to the optimization function . expresses the relative importance of the two terms. By following this approach the labels’ information of the labelled graph nodes is propagated over the entire graph.
Iii Graph Convolutional Networks
GCNs are mainly categorized into spatialbased and spectralbased methods. The spatialbased methods update the features of each node by aggregating its spatially close neighbors’ feature vectors. In these methods, the convolution operation is defined on the graph with a specified neighborhood size which propagates the information locally [10, 28, 29, 5].
The spectralbased GCN methods follow a graph signal processing approach [6]. Let us denote by a filter and by a multidimensional signal defined over the nodes of the graph. The signal transformation using is given by:
(2) 
where denotes the convolution operator,
is the matrix of eigenvectors of the normalized graph Laplacian
withbeing a diagonal matrix having as elements the corresponding eigenvalues, and
being the graph Fourier transform of
. Since computing the eigendecomposition of is computationally expensive, lowrank approximations using truncated Chebyshev polynomials have been proposed [30]. The transformation in Eq. (2) corresponds to the building block of a GCN. Followed by an elementwise activation it forms a GCN layer.The multilayer GCN for semisupervised node classification was proposed in [6] by stacking multiple GCN layers. To achieve a fast and scalable operation a firstorder approximation of the spectral graph convolution is proposed leading to:
(3) 
where and . Let us denote by the graph node representations at layer of the multilayer GCN. The propagation rule for calculating the graph node representations at layer is given by:
(4) 
where is the layer weight matrix and
denotes the activation function, such as
, or used for the output layer. For a two layer GCN model this leads to:(5) 
where and denotes the predicted feature vectors for all the graph nodes in classes. The model parameters () are finetuned by minimizing the cross entropy loss over the labeled nodes.
One of the drawbacks of all existing GCN methods is that they use a predefined network topology, which is selected either based on the user’s experience, or empirically by testing multiple network topologies. In the next section, we describe a method for automatically determining a problemspecific compact GCN topology based on a datadriven approach.
Iv Progressive Graph Convolutional Network
PGCN follows a datadriven approach to learn a problemdependant compact network topology, in terms of both depth and width, by progressively building the network’s topology based on a process guided by its performance. That is, the learning process of GCN jointly determines the network’s topology and estimates its parameters to optimize the cost function defined at the output of the network.
The learning process starts with a single hidden layer formed by one block of neurons equipped with an activation function (e.g. ) and an output layer with neurons. At iteration , the synaptic weight matrix connecting the input layer to the hiddenlayer neurons is initialized randomly. The graph nodes’ representations defined at the outputs of the hidden layer , where the index indicates that the one block of hiddenlayer neurons is used, are obtained by using graph convolution:
(6) 
By setting , the network’s output for all graph nodes
is calculated using a linear transformation as follows:
(7) 
where denotes the weight matrix connecting hidden layer to the output layer. The transformation matrix can be calculated by minimizing the regression problem:
(8) 
where denotes the trace operator, denotes the relative importance of model loss, denotes the hidden layer representations of the labeled graph nodes and is a matrix formed by the labeled nodes’ target vectors.
To exploit the information in both the labeled and unlabeled nodes’ feature vectors in
, we can replace the linear regression problem in Eq. (
8) by a semisupervised regression problem exploiting the smoothness assumption of semisupervised learning [31, 26] expressed by the term:(9) 
Minimization of Eq. (9) with respect to leads to node feature vectors at the output of the network which are similar for nodes connected in the graph. To incorporate this property in the optimization problem of Eq. (8) the term in Eq. (9) is added as a regularizer. Thus, is obtained by minimizing:
(10)  
where is the number of all (labeled and unlabeled) graph nodes and denotes the relative importance of Laplacian regularization and is set to which is the natural scale factor to estimate the Laplacian operator empirically [24].
The optimal solution of Eq. (10) is obtained by setting , and is given by:
(11) 
where
is identity matrix.
After the initialization of both the hidden layer and output layer weights, these are finetuned based on Backpropagation using the loss of the model on the labeled graph nodes. Finally, the model’s performance (classification accuracy on the labeled nodes)
is recorded.At iteration , the network’s topology grows by adding a second block of hidden layer neurons. We keep the weights connecting the input layer to the first block of hidden layer neurons fixed, while the weight matrix of the newly added block is initialized randomly. The hidden layer representations corresponding to the newly added neurons are calculated as in Eq. (6) by replacing with . Then, we combine and the output weight matrix is calculated by using Eq. (11).
After finetuning the adjustable parameters and recording the model’s performance, the network’s progression is evaluated based on the rate of performance improvement given by:
(12) 
where and denote the model’s performance before and after adding the second block of hidden layer neurons. If the addition of the second GCN block does not improve the model’s performance, i.e. when , the newly added block is removed and the progression in the current hidden layer terminates. After stopping the progressive learning in the first hidden layer, all of its parameters are fixed and a new hidden layer is formed which takes as input the previous hidden layers’ output. The blockbased progression of the newly added hidden layer starts by using a single block of neurons and repeats in the same way as for the first hidden layer until model’s performance converges.
Let us assume that at iteration , the network’s topology comprises of layers (the input layer corresponds to ) and it is growing at the layer giving as outputs . Before adding the block formed by neurons, the weights of all the existing blocks in the network are fixed. The newly added block in the hidden layer takes the output of the previous hidden layer as input, and the graph convolutional operation for the block is given by:
(13) 
where and denote the hidden representation of the newly added block and the randomly initialized synaptic weights connecting the hidden layer to block of layer, respectively. Given which denotes the hidden representations formed by using both the existing blocks and the newly added block in the hidden layer, the models’ output is calculated using a linear transformation as follows:
(14) 
where denotes the transformation matrix which is calculated based on semisupervised linear regression:
(15) 
Similar to other GCNbased methods, the linear transformation for calculating the models’ output in Eq. (14) (and Eq. (7), respectively) can also be followed by softmax activation (on a node basis) as follows:
(16) 
and instead of Mean Squared Error (MSE), the CrossEntropy (CE) loss function can be employed for finetuning the model.
After the initialization step, the synaptic weights of all the existing blocks are fixed and all the adjustable weight parameters of the newly added block are finetuned with respect to the labeled data using Eq. (8). To evaluate the network’s progression, the model’s performance is recorded and the rate of the improvement is given by:
(17) 
where and denote the classification accuracy before and after adding the block, respectively. If the addition of a new GCN block in step does not improve the model’s performance, i.e. when , the progression of the layer terminates. After stopping the block progression for the layer, all the learned parameters are fixed and the algorithm evaluates whether the network’s performance converged using the rate of the performance before and after adding the new hidden layer:
(18) 
When , the last hidden layer is removed and the algorithm stops growing the network’s topology. Subsequently, all network’s parameters are finetuned. Here we should note that it is also possible to use other performance metrics, such as model loss, to evaluate the network progression process in (17) and (18).
V Experiments
Va Datasets
We evaluated the proposed method for semisupervised node classification task following transductive setting, on four widelyused benchmark datasets including three standard citation networks, Citeseer, Cora and Pubmed [32], and a knowledge graph, NELL [33].
The citation networks represent published documents as nodes and the citation links between them as undirected edges. Each node in a citation network is represented by a sparse binary BagofWords (BoW) feature vector extracted from articles’ abstract and a class label which represents the articles’ subject. The symmetric binary adjacency matrix is built using the list of undirected edges between the nodes and the task to be solved is the prediction of the articles’ subject based on the BoW features and their citations to other articles.
The knowledge graph dataset NELL [33] is a bipartite graph extracted from a knowledge graph and contains a set of entities as nodes and labeled relations between them as directed edges. Each entity is described by a sparse feature vector and the values in the adjacency matrix indicate the existence of one or more edges between different pairs of nodes.
To perform a fair comparison, we follow the same experimental setup as in [34, 6] for data configuration and preprocessing. The detailed datasets statistics are summerized in Table I. For training the model, 20 labeled nodes per class are used for each citation network and one labeled node per class is used for NELL dataset. In Table I, label rate denotes the number of training labeled nodes divided by the total number of nodes for each dataset. In all datasets, the validation set contains 500 randomly selected samples and the trained model is evaluated on 1000 test nodes. The labels of validation set are not used for training the model. Following the transductive learning setup, only the labeled nodes of the training set but all the feature vectors are used for training. The feature vectors are rownormalized.
Dataset  Citeseer  Cora  Pubmed  NELL 

Type  Citation  Citation  Citation  Knowledge 
Nodes  3327  2708  19717  65775 
Edges  4732  5429  44338  266144 
Classes  6  7  3  210 
Features  3703  1433  500  5414 
Label rate  0.036  0.052  0.003  0.001 
Method  Citeseer  Cora  Pubmed  NELL 

GAT [35]    
NGCN [36]  
HGCN [37]  
GCN [6]  
HGCN (reproduced)  
GCN (reproduced)  
PGCN (CE)  74.3  80.2  
PGCN (MSE) 
Method  Citeseer  Cora  Pubmed  NELL 
GCN  59.4k  23.1k  8.1k  360.2k 
NGCN  445.4k  173k  60.5k  677.5k 
HGCN  127.3k  54.6k  24.6k  188.7k 
GAT  237.5k  92.3k  32.3k  360.2k 
PGCN (CE)  168.2k  21.8k  5.043k  572.81k 
PGCN (MSE)  74.3k  22.1k  10.1k  572.81k 
Method  GCN  PGCN (CE)  PGCN (MSE) 

Citeseer  [D, h1: 16, 6]  [D, h1: 45, h2: 30, 6]  [D, h1: 20, 6] 
Cora  [D, h1: 16, 7]  [D, h1: 15, h2: 15, 7]  [D, h1: 15, h2: 25, 7] 
Pubmed  [D, h1: 16, 3]  [D, h1: 10, 3]  [D, h1: 20, 3] 
NELL  [D, h1: 64, 210]  [D, h1:100, h2:100, 210]  [D, h1:100, h2:100, 210] 
denotes the dimensionality of the input data 
hX denotes the hidden layer. 
VB Competing Methods
We compared the proposed method with the baseline GCN [6] and related methods NGCN [36], HGCN [37], and GAT [35]. NGCN [36] trains a network of GCNs over the neighboring nodes discovered at different distances in random walk. Each GCN module is uses a different power of adjacancy matrix like , where indicates the statistics collected from the step of a random walk on the graph. It combines the information from different graph scales by using the weighted sum or the concatenation of all the GCNs’ outputs into a final classification layer and finetunes the entire model for semisupervised node classification. HGCN [37] uses a deep hierarchical topology for GCN with multiple coarsening and refining layers to aggregate more global information from graph data. Each coarsening layer updates the nodes’ feature vectors by employing convolutional operation, then constructs hypernodes by combining the structurally similar nodes together. In this way, it increases the graph convolution receptive field and the nodes’ representations are captured by exploiting the local to global structure of the graph. The original graph structure is reconstructed by applying symmetric graph refining layers then. The graph attention network (GAT) [35] specifies different weights to different neighboring nodes by employing selfattention. The attention coefficients are shared across all graph edges, so there is no need to capture global graph strucure.
VC Experimental settings
We implemented our method in tensorflow
[38] and trained it using Adam optimizer [39] for epochs with learning rate of. The network weight parameters are initialized randomly using uniform distribution. To handle the effect of randomness on network performance, we ran it 3 times on each dataset. For each dataset, the set of hyperparameters which leads to best validation performance is selected and the corresponding performance on test set and architectural information are reported. To avoid overfitting, dropout
[40] and weight regularization techniques are employed to regularize the network. The norm constraint maximum value is selected from and regularization factor of and are selected for citation network and knowledge network datasets respectively. We apply dropout on the output of hidden layers, not on input features. The dropout rate is selected from . The regularizer is selected from . The size of the block is selected from and for citation networks and knowledge graph datasets, respectively. The maximum topology of the network is limited to layers with neurons per layer and the threshold values are set to .The baseline and competing methods also optimized the hyperparameters on the same data splits. We used the implementations provided by authors to reproduce their results following the experimental settings explained in their reports. GCN method optimized the hyperparameters on Cora dataset and used the same set of hyperparameters for Citeseer and Pubmed datasets too. The two layer GCN is trained for 200 epochs using Adam optimizer with a learning rate of 0.01 and early stoping of step size 10. GCN with dropout rate of , hidden layer of neurons and regularization of is applied on citation datasets. For NELL dataset, the dropout rate is , the regularizer is and the hidden layer has neurons. The reported results for GCN method are the mean classification accuracy based on 100 runs with random weight initalization.
To reproduce the results of HGCN and NGCN, we followed the network configuration and hyperparameter settings described in [37] and [36], respectively. HGCN is trained using Adam optimizer with learning rate of for epochs. The regularization factor and the dropout rate in this method are set to and , respectively. NGCN uses the same optimizer with learning rate of for epochs. regularization factor is set to . In GAT method, the hyperparameters are optimized on Cora dataset and then reused for Citeseer dataset. The regularization factor and the dropout rate are set to , respectively for Citeseer and Cora datasets. For pubmed dataset the The regularization factor is set to . The dropout is applied on both input feature vectors and hidden layers output.
VD Results
Table II shows the performance in terms of classification accuracy for all the methods on the datasets. The best performance is shown in bold fonts for each dataset. The test classification accuracy of our method is achieved by capturing the model parameters with the highest validation accuracy. We evaluate our method with both Cross Entropy (CE) and Mean Squared Error (MSE) loss functions. Running GAT on NELL dataset needs more than 64GB memory, so its performance is not reported due to the memory issues.
The obtained results indicate that the proposed method has outperformed the baseline GCN method and other competing methods on Citeseer dataset and it has comparative performance on Cora and Pubmed. On NELL dataset, HGCN and NGCN have the best performance. The reason can be explain as the NELL dataset has fewer labeled samples per class than in citation datasets, so that the labeled nodes are faraway from other nodes. The HGCN and NGCN try to increase the receptive field to explore global information as well as local information and propagate the label information to other nodes more sufficiently. However, we couldn’t reproduce the results which are reported in HGCN. One of the reasons is that they first train the network on link prediction task and then use the pretrained network for node classification. Since for GCN and HGCN methods we got different results, we show both their reported results and the reproduced results in table (II).
NGCN uses a fixed architecture for all datasets. It has blocks of GCN which are use different powers of : , , , . Each GCN, as a two layer network with hidden neurons, is replicated three times and the weighted sum of the outputs is introduced to the final classification layer. The HGCN method uses a symmetric network architecture, with coarsening and refining layers, which comprises of layers for citation networks and layers for NELL dataset while the fourchannel GCNs with hidden layers are applied in all the layers.
GAT method use a two layer network architecture. The first layer consists of attention heads, each computing features, followed by ELU [41] activation function and the second layer is a single attention head computing features followed by softmax activation function.
The model sizes, i.e., the number of model parameters, of all trained models are reported in table III. Table IV also shows the model architectures which are learned by our proposed method and the model architectures used by the baseline method GCN.
The results indicate that the network topologies which are learned by the proposed method on Cora and Pubmed datasets are much more compact compared to the fixed network topologies used by other methods, while the classification performance of our compact models is similar or better than others. Figure 2 illustrates the tSNE visualization of learned feature vectors of the Cora dataset from last layer of network before applying the softmax activation. On NELL dataset, the best perforamnce and the most compact architecture corresponds to HGCN with respect to the results reported in [37]. The second best result belongs to NGCN which trains more parameters than our method. The best classification accuracy on Citeseer dataset is achieved by the proposed method. Although our trained model doesn’t have the minimum number of parameters for this dataset, but it outperforms the baseline method GCN by .
VE Analysis on Dataset Statistics
The results of the previous section indicate that all the competing methods perform on par with the baseline GCN which has a simple network architecture. This can be explained by the benchmark dataset statistics. The dataset complexity is defined by the number of labeled nodes and the dimentionality of nodes’ feature vectors , and the ratio is extremely small for all the benchmark datasets of Table I. It has been recently shown that even the heavily regularized linear methods can obtain high performance on classification problems on datasets with low complexity [42, 43, 44]. Therefore, all the GCNbased methods, with simple or sophisticated network structures, can lead to comparable performance on these widely used benchmark datasets. In [15], it has been experimentally shown that the performance of the GCNbased methods heavily depends on the underlying difficulty of the problem and nonlinear models with more complex structure perform significantly better on datasets with higher ratio. That is, it is expected that the difference in performance of various methods will increase when the underlying semisupervised classification problem becomes more complex.
Here we highlight the importance of optimizing the network structure based on the problem’s complexity. We compare the performance of the proposed method with the baseline method GCN [6], by tuning the ratio using different input data dimensionalities when and of nodes are labeled in citation networks and knowledge graph datasets, respectively. We use the same data splits for both methods and follow the same approach as in [15] to control the ratio by mapping the input data representations to a subspace through random projections. Specifically, we use a random sketching matrix with
, which is drawn from a Normal distribution, to obtain new data representations
as follows:(19) 
To avoid bias of the performance values obtained for different values of , we first randomly sample a square matrix and subsequently we use its first columns to map the input data from to its subspace . Such an approach guarantees that when a subspace of a higher dimensionality is used, it corresponds to an augmented version of the initial (lowerdimensional) subspace. We applied three experiments for each choice of on each dataset and we report the performance on the test data corresponding to the best validation performance.



Figure 3 compares the classification performance of GCN and PGCN methods on all datasets when problems of different difficulty are considered. It can be observed that for all data dimensionalities, our method performs better than the baseline method, while they have a larger difference in classification accuracy in lower dimensionalities, i.e. when the ratio is higher. This is in line with the findings of [15] indicating that in highdimensional feature spaces neural network structures tend to perform in similar manners irrespectively to their complexity. On the other hand, when the classification problem is encoded in lowerdimensional feature spaces and, thus, becomes more complex, the structure of the neural network’s topology is important. Indeed, as can be observed in Figure 3b, PGCN outperforms with a high margin GCN when the dimensionality of the node representations is low. The comparison of model complexity with respect to number of trainable parameters in Figure 3c shows that both methods have similar complexity when lowerdimensional feature spaces are used on Citeseer and Cora datasets, while PGCN outperforms GCN in terms of classification perfromance. For Pubmed dataset, PGCN outperforms GCN in terms of both model complexity and classification performance in all cases. For NELL dataset, the models learned by PGCN method are more complex while they outperform GCN in terms of classification performance in all cases. The difference in performance is striking in NELL dataset, as PGCN by optimizing its topology is able to outperform GCN by a margin much greater than for lowdimensional data.
Vi Conclusion
In this paper, we proposed a method for progressively training graph convolutional networks for semisupervised node classification which jointly defines a problemdependant compact topology and optimizes its parameters. The proposed method employs a learning process which utilizes the input to each layer data to grow the network’s structure both in terms of depth and width. This is achieved by operating an efficient layerwise propagation rule leading to a selforganized network structure exploiting data relationships expressed by their vector representations and the adjacency matrix of the corresponding graph structure. Experimental results on four commonly used datasets for evaluating graph convolutional networks on semisupervised classification indicate that the proposed method outperforms the baseline method GCN and performs on par, or even better, compared to more complex recently proposed methods in terms of classification performance and efficiency.
Appendix A Proof of Convergence
Here we show that the progressive learning in each layer of the PGCN method converges. Lets assume that denotes the hidden representations of data produced by using the first blocks in GCN layer and denotes the finetuned weights connecting all the blocks of layer to the output layer. We prove that the sequence of graphregularized MSE, obtained with and is monotonically decreasing while it is bounded bellow by .
Given the fixed hidden representation , the finetuned output weights are not necessarily the optimized weights in terms of MSE. It can be explained by the following relation:
(20)  
where denotes the optimized output weights which are obtained by solving the semisupervised linear regression problem as follows:
(21) 
In the next step, when the block is added to the layer, the new hidden representation of layer would be in which is fixed from previous step and is generated by new randomly initialized weights. The new optimal output weights which connect the layer to output layer is initialized according to Eq. (21) by substituting by . The MSE after adding the block would be as follows:
(22)  
Since Eq. (22) holds for all , we can replace , with , respectively to obtain the following relation:
(23)  
After finetuning , , the output weights are denoted by and the MSE would be
. It has been proven that stochastic gradient descent converges to a local optimum
[45] with small enough learning rate, so the following relation holds for the MSE:(24) 
According to (24), (25) we have the following relation:
(25) 
which indicates that the sequence is monotonically decreasing.
Based on the connection of the linear activation function combined with the meansquare error to the softmax activation function combined with the crossentropy criterion and maximum likelihood optimization [46], an analysis following the same steps as above can be used to show that (when the latter is employed) the sequence is also monotonically decreasing.
References
 [1] Y. LeCun, Y. Bengio, and G. Hinton, “Deep learning,” Nature, vol. 521, no. 7553, pp. 436–444, 2015.

[2]
A. Krizhevsky, I. Sutskever, and G. E. Hinton, “Imagenet classification with deep convolutional neural networks,” in
Advances in Neural Information Processing Systems, 2012, pp. 1097–1105. 
[3]
C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan,
V. Vanhoucke, and A. Rabinovich, “Going deeper with convolutions,” in
IEEE Conference on Computer Vision and Pattern Recognition
, 2015, pp. 1–9.  [4] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” in IEEE Conference on Computer Vision and Pattern Recognition, 2016, pp. 770–778.
 [5] W. Hamilton, Z. Ying, and J. Leskovec, “Inductive representation learning on large graphs,” in Advances in Neural Information Processing Systems, 2017, pp. 1024–1034.
 [6] T. N. Kipf and M. Welling, “Semisupervised classification with graph convolutional networks,” arXiv:1609.02907, 2016.
 [7] T. Hamaguchi, H. Oiwa, M. Shimbo, and Y. Matsumoto, “Knowledge transfer for outofknowledgebase entities: A graph neural network approach,” arXiv:1706.05674, 2017.
 [8] Z. Wu, S. Pan, F. Chen, G. Long, C. Zhang, and P. S. Yu, “A comprehensive survey on graph neural networks,” IEEE Transactions on Neural Networks and Learning Systems (Early Access), DOI: 10.1109/TNNLS.2020.2978386, 2019.
 [9] E. L. Denton, W. Zaremba, J. Bruna, Y. LeCun, and R. Fergus, “Exploiting linear structure within convolutional networks for efficient evaluation,” in Advances in Neural Information Processing Systems, 2014, pp. 1269–1277.
 [10] J. Atwood and D. Towsley, “Diffusionconvolutional neural networks,” in Advances in neural information processing systems, 2016, pp. 1993–2001.

[11]
M. Zhang, Z. Cui, M. Neumann, and Y. Chen, “An endtoend deep learning
architecture for graph classification,” in
AAAI Conference on Artificial Intelligence
, 2018.  [12] M. Zhang and Y. Chen, “Link prediction based on graph neural networks,” in Advances in Neural Information Processing Systems, 2018, pp. 5165–5175.
 [13] X. Zhang, C. Xu, X. Tian, and D. Tao, “Graph edge convolutional neural networks for skeletonbased action recognition,” IEEE Transactions on Neural Networks and Learning Systems (Early Access), DOI: 10.1109/TNNLS.2019.2935173, 2019.
 [14] H. Shi, Y. Zhang, Z. Zhang, N. Ma, X. Zhao, Y. Gao, and J. Sun, “Hypergraphinduced convolutional networks for visual classification,” IEEE Transactions on Neural Networks and Learning Systems, vol. 30, no. 10, pp. 2963–2972, 2018.
 [15] C. Vignac, G. OrtizJimenez, and P. Frossard, “On the choice of graph neural network architectures,” in arXiv:1911.05384, 2019.
 [16] Y. Cheng, D. Wang, P. Zhou, and T. Zhang, “A survey of model compression and acceleration for deep neural networks,” arXiv:1710.09282, 2017.
 [17] A. G. Howard, M. Zhu, B. Chen, D. Kalenichenko, W. Wang, T. Weyand, M. Andreetto, and H. Adam, “Mobilenets: Efficient convolutional neural networks for mobile vision applications,” arXiv:1704.04861, 2017.
 [18] D. T. Tran, A. Iosifidis, and M. Gabbouj, “Improving efficiency in convolutional neural networks with multilinear filters,” Neural Networks, vol. 105, pp. 328–339, 2018.

[19]
D. T. Tran, S. Kiranyaz, M. Gabbouj, and A. Iosifidis, “Heterogeneous multilayer generalized operational perceptron,”
IEEE Transactions on Neural Networks and Learning Systems, vol. 31, no. 3, pp. 710–724, 2019.  [20] S. Wiedemann, K.R. Müller, and W. Samek, “Compact and computationally efficient representation of deep neural networks,” IEEE Transactions on Neural Networks and Learning Systems, vol. 31, no. 3, pp. 772–785, 2019.
 [21] Y. Sun, B. Xue, M. Zhang, and G. G. Yen, “Completely automated cnn architecture design based on blocks,” IEEE Transactions on Neural Networks and Learning Systems (Early Access), DOI: 10.1109/TNNLS.2019.2919608, 2019.
 [22] S. AbuElHaija, B. Perozzi, R. AlRfou, and A. Alemi, “Watch your step: Learning graph embeddings through attention,” arXiv:1710.09599, 2017.
 [23] A. Grover and J. Leskovec, “node2vec: Scalable feature learning for networks,” in ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 2016, pp. 855–864.
 [24] M. Belkin, P. Niyogi, and V. Sindhwani, “Manifold regularization: A geometric framework for learning from labeled and unlabeled examples,” Journal of Machine Learning Research, vol. 7, no. Nov, pp. 2399–2434, 2006.
 [25] J. Weston, F. Ratle, H. Mobahi, and R. Collobert, “Deep learning via semisupervised embedding,” in Neural Networks: Tricks of the trade. Springer, 2012, pp. 639–655.
 [26] A. Iosifidis, A. Tefas, and I. Pitas, “Regularized extreme learning machine for multiview semisupervised action recognition,” Neurocomputing, vol. 145, pp. 250–262, 2014.
 [27] D. Ienco and R. G. Pensa, “Enhancing graphbased semisupervised learning via knowledgeaware data embedding,” IEEE Transactions on Neural Networks and Learning Systems (Early Access), DOI: 10.1109/TNNLS.2019.2955565, 2019.
 [28] C. Zhuang and Q. Ma, “Dual graph convolutional networks for graphbased semisupervised classification,” in World Wide Web Conference, 2018, pp. 499–508.
 [29] F. Monti, D. Boscaini, J. Masci, E. Rodola, J. Svoboda, and M. M. Bronstein, “Geometric deep learning on graphs and manifolds using mixture model cnns,” in IEEE Conference on Computer Vision and Pattern Recognition, 2017, pp. 5115–5124.
 [30] D. K. Hammond, P. Vandergheynst, and R. Gribonval, “Wavelets on graphs via spectral graph theory,” Applied and Computational Harmonic Analysis, vol. 30, no. 2, pp. 129–150, 2011.
 [31] S. Yan, D. Xu, B. Zhang, H.J. Zhang, Q. Yang, and S. Lin, “Graph embedding and extensions: A general framework for dimensionality reduction,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 29, no. 1, pp. 40–51, 2006.
 [32] P. Sen, G. Namata, M. Bilgic, L. Getoor, B. Galligher, and T. EliassiRad, “Collective classification in network data,” AI Magazine, vol. 29, no. 3, pp. 93–93, 2008.
 [33] A. Carlson, J. Betteridge, B. Kisiel, B. Settles, E. R. Hruschka, and T. M. Mitchell, “Toward an architecture for neverending language learning,” in AAAI Conference on Artificial Intelligence, 2010.
 [34] Z. Yang, W. W. Cohen, and R. Salakhutdinov, “Revisiting semisupervised learning with graph embeddings,” arXiv:1603.08861, 2016.
 [35] P. Veličković, G. Cucurull, A. Casanova, A. Romero, P. Lio, and Y. Bengio, “Graph attention networks,” arXiv:1710.10903, 2017.
 [36] S. AbuElHaija, A. Kapoor, B. Perozzi, and J. Lee, “Ngcn: Multiscale graph convolution for semisupervised node classification,” arXiv:1802.08888, 2018.
 [37] F. Hu, Y. Zhu, S. Wu, L. Wang, and T. Tan, “Hierarchical graph convolutional networks for semisupervised node classification,” arXiv:1902.06667, 2019.
 [38] M. Abadi, A. Agarwal, P. Barham, E. Brevdo, Z. Chen, C. Citro, G. S. Corrado, A. Davis, J. Dean, M. Devin et al., “Tensorflow: Largescale machine learning on heterogeneous distributed systems,” arXiv:1603.04467, 2016.
 [39] D. P. Kingma and J. Ba, “Adam: A method for stochastic optimization,” arXiv:1412.6980, 2014.
 [40] N. Srivastava, G. Hinton, A. Krizhevsky, I. Sutskever, and R. Salakhutdinov, “Dropout: a simple way to prevent neural networks from overfitting,” Journal of Machine Learning Research, vol. 15, no. 1, pp. 1929–1958, 2014.
 [41] D.A. Clevert, T. Unterthiner, and S. Hochreiter, “Fast and accurate deep network learning by exponential linear units (elus),” arXiv:1511.07289, 2015.
 [42] F. Wu, T. Zhang, A. H. d. Souza Jr, C. Fifty, T. Yu, and K. Q. Weinberger, “Simplifying graph convolutional networks,” arXiv:1902.07153, 2019.
 [43] H. NT and T. Maehara, “Revisiting graph neural networks: All we have is lowpass filters,” arXiv:1905.09550, 2019.
 [44] J. Klicpera, A. Bojchevski, and S. Günnemann, “Predict then propagate: Graph neural networks meet personalized pagerank,” arXiv:1810.05997, 2018.
 [45] H. Robbins and S. Monro, “A stochastic approximation method in: Herbert robbins selected papers,” NewYork, USA: Springer, vol. 102, p. 109, 1985.
 [46] C. M. Bishop, Pattern Recognition and Machine Learning. Springer, 2007.
Comments
There are no comments yet.