1 Introduction
Deep learning encompasses a broad class of machine learning methods that use multiple layers of nonlinear processing units in order to learn multilevel representations for detection or classification tasks
[1, 2, 3, 4, 5]. The main realizations of deep multilayer architectures are socalled Deep Neural Networks (DNNs), which correspond to Artificial Neural Networks (ANNs) with multiple layers between input and output layers. DNNs have been shown to perform successfully in processing a variety of signals with an underlying Euclidean or gridlike structure, such as speech, images and videos. Signals with an underlying Euclidean structure usually come in the form of multiple arrays
[1] and are known for their statistical properties such as locality, stationarity and hierarchical compositionality from local statistics [6, 7]. For instance, an image can be seen as a function on Euclidean space (the 2D plane) sampled from a grid. In this setting, locality is a consequence of local connections, stationarity results from shiftinvariance, and compositionality stems from the intrinsic multiresolution structure of many images [4]. It has been suggested that such statistical properties can be exploited by convolutional architectures via DNNs, namely (deep) Convolutional Neural Networks (CNNs)
[8, 9, 10] which are based on four main ideas: local connections, shared weights, pooling, and multiple layers [1]. The role of the convolutional layer in a typical CNN architecture is to detect local features from the previous layer that are shared across the image domain, thus largely reducing the parameters compared with traditional fully connected feedforward ANNs.Although deep learning models, and in particular CNNs, have achieved highly improved performance on data characterized by an underlying Euclidean structure, many realworld data sets do not have a natural and direct connection with a Euclidean space. Recently there has been interest in extending deep learning techniques to nonEuclidean domains, such as graphs and manifolds [4]. An archetypal example is social networks, which can be represented as graphs with users as nodes and edges representing social ties between them. In biology, gene regulatory networks represent relationships between genes encoding proteins that can up or downregulate the expression of other genes. In this paper, we illustrate our results through examples stemming from another kind of relational data with no discernible Euclidean structure, yet with a clear graph formulation, namely, citation networks, where nodes represent documents and an edge is established if one document cites the other [11].
To address the challenge of extending deep learning techniques to graphstructured data, a new class of deep learning algorithms, broadly named Graph Neural Networks (GNNs), has been recently proposed [12, 13, 4]
. In this setting, each node of the graph represents a sample, which is described by a feature vector, and we are additionally provided with relational information between the samples that can be formalized as a graph. GNNs are well suited to node (i.e., sample) classification tasks. For a recent survey of this fastgrowing field, see
[14].Generalizing convolutions to nonEuclidean domains is not straightforward [15]. Recently, Graph Convolutional Networks (GCNs) have been proposed [16]
as a subclass of GNNs with convolutional properties. The GCN architecture combines the full relational information from the graph together with the node features to accomplish the classification task, using the ground truth class assignment of a small subset of nodes during the training phase. GCN has shown improved performance for semisupervised classification of documents (described by their text) into topic areas, outperforming methods that rely exclusively on text information without the use of any citation information, e.g., multilayer perceptron (MLP)
[16].However, we would not expect such an improvement to be universal. In some cases, the additional information provided by the graph (i.e., the edges) might not be consistent with the similarities between the features of the nodes. In particular, in the case of citation graphs, it is not always the case that documents cite other documents that are similar in content. As we will show below with some illustrative data sets, in those cases the conflicting information provided by the graph means that a graphless MLP approach outperforms GCN. Here, we explore the relative importance of the graph with respect to the features for classification purposes, and propose a geometric measure based on subspace alignment to explain the relative performance of GCN against different limiting cases.
Our hypothesis is that a degree of alignment among the three layers of information available (i.e., the features, the graph and the ground truth) is needed for GCN to perform well, and that any degradation in the information content leads to an increased misalignment of the layers and worsened performance. We will first use randomization schemes to show that the systematic degradation of the information contained in the graph and the features leads to a progressive worsening of GCN performance. Second, we propose a simple spectral alignment measure, and show that this measure correlates with the classification performance in a number of data sets: (i) a constructive example built to illustrate our work; (ii) CORA, a wellknown citation network benchmark; (iii) AMiner, a newly constructed citation network data set; and (iv) two subsets of Wikipedia: Wikipedia I, where GCN outperforms MLP, and Wikipedia II, where instead MLP outperforms GCN.
2 Related work
The first attempt to generalize neural networks on graphs can be traced back to Gori et al. (2005) [17]
, who proposed a scheme combining recurrent neural networks (RNNs) and random walk models. Their method requires the repeated application of contraction maps as propagation functions until the node representations reach a stable fixed point. This method, however, did not attract much attention when it was proposed. With the current surge of interest in deep learning, this work has been reappraised in a new and modern form: Ref.
[18] introduced modern techniques for RNN training based on the original graph neural network framework, whereas Ref. [19] proposed a convolutionlike propagation rule on graphs and methods for graphlevel classification.The first formulation of convolutional neural networks on graphs (GCNNs) was proposed by Bruna et al. (2013) [20]. These researchers applied the definition of convolutions to the spectral domain of the graph Laplacian. While being theoretically salient, this method is unfortunately impractical due to its computational complexity. This drawback was addressed by subsequent studies [15, 21]. In particular, Ref. [21] leveraged fast localized convolutions with Chebyshev polynomials. In [16], a GCN architecture was proposed via a firstorder approximation of localized spectral filters on graphs. In that work, Kipf and Welling considered the task of semisupervised transductive node classification where labels are only available for a small number of nodes. Starting with a feature matrix and a network adjacency matrix , they encoded the graph structure directly using a neural network model
, and trained on a supervised target loss function
computed over the subset of nodes with known labels. Their proposed GCN was shown to achieve improved accuracy in classification tasks on several benchmark citation networks and on a knowledge graph data set. In our study, we study how the properties of features and the graph interact in the model proposed by Kipf and Welling for semisupervised transductive node classification in citation networks. The architecture and propagation rules of this method are detailed in Section
2.2.2.1 Spectral graph convolutions
We now present briefly the key insights introduced by Bruna et al [20] to extend CNNs to the nonEuclidean domain. For an extensive recent review, the reader should refer to [4].
We study GCN in the context of a classification task for samples. Each sample is described by a dimensional feature vector, which is conveniently arranged into the feature matrix . Each sample is also associated with the node of a given graph with nodes, with edges representing additional relational (symmetric) information. This undirected graph is described by the adjacency matrix . The ground truth assignment of each node to one of classes is encoded into a 01 membership matrix .
The main hurdle is the definition of a convolution operation on a graph between a filter and the node features . This can be achieved by expressing onto a basis encoding information about the graph, e.g., the adjacency matrix or the Laplacian , where . This real symmetric matrix has an eigendecomposition , where
is the matrix of column eigenvectors with associated eigenvalues collected in the diagonal matrix
. The filters can then be expressed in the eigenbasis of :(1) 
with the convolution between filter and signal given by:
(2) 
The signal is thus projected onto the space of the graph, filtered in the frequency domain, and projected back onto the nodes.
2.2 Graph Convolutional Networks
Before moving on to the specific model we used in this work, it is worth remarking on some basic properties of the GCN framework. A GCN is a semisupervised method, in that a small subset of the node ground truth labels are used in the training phase to infer the class of unlabeled nodes. This type of learning paradigm, where only a small amount of labeled data is available, therefore lies between supervised and unsupervised learning.
Furthermore, the model architecture, and thus the learning, depends explicitly
on the structure of the network. Hence the addition of any new data point (i.e., a new node in the network) will require a retraining of the model. GCN is therefore an example of a transductive learning paradigm, where the classifier cannot be generalized to data it has not already seen. Node classification using a GCN can be seen as a label propagation task: given a set of seed nodes with known labels, the task is to predict which label will be assigned to the unlabeled nodes given a certain topology and attributes.
Layerwise propagation rule and multilayer architecture
Our study uses the multilayer GCN proposed in [16]. Given the matrix with sample features and the (undirected) adjacency matrix of the graph encoding relational information between the samples, the propagation rule between layers and (of size and , respectively) is given by:
(3) 
where and are matrices of activation in the and layers, respectively;
is the threshold activation function for layer
; and the weights connecting layers and are stored in the matrix . Note that the input layer contains the feature matrix .The graph is encoded in , where is the adjacency matrix of a graph with added selfloops,
is the identity matrix, and
is a diagonal matrix containing the degrees of . In the remainder of this work (and to ensure comparability with the results in [16]), we use as the descriptor of the graph .Following [16], we implement a twolayer GCN with propagation rule 3
and different activation functions for each layer, i.e., a rectified linear unit for the first layer and a softmax unit for the output layer:
(4)  
(5) 
where is a vector. The model then takes the simple form:
(6) 
where the softmax activation function is applied rowwise and the ReLU is applied elementwise. Note there is only one hidden layer with
units. Hence maps the input with features to the hidden layer and maps these hidden units to the output layer with units, corresponding to the number of classes of the ground truth. In this semisupervised multiclass classification, the crossentropy error over all labeled instances is evaluated as follows:(7) 
where is the set of nodes that have labels. The weights of the neural network ( and ) are trained using gradient descent to minimize the loss . A visual summary of the GCN architecture is shown in Fig. 1. The reader is referred to [16] for details and indepth analysis.
3 Methods
3.1 Randomization strategies
To test the hypothesis that a degree of alignment across information layers is crucial for a good classification performance of GCN, we gradually randomize the node features, the node connectivity, or both. By controlling the level of randomization, we monitor their effect on classification performance.
3.1.1 Randomization of the graph
The edges of the graph are randomized by rewiring a percentage of edge stubs (i.e., ‘halfedges’) under the constraint that the degree distribution remains unchanged. This randomization strategy is described in Algorithm 1 which is based on the configuration model [22]. Once a randomized realization of the graph is produced, the corresponding is computed.
3.1.2 Randomization of the features
The features were randomized by swapping feature vectors between a percentage of randomly chosen nodes following the procedure described in Algorithm 2.
A fundamental difference between the two randomization schemes is that the graph randomization alters its spectral properties as it gradually destroys the graph structure, whereas the randomization of the features preserves its spectral properties in the principal component analysis (PCA) sense, i.e., the principal values are the same but the loadings on the components are swapped. Hence the feature randomization still alters the classification performance because the features are reassigned to nodes that have a different environment, thereby changing the result of the convolution operation defined by the
activation matrices (3).3.2 Limiting cases
To interrogate the role that the graph plays in the classification performance of a GCN, it is instructive to consider three limiting cases:

No graph: . If we remove all the edges in the graph, the classifier becomes equivalent to an MLP, a classic feedforward ANN. The classification is based solely on the information contained in the features, as no graph structure is present to guide the label propagation.

Complete graph: .
In this case, the mixing of features is immediate and homogeneous, corresponding to a mean field approximation of the information contained in the features.

No features: . In this case, the label propagation and assignment are purely based on graph topology.
An illustration of these limiting cases can be found in the top row of Table 2.
3.3 Spectral alignment measure
In order to quantify the alignment between the features, the graph and the ground truth, we propose a measure based on the chordal distance between subspaces, as follows.
3.3.1 Chordal distance between two subspaces
Recent work by Ye and Lim [23] has shown that the distance between two subspaces of different dimension in is necessarily defined in terms of their principal angles.
Let and be two subspaces of the ambient space with dimensions and , respectively, with . The principal angles between and denoted are defined recursively as follows [24, 25]:
If the minimal principal angle is small, then the two subspaces are nearly linearly dependent, i.e., almost perfectly aligned. A numerically stable algorithm that computes the canonical correlations, (i.e., the cosine of the principal angles) between subspaces is given in Algorithm 3.
using the QR decomposition:
2. Compute the singular value decomposition (SVD)
= . 3. Extract the diagonal elements of : , to obtain the canonical correlations .The principal angles are the basic ingredient of a number of well defined Grassmanian distances between subspaces [23]. Here we use the chordal distance given by:
(8) 
The larger the chordal distance is, the worse the alignment between the subspaces and .
We remark that the last inequality in is strict. If a subspace spans the whole ambient space (i.e., ), then its distance to all other strict subspaces of is trivially zero, as it is always possible to find a rotation that aligns the strict subspace with the whole space.
3.3.2 Alignment metric
Our task involves establishing the alignment between three subspaces associated with the features , the graph , and the ground truth . To do so, we consider the distance matrix containing all the pairwise chordal distances:
(9) 
and we take the Frobenius norm [25] of this matrix as our subspace alignment measure (SAM):
(10) 
The larger is, the worse the alignment between the three subspaces. This alignment measure has a geometric interpretation related to the area of the triangle with sides (blue triangle in Fig. 2).
3.3.3 Determining the dimension of the subspaces
The feature, graph and ground truth matrices are associated with subspaces of the ambient space , where is the number of nodes (or samples). These subspaces are spanned by: the eigenvectors of , the principal components of the feature matrix , and the principal components of the ground truth matrix , respectively [26]. The dimension of the graph subspace is ; the dimension of the feature subspace is the number of features (in our examples); and the dimension of the ground truth subspace is the number of classes .
The pairwise chordal distances in (9) are computed from a number of minimal angles, corresponding to the smaller of the two dimensions of the subspaces being compared. Hence the dimensions of the subspaces need to be defined to compute the distance matrix . Here, we are interested in finding low dimensional subspaces of features, graph and ground truth with dimensions such that they provide maximum discriminatory power between the original problem and the fully randomized (null) model. To do this, we propose the following criterion:
(11)  
We choose equal to the number of ground truth classes since they are nonoverlapping [26]. Our optimization selects and such that the difference in alignment between the original problem with no randomization () and an ensemble of 100 fully randomized (feature and graph, ) problems is maximized (see Appendix for details on the optimization scheme). This criterion maximizes the range of values that can take, thus augmenting the discriminatory power of the alignment measure when finding the alignment between both data sources and the ground truth, beyond what is expected purely at random. Importantly, the reduced dimension of features and graph are found simultaneously, since our objective is to quantify the alignment (or amount of shared information) contained in the three subspaces. Our criterion effectively amounts to finding the dimensions of the subspaces that maximize a difference in the surfaces of the blue and red triangles in Fig. 2.
We provide the code to compute our proposed alignment measure at https://github.com/haczqyf/gcndataalignment.
4 Experiments
4.1 Data sets
Relevant statistics of the data sets, including number of nodes and edges, dimension of feature vectors, and number of ground truth classes, are reported in Table 1.
Data sets  Nodes ()  Edges  Features ()  Classes () 

Constructive  
CORA  
AMiner  
Wikipedia  
Wikipedia I  
Wikipedia II 
4.1.1 Constructive example
To illustrate the alignment measure in a controlled setting, we build a constructive example, consisting of nodes assigned to planted communities
of equal size. We then generate a feature matrix and a graph matrix whose structures are aligned with the ground truth assignment matrix. The graph structure is generated using a stochastic block model that reproduces the ground truth structure with some noise: two nodes are connected with a probability
if they belong to the same community and otherwise. The feature matrix is constructed in a similar way. The feature vectors are dimensional and binary, i.e., a node either possesses a feature or it does not. Each ground truth cluster is associated with features that are present with a probability of . Each node also has a probability of of possessing each feature characterizing other clusters. Using the same stochastic block structure for both features and graph ensures that they are maximally aligned with the ground truth. This constructive example is then randomized in a controlled way to detect the loss of alignment and the impact this loss of alignment has on the classification performance.4.1.2 Cora
The CORA data set is a benchmark for classification algorithms using text and citation data^{1}^{1}1https://linqs.soe.ucsc.edu/data. Each paper is labeled as belonging to one of categories (Case_Based, Genetic_Algorithms, Neural_Networks, Probabilistic_Methods, Reinforcement_Learning, Rule_Learning, and Theory), which gives the ground truth . The text of each paper is described by a vector indicating the absence/presence of words in a dictionary of unique words, the dimension of the feature space. The feature matrix is made from these word vectors. We extracted the largest connected component of this citation graph (undirected) to form the graph adjacency matrix .
4.1.3 AMiner
For additional comparisons, we produced a new data set with similar characteristics to CORA from the academic citation site AMiner. AMiner is a popular scholarly social network service for research purposes only [27], which provides an open database^{2}^{2}2https://aminer.org/data with more than data sets encompassing linked up researcher, conferences, and publication data. Among these, the academic social network^{3}^{3}3https://aminer.org/aminernetwork is the largest one and includes information on papers, citations, authors, and scientific collaborations. The Chinese Computer Federation (CCF) released a catalog in 2012 including subfields of computer science. Using the AMiner academic social network, Qian et al. [28] extracted papers published from 2010 to 2012, and mapped each paper with a unique subfield of computer science according to the publication venue. Here, we use these assigned categories as the ground truth for a classification task. Using all the papers in [28] that have both abstract and references, we created a data set of similar size to CORA. We extracted the largest connected component from the citation network of all papers in subfields (Computer systems/high performance computing, Computer networks, Network/information security, Software engineering/software/programming language, Databases/data mining/information retrieval, Theoretical computer science, and Computer graphics/multimedia) from 2010 to 2011. The resulting AMiner citation network consists of papers with edges. Just as with CORA, we treat the citations as undirected edges, and obtain an adjacency matrix . We further extracted the most frequent stemmed terms from the corpus of abstracts of papers and constructed the feature matrix for AMiner using bagofwords.
4.1.4 Wikipedia
As a contrasting example with distinct characteristics, we produced three data sets from the English Wikipedia. The Wikipedia provides a wellknown, interlinked corpus of documents (articles) in different fields, which ‘cite’ each other via hyperlinks. We started by producing a large corpus of articles, consisting of a mixture of popular and random pages so as to obtain a balanced data set. We retrieved the most accessed articles during the week before the construction of the data set (July 2017), and an additional documents at random using the Wikipedia builtin random function^{4}^{4}4https://en.wikipedia.org/wiki/Wikipedia:Random. The text and subcategories of each document, together with the names of documents connected to it, were obtained using the Python library Wikipedia^{5}^{5}5https://github.com/goldsmith/Wikipedia. A few documents (e.g., those with no subcategories) were filtered out during this process. We constructed the citation network of the documents retrieved and extracted the largest connected component. The resulting citation network contained nodes and edges. The text content of each document was converted into a bagofwords representation based on the most frequent words. To establish the ground truth, we used categories from the API (People, Geography, Culture, Society, History, Nature, Sports, Technology, Health, Religion, Mathematics, Philosophy) and assigned each document to one of them. As part of our investigation, we split this large Wikipedia data set into two smaller subsets of nonoverlapping categories: Wikipedia I, consisting of Health, Mathematics, Nature, Sports, and Technology; and Wikipedia II, with the categories Culture, Geography, History, Society, and People.
All six datasets used in our study can be found at https://github.com/haczqyf/gcndataalignment/tree/master/alignment/data.
4.2 GCN architecture, hyperparameters and implementation
We used the GCN implementation^{6}^{6}6https://github.com/tkipf/gcn provided by the authors of [16], and followed closely their experimental setup to train and test the GCN on our data sets. We used a twolayer GCN as described in Section 2.2
with the maximum number of training iterations (epochs) set to
[29], a learning rate of , and early stopping with a window size of , i.e., training stops if the validation loss does not decrease forconsecutive epochs. Other hyperparameters used were: (i) dropout rate:
; (ii) L2 regularization: ; and (iii) number of hidden units: . We initialized the weights as described in [30], and accordingly rownormalized the input feature vectors. For the training, validation and test of the GCN, we used the following split: (i) % of instances as training set; (ii) % as validation set; and (iii) the remaining % as test set. We used this split for all data sets with exception of the full Wikipedia data set, where we used: (i) % of instances as training set; (ii) % as validation set; and (iii) the remaining % as test set. This modification of the split was necessary to ensure the instances in the training set were evenly distributed across categories.5 Results
The GCN performance is evaluated using the standard classification accuracy defined as the proportion of nodes correctly classified in the test set.
5.1 GCN: original graph vs. limiting cases
For each data set in Table 1, we trained and tested a GCN with the original graph and features matrices, and GCN models under the three limiting cases described in Section 3.2. We computed the average accuracy of runs with random weight initializations (Table 2).
GCN (original)  GCN (limiting cases)  
No graph = MLP  No features  Complete graph  
(Only features)  (Only graph)  (Mean field)  
Data sets  
Constructive  0.932 0.006  0.416 0.010  0.764 0.009  0.100 0.003 
CORA  0.811 0.005  0.548 0.014  0.691 0.006  0.121 0.066 
AMiner  0.748 0.005  0.547 0.013  0.591 0.006  0.123 0.045 
Wikipedia  0.392 0.010  0.450 0.007  0.254 0.037  O.O.M. 
Wikipedia I  0.861 0.006  0.796 0.005  0.824 0.003  0.163 0.135 
Wikipedia II  0.566 0.021  0.659 0.011  0.347 0.012  0.155 0.176 
The GCN using all the information available in the features and the graph outperforms MLP (the no graph limit) except in the case of the large Wikipedia set. Hence using the additional information contained in the graph does not necessarily increase the performance of GCN. To investigate this issue further, we split the Wikipedia data set into two subsets: Wikipedia I, with articles in topics that tend to be more selfreferential (e.g., Mathematics or Technology) and Wikipedia II, containing pages in areas that are less selfcontained (e.g., Culture or Society). We observed that GCN outperforms MLP for Wikipedia I but the opposite is still true for Wikipedia II. Finally, we also observe that the performance of ‘No features’ is always lower than the performance of GCN, and, as expected, the performance of ‘Complete graph’ (i.e., mean field) is very low and close to pure chance (i.e., ).
5.2 Performance of GCN under randomization
The results above lead us to pose the hypothesis that a degree of synergy between features, graph and ground truth is needed for GCN to perform well. To investigate this hypothesis, we use the randomization schemes described in Section 3.1 to degrade systematically the information content of the graph and/or the features in our data sets. Fig. 3 presents the performance of the GCN as a function of the percent of randomization of the graph structure, the features, or both. As expected, the accuracy decreases for all data sets as the information contained in the graph, features or both is scrambled, yet with differences in the decay rate of each of the ingredients for the different examples.
Note that the chancelevel performance of the ‘Complete graph’ (mean field) limiting case is achieved only when both graph and features are fully randomized, whereas the accuracy of the two other limiting cases (‘No graph  MLP’, ‘No features’) is reached around the halfpoint () of randomization of the graph or of the features, respectively. This indicates that using the scrambled information above a certain degree of randomization becomes more detrimental to the classification performance than simply ignoring it.
5.3 Relating GCN performance and subspace alignment
We tested whether the degradation of GCN performance is linked to the increased misalignment of features, graph and ground truth given by the subspace alignment measure:
(12) 
which corresponds to (10) computed with the dimensions obtained using (11) (Table 3, and see appendix for the optimization scheme used). Fig. 4 shows that the GCN accuracy is clearly (anti)correlated with the subspace alignment distance (12) in all our examples (mean correlation ). As we randomize the graph and/or features, the subspace misalignment increases and the GCN performance decreases.
Data sets  

Constructive example  287  10  10 
CORA  1,291  190  7 
AMiner  500  57  7 
Wikipedia I  68  1,699  5 
Wikipedia II  100  1,125  5 
6 Discussion
Our first set of experiments (Table 2) reflects the varying amount of information that GCN can extract from features, graph and their combination, for the purpose of classification. For a classifier to perform well, it is necessary to find (possibly nonlinear) combinations of features that map differentially and distinctively onto the categories of the ground truth. The larger the difference (or distance on the projected space) between the samples of each category, the easier it is to ‘separate’ them, and the better the classifier. In the MLP setting, for instance, the weights between layers () are trained to maximize this separation. As seen by the different accuracies in the ‘No graph’ column (Table 2), the features of each example contain variable amount of information that is mappable on its ground truth. A similar reasoning applies to classification based on graph information alone, but in this case, it is the eigenvectors of that need to be combined to produce distinguishing features between the categories in the ground truth (e.g., if the graph substructures across scales [31] do not map onto the separation lines of the ground truth categories, then the classification performance based on the graph will deteriorate). The accuracies in the ‘No features’ column indicate that some of the graphs contain more congruent information with the ground truth than others. Therefore, the ‘No graph’ and ‘No features’ limiting cases inform about the relative congruence of each type of information with respect to the ground truth. One can then conjecture that if the performance of the ‘No features’ case is higher than the ‘No graph’ case, GCN will yield better results than MLP.
In addition, our numerics show that although combining both sources of information generally leads to improved classification performance (‘GCN original’ column in Table 2), this is not always necessarily the case. Indeed, for the Wikipedia and Wikipedia II examples, the classification performance of the MLP (‘No graph’), which is agnostic to relationships between samples, is better than when the additional layer of relational information about the samples (i.e., the graph) is incorporated via the GCN architecture. This suggests that, for improved GCN classification, the information contained in features and graph need to be constructively aligned with the ground truth. This phenomenon can be intuitively understood as follows. In the absence of a graph (i.e., the MLP setting) the training of the layer weights is done independently over the samples, without assuming any relationship between them. In GCN, on the other hand, the role of the graph is to guide the training of the weights by averaging the features of a node with those of its graph neighbors. The underlying assumption is that the relationships represented by the graph should be consistent with the information of their features, i.e., the features of nodes that are graph neighbors are expected to be more similar than otherwise; hence the training process is biased towards convolving the diffusing information on the graph to extract improved feature descriptions for the classifier. However, if feature similarities and graph neighborhoods (or more generally, graph communities [31]) are not congruent, this graphbased averaging during the training is not beneficial.
To explore this issue in a controlled fashion, our second set of experiments (Fig. 3) studied the degradation of the classification performance induced by the systematic randomization of graph structure and/or features. The erosion of information is not uniform across our examples, reflecting the relative salience of each of the components (features and graph) for classification. Note that the GCN is able to leverage the information present in any of the two components, and is only degraded to chancelevel performance when both graph and features are fully randomized. Interestingly, this fully randomized (chancelevel) performance coincides with that of the ‘Complete graph’ (or mean field) limiting case, where the classifier is trained on features averaged over all the samples, thus leading to a uniform representation that has zero discriminating power when it comes to category assignment.
These results suggest that a degree of constructive alignment between the matrices of features, graph and ground truth is necessary for GCN to operate successfully beyond standard classifiers. To capture this idea, we proposed a simple subspace alignment measure (SAM) (12) that uses the minimal principal angles to capture the consistency of pairwise projections between subspaces. Fig. 4 shows that SAM correlates well with the classification performance and captures the monotonic dependence remarkably given that SAM is a simple linear measure being applied to the outcome of a highly nonlinear, optimized system.
The alignment measure can be used to evaluate the relative importance of features and graph for classification without explicitly running the GCN, by comparing the SAM under full randomization of features against the SAM under full randomization of the graph. If , the features play a more important role in GCN classification, indicating that MLP could potentially yield better results (e.g., in Wikipedia II). Conversely, if , the graph is more important in GCN classification, suggesting that GCN should outperforms MLP (e.g., the constructive example and CORA).
7 Conclusion
Here, we have introduced a subspace alignment measure (SAM) (12) to quantify the consistency between the feature and graph ingredients of data sets, and showed that it correlates well with the classification performance of GCNs. Our experiments show that a degree of alignment is needed for a GCN approach to be beneficial, and that using a GCN can actually be detrimental to the classification performance if the feature and graph subspaces associated with the data are not constructively aligned, (e.g., Wikipedia and Wikipedia II). The SAM has potentially a wider range of applications in the quantification of data alignment in general. It could be used, among others, to quantify the alignment of different graphs associated with, or obtained from, particular data sets; to evaluate the quality of classifications found using unsupervised methods; or to aid in choosing the classifier architecture most advantageous computationally given a particular data set.
Acknowledgment
Yifan Qian acknowledges financial support from the China Scholarship Council program (No. 201706020176). Paul Expert and Mauricio Barahona acknowledge support through the EPSRC grant EP/N014529/1 funding the EPSRC Centre for Mathematics of Precision Healthcare.
References
 [1] Y. LeCun, Y. Bengio, and G. Hinton, “Deep learning,” Nature, vol. 521, no. 7553, p. 436, 2015.
 [2] I. Goodfellow, Y. Bengio, and A. Courville, Deep Learning. MIT Press, 2016.
 [3] J. Schmidhuber, “Deep learning in neural networks: An overview,” Neural Networks, vol. 61, pp. 85–117, 2015.
 [4] M. M. Bronstein, J. Bruna, Y. LeCun, A. Szlam, and P. Vandergheynst, “Geometric deep learning: going beyond euclidean data,” IEEE Signal Processing Magazine, vol. 34, no. 4, pp. 18–42, 2017.
 [5] L. Deng, D. Yu et al., “Deep learning: methods and applications,” Foundations and Trends® in Signal Processing, vol. 7, no. 3–4, pp. 197–387, 2014.
 [6] E. P. Simoncelli and B. A. Olshausen, “Natural image statistics and neural representation,” Annual Review of Neuroscience, vol. 24, no. 1, pp. 1193–1216, 2001.
 [7] D. J. Field, “What the statistics of natural images tell us about visual coding,” in Human Vision, Visual Processing, and Digital Display, vol. 1077. International Society for Optics and Photonics, 1989, pp. 269–277.
 [8] Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner, “Gradientbased learning applied to document recognition,” Proceedings of the IEEE, vol. 86, no. 11, pp. 2278–2324, 1998.
 [9] Y. LeCun, B. E. Boser, J. S. Denker, D. Henderson, R. E. Howard, W. E. Hubbard, and L. D. Jackel, “Handwritten digit recognition with a backpropagation network,” in Advances in Neural Information Processing Systems, 1990, pp. 396–404.
 [10] J. Bruna and S. Mallat, “Invariant scattering convolution networks,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 35, no. 8, pp. 1872–1886, 2013.
 [11] D. Lazer, A. S. Pentland, L. Adamic, S. Aral, A. L. Barabasi, D. Brewer, N. Christakis, N. Contractor, J. Fowler, M. Gutmann et al., “Life in the network: the coming age of computational social science,” Science, vol. 323, no. 5915, p. 721, 2009.
 [12] K. Xu, W. Hu, J. Leskovec, and S. Jegelka, “How powerful are graph neural networks?” arXiv preprint arXiv:1810.00826, 2018.
 [13] W. Hamilton, Z. Ying, and J. Leskovec, “Inductive representation learning on large graphs,” in Advances in Neural Information Processing Systems, 2017, pp. 1024–1034.
 [14] Z. Wu, S. Pan, F. Chen, G. Long, C. Zhang, and P. S. Yu, “A comprehensive survey on graph neural networks,” arXiv preprint arXiv:1901.00596, 2019.
 [15] M. Defferrard, X. Bresson, and P. Vandergheynst, “Convolutional neural networks on graphs with fast localized spectral filtering,” in Advances in Neural Information Processing Systems, 2016, pp. 3844–3852.
 [16] T. N. Kipf and M. Welling, “Semisupervised classification with graph convolutional networks,” in International Conference on Learning Representations (ICLR), 2017.
 [17] M. Gori, G. Monfardini, and F. Scarselli, “A new model for learning in graph domains,” in Proceedings of IEEE International Joint Conference on Neural Networks, vol. 2. IEEE, 2005, pp. 729–734.
 [18] Y. Li, D. Tarlow, M. Brockschmidt, and R. Zemel, “Gated graph sequence neural networks,” arXiv preprint arXiv:1511.05493, 2015.

[19]
S. Sukhbaatar, R. Fergus et al.
, “Learning multiagent communication with backpropagation,” in
Advances in Neural Information Processing Systems, 2016, pp. 2244–2252.  [20] J. Bruna, W. Zaremba, A. Szlam, and Y. LeCun, “Spectral networks and locally connected networks on graphs,” arXiv preprint arXiv:1312.6203, 2013.
 [21] M. Henaff, J. Bruna, and Y. LeCun, “Deep convolutional networks on graphstructured data,” arXiv preprint arXiv:1506.05163, 2015.
 [22] M. E. Newman, “The structure and function of complex networks,” SIAM Review, vol. 45, no. 2, pp. 167–256, 2003.
 [23] K. Ye and L.H. Lim, “Schubert varieties and distances between subspaces of different dimensions,” SIAM Journal on Matrix Analysis and Applications, vol. 37, no. 3, pp. 1176–1197, 2016.
 [24] A. Björck and G. H. Golub, “Numerical methods for computing angles between linear subspaces,” Mathematics of Computation, vol. 27, no. 123, pp. 579–594, 1973.
 [25] G. H. Golub and C. F. Van Loan, Matrix Computations. JHU Press, 2012, vol. 3.

[26]
U. Von Luxburg, “A tutorial on spectral clustering,”
Statistics and Computing, vol. 17, no. 4, pp. 395–416, 2007.  [27] J. Tang, J. Zhang, L. Yao, J. Li, L. Zhang, and Z. Su, “Arnetminer: extraction and mining of academic social networks,” in Proceedings of the 14th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. ACM, 2008, pp. 990–998.

[28]
Y. Qian, W. Rong, N. Jiang, J. Tang, and Z. Xiong, “Citation regression analysis of computer science publications in different ranking categories and subfields,”
Scientometrics, vol. 110, no. 3, pp. 1351–1374, 2017.  [29] D. P. Kingma and J. Ba, “Adam: A method for stochastic optimization,” arXiv preprint arXiv:1412.6980, 2014.

[30]
X. Glorot and Y. Bengio, “Understanding the difficulty of training deep
feedforward neural networks,” in
Proceedings of the Thirteenth International Conference on Artificial Intelligence and Statistics
, 2010, pp. 249–256.  [31] R. Lambiotte, J. Delvenne, and M. Barahona, “Random walks, markov processes and the multiscale modular organization of complex networks,” IEEE Transactions on Network Science and Engineering, vol. 1, no. 2, pp. 76–90, July 2014.
Appendix: Finding and
A key element of subspace alignment measure described in the main text is to find lower dimensional representations of the graph, features and ground truth.
To determine the dimension of the representative subspaces, we propose the following heuristic:
(13) 
We choose to be equal to the number of categories in the ground truth as they are non overlapping. Thus, and range from to their maximum values, , the dimension of the feature vectors and , the number of nodes in the graph, respectively.
To find the values for and that maximize , we scan different possible combinations of and . We applied two rounds of scanning. In the first scanning round, in the intervals of and , we picked equally spaced values that contain the minimum and maximum possible values for and . For example, in CORA, equals because the number of categories in the ground truth is . Thus ranges from to . At the end of the first round, the optimal values of and are and , respectively (see Fig. 4(c)). In the second scanning round, we applied a very similar process to the one just described. We set the scanning intervals of and as the neighbors of and found in the first round, respectively. For example, in CORA, for the second round, we set the intervals of and as and . Again, we split the new intervals with equally spaced values. We have also shown the scanning results for other data sets in Figure 5.