Quantifying the alignment of graph and features in deep learning

05/30/2019 ∙ by Yifan Qian, et al. ∙ Criteo Queen Mary University of London Imperial College London 0

We show that the classification performance of Graph Convolutional Networks is related to the alignment between features, graph and ground truth, which we quantify using a subspace alignment measure corresponding to the Frobenius norm of the matrix of pairwise chordal distances between three subspaces associated with features, graph and ground truth. The proposed measure is based on the principal angles between subspaces and has both spectral and geometrical interpretations. We showcase the relationship between the subspace alignment measure and the classification performance through the study of limiting cases of Graph Convolutional Networks as well as systematic randomizations of both features and graph structure applied to a constructive example and several examples of citation networks of different origin. The analysis also reveals the relative importance of the graph and features for classification purposes.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 16

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Deep learning encompasses a broad class of machine learning methods that use multiple layers of nonlinear processing units in order to learn multi-level representations for detection or classification tasks 

[1, 2, 3, 4, 5]

. The main realizations of deep multi-layer architectures are so-called Deep Neural Networks (DNNs), which correspond to Artificial Neural Networks (ANNs) with multiple layers between input and output layers. DNNs have been shown to perform successfully in processing a variety of signals with an underlying Euclidean or grid-like structure, such as speech, images and videos. Signals with an underlying Euclidean structure usually come in the form of multiple arrays 

[1] and are known for their statistical properties such as locality, stationarity and hierarchical compositionality from local statistics [6, 7]. For instance, an image can be seen as a function on Euclidean space (the 2D plane) sampled from a grid. In this setting, locality is a consequence of local connections, stationarity results from shift-invariance, and compositionality stems from the intrinsic multi-resolution structure of many images [4]

. It has been suggested that such statistical properties can be exploited by convolutional architectures via DNNs, namely (deep) Convolutional Neural Networks (CNNs) 

[8, 9, 10] which are based on four main ideas: local connections, shared weights, pooling, and multiple layers [1]. The role of the convolutional layer in a typical CNN architecture is to detect local features from the previous layer that are shared across the image domain, thus largely reducing the parameters compared with traditional fully connected feed-forward ANNs.

Although deep learning models, and in particular CNNs, have achieved highly improved performance on data characterized by an underlying Euclidean structure, many real-world data sets do not have a natural and direct connection with a Euclidean space. Recently there has been interest in extending deep learning techniques to non-Euclidean domains, such as graphs and manifolds [4]. An archetypal example is social networks, which can be represented as graphs with users as nodes and edges representing social ties between them. In biology, gene regulatory networks represent relationships between genes encoding proteins that can up- or down-regulate the expression of other genes. In this paper, we illustrate our results through examples stemming from another kind of relational data with no discernible Euclidean structure, yet with a clear graph formulation, namely, citation networks, where nodes represent documents and an edge is established if one document cites the other [11].

To address the challenge of extending deep learning techniques to graph-structured data, a new class of deep learning algorithms, broadly named Graph Neural Networks (GNNs), has been recently proposed [12, 13, 4]

. In this setting, each node of the graph represents a sample, which is described by a feature vector, and we are additionally provided with relational information between the samples that can be formalized as a graph. GNNs are well suited to node (i.e., sample) classification tasks. For a recent survey of this fast-growing field, see 

[14].

Generalizing convolutions to non-Euclidean domains is not straightforward [15]. Recently, Graph Convolutional Networks (GCNs) have been proposed [16]

as a subclass of GNNs with convolutional properties. The GCN architecture combines the full relational information from the graph together with the node features to accomplish the classification task, using the ground truth class assignment of a small subset of nodes during the training phase. GCN has shown improved performance for semi-supervised classification of documents (described by their text) into topic areas, outperforming methods that rely exclusively on text information without the use of any citation information, e.g., multilayer perceptron (MLP) 

[16].

However, we would not expect such an improvement to be universal. In some cases, the additional information provided by the graph (i.e., the edges) might not be consistent with the similarities between the features of the nodes. In particular, in the case of citation graphs, it is not always the case that documents cite other documents that are similar in content. As we will show below with some illustrative data sets, in those cases the conflicting information provided by the graph means that a graph-less MLP approach outperforms GCN. Here, we explore the relative importance of the graph with respect to the features for classification purposes, and propose a geometric measure based on subspace alignment to explain the relative performance of GCN against different limiting cases.

Our hypothesis is that a degree of alignment among the three layers of information available (i.e., the features, the graph and the ground truth) is needed for GCN to perform well, and that any degradation in the information content leads to an increased misalignment of the layers and worsened performance. We will first use randomization schemes to show that the systematic degradation of the information contained in the graph and the features leads to a progressive worsening of GCN performance. Second, we propose a simple spectral alignment measure, and show that this measure correlates with the classification performance in a number of data sets: (i) a constructive example built to illustrate our work; (ii) CORA, a well-known citation network benchmark; (iii) AMiner, a newly constructed citation network data set; and (iv) two subsets of Wikipedia: Wikipedia I, where GCN outperforms MLP, and Wikipedia II, where instead MLP outperforms GCN.

2 Related work

The first attempt to generalize neural networks on graphs can be traced back to Gori et al. (2005) [17]

, who proposed a scheme combining recurrent neural networks (RNNs) and random walk models. Their method requires the repeated application of contraction maps as propagation functions until the node representations reach a stable fixed point. This method, however, did not attract much attention when it was proposed. With the current surge of interest in deep learning, this work has been reappraised in a new and modern form: Ref. 

[18] introduced modern techniques for RNN training based on the original graph neural network framework, whereas Ref. [19] proposed a convolution-like propagation rule on graphs and methods for graph-level classification.

The first formulation of convolutional neural networks on graphs (GCNNs) was proposed by Bruna et al. (2013) [20]. These researchers applied the definition of convolutions to the spectral domain of the graph Laplacian. While being theoretically salient, this method is unfortunately impractical due to its computational complexity. This drawback was addressed by subsequent studies [15, 21]. In particular, Ref. [21] leveraged fast localized convolutions with Chebyshev polynomials. In [16], a GCN architecture was proposed via a first-order approximation of localized spectral filters on graphs. In that work, Kipf and Welling considered the task of semi-supervised transductive node classification where labels are only available for a small number of nodes. Starting with a feature matrix and a network adjacency matrix , they encoded the graph structure directly using a neural network model

, and trained on a supervised target loss function

computed over the subset of nodes with known labels. Their proposed GCN was shown to achieve improved accuracy in classification tasks on several benchmark citation networks and on a knowledge graph data set. In our study, we study how the properties of features and the graph interact in the model proposed by Kipf and Welling for semi-supervised transductive node classification in citation networks. The architecture and propagation rules of this method are detailed in Section 

2.2.

2.1 Spectral graph convolutions

We now present briefly the key insights introduced by Bruna et al [20] to extend CNNs to the non-Euclidean domain. For an extensive recent review, the reader should refer to [4].

We study GCN in the context of a classification task for samples. Each sample is described by a -dimensional feature vector, which is conveniently arranged into the feature matrix . Each sample is also associated with the node of a given graph with nodes, with edges representing additional relational (symmetric) information. This undirected graph is described by the adjacency matrix . The ground truth assignment of each node to one of classes is encoded into a 0-1 membership matrix .

The main hurdle is the definition of a convolution operation on a graph between a filter and the node features . This can be achieved by expressing onto a basis encoding information about the graph, e.g., the adjacency matrix or the Laplacian , where . This real symmetric matrix has an eigendecomposition , where

is the matrix of column eigenvectors with associated eigenvalues collected in the diagonal matrix

. The filters can then be expressed in the eigenbasis of :

(1)

with the convolution between filter and signal given by:

(2)

The signal is thus projected onto the space of the graph, filtered in the frequency domain, and projected back onto the nodes.

2.2 Graph Convolutional Networks

Before moving on to the specific model we used in this work, it is worth remarking on some basic properties of the GCN framework. A GCN is a semi-supervised method, in that a small subset of the node ground truth labels are used in the training phase to infer the class of unlabeled nodes. This type of learning paradigm, where only a small amount of labeled data is available, therefore lies between supervised and unsupervised learning.

Furthermore, the model architecture, and thus the learning, depends explicitly

on the structure of the network. Hence the addition of any new data point (i.e., a new node in the network) will require a retraining of the model. GCN is therefore an example of a transductive learning paradigm, where the classifier cannot be generalized to data it has not already seen. Node classification using a GCN can be seen as a label propagation task: given a set of seed nodes with known labels, the task is to predict which label will be assigned to the unlabeled nodes given a certain topology and attributes.

Layer-wise propagation rule and multi-layer architecture

Our study uses the multi-layer GCN proposed in [16]. Given the matrix with sample features and the (undirected) adjacency matrix of the graph encoding relational information between the samples, the propagation rule between layers and (of size and , respectively) is given by:

(3)

where and are matrices of activation in the and layers, respectively;

is the threshold activation function for layer

; and the weights connecting layers and are stored in the matrix . Note that the input layer contains the feature matrix .

The graph is encoded in , where is the adjacency matrix of a graph with added self-loops,

is the identity matrix, and

is a diagonal matrix containing the degrees of . In the remainder of this work (and to ensure comparability with the results in [16]), we use as the descriptor of the graph .

Figure 1: Schematic illustration of the Graph Convolutional Network used. The graph is applied to the input of each layer before it is funneled into the input of layer . The process is repeated until the output has dimension and produces a predicted class assignment. During the training phase, the predicted assignments are compared against a subset of values of the ground truth.

Following [16], we implement a two-layer GCN with propagation rule 3

and different activation functions for each layer, i.e., a rectified linear unit for the first layer and a softmax unit for the output layer:

(4)
(5)

where is a vector. The model then takes the simple form:

(6)

where the softmax activation function is applied row-wise and the ReLU is applied element-wise. Note there is only one hidden layer with

units. Hence maps the input with features to the hidden layer and maps these hidden units to the output layer with units, corresponding to the number of classes of the ground truth. In this semi-supervised multi-class classification, the cross-entropy error over all labeled instances is evaluated as follows:

(7)

where is the set of nodes that have labels. The weights of the neural network ( and ) are trained using gradient descent to minimize the loss . A visual summary of the GCN architecture is shown in Fig. 1. The reader is referred to [16] for details and in-depth analysis.

3 Methods

3.1 Randomization strategies

To test the hypothesis that a degree of alignment across information layers is crucial for a good classification performance of GCN, we gradually randomize the node features, the node connectivity, or both. By controlling the level of randomization, we monitor their effect on classification performance.

3.1.1 Randomization of the graph

The edges of the graph are randomized by rewiring a percentage of edge stubs (i.e., ‘half-edges’) under the constraint that the degree distribution remains unchanged. This randomization strategy is described in Algorithm 1 which is based on the configuration model [22]. Once a randomized realization of the graph is produced, the corresponding is computed.

Input: A graph , where is the set of nodes and is the set of edges, and a randomization percentage .
Output: A randomized graph
1.Choose a random subset of edges from with , and denote the unrandomized edges in as . 2. Obtain the degree sequence of nodes from , and build a stub list based on the degree sequence. 3. Obtain a randomized stub list by shuffling , and randomized edges by connecting the stubs in the corresponding positions of the two stub lists  and . 4. Compute , remove multiedges and self-loops, and obtain the final edge set E’. 5. Generate randomized graph from node set V and edge set E’.
Algorithm 1 Randomization of the graph

3.1.2 Randomization of the features

The features were randomized by swapping feature vectors between a percentage of randomly chosen nodes following the procedure described in Algorithm 2.

Input: A feature matrix , and a randomization percentage .
Output: A randomized feature matrix
1. Choose at random rows from , where . 2. Swap randomly the rows to obtain .
Algorithm 2 Randomization of the features

A fundamental difference between the two randomization schemes is that the graph randomization alters its spectral properties as it gradually destroys the graph structure, whereas the randomization of the features preserves its spectral properties in the principal component analysis (PCA) sense, i.e., the principal values are the same but the loadings on the components are swapped. Hence the feature randomization still alters the classification performance because the features are re-assigned to nodes that have a different environment, thereby changing the result of the convolution operation defined by the

activation matrices (3).

3.2 Limiting cases

To interrogate the role that the graph plays in the classification performance of a GCN, it is instructive to consider three limiting cases:

  • No graph: . If we remove all the edges in the graph, the classifier becomes equivalent to an MLP, a classic feed-forward ANN. The classification is based solely on the information contained in the features, as no graph structure is present to guide the label propagation.

  • Complete graph: .

    In this case, the mixing of features is immediate and homogeneous, corresponding to a mean field approximation of the information contained in the features.

  • No features: . In this case, the label propagation and assignment are purely based on graph topology.

An illustration of these limiting cases can be found in the top row of Table 2.

3.3 Spectral alignment measure

In order to quantify the alignment between the features, the graph and the ground truth, we propose a measure based on the chordal distance between subspaces, as follows.

3.3.1 Chordal distance between two subspaces

Recent work by Ye and Lim [23] has shown that the distance between two subspaces of different dimension in is necessarily defined in terms of their principal angles.

Let and be two subspaces of the ambient space with dimensions and , respectively, with . The principal angles between and denoted are defined recursively as follows [24, 25]:

If the minimal principal angle is small, then the two subspaces are nearly linearly dependent, i.e., almost perfectly aligned. A numerically stable algorithm that computes the canonical correlations, (i.e., the cosine of the principal angles) between subspaces is given in Algorithm 3.

Input: matrices and with .
Output: cosines of the principal angles between and , the column spaces of and .
1. Find orthonormal bases and for and

using the QR decomposition:

2. Compute the singular value decomposition (SVD)

= . 3. Extract the diagonal elements of : , to obtain the canonical correlations .
Algorithm 3 Principal angles [24, 25]

The principal angles are the basic ingredient of a number of well defined Grassmanian distances between subspaces [23]. Here we use the chordal distance given by:

(8)

The larger the chordal distance is, the worse the alignment between the subspaces and .

We remark that the last inequality in is strict. If a subspace spans the whole ambient space (i.e., ), then its distance to all other strict subspaces of is trivially zero, as it is always possible to find a rotation that aligns the strict subspace with the whole space.

3.3.2 Alignment metric

Our task involves establishing the alignment between three subspaces associated with the features , the graph , and the ground truth . To do so, we consider the distance matrix containing all the pairwise chordal distances:

(9)

and we take the Frobenius norm [25] of this matrix as our subspace alignment measure (SAM):

(10)

The larger is, the worse the alignment between the three subspaces. This alignment measure has a geometric interpretation related to the area of the triangle with sides (blue triangle in Fig. 2).

3.3.3 Determining the dimension of the subspaces

The feature, graph and ground truth matrices are associated with subspaces of the ambient space , where is the number of nodes (or samples). These subspaces are spanned by: the eigenvectors of , the principal components of the feature matrix , and the principal components of the ground truth matrix , respectively [26]. The dimension of the graph subspace is ; the dimension of the feature subspace is the number of features (in our examples); and the dimension of the ground truth subspace is the number of classes .

The pairwise chordal distances in (9) are computed from a number of minimal angles, corresponding to the smaller of the two dimensions of the subspaces being compared. Hence the dimensions of the subspaces need to be defined to compute the distance matrix . Here, we are interested in finding low dimensional subspaces of features, graph and ground truth with dimensions such that they provide maximum discriminatory power between the original problem and the fully randomized (null) model. To do this, we propose the following criterion:

(11)

We choose equal to the number of ground truth classes since they are non-overlapping [26]. Our optimization selects and such that the difference in alignment between the original problem with no randomization () and an ensemble of 100 fully randomized (feature and graph, ) problems is maximized (see Appendix for details on the optimization scheme). This criterion maximizes the range of values that can take, thus augmenting the discriminatory power of the alignment measure when finding the alignment between both data sources and the ground truth, beyond what is expected purely at random. Importantly, the reduced dimension of features and graph are found simultaneously, since our objective is to quantify the alignment (or amount of shared information) contained in the three subspaces. Our criterion effectively amounts to finding the dimensions of the subspaces that maximize a difference in the surfaces of the blue and red triangles in Fig. 2.

We provide the code to compute our proposed alignment measure at https://github.com/haczqyf/gcn-data-alignment.

Figure 2: Method to determine relevant subspaces (Eq. 11). Using the constructive example, we illustrate the subspaces representing features, graph and ground truth. The feature and ground truth matrices are decomposed via PCA and the graph matrix is similarly eigendecomposed. Fixing , we optimize 11 to find the dimensions and that maximize the difference between the area of the blue triangle, which reflects the alignment of the three subspaces of the original data, and the area of the red triangle, which corresponds to the alignment of the subspaces of the fully randomized data. The edges of the triangles correspond to the pairwise chordal distances (e.g., the base of the blue triangle corresponds to ).

4 Experiments

4.1 Data sets

Relevant statistics of the data sets, including number of nodes and edges, dimension of feature vectors, and number of ground truth classes, are reported in Table 1.

Data sets Nodes () Edges Features () Classes ()
Constructive
CORA
AMiner
Wikipedia
Wikipedia I
Wikipedia II
Table 1: Some statistics of the data sets in our study.

4.1.1 Constructive example

To illustrate the alignment measure in a controlled setting, we build a constructive example, consisting of nodes assigned to planted communities

of equal size. We then generate a feature matrix and a graph matrix whose structures are aligned with the ground truth assignment matrix. The graph structure is generated using a stochastic block model that reproduces the ground truth structure with some noise: two nodes are connected with a probability

if they belong to the same community and otherwise. The feature matrix is constructed in a similar way. The feature vectors are dimensional and binary, i.e., a node either possesses a feature or it does not. Each ground truth cluster is associated with features that are present with a probability of . Each node also has a probability of of possessing each feature characterizing other clusters. Using the same stochastic block structure for both features and graph ensures that they are maximally aligned with the ground truth. This constructive example is then randomized in a controlled way to detect the loss of alignment and the impact this loss of alignment has on the classification performance.

4.1.2 Cora

The CORA data set is a benchmark for classification algorithms using text and citation data111https://linqs.soe.ucsc.edu/data. Each paper is labeled as belonging to one of categories (Case_Based, Genetic_Algorithms, Neural_Networks, Probabilistic_Methods, Reinforcement_Learning, Rule_Learning, and Theory), which gives the ground truth . The text of each paper is described by a vector indicating the absence/presence of words in a dictionary of unique words, the dimension of the feature space. The feature matrix is made from these word vectors. We extracted the largest connected component of this citation graph (undirected) to form the graph adjacency matrix .

4.1.3 AMiner

For additional comparisons, we produced a new data set with similar characteristics to CORA from the academic citation site AMiner. AMiner is a popular scholarly social network service for research purposes only [27], which provides an open database222https://aminer.org/data with more than data sets encompassing linked up researcher, conferences, and publication data. Among these, the academic social network333https://aminer.org/aminernetwork is the largest one and includes information on papers, citations, authors, and scientific collaborations. The Chinese Computer Federation (CCF) released a catalog in 2012 including subfields of computer science. Using the AMiner academic social network, Qian et al. [28] extracted papers published from 2010 to 2012, and mapped each paper with a unique subfield of computer science according to the publication venue. Here, we use these assigned categories as the ground truth for a classification task. Using all the papers in [28] that have both abstract and references, we created a data set of similar size to CORA. We extracted the largest connected component from the citation network of all papers in subfields (Computer systems/high performance computing, Computer networks, Network/information security, Software engineering/software/programming language, Databases/data mining/information retrieval, Theoretical computer science, and Computer graphics/multimedia) from 2010 to 2011. The resulting AMiner citation network consists of papers with edges. Just as with CORA, we treat the citations as undirected edges, and obtain an adjacency matrix . We further extracted the most frequent stemmed terms from the corpus of abstracts of papers and constructed the feature matrix for AMiner using bag-of-words.

4.1.4 Wikipedia

As a contrasting example with distinct characteristics, we produced three data sets from the English Wikipedia. The Wikipedia provides a well-known, interlinked corpus of documents (articles) in different fields, which ‘cite’ each other via hyperlinks. We started by producing a large corpus of articles, consisting of a mixture of popular and random pages so as to obtain a balanced data set. We retrieved the most accessed articles during the week before the construction of the data set (July 2017), and an additional documents at random using the Wikipedia built-in random function444https://en.wikipedia.org/wiki/Wikipedia:Random. The text and subcategories of each document, together with the names of documents connected to it, were obtained using the Python library Wikipedia555https://github.com/goldsmith/Wikipedia. A few documents (e.g., those with no subcategories) were filtered out during this process. We constructed the citation network of the documents retrieved and extracted the largest connected component. The resulting citation network contained nodes and edges. The text content of each document was converted into a bag-of-words representation based on the most frequent words. To establish the ground truth, we used categories from the API (People, Geography, Culture, Society, History, Nature, Sports, Technology, Health, Religion, Mathematics, Philosophy) and assigned each document to one of them. As part of our investigation, we split this large Wikipedia data set into two smaller subsets of non-overlapping categories: Wikipedia I, consisting of Health, Mathematics, Nature, Sports, and Technology; and Wikipedia II, with the categories Culture, Geography, History, Society, and People.

All six datasets used in our study can be found at https://github.com/haczqyf/gcn-data-alignment/tree/master/alignment/data.

4.2 GCN architecture, hyperparameters and implementation

We used the GCN implementation666https://github.com/tkipf/gcn provided by the authors of [16], and followed closely their experimental setup to train and test the GCN on our data sets. We used a two-layer GCN as described in Section 2.2

with the maximum number of training iterations (epochs) set to

 [29], a learning rate of , and early stopping with a window size of , i.e., training stops if the validation loss does not decrease for

consecutive epochs. Other hyperparameters used were: (i) dropout rate:

; (ii) L2 regularization: ; and (iii) number of hidden units: . We initialized the weights as described in [30], and accordingly row-normalized the input feature vectors. For the training, validation and test of the GCN, we used the following split: (i) % of instances as training set; (ii) % as validation set; and (iii) the remaining % as test set. We used this split for all data sets with exception of the full Wikipedia data set, where we used: (i) % of instances as training set; (ii) % as validation set; and (iii) the remaining % as test set. This modification of the split was necessary to ensure the instances in the training set were evenly distributed across categories.

5 Results

The GCN performance is evaluated using the standard classification accuracy defined as the proportion of nodes correctly classified in the test set.

5.1 GCN: original graph vs. limiting cases

For each data set in Table 1, we trained and tested a GCN with the original graph and features matrices, and GCN models under the three limiting cases described in Section 3.2. We computed the average accuracy of runs with random weight initializations (Table 2).

GCN (original) GCN (limiting cases)
No graph = MLP No features Complete graph
(Only features) (Only graph) (Mean field)
Data sets
Constructive 0.932 0.006 0.416 0.010 0.764 0.009 0.100 0.003
CORA 0.811 0.005 0.548 0.014 0.691 0.006 0.121 0.066
AMiner 0.748 0.005 0.547 0.013 0.591 0.006 0.123 0.045
Wikipedia 0.392 0.010 0.450 0.007 0.254 0.037 O.O.M.
Wikipedia I 0.861 0.006 0.796 0.005 0.824 0.003 0.163 0.135
Wikipedia II 0.566 0.021 0.659 0.011 0.347 0.012 0.155 0.176
Table 2: Classification accuracy of GCN with original data and for limiting cases for our data sets. The best performance is indicated in bold. Error bars evaluated over runs. The GCN with original data performs best in most cases, but is outperformed by MLP in the full Wikipedia data set and its subset Wikipedia II.

The GCN using all the information available in the features and the graph outperforms MLP (the no graph limit) except in the case of the large Wikipedia set. Hence using the additional information contained in the graph does not necessarily increase the performance of GCN. To investigate this issue further, we split the Wikipedia data set into two subsets: Wikipedia I, with articles in topics that tend to be more self-referential (e.g., Mathematics or Technology) and Wikipedia II, containing pages in areas that are less self-contained (e.g., Culture or Society). We observed that GCN outperforms MLP for Wikipedia I but the opposite is still true for Wikipedia II. Finally, we also observe that the performance of ‘No features’ is always lower than the performance of GCN, and, as expected, the performance of ‘Complete graph’ (i.e., mean field) is very low and close to pure chance (i.e., ).

5.2 Performance of GCN under randomization

The results above lead us to pose the hypothesis that a degree of synergy between features, graph and ground truth is needed for GCN to perform well. To investigate this hypothesis, we use the randomization schemes described in Section 3.1 to degrade systematically the information content of the graph and/or the features in our data sets. Fig. 3 presents the performance of the GCN as a function of the percent of randomization of the graph structure, the features, or both. As expected, the accuracy decreases for all data sets as the information contained in the graph, features or both is scrambled, yet with differences in the decay rate of each of the ingredients for the different examples.

Note that the chance-level performance of the ‘Complete graph’ (mean field) limiting case is achieved only when both graph and features are fully randomized, whereas the accuracy of the two other limiting cases (‘No graph - MLP’, ‘No features’) is reached around the half-point () of randomization of the graph or of the features, respectively. This indicates that using the scrambled information above a certain degree of randomization becomes more detrimental to the classification performance than simply ignoring it.

Figure 3: Degradation of the classification performance as a function of randomization. Each panel shows the degradation of the classification accuracy as a function of the randomization of graph, features and both, for a different data set. Error bars are evaluated over realizations: for zero percent randomization, we report runs with random weight initializations; for the rest, we report run with random weight initializations for random realizations. The horizontal lines correspond to the limiting cases in Table 2. The full Wikipedia data set was not analyzed here since the eigendecomposition of needed to obtain is computationally intensive.

5.3 Relating GCN performance and subspace alignment

We tested whether the degradation of GCN performance is linked to the increased misalignment of features, graph and ground truth given by the subspace alignment measure:

(12)

which corresponds to (10) computed with the dimensions obtained using (11) (Table 3, and see appendix for the optimization scheme used). Fig. 4 shows that the GCN accuracy is clearly (anti)correlated with the subspace alignment distance (12) in all our examples (mean correlation ). As we randomize the graph and/or features, the subspace misalignment increases and the GCN performance decreases.

Data sets
Constructive example 287 10 10
CORA 1,291 190 7
AMiner 500 57 7
Wikipedia I 68 1,699 5
Wikipedia II 100 1,125 5
Table 3: Dimensions of the three subspaces obtained according to Eq. 11 for our data sets.
Figure 4: Classification performance versus the subspace alignment measure. Each panel shows the accuracy of GCN versus the SAM 12 for all the runs presented in Fig. 3. Error bars evaluated over randomizations.

6 Discussion

Our first set of experiments (Table 2) reflects the varying amount of information that GCN can extract from features, graph and their combination, for the purpose of classification. For a classifier to perform well, it is necessary to find (possibly nonlinear) combinations of features that map differentially and distinctively onto the categories of the ground truth. The larger the difference (or distance on the projected space) between the samples of each category, the easier it is to ‘separate’ them, and the better the classifier. In the MLP setting, for instance, the weights between layers () are trained to maximize this separation. As seen by the different accuracies in the ‘No graph’ column (Table 2), the features of each example contain variable amount of information that is mappable on its ground truth. A similar reasoning applies to classification based on graph information alone, but in this case, it is the eigenvectors of that need to be combined to produce distinguishing features between the categories in the ground truth (e.g., if the graph substructures across scales [31] do not map onto the separation lines of the ground truth categories, then the classification performance based on the graph will deteriorate). The accuracies in the ‘No features’ column indicate that some of the graphs contain more congruent information with the ground truth than others. Therefore, the ‘No graph’ and ‘No features’ limiting cases inform about the relative congruence of each type of information with respect to the ground truth. One can then conjecture that if the performance of the ‘No features’ case is higher than the ‘No graph’ case, GCN will yield better results than MLP.

In addition, our numerics show that although combining both sources of information generally leads to improved classification performance (‘GCN original’ column in Table 2), this is not always necessarily the case. Indeed, for the Wikipedia and Wikipedia II examples, the classification performance of the MLP (‘No graph’), which is agnostic to relationships between samples, is better than when the additional layer of relational information about the samples (i.e., the graph) is incorporated via the GCN architecture. This suggests that, for improved GCN classification, the information contained in features and graph need to be constructively aligned with the ground truth. This phenomenon can be intuitively understood as follows. In the absence of a graph (i.e., the MLP setting) the training of the layer weights is done independently over the samples, without assuming any relationship between them. In GCN, on the other hand, the role of the graph is to guide the training of the weights by averaging the features of a node with those of its graph neighbors. The underlying assumption is that the relationships represented by the graph should be consistent with the information of their features, i.e., the features of nodes that are graph neighbors are expected to be more similar than otherwise; hence the training process is biased towards convolving the diffusing information on the graph to extract improved feature descriptions for the classifier. However, if feature similarities and graph neighborhoods (or more generally, graph communities [31]) are not congruent, this graph-based averaging during the training is not beneficial.

To explore this issue in a controlled fashion, our second set of experiments (Fig. 3) studied the degradation of the classification performance induced by the systematic randomization of graph structure and/or features. The erosion of information is not uniform across our examples, reflecting the relative salience of each of the components (features and graph) for classification. Note that the GCN is able to leverage the information present in any of the two components, and is only degraded to chance-level performance when both graph and features are fully randomized. Interestingly, this fully randomized (chance-level) performance coincides with that of the ‘Complete graph’ (or mean field) limiting case, where the classifier is trained on features averaged over all the samples, thus leading to a uniform representation that has zero discriminating power when it comes to category assignment.

These results suggest that a degree of constructive alignment between the matrices of features, graph and ground truth is necessary for GCN to operate successfully beyond standard classifiers. To capture this idea, we proposed a simple subspace alignment measure (SAM) (12) that uses the minimal principal angles to capture the consistency of pairwise projections between subspaces. Fig. 4 shows that SAM correlates well with the classification performance and captures the monotonic dependence remarkably given that SAM is a simple linear measure being applied to the outcome of a highly non-linear, optimized system.

The alignment measure can be used to evaluate the relative importance of features and graph for classification without explicitly running the GCN, by comparing the SAM under full randomization of features against the SAM under full randomization of the graph. If , the features play a more important role in GCN classification, indicating that MLP could potentially yield better results (e.g., in Wikipedia II). Conversely, if , the graph is more important in GCN classification, suggesting that GCN should outperforms MLP (e.g., the constructive example and CORA).

7 Conclusion

Here, we have introduced a subspace alignment measure (SAM) (12) to quantify the consistency between the feature and graph ingredients of data sets, and showed that it correlates well with the classification performance of GCNs. Our experiments show that a degree of alignment is needed for a GCN approach to be beneficial, and that using a GCN can actually be detrimental to the classification performance if the feature and graph subspaces associated with the data are not constructively aligned, (e.g., Wikipedia and Wikipedia II). The SAM has potentially a wider range of applications in the quantification of data alignment in general. It could be used, among others, to quantify the alignment of different graphs associated with, or obtained from, particular data sets; to evaluate the quality of classifications found using unsupervised methods; or to aid in choosing the classifier architecture most advantageous computationally given a particular data set.

Acknowledgment

Yifan Qian acknowledges financial support from the China Scholarship Council program (No. 201706020176). Paul Expert and Mauricio Barahona acknowledge support through the EPSRC grant EP/N014529/1 funding the EPSRC Centre for Mathematics of Precision Healthcare.

References

  • [1] Y. LeCun, Y. Bengio, and G. Hinton, “Deep learning,” Nature, vol. 521, no. 7553, p. 436, 2015.
  • [2] I. Goodfellow, Y. Bengio, and A. Courville, Deep Learning.   MIT Press, 2016.
  • [3] J. Schmidhuber, “Deep learning in neural networks: An overview,” Neural Networks, vol. 61, pp. 85–117, 2015.
  • [4] M. M. Bronstein, J. Bruna, Y. LeCun, A. Szlam, and P. Vandergheynst, “Geometric deep learning: going beyond euclidean data,” IEEE Signal Processing Magazine, vol. 34, no. 4, pp. 18–42, 2017.
  • [5] L. Deng, D. Yu et al., “Deep learning: methods and applications,” Foundations and Trends® in Signal Processing, vol. 7, no. 3–4, pp. 197–387, 2014.
  • [6] E. P. Simoncelli and B. A. Olshausen, “Natural image statistics and neural representation,” Annual Review of Neuroscience, vol. 24, no. 1, pp. 1193–1216, 2001.
  • [7] D. J. Field, “What the statistics of natural images tell us about visual coding,” in Human Vision, Visual Processing, and Digital Display, vol. 1077.   International Society for Optics and Photonics, 1989, pp. 269–277.
  • [8] Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner, “Gradient-based learning applied to document recognition,” Proceedings of the IEEE, vol. 86, no. 11, pp. 2278–2324, 1998.
  • [9] Y. LeCun, B. E. Boser, J. S. Denker, D. Henderson, R. E. Howard, W. E. Hubbard, and L. D. Jackel, “Handwritten digit recognition with a back-propagation network,” in Advances in Neural Information Processing Systems, 1990, pp. 396–404.
  • [10] J. Bruna and S. Mallat, “Invariant scattering convolution networks,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 35, no. 8, pp. 1872–1886, 2013.
  • [11] D. Lazer, A. S. Pentland, L. Adamic, S. Aral, A. L. Barabasi, D. Brewer, N. Christakis, N. Contractor, J. Fowler, M. Gutmann et al., “Life in the network: the coming age of computational social science,” Science, vol. 323, no. 5915, p. 721, 2009.
  • [12] K. Xu, W. Hu, J. Leskovec, and S. Jegelka, “How powerful are graph neural networks?” arXiv preprint arXiv:1810.00826, 2018.
  • [13] W. Hamilton, Z. Ying, and J. Leskovec, “Inductive representation learning on large graphs,” in Advances in Neural Information Processing Systems, 2017, pp. 1024–1034.
  • [14] Z. Wu, S. Pan, F. Chen, G. Long, C. Zhang, and P. S. Yu, “A comprehensive survey on graph neural networks,” arXiv preprint arXiv:1901.00596, 2019.
  • [15] M. Defferrard, X. Bresson, and P. Vandergheynst, “Convolutional neural networks on graphs with fast localized spectral filtering,” in Advances in Neural Information Processing Systems, 2016, pp. 3844–3852.
  • [16] T. N. Kipf and M. Welling, “Semi-supervised classification with graph convolutional networks,” in International Conference on Learning Representations (ICLR), 2017.
  • [17] M. Gori, G. Monfardini, and F. Scarselli, “A new model for learning in graph domains,” in Proceedings of IEEE International Joint Conference on Neural Networks, vol. 2.   IEEE, 2005, pp. 729–734.
  • [18] Y. Li, D. Tarlow, M. Brockschmidt, and R. Zemel, “Gated graph sequence neural networks,” arXiv preprint arXiv:1511.05493, 2015.
  • [19] S. Sukhbaatar, R. Fergus et al.

    , “Learning multiagent communication with backpropagation,” in

    Advances in Neural Information Processing Systems, 2016, pp. 2244–2252.
  • [20] J. Bruna, W. Zaremba, A. Szlam, and Y. LeCun, “Spectral networks and locally connected networks on graphs,” arXiv preprint arXiv:1312.6203, 2013.
  • [21] M. Henaff, J. Bruna, and Y. LeCun, “Deep convolutional networks on graph-structured data,” arXiv preprint arXiv:1506.05163, 2015.
  • [22] M. E. Newman, “The structure and function of complex networks,” SIAM Review, vol. 45, no. 2, pp. 167–256, 2003.
  • [23] K. Ye and L.-H. Lim, “Schubert varieties and distances between subspaces of different dimensions,” SIAM Journal on Matrix Analysis and Applications, vol. 37, no. 3, pp. 1176–1197, 2016.
  • [24] A. Björck and G. H. Golub, “Numerical methods for computing angles between linear subspaces,” Mathematics of Computation, vol. 27, no. 123, pp. 579–594, 1973.
  • [25] G. H. Golub and C. F. Van Loan, Matrix Computations.   JHU Press, 2012, vol. 3.
  • [26]

    U. Von Luxburg, “A tutorial on spectral clustering,”

    Statistics and Computing, vol. 17, no. 4, pp. 395–416, 2007.
  • [27] J. Tang, J. Zhang, L. Yao, J. Li, L. Zhang, and Z. Su, “Arnetminer: extraction and mining of academic social networks,” in Proceedings of the 14th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining.   ACM, 2008, pp. 990–998.
  • [28]

    Y. Qian, W. Rong, N. Jiang, J. Tang, and Z. Xiong, “Citation regression analysis of computer science publications in different ranking categories and subfields,”

    Scientometrics, vol. 110, no. 3, pp. 1351–1374, 2017.
  • [29] D. P. Kingma and J. Ba, “Adam: A method for stochastic optimization,” arXiv preprint arXiv:1412.6980, 2014.
  • [30] X. Glorot and Y. Bengio, “Understanding the difficulty of training deep feedforward neural networks,” in

    Proceedings of the Thirteenth International Conference on Artificial Intelligence and Statistics

    , 2010, pp. 249–256.
  • [31] R. Lambiotte, J. Delvenne, and M. Barahona, “Random walks, markov processes and the multiscale modular organization of complex networks,” IEEE Transactions on Network Science and Engineering, vol. 1, no. 2, pp. 76–90, July 2014.

Appendix: Finding and

A key element of subspace alignment measure described in the main text is to find lower dimensional representations of the graph, features and ground truth.

To determine the dimension of the representative subspaces, we propose the following heuristic:

(13)

We choose to be equal to the number of categories in the ground truth as they are non overlapping. Thus, and range from to their maximum values, , the dimension of the feature vectors and , the number of nodes in the graph, respectively.

To find the values for and that maximize , we scan different possible combinations of and . We applied two rounds of scanning. In the first scanning round, in the intervals of and , we picked equally spaced values that contain the minimum and maximum possible values for and . For example, in CORA, equals because the number of categories in the ground truth is . Thus ranges from to . At the end of the first round, the optimal values of and are and , respectively (see Fig. 4(c)). In the second scanning round, we applied a very similar process to the one just described. We set the scanning intervals of and as the neighbors of and found in the first round, respectively. For example, in CORA, for the second round, we set the intervals of and as and . Again, we split the new intervals with equally spaced values. We have also shown the scanning results for other data sets in Figure 5.

(a) Constructive example: round 1
(b) Constructive example: round 2
(c) CORA: round 1
(d) CORA: round 2
(e) AMiner: round 1
(f) AMiner: round 2
(g) Wikipedia I: round 1
(h) Wikipedia I: round 2
(i) Wikipedia II: round 1
(j) Wikipedia II: round 2
Figure 5: Summary of results on scanning subspaces.