1. Introduction
Graph has been widely used to model realworld entities and the relationship among them. Two graph learning problems have received a lot of attention recently, i.e., node classification and graph classification. Node classification is to predict the class label of nodes in a graph, for which many studies in the literature make use of the connections between nodes to boost the classification performance. For example, (Ramanath et al., 2018) enhances the recommendation precision in LinkedIn by taking advantage of the interaction network, and (Sen et al., 2008) improves the performance of document classification by exploiting the citation network. Graph classification, on the other hand, is to predict the class label of graphs, for which various graph kernels (Borgwardt and Kriegel, 2005; Gärtner et al., 2003; Shervashidze et al., 2009, 2011)
and deep learning approaches
(Niepert et al., 2016; Narayanan et al., 2017) have been designed. In this work, we consider a more challenging but practically useful setting, in which a node itself is a graph instance. This leads to a hierarchical graph in which a set of graph instances are interconnected via edges. This is a very expressive data representation, as it considers the relationship between graph instances, rather than treating them independently. The hierarchical graph model applies to many realworld data, for example, a social network can be modeled as a hierarchical graph, in which a user group is represented by a graph instance and treated as a node in the hierarchical graph, and then a number of user groups are interconnected via interactions or common members. As another example, a document collection can be modeled as a hierarchical graph, in which a document is regarded as a graphofwords (Rousseau et al., 2015), and then a set of documents are interconnected via the citation relationship. In this paper, we study graph classification in a hierarchical graph, which predicts the class label of graph instances in a hierarchical graph.One challenge in this problem is that a hierarchical graph is a much too complicated input for building a classifier. To tackle this challenge, we design a new graph embedding method which embeds a graph instance of arbitrary size into a fixedlength vector. All graph instances in the hierarchical graph are transformed to embedding vectors which are the common input format for classification. Specifically, the embedding method builds an instancelevel classifier called IC from graph instances, and produces embedding vectors and predicted class probabilities of the graph instances. Another classifier HC at the hierarchical graph level takes the embedding vectors and their connections as input, and outputs the predicted class probabilities of all graph instances. To enforce a consistency between the two classifiers, we define a disagreement loss to measure the degree of divergence between the predictions by them and aim to minimize the disagreement loss.
Another challenge is that the amount of available class labels is usually very small in realworld data, which limits the classification performance. To address this challenge, we take a semisupervised learning approach to solving the graph classification problem. We design an iterative algorithm framework which takes turns to build or update classifiers IC and HC. We start with the limited labeled training set and build IC, which produces the embedding vectors of graph instances. HC takes the embedding vectors as input and produces predictions. We cautiously select a subset of predicted labels by HC with high confidence to enlarge the training set. The enlarged training set is then fed into IC in the next iteration to update its parameters in the hope of generating more accurate embedding vectors and predictions. HC further takes the new embedding vectors for model update and class prediction. This is our proposed solution, called SEmisupervised grAph cLassification via Cautious Iteration (SEALCI), to the graph classification problem.
We also extend this iterative algorithm to the active learning framework, in which we iteratively select the most informative instances for manual annotation, and then update the classifiers with the newly labeled instances in a similar process as described above. This method is called SEALAI in short.
Our contributions are summarized as follows.

We study semisupervised graph classification from a hierarchical graph perspective, which, to the best of our knowledge, has not been studied before. Our proposed solutions SEALC/AI achieve superior classification performance to the stateoftheother graph kernel and deep learning methods, even when given very few labeled training instances.

We design a novel supervised, selfattentive graph embedding method called SAGE to embed graphs of arbitrary size into fixedlength vectors, which are used as a common form of input for classification. The embedding approach not only simplifies the representation of a hierarchical graph greatly, but also provides meaningful interpretations of the underlying data in two forms: 1) embedding vectors of graph instances, and 2) node importance in a graph instance learned through a selfattentive mechanism that differentiates their contribution in classifying a graph instance.

We evaluate SEALC/AI on both synthetic graphs and Tencent QQ group data. From the social networking platform Tencent QQ, we select 37,836 QQ groups with 18,422,331 unique anonymized users and classify them as “game” or “nongame” groups. SEALC/AI achieve a MacroF1 score of 70.8% and 73.2% respectively with only 2.6% labeled instances. They both outperform other competing methods by a large margin.
2. Problem Definition
We denote a set of objects as which represent realworld entities. We use attributes to describe properties of objects, e.g., age, gender, and other information of people.
We use a graph instance to model the relationship between objects in , which is denoted as , is the node set and , is an adjacency matrix representing the connectivity in , and is a matrix recording the attribute values of all nodes in .
A set of graph instances can be interconnected, and the connectivity between the graph instances is represented by an adjacency matrix . The graph instances and their connections are modeled as a hierarchical graph.
A graph instance is a labeled graph if it has a class label, represented by a vector , where is the number of classes. A graph instance is unlabeled if its class label is unknown. Then can be divided into two subsets: labeled graphs and unlabeled graphs , where , and . In this paper, we study the problem of graph classification, which determines the class label of the unlabeled graph instances in from the available class labels in and the hierarchical graph topological structure. As the amount of available class labels is usually very limited in realworld data, we take a semisupervised learning approach to solving this problem.
Figure 1 depicts a hierarchical graph in the context of a social network. denote four user groups. Group has the class label of “game”, has the label of “nongame”, while the class labels of and are unknown. These four groups are connected via some kind of relationships, e.g., interactions or common members. The internal structure of each user group shows the connections between individual users. From this hierarchical graph, we want to determine the class labels of groups and .
3. Methodology
3.1. Problem Formulation
In our problem setting, we have two kinds of information: graph instances and connections between the graph instances, which provide us with two perspectives to tackle the graph classification problem. Accordingly, we build two classifiers: a classifier IC constructed for graph instances and a classifier HC constructed for the hierarchical graph, both of which make predictions for unlabeled graph instances in .
For both classifiers, one goal is to minimize the supervised loss, which measures the distance between the predicted class probabilities and the true labels. Another goal is to minimize a disagreement loss, which measures the distance between the predicted class probabilities by IC and HC. The purpose of this disagreement loss is to enforce a consistency between the two classifiers.
Formally, we formulate the graph classification problem as an optimization problem:
(1) 
where is the supervised loss for the labeled graph instances, and is the disagreement loss for the unlabeled graph instances.
Specifically, includes two parts:
(2) 
where is a vector of predicted class probabilities by IC, and is a vector of predicted class probabilities by HC.
is the crossentropy loss function.
The disagreement loss is defined as:
(3) 
where
is the KullbackLeibler divergence,
. In the following subsections, we describe our design of classifiers IC and HC, and our approach to minimizing the supervised loss and the disagreement loss.3.2. Design of Classifiers
Classifier IC takes a graph instance as input. As different graph instances have different numbers of nodes, IC is expected to handle graph instances of arbitrary size. Classifier HC takes the hierarchical graph as input, in which individual graph instances are the “nodes”. This is a much too complicated input for a classifier. To deal with the above challenges, we propose to embed a graph instance into a fixedlength vector via IC first. Then HC can take as input the embedding vectors of graph instances and the adjacency matrix . In particular, IC takes as input the adjacency matrix and attribute matrix of an arbitrarysized graph instance , and outputs an embedding vector and a vector of predicted class probabilities , i.e., . HC takes the embedding vectors and , and outputs the predicted class probabilities , i.e., . In the following, we illustrate the design of IC which performs discriminative graph embedding, and then the design of HC which performs graphbased classification.
3.2.1. Discriminative graph embedding
Our graph embedding task is to produce a fixedlength discriminative embedding vector of a graph instance. In the literature, graph representation techniques have recently shifted from handcrafted kernel methods (Yanardag and Vishwanathan, 2015)
to neural network based endtoend methods, which achieve better performance in graphstructured learning tasks. In this vein, we adopt neural network methods for the graph embedding task, for which, however, we identify three challenges:

Size invariance: How to design the neural network structure to flexibly take an arbitrarysized graph instance and produce a fixedlength embedding vector?

Permutation invariance: How to derive the representation regardless of the permutation of nodes?

Node importance: How to encode the importance of different nodes into a unified embedding vector?
In particular, the third challenge is node importance, i.e., different nodes in a graph instance have different degrees of importance. For example, in a “game” group the “core” members should be more important than the “border” members in contributing to the derived embedding vector. We need to design a mechanism to learn the node importance and then encode it in the embedding vector properly.
To this end, we propose a selfattentive graph embedding method, called SAGE, which can take a variablesized graph instance, and combine each node to produce a fixedlength vector according to their importance within the graph. In SAGE, we first utilize a multilayer GCN (Kipf and Welling, 2017) to smooth each node’s features over the graph’s topology. Then we use a selfattentive mechanism to learn the node importance and then transform a variable number of smoothed nodes into a fixedlength embedding vector, as proposed in (Lin et al., 2017). Finally, the embedding vector is cascaded with a fully connected layer and a softmax function, in which the label information can be leveraged to discriminatively transform the embedding vector into . Figure 2 depicts the overall framework of SAGE.
Formally, we are given the adjacency matrix and the attribute matrix of a graph instance as input. In the preprocessing step, the adjacency matrix is normalized:
(4) 
where
is the identity matrix and
. Then we apply a twolayer GCN network:(5) 
Here and are two weight matrices. GCN can be considered as a Laplacian smoothing operator for node features over graph structures, as pointed out in (Li et al., 2018). Then we get a set of representation for nodes in . Note that the representation does not provide node importance, and it is size variant, i.e., its size is still determined by the number of nodes . So next we utilize the selfattentive mechanism to learn node importance and encode it into a unified graph representation, which is size invariant:
(6) 
where and are two weight matrices. The function of
is to linearly transform the node representation from a
dimensional space to a dimensional space, then nonlinearity is introduced by tying with the function tanh. is used as views of inferring the importance of each node within the graph. It acts like inviting experts to give their opinions about the importance of each node independently. Then softmax is applied to derive a standardized importance of each node within the graph, which means in each view the summation of all the node importance is 1.After that, we compute the final graph representation by multiplying with :
(7) 
is size invariant since it does not depend on the number of nodes any more. It is also permutation invariant since the importance of each node is learned regardless of the node sequence, and only determined by the task labels.
One potential risk in SAGE is that views of node importance may be similar. To diversify their views of node importance, a penalization term is imposed:
(8) 
Here represents the Frobenius norm of a matrix. We train the classifier in a supervised way with the task at hand, in the hope of minimizing both the penalization and the crossentropy loss.
To summarize, we use SAGE to construct the instancelevel classifier IC. It produces not only the estimated class probability vector
, but also a graph embedding , which is the input for classifier HC described in the next subsection.3.2.2. Graphbased classification
Given the graph embedding and the adjacency matrix , our next task is to infer the parameters of classifier HC and derive the predicted probabilities . This problem falls into the setting of traditional graphbased learning where can be treated as the set of node features. Recently neural network based approaches such as (Kipf and Welling, 2017; Yang et al., 2016) have demonstrated their superiority to traditional methods such as ICA (Sen et al., 2008). In this context we make use of GCN (Kipf and Welling, 2017) again for the consideration of efficiency and effectiveness. In the following, we consider a twolayer GCN and apply preprocessing by . Then the model becomes:
(9) 
where is an inputtohidden weight matrix with feature maps and is a hiddentooutput weight matrix. The softmax function is applied rowwise and we get . With and we can compute the supervised loss in problem (2) and the disagreement loss in problem (3).
3.3. The Proposed SEALCI Model
In this subsection, we present our method to minimize the objective function (1). In realworld scenarios, the number of labeled graph instances can be quite small compared to the number of unlabeled instances . In this context, neural network based classifiers such as IC may suffer from the problem of overfitting. To mitigate this, we have both the disagreement loss (3) and the supervised loss (2) included in the objective function (1). The disagreement loss can be regarded as a regularization to prevent overfitting.
Problem (1) is a mixed combinatorial and continuous optimization problem. The supervised loss (2) includes two parts, and , i.e., the supervised loss of IC and HC. depends on classifier IC to provide accurate graph embedding. All these issues make the problem highly nonconvex. As such, we use the idea of iterative algorithm to alternate minimizing the supervised loss of IC and HC, and minimizing the disagreement loss by trusting a subset of predictions by HC in the next iteration of graph embedding by IC.
To be more specific, we combine the graph embedding algorithm in Section 3.2.1 and graphbased classification algorithm in Section 3.2.2 into one iterative algorithm. We build IC to produce graph embedding for all graph instances in iteration , and then feed into HC to get the predicted probabilities . We then make use of to update the parameters of IC and generate , which is then used as the input of HC in iteration . Figure 3 depicts the overall framework of this iterative process. Although this method may not reach the global optimum, similar setting (McDowell et al., 2007; Sen et al., 2008) has been proven to be effective.
3.3.1. How to utilize ?
To update the graph embedding vectors, a naive approach is feeding the whole set of for the parameter update in IC, which is the idea of the original ICA (Sen et al., 2008). However, not all are correct in their predictions. The false predictions may lead the update of embedding neural network to the wrong direction. To this end, we make use of the idea of (McDowell et al., 2007), a variant of the original ICA, and cautiously exploit a subset of to update the parameters of IC in each iteration. Specifically, in iteration , we choose the most confident predicted labels while ignoring the less confident predicted labels. This operation continues until all the unlabeled samples have been utilized. To further improve the efficiency, the parameters of IC are not retrained but finetuned based on the parameters obtained in the previous iteration. This algorithm is called SEmisupervised grAph cLassification via Cautious Iteration (SEALCI) and is presented in Algorithm 1. Note here is the set of all the parameters of IC and HC. In line , the training set for IC has been enlarged by instances and it is done by “committing” these instances’ labels from their maximum probability. In other words, the newly enrolled training instances are found by:
(10) 
Here function is used to select the top instances and function is used to select the maximum value in the probability vector .
3.4. The Proposed SEALAI Model
Our proposed model is easy to extend to the active learning scenario. In case further annotation is available, we can perform active learning and ask for annotations with a budget of . Denote the set of graph instances being annotated as , then the objective function in the active learning setting is rewritten as:
(11) 
where . This is still a mixed combinatorial and continuous optimization problem. It is very hard to infer the model parameters and the active learning set simultaneously. By definition, the active learning set is intractable unless the model parameters are completely inferred. To solve this chickenandegg problem, we decompose the objective function into two substeps: parameter optimization and candidate generation. Then we optimize iteratively. This algorithm is called SEmisupervised grAph cLassification via Active Iteration (SEALAI) and is shown in Algorithm 2.
At the beginning of this iterative process, we optimize the supervised loss based on current labeled graphs in (line 3 in Algorithm 2). In active learning, the choice of candidate generator is a key component. We exploit the idea of ALFNET (Bilgic et al., 2010) and choose the candidate graph instances by maximizing the decrease of the current disagreement loss based on the new parameter obtained in the first step (line 6 in Algorithm 2). At last we label and update , and respectively (line 79 in Algorithm 2).
It is worth noting that from the hard example mining perspective, the disagreement score is an excellent criterion for the active learning setting. Specifically, we choose the candidates by first calculating the distribution divergence of from and :
(12) 
Then we choose instances with the largest KL divergence. Intuitively, the KL divergence between and can be viewed as the conflict of two supervised models. A large KL divergence indicates that one of the models gives wrong predictions. To this end, the instances with a large KL divergence are more informative to help the algorithm converge more quickly.
3.5. Complexity Analysis
We analyze the computational complexity of our proposed methods. Here we only focus on Algorithm 1, since Algorithm 2 is almost the same except the step of selecting candidate graph instances to the training set. In Algorithm 1, the intensive parts in each iteration contain the updates of IC and HC as well as the selection of candidate instances. We discuss each part in details below.
Regarding IC, the core is to compute the activation matrix in Eq. (5) where the matrixvector multiplications are up to flops for one input graph instance; here denotes the number of edges in the graph instance and is the input feature dimension. Thus, it leads to the complexity of by going through all graph instances.
Next, the computation by HC in Eq. (9) requires flops in total, where denotes the number of links between graph instances. Then in candidate selection, performing comparisons between all unlabeled graph instances has a complexity of given the outputs of two classifiers IC and HC.
Overall, the complexity of our method is which scales linearly in terms of the number of edges in each graph instance (i.e., ), the number of links between graph instances (i.e., ) and the number of graph instances (i.e., ). Thus, our method is computationally comparable to the GCNbased method (Kipf and Welling, 2017), and more efficient than PSCN (Niepert et al., 2016) that is quasilinear with respect to the numbers of nodes and edges.
4. Experiments
We first validate the effectiveness of our graph embedding algorithm SAGE on two data sets: PROTEINS and D&D. Then we evaluate our SEALC/AI methods on both synthetic and Tencent QQ group data sets.
4.1. Performance of SAGE
We use two benchmark data sets, PROTEINS and D&D, to evaluate the classification accuracy of SAGE, and compare it with the stateoftheart graph kernels and deep learning approaches. PROTEINS (Borgwardt et al., 2005) is a graph data set where nodes are secondary structure elements and edges represent that two nodes are neighbors in the aminoacid sequence or in 3D space. D&D (Dobson and Doig, 2003) is a set of structures of enzymes and nonenzymes proteins, where nodes are amino acids, and edges represent spatial closeness between nodes. Table 1 lists the statistics of these two data sets.
PROTEINS  D&D  

Max number of nodes  620  5748 
Avg number of nodes  39.06  284.32 
Number of graphs  1113  1178 
4.1.1. Baselines and Metrics
The baselines include four graph kernels and two deep learning approaches:

the shortestpath kernel (SP) (Borgwardt and Kriegel, 2005),

the random walk kernel (RW) (Gärtner et al., 2003),

the graphlet count kernel (GK) (Shervashidze et al., 2009),

the WeisfeilerLehman subtree kernel (WL) (Shervashidze et al., 2011),

PATCHYSAN (PSCN) (Niepert et al., 2016), and

graph2vec (Narayanan et al., 2017).
We follow the experimental setting as described in (Niepert et al., 2016)
, and perform 10fold cross validation. In each partition, the experiments are repeated for 10 times. The average accuracy and the standard deviation are reported. We list results of the graph kernels and the best reported results of PSCN according to
(Niepert et al., 2016).For SAGE, we use the same network architecture on both data sets. The first GCN layer has 128 output channels, and the second GCN has 8 output channels. We set , , and the penalization term coefficient to be
. The dense layer has 256 rectified linear units with a dropout rate of 0.5. We use minibatch based Adam
(Kingma and Ba, 2015) to minimize the crossentropy loss and use Henormal (He et al., 2015)as the initializer for GCN. For both data sets, the only hyperparameter we optimized is the number of epochs.
4.1.2. Results
Table 2 lists the experimental results. As we can see, SAGE outperforms all the graph kernel methods and the two deep learning methods by 1.27% – 5.59% in accuracy. This shows that our graph embedding method SAGE is superior.
Approach  PROTEINS  D&D 

SP  75.07±0.54%   
RW  74.22±0.42%   
GK  71.67±0.55%  78.45±0.26% 
WL  72.92±0.56%  77.95±0.70% 
PSCN  75.89±2.76%  77.12±2.41% 
graph2vec  73.30±2.05%   
SAGE  77.26±2.28%  80.88±2.33% 
4.2. SEALC/AI on Synthetic Data
We evaluate the performance of SEALC/AI on synthetic data. We first give a description of the synthetic generator, then visualize the learned embeddings and analyze the selfattentive mechanism on the generated data. Finally we compare our methods with baselines in terms of classification accuracy.
4.2.1. Synthetic Data Generation
The benchmark data set Cora (McCallum et al., 2000) contains 2708 papers which are connected by the “citation” relationship. We borrow the topological structure of Cora to provide the skeleton (i.e., edges) of our synthetic hierarchical graph. Then we generate a set of graph instances, which serve as the nodes of this hierarchical graph. Since there are 7 classes in Cora, we adopt 7 different graph generation algorithms, that is, WattsStrogatz (Watts and Strogatz, 1998), Tree graph, ErdősRényi (Erdős and Rényi, 1960), Barbell (Herbster and Pontil, 2007), Bipartite graph, BarabsiAlbert graph (Bollobás and Riordan, 2003) and Path graph, to generate 7 different types of graph instances, and connect them in the hierarchical graph.
Specifically, to generate a graph instance , we randomly sample a number from as its size . Then we generate its structure and assign the class label according to the graph generation algorithm. In this step, the parameter in WattsStrogatz, ErdősRényi, Bipartite graph and BarabsiAlbert graph is randomly sampled from , the branching factor for Tree graph is randomly sampled from . At last, to make this problem more challenging, we randomly remove to edges in the generated graph . The statistics of the generated graph instances are listed in Table 3.
Type  Number  Nodes  Edges  Density 

WattsStrogatz  351  173  347  2.3% 
Tree  217  127  120  1.5% 
ErdősRényi  418  174  3045  20% 
Barbell  818  169  2379  16.3% 
Bipartite  426  144  1102  10.6% 
BarabsiAlbert  298  173  509  3.4% 
Path  180  175  170  1.1% 
The node and edge numbers and density are the average for each type of graph.
4.2.2. Visualization
To have a better understanding of the synthesized graph instances, we split all 2708 graph instances into two parts. 1708 instances are used for training and the remaining 1000 instances are used for testing. We apply SAGE on the training set and derive the embeddings of the 1000 testing instances. We then project these learned embeddings into a twodimensional space by tSNE (v. d. Maaten and Hinton, 2008), as depicted in Figure 4. Each color in Figure 4 represents a graph type. As we can see from this twodimensional space, the geometric distance between the graph instances can reflect their graph similarity properly.
We then examine the selfattentive mechanism of SAGE. We calculate the average attention weight across views and normalize the resulting attention weights to sum up to 1. From the testing instances, we select three examples: a Tree graph, an ErdősRényi graph and a Barbell graph, for which SAGE has a high confidence () in predicting their class label. The three examples are depicted in Figure 5
, where a bigger node implies a larger average attention weight, and a darker color implies a larger node degree. On the left is a Tree graph, in which most of the important nodes learned by SAGE are leaf nodes. This is reasonable since leaves are discriminative features to distinguish Tree graph from the other 6 types of graphs. In the center is an ErdősRényi graph. We cluster these nodes into 5 groups by hierarchical clustering
(Johnson, 1967), and see that SAGE tends to highlight those nodes with large degrees within each cluster. On the right is a Barbell graph, in which SAGE pays attention to two kinds of nodes. The first kind is those nodes that connect a dense graph and a path, and the second kind is the nodes that are on the path.4.2.3. Baselines and Metrics
We use 6 approaches as our baselines:

WLSVM/GCN (Shervashidze et al., 2011), which is similar as above but using the WeisfeilerLehman subtree kernel (WL).

graph2vecGCN (Narayanan et al., 2017), which embeds the graph instances by graph2vec and then feeds the embeddings to GCN.

cautiousSAGECheby, which is similar to SEALCI except that we replace GCN with ChebyGCN (Defferrard et al., 2016).

activeSAGECheby, which is similar to SEALAI except that we replace GCN with ChebyGCN (Defferrard et al., 2016).

SAGE, which ignores the connections between graph instances and treats them independently.
We use 300 graph instances as the training set for all methods except SEALAI and activeSAGECheby, for which only 140 graphs are used as labeled graph instances at hand and then
is set for active learning. We use 1000 graph instances as the testing set. We run each method 5 times and report its average accuracy. The number of epochs for graph2vec is 1000 and the learning rate is 0.3. To avoid overfitting of SAGE on this small data set, we use a relatively small number of neurons. The first GCN layer has 32 output channels and the second GCN layer has 4 output channels. We set
and . The dense layer has 48 units with a dropout rate of 0.3. We set in HC.4.2.4. Results
Table 4 shows the experimental results for semisupervised graph classification. Among all approaches, SEALC/AI achieve the best performance. In the following, we analyze the performance of all methods categorized into 4 groups.
Group *1: Both GKSVM and WLSVM outperform their GCNbased counterparts, indicating that SVM is more effective than GCN with the computed kernel matrix. All the embeddingbased methods perform better than these two kernel methods, which proves that embedding vectors are effective representations for graph instances and are suitable input for graph neural networks.
Group *2: graph2vecGCN achieves 85.2% accuracy, which is comparable to that of SAGE, but lower than that of SEALC/AI. One possible explanation is that graph2vec is an unsupervised embedding method, which fails to generate discriminative embeddings for classification. Another possibility is that there is no iteration in this method, and the 300 training instances do not include very informative ones. These limitations of graph2vec are also motivations for us to design the supervised embedding method SAGE and the iterative framework in SEALCI.
Group *3: cautiousSAGECheby outperforms SAGE by only 0.8%, which is not remarkable considering that it exploits many more training instances. The accuracy of activeSAGECheby is 3.3% lower than that of SEALAI, which means that ChebyGCN is inferior to GCN.
Group *4: Both SEALCI and SEALAI outperform SAGE significantly, which proves the effectiveness of our hierarchical graph based perspective and the iterative algorithm for graph classification. SEALAI outperforms SEALCI only slightly, by 1.2%. This shows, although SEALCI can make use of more training samples, it is still influenced by the misclassified cases of GCN.
Algorithm  Accuracy  
*1  GKSVM/GCN  77.8%/73.4% 
WLSVM/GCN  83.4%/75.5%  
*2  graph2vecGCN  85.2% 
*3  cautiousSAGECheby  86.5% 
activeSAGECheby  89.1%  
*4  SAGE  85.7% 
SEALCI  91.2%  
SEALAI  92.4% 
4.2.5. Influence of the number of labeled training instances
We examine how the number of labeled training instances affects the performance of our methods. We train SAGE and SEALCI with a label size of . We train SEALAI with 140 labeled instances and then set the budget for active learning at . Thus the three methods have the same number of labeled training instances. We set in SEALCI and in SEALAI. We run all methods 5 times, and plot their average accuracy in Figure 6. As we can see from Figure 6, when the number of labeled training instances is 140, SEALCI performs best since it can utilize more training samples. As the number of labeled training instances increases, the performance of SEALAI improves dramatically. SEALAI catches up with SEALCI at 260 labeled training instances and outperforms SEALCI at 300 labeled training instances. It validates that SEALAI can make use of the iterations to find informative and accurate training samples. Meanwhile SEALCI trusts the prediction of GCN conditionally on its confidence, which may bring some noise to the learning process. SEALC/AI outperform SAGE in all cases, which makes sense because SEALC/AI make good use of the hierarchical graph setting and consider the connections between the graph instances for classification.
4.3. SEALC/AI on Tencent QQ Group
In this section, we evaluate SEALC/AI on Tencent QQ group data. We describe the characteristics of this data set and then present the experimental results. Finally, we have some open discussions on how to construct a hierarchical graph from realworld data.
4.3.1. Data Description
Tencent QQ is a social networking platform in China with nearly 800 million monthly active users^{1}^{1}1https://www.tencent.com/enus/articles/17000391523362601.pdf. There are around 100 million active online QQ groups. In this experiment, we select 37,836 QQ groups with 18,422,331 unique anonymized users. For each user, we extract seven personal features:

number of days ever since the registration day;

most frequently active area code in the past 90 days;

number of friends;

number of active days in the past 30 days;

number of logging in the past 30 days;

number of messages sent in the past 30 days;

number of messages sent within QQ groups in the past 30 days.
We have 298,837,578 friend relationships among these users. 1,773 groups are labeled as “game” and the remaining groups are labeled as “nongame”.
We construct the hierarchical graph from this Tencent QQ group data as follows. A user is treated as an object, and a QQ group as a graph instance. The users in one group are connected by their friendship. The attribute matrix is filled with the attribute values of the users. The statistics of the graph instances are listed in Table 5. We build the hierarchical graph from the graph instances via common members across groups. That is, if groups and have more than one common member, we connect them.
Class label  Number  Nodes  Edges  Density 

game  1,773  147  395  5.48% 
nongame  36,063  365  1586  3.28% 
The node and edge numbers and density are the average for each type of QQ group.
4.3.2. Baselines and Metrics
We use the same set of baselines as in Section 4.2.3. 1000 graph instances are used as labeled training instances for all methods except SEALAI and activeSAGECheby, for which only 500 are used as labeled training instances at hand and then is set to 500 for active learning. We use 10,000 instances for testing for all methods. We run each method 3 times and report its average accuracy. The hyperparameters of SAGE are the same as the settings in Section 4.1.1. Since the class distribution is quite imbalanced in this data set, we report the MacroF1 instead of accuracy.
4.3.3. Results
Table 6 shows the experimental results. SEALC/AI outperform GK, WL and grah2vec by at least 12% in MacroF1. Within our framework, GCN is better than ChebyGCN for about 6%. SEALAI outperforms SEALCI by 2.4%. Next we provide the reason why SEALAI outperforms SEALCI on this data set. Figure 7 shows the false prediction rate (i.e., the percentage of misclassified instances) within the most confident predictions of GCN. As we can see, the false prediction rate increases as increases and it reaches when . In the framework of SEALCI, as the iteration goes on, we shall bring in more noise to the parameter update of SAGE, while all the training samples in SEALAI are informative and correct. This explains why SEALAI outperforms SEALCI on this Tencent QQ group data.
Algorithm  MacroF1  
*1  GKSVM  48.8% 
WLSVM  47.8%  
*2  graph2vecGCN  48.1% 
*3  cautiousSAGECheby  64.3% 
activeSAGECheby  66.7%  
*4  SAGE  54.7% 
SEALCI  70.8%  
SEALAI  73.2% 
4.3.4. Visualization
We provide visualization of a “game” group and its neighborhood in Figure 8. The left part is the ego network of the center “game” group. In the onehop neighborhood of this “game” group, there are 10 “game” groups and 19 “nongame” groups. “Game” groups are densely interconnected with a density of 34.5%, whereas “nongame” groups are sparsely connected with a density of 8.8%. The much higher density among “game” groups validates that common membership is an effective way to relate them in a hierarchical graph for classification. The right part depicts the internal structure of the ego “game” group with 22 users. A bigger node indicates a larger importance, and a darker green color implies a larger node degree. These 22 members are loosely connected and there are no triangles. This makes sense because in reality online “game” groups are not acquaintance networks. Regarding the learned node importance, node 1 has the highest importance as it is the second active member and has a large degree in this group. Node 16 is also important since it has the highest degree in this group. The “border” member 5 has a big attention weight since it has the largest number of days ever since the registration day and is quite active in this group.
4.3.5. Discussion
How to construct a hierarchical graph from raw data is an open question. In the above experiment, we connect two QQ groups if they have more than one common member (i.e., ). When we change the threshold, it directly affects the edge density in the hierarchical graph, and may influence the classification performance. For example, if we connect two QQ groups when they have one common member or more (i.e., ), the edge density is 2.8% compared with 0.27% in the first setting. A proper setting of this threshold is data dependent, and can be determined through a validation set.
5. Related Work
This work is related to semisupervised classification of networked data, variablesized graph embedding and active learning.
Most work on semisupervised learning for networked data aims to utilize the network structure to boost the learning performance. The assumption is that network context can provide additional information that is not covered by node attributes. Ever since the pioneer work of Sen et al. (Sen et al., 2008), Iterative Classification Algorithm (ICA) has become a paradigm for networked data with limited annotations. In ICA, for each node a local classifier takes the estimated labels of its neighborhood and its own features as input, and outputs a new estimated label. The iteration continues until adjacent estimations stabilize. In ALFNET (Bilgic et al., 2010), the authors first cluster the network nodes into several groups, and design a contentonly classifier CO and a collective classifier CC. Based on the disagreement score of CO and CC in each iteration, a candidate instance set is generated from different clusters and labeled. Then both CO and CC are retrained using the labeled set until convergence. One main difference between ICA and ALFNET is that ICA does not require human intervention while ALFNET needs human annotation in case labels of the candidate set are not available.
Recent work has focused on using deep learning neural networks to further improve the performance. (Yang et al., 2016) leverages both network context and node features by jointly training node embedding to predict the class label and the context of the network. Later Kipf and Welling (Kipf and Welling, 2017) simplify the loss design by only considering the supervised loss while network context is exploited by the GCN operator. Our problem setting is different from all of the above, as the node is no longer a fixedsize feature vector but a variablesize graph. It can be regarded as a generalization of the previous setting, and cannot be handled by existing solutions effectively.
Representation learning on graphs has been proposed to transform instances in topological space into fixedsize vectors in Euclidean space in which geometric distance reflects their structural similarity. There are two trends on this topic, one of which is a shift from node embedding (Perozzi et al., 2014; Grover and Leskovec, 2016) to whole graph embedding. (Yanardag and Vishwanathan, 2015) uses CBOW and skipgram model (Mikolov et al., 2013)
, previously proven to be successful in natural language processing, to learn a new graph kernel. Meanwhile, some other methods focus on generating graph embeddings by integrating node embeddings.
(Niepert et al., 2016) proposes a spatialbased graph CNN operator and then concatenates these obtained node representations by imposing a problemspecific node ordering. (Defferrard et al., 2016)defines a “graph coarsening” operation by first clustering the node representations and then applying a maxpooling operation. However, all these methods need some preprocessing steps such as node ordering or clustering, which is not a necessity from a datadriven perspective. Another trend is a shift from unsupervised embedding
(Mikolov et al., 2013) to supervised embedding (Dai et al., 2016; Lin et al., 2017), which provides better performance for downstream classification tasks. In this sense, our embedding method SAGE performs whole graph embedding in a supervised way.Active learning has been integrated in many collective classification methods (Settles, 2012; Bilgic et al., 2010) to find the most informative samples to be labeled. However, research that generalizes active learning with deep semisupervised learning is still lacking. The closest work is (Zhou et al., 2017) in which the authors utilize active learning to incrementally finetune a CNN network for image classification. Our solution SEALAI is different in the sense that the informative samples selected by active learning are used to update the parameters of the graph embedding network, whose output is then fed into HC in an iterative framework.
6. Conclusion
In this paper, we study semisupervised graph classification from a hierarchical graph perspective. The hierarchical graph is a much too complicated input for classification, thus we first design a supervised, selfattentive graph embedding method SAGE to embed graph instances into fixedlength vectors, which are a common input form for classification. We build two classifiers IC and HC at the graph instance level and the hierarchical graph level respectively to fully exploit the available information. Our semisupervised solutions SEALC/AI adopt an iterative framework to update IC and HC alternately with an enlarged training set. Experimental results on synthetic graphs and Tencent QQ group data show that SEALC/AI outperform other competitors by a significant margin in accuracy/MacroF1, and they also generate meaningful interpretations of the learned representations for graph instances.
Acknowledgements.
The authors would like to thank Tencent Security Platform Department for discussions and suggestions. The work described in this paper was supported by grants from the Research Grant Council of the Hong Kong Special Administrative Region, China [Project No.: CUHK 14205618], Tencent AI Lab RhinoBird Focused Research Program GF201801 and the CUHK Stanley Ho Big Data Decision Analytics Research Centre.References
 (1)
 Bilgic et al. (2010) M. Bilgic, L. Mihalkova, and L. Getoor. 2010. Active learning for networked data. In ICML. 79–86.
 Bollobás and Riordan (2003) B. Bollobás and O. M. Riordan. 2003. Mathematical results on scalefree random graphs. Handbook of graphs and networks: from the genome to the internet (2003), 1–34.
 Borgwardt and Kriegel (2005) K. M. Borgwardt and H.P. Kriegel. 2005. Shortestpath kernels on graphs. In ICDM. 74–81.
 Borgwardt et al. (2005) K. M. Borgwardt, C. S. Ong, S. Schönauer, S.V.N. Vishwanathan, A. J. Smola, and H.P. Kriegel. 2005. Protein function prediction via graph kernels. In ISMB. 47–56.
 Dai et al. (2016) H. Dai, B. Dai, and L. Song. 2016. Discriminative embeddings of latent variable models for structured data. In ICML. 2702–2711.
 Defferrard et al. (2016) M. Defferrard, X. Bresson, and P. Vandergheynst. 2016. Convolutional neural networks on graphs with fast localized spectral filtering. In NIPS. 3844–3852.
 Dobson and Doig (2003) P. D. Dobson and A. J. Doig. 2003. Distinguishing enzyme structures from nonenzymes without alignments. Journal of molecular biology 330, 4 (2003), 771–783.
 Erdős and Rényi (1960) P. Erdős and A Rényi. 1960. On the evolution of random graphs. Publ. Math. Inst. Hung. Acad. Sci 5, 1 (1960), 17–60.
 Gärtner et al. (2003) T. Gärtner, P. Flach, and S. Wrobel. 2003. On graph kernels: Hardness results and efficient alternatives. In Learning theory and kernel machines. Springer, 129–143.
 Grover and Leskovec (2016) A. Grover and J. Leskovec. 2016. node2vec: Scalable feature learning for networks. In KDD. 855–864.

He
et al. (2015)
K. He, X. Zhang,
S. Ren, and J. Sun.
2015.
Delving deep into rectifiers: Surpassing humanlevel performance on imagenet classification. In
ICCV. 1026–1034.  Hearst et al. (1998) M. A. Hearst, S. T. Dumais, E. Osuna, J. Platt, and B. Scholkopf. 1998. Support vector machines. IEEE Intelligent Systems and their Applications 13, 4 (1998), 18–28.

Herbster and
Pontil (2007)
M. Herbster and M.
Pontil. 2007.
Prediction on a graph with a perceptron. In
NIPS. 577–584.  Johnson (1967) S. C. Johnson. 1967. Hierarchical clustering schemes. Psychometrika 32, 3 (1967), 241–254.
 Kingma and Ba (2015) D. P. Kingma and J. Ba. 2015. Adam: A Method for Stochastic Optimization. In ICLR.
 Kipf and Welling (2017) T. N. Kipf and M. Welling. 2017. SemiSupervised Classification with Graph Convolutional Networks. In ICLR.
 Li et al. (2018) Q. Li, Z. Han, and X. Wu. 2018. Deeper Insights into Graph Convolutional Networks for SemiSupervised Learning. In AAAI. 3538–3545.
 Lin et al. (2017) Z. Lin, M. Feng, C. N. d. Santos, M. Yu, B. Xiang, B. Zhou, and Y. Bengio. 2017. A Structured Selfattentive Sentence Embedding. In ICLR.

McCallum
et al. (2000)
A. K. McCallum, K. Nigam,
J. Rennie, and K. Seymore.
2000.
Automating the construction of internet portals with machine learning.
Information Retrieval 3, 2 (2000), 127–163.  McDowell et al. (2007) L. K. McDowell, K. M. Gupta, and D. W. Aha. 2007. Cautious inference in collective classification. In AAAI. 596–601.
 Mikolov et al. (2013) T. Mikolov, I. Sutskever, K. Chen, G. S. Corrado, and J. Dean. 2013. Distributed representations of words and phrases and their compositionality. In NIPS. 3111–3119.
 Narayanan et al. (2017) A. Narayanan, M. Chandramohan, R. Venkatesan, L. Chen, Y. Liu, and S. Jaiswal. 2017. graph2vec: Learning Distributed Representations of Graphs. CoRR abs/1707.05005 (2017). arXiv:1707.05005
 Niepert et al. (2016) M. Niepert, M. Ahmed, and K. Kutzkov. 2016. Learning Convolutional Neural Networks for Graphs. In ICML. 2014–2023.
 Perozzi et al. (2014) B. Perozzi, R. AlRfou, and S. Skiena. 2014. Deepwalk: Online learning of social representations. In KDD. 701–710.
 Ramanath et al. (2018) R. Ramanath, H. Inan, G. Polatkan, B. Hu, Q. Guo, C. Ozcaglar, X. Wu, K. Kenthapadi, and S. C. Geyik. 2018. Towards Deep and Representation Learning for Talent Search at LinkedIn. In CIKM. 2253–2261.
 Rousseau et al. (2015) F. Rousseau, E. Kiagias, and M. Vazirgiannis. 2015. Text categorization as a graph classification problem. In ACLIJCNLP. 1702–1712.
 Sen et al. (2008) P. Sen, G. Namata, M. Bilgic, L. Getoor, B. Galligher, and T. EliassiRad. 2008. Collective classification in network data. AI magazine 29, 3 (2008), 93–106.

Settles (2012)
B. Settles.
2012.
Active learning.
Synthesis Lectures on Artificial Intelligence and Machine Learning
6, 1 (2012), 1–114.  Shervashidze et al. (2011) N. Shervashidze, P. Schweitzer, E. J. v. Leeuwen, K. Mehlhorn, and K. M. Borgwardt. 2011. Weisfeilerlehman graph kernels. Journal of Machine Learning Research 12, Sep (2011), 2539–2561.
 Shervashidze et al. (2009) N. Shervashidze, S.V.N. Vishwanathan, T. Petri, K. Mehlhorn, and K. M. Borgwardt. 2009. Efficient graphlet kernels for large graph comparison. In AISTATS. 488–495.
 v. d. Maaten and Hinton (2008) L. v. d. Maaten and G. Hinton. 2008. Visualizing data using tSNE. Journal of machine learning research 9, Nov (2008), 2579–2605.
 Watts and Strogatz (1998) D. J. Watts and S. H. Strogatz. 1998. Collective dynamics of ‘smallworld’ networks. Nature 393 (1998), 440–442.
 Yanardag and Vishwanathan (2015) P. Yanardag and S.V.N. Vishwanathan. 2015. Deep Graph Kernels. In KDD. 1365–1374.
 Yang et al. (2016) Z. Yang, W. W. Cohen, and R. Salakhutdinov. 2016. Revisiting semisupervised learning with graph embeddings. In ICML. 40–48.
 Zhou et al. (2017) Z. Zhou, J. Shin, L. Zhang, S. Gurudu, M. Gotway, and J. Liang. 2017. FineTuning Convolutional Neural Networks for Biomedical Image Analysis: Actively and Incrementally. In CVPR. 4761–4772.
Comments
There are no comments yet.