1 Introduction
Graphs are the most general way to represent structured data. In general, a set of entities with some given pairwise relationships between them can be modeled as a graph with a corresponding node set and an edge set . Realworld examples of graphstructured data are abundant and include social networks, cocitation networks or biological networks.
In addition to the graph structure, further attribute information may be provided for the entities described by the graph nodes. In an attributed graph, each node
is associated with an attribute vector
. For instance, social network users might be enriched with personal information or documents in a cocitation network might be described by bagofwords vectors. The increasing relevance of graphstructured data has been accompanied by an increased interest in learning algorithms which can leverage underlying graph structure to make accurate predictions for the modeled entities.An important semisupervised learning task on graphs is node classification, where each node can be associated with a set of class labels (simply referred to as labels in the following) represented by a label vector where is the number of possible labels. Given a set of already labeled nodes in a graph, the goal is to predict new likely labels for unlabeled nodes. The task is semisupervised in the sense that connectivity information about the whole graph is available and at least some of the class labels are already known. In the case of attributed graphs, attributes of all nodes can additionally be used for prediction, including those of the unlabeled nodes in the graph. Important applications include recommendation in social networks, where the node labels represent user interests, or document classification in cocitation networks, where the node labels indicate associated fields of research.
Approaches for node classification on graphs may employ additional node attributes or operate on the graph structure alone. We will refer to these approaches as attributebased and connectivitybased approaches, respectively. Among the most successful connectivitybased methods are node embedding techniques [1, 2, 3, 4, 5, 6, 7, 8, 9, 10]. An underlying assumption of these techniques is that nodes which are closely connected in the graph, should have similar labels, which is commonly referred to as homophily [11]
. Our method does not rely on the homophily assumption, but is still able to relate closeby nodes. Furthermore, unlike most node embedding techniques our new approach can be used to classify nodes unseen during training. In the attributebased setting the graph structure can be incorporated in different ways, for instance by using regularization
[12, 13, 14, 15], combining attributes with node embeddings [16], or aggregating them over local neighborhoods [17, 18, 19, 20, 21, 22, 23, 24, 25]. While regularizationbased methods rely on the homophily assumption and most of them are not able to classify instances unseen during training, all other methods focus on node attributes. In addition to connectivity and node attributes, the labels available during training further provide valuable information that is in general complementary to connectivity and attribute features, and are useful to improve classification. In general learning tasks on independent and identically distributed (iid) data, labels indicate that an observation is sampled from a particular distribution. However, in a graph we have noniid data and thus, the labels of connected objects allow for a novel use of label information which has not been exploited before for learning graph embeddings.In this paper, we propose a labelbased approach to learn a node embedding which allows for more accurate node classifications. The main idea of our approach is that there often exists a correlation between the labels of a node and the distribution of labels in its local neighborhood. Thus, considering the local label distribution when computing a node embedding can exploit this correlation to improve the descriptiveness of the learned embedding. In Figure 1, we illustrate this for a typical case for which the label of a node is determined by the labels of neighboring nodes and not by node attributes or connectivity. As an additional example, the function of a protein can be expected to correlate strongly with the functions of interacting proteins. As mentioned above, we assume that the labels of at least some of the neighboring nodes are known for each new node with unknown labels. In the majority of applications, this is realistic because new nodes usually connect to already known parts of the network. For instance, new papers usually cite established articles and new members of a social network will usually already know multiple friends in the network to connect to.
Though labels can be considered as another type of node attributes, there exists an important difference between labels and attributes which prevents attributebased embeddings to generalize well on label information. Though the attribute values of the predicted node are allowed to be used for learning the embedding, using the node labels even in an transitive way leads to overfitting and a bad generalization performance of the learned embedding. We will discuss these issues in more detail in Section 3 and introduce a simple baseline method. In our new method, we aggregate labels from relevant nodes directly and thus, we can completely exclude any influence of the nodes’ own labels. In a first step, we determine the relevant neighbors of a given node based on Approximate Personalized PageRank (APPR). Since this might be an expensive task for large graphs, we use an adaption of the highly efficient algorithm from [26]. After determining the neighborhood, we compute the label distribution within the neighborhood and classify the node based on this novel representation. We compare our new representation to stateoftheart graph embeddings based on several benchmark datasets for node classification.
The remainder of the paper is structured as follows: After providing a formal problem definition for our approach in Section 2, we introduce our new method in Section 3, starting with a discussion on the possibility of incorporating labelbased features into existing models in Section 3.1. After a discussion of related work in Section 4, the performance of our model is evaluated experimentally and compared to stateoftheart methods in Section 5. Finally, Section 6 concludes the paper and proposes directions for future work.
2 Problem Setting
We consider (possibly directed) graphs , with node set and edge set . A graph can be represented by an adjacency matrix , where denotes the weight of the edge . In case of an unweighted graph, indicates the existence and the absence of an edge between and . Furthermore, we do not allow selflinks, i.e., for all nodes . In an attributed graph, additional node attributes are provided in the form of an attribute vector for each node . The attribute information for the whole graph can be represented by an attribute matrix , where the th row of corresponds to ’s attribute vector . Let us note that an important difference between attributes and labels is that attributes are usually known for all nodes, in particular those nodes without known labels.
Our problem setting is semisupervised node classification, where the node set is partitioned into a set of labeled nodes and unlabeled nodes , such that and . Thereby, each node is associated with a label vector , where is the number of possible labels and an entry one indicates the presence of the corresponding label for a certain node. The labels available for training can be represented by an label matrix , where the ’ths row of corresponds to the label vector of if . For unlabeled nodes, we assign constant zero vectors. The task is now to train a classifier using , and possibly which accurately predicts for each . In multiclass classification, each node is assigned to exactly one class, such that is the ’s unit vector, if is assigned to class . Multilabel classification denotes the general case, in which each node may be assigned to one or more classes and the goal is to predict all labels assigned to a particular node.
3 Semisupervised learning on graphs based on local label distribution
3.1 Labels as Attributes
The main idea of our approach is to learn a more descriptive node representation by incorporating the known labels in the neighborhood of a node. In the following, we will show why existing methods are not suitable to consider this information. Methods relying on neighborhood similarity [1, 2, 5, 3, 6, 4, 9] learn representations in an unsupervised manner and thus, only rely on the topology of the graph and not on attributes or labels. The PlanetoidT model [16] considers labels by partly enforcing the similarity between members of the same class and therefore, nodes are related to each other based only on their own labels.
Graph Neural Networks
[27, 28] or Graph Convolution Networks (GCN) [17, 18, 19, 20, 21, 22, 23, 24, 25] are special cases of a Message Passing Neural Network (MPNN) [29] which is a framework describing a family of neural network based models for attributed graphs. All MPNN methods have in common that they use some differentiable function to iteratively compute messages for each node which are passed to all its neighbors. These messages build an input to a differentiable update function which computes new node representations :Here denotes the current iteration, is the representation of node in iteration and vector corresponds to the input features of node . denotes the set of direct neighbors of node , is the message and the set of update functions. The obvious way to integrate the neighborhood label information into an MPNNbased prediction model is to include the label information into the messages directed to the neighbors in the first iteration. However, even after removing selflinks each node would receive information about its own labels already in the second iteration during the training stage. Thus, models learned on these representations overfit on the nodes’ own labels and do not generalize well in the inference step where the node labels are unknown. The same applies to directed graphs with cycles. Therefore, applying MPNN models to communicate neighboring labels is restricted to one iteration only. We use a corresponding model as a baseline for our experiments.
3.2 General Approach
To present our method for semisupervised learning on graphs using local label distributions we first outline an efficient algorithm for computing node neighborhoods based on Approximated Personalized PageRank (APPR). Afterwards, we describe how to create node representations based on the label distribution in the local neighborhood based on APPR. Finally, the node representations can be used as feature descriptors in arbitrary classification models.
The Personalized PageRank (PPR) corresponds to the PageRank algorithm [30]
, where the probabilities in the starting vector
are biased towards some set of nodes. The result is the “importance” of all nodes in the graph from the viewpoint of the nodes in .The push algorithm described in [31] and [32] is an efficient way to compute an approximation of the Personalized PageRank (APPR) vector if the start distribution vector is sparse. The idea behind the push algorithm is only to consider a node in the local neighborhood if the probability to visit the node is significantly larger than the probability to visit any other node from the rest of the graph. This leads to a sparse solution meaning that only relatively few nodes of the underlying graph are contained in the resulting APPR vector.
Algorithm 1 describes the computation of APPR using a variant of the push operation on lazy random walk transition matrices of undirected unweighted graphs. This algorithm was proposed in [33], where APPR is used to partition graphs. We describe an adapted version from [26] which converges faster. The algorithm maintains two vectors: the solution vector and a residual vector . The vector is the current approximation of the PPR vector and vector contains the approximation error or the not yet distributed probability mass. and are the entries in vectors and corresponding to node , is the degree of node . In each iteration the algorithm selects a node with sufficient probability mass in vector . This probability mass is spread between the node entry in and the entries of its direct neighbors in . In each step, the exact PPR is the linear combination of the current solution vector and the PPR solution for , i.e., . This procedure can also be trivially adapted to directed graphs and graphs with weighted edges. Moreover, the algorithm requires two parameters: , which is the teleportation parameter and determines the level of locality for each node neighborhood; and , which is an approximation threshold and controls the approximation quality and the runtime. In fact, the push algorithm performs updates as long as there is one node for which at least probabilty mass is moved towards each of its neighbors. The complexity of the procedure is .
3.2.1 Local Label Distribution
In our approach we first compute the APPR vector for each node. Before APPR is computed for node , the corresponding entry in starting vector is set to one and all other entries to zero. Therefore, the APPR vector of describes the importance of local neighbors only from its point of view.
In the APPR result matrix
, each row corresponds to the APPR vector of the corresponding node. The local label distribution representation
is computed by multiplying with the label matrix . The diagonal of is set to zero beforehand to exclude information about the own labels. Therefore, the entry can be interpreted as the probability that a random walk starting from node stops at a neighbor with label .The local label distribution can be used as a node embedding vector which can be passed into an arbitrary classification algorithm. In our experiments, we employ a multilayer perceptron.
4 Related Work
Numerous approaches for semisupervised learning on graphs have been proposed in recent years. This can be sorted into two main categories, unsupervised node embedding techniques and semisupervised techniques.
4.1 Unsupervised Node Embedding
A strong focus of recent developments related to learning from structural relationships has been placed on learning node embeddings, where a latent vector representation is learned for each node, reflecting its connectivity in the underlying graph. The learned node embeddings can be used as an input to a subsequent downstream task, such as node classification. Random walk based methods [1, 2] sample a number of random walks from the graph and nodes are related if they have common neighbors. LINE [3] is another variant, which considers direct first and secondorder proximities instead of random walks. Graph2Gauss [9]
learns similarity to hop neighborhoods and embeds each node as a Gaussian distribution to allow for uncertainty in the representation.
GECS [34] uses connections subgraphs to determine appropriate node neighborhood. More closely related to our approach, LASAGNE [4] relies on APPR to determine relevant context nodes. Other works perform matrix factorization. For instance, GraRep [5] factorizes a sequence of step logprobability matrices with SVD and concatenates the resulting lowdimensional node representations to form the final representations. AbuElHaija et al. propose matrix factorization of randomwalk occurrence matrix with different approaches to determine context window size distribution [35]. SDNE [6] uses a multilayer autoencoder model to capture nonlinear structures based on direct first and secondorder proximities. Authors of [36] propose embeddings in hyperbolic space. HARP [37] addresses the local minima problem and introduce an iterative scheme for learning of node representations which can be used with different embedding learning methods. An input graph is coarsened on different levels and node representations are learned starting with the coarsest graph and learned embeddings are provided as initializations for the embeddings of subsequent finer graphs. While the above methods rely on the homophily assumption, struc2vec [7] aims at learning representations which relate structurally similar nodes instead of nodes which are close in the graph. It does so by using degree sequences in neighborhoods of different sizes. All of the above approaches are transductive in the sense that labels can only be predicted for unlabeled nodes observed already at training time. The GraphSAGE [8] framework introduces inductivenode embeddings. The basic idea is to learn an embedding function by sampling and aggregating node attributes in local neighborhoods. The embedding function can further be learned with a supervised loss function. Inductive models are also obtained by considering node attributes.
Variational Graph AutoEncoders [10] learn node representations using a variational autoencoder, where the encoder is a twolayer GCN. The model can be applied to attributed and nonattributed graphs.4.2 SemiSupervised Learning on Graphs
Compared to separately optimizing steps in a semisupervised learning pipeline, as is the case for semisupervised learning with pretrained node embeddings, endtoend training usually leads to better performance on the supervised learning objective.
One important direction is Laplacian Regularization, where the prediction loss is augmented with an unsupervised loss function based on the graph’s Laplacian matrix, encoding the homophily assumption that closeby nodes should have the same label. Related approaches include Manifold Regularization [14], a kernelbased method, and Deep SemiSupervised Embedding [15] which incorporates node embeddings by augmenting neural network models with an embedding layer. Both of these methods generalize to attributed graphs. Label Propagation [12, 13] is more closely related to our approach. The main idea is to propagate the observed labels through the graph via the random walk transition matrix. Similarly, Collective Classification [38] starts with the observed labels and iteratively classifies nodes based on previously inferred labels in the local neighborhood based on a majority vote. However, none of these two methods considers additional node attributes and no actual learning is involved. Similarly to collective classification, the more recent methods Structural Neighborhood Based Classification [39] and WeightedVote Geometric Neighbor Classification [40] also classify nodes based on labels in local neighborhoods. They achieve this by learning a model which predicts node labels from a feature vector describing the local neighborhood. Both methods assume an unattributed graph.
Instead of imposing regularization, Planetoid [16] combines the prediction loss with node embeddings by training a joint model which predicts class labels as well as graph context for a given node. The graph context sampled from random walks as well as the set of nodes with shared labels. This allows Planetoid to relate nodes with similar labels even if they are not close in the graph. Thus, Planetoid does not rely on a strong homophily assumption. In addition to a connectivitybased variant, PlanetoidG, the authors propose two further architectures, which incorporate node attributes. The transductive variant PlanetoidT starts with pretrained embeddings and alternately optimizes the prediction and embedding loss functions. The inductive variant PlanetoidI on the other hand predicts the graph context from the node features instead.
Another important direction which has recently gained increasing attention is concerned with generalizing deep neural network architectures to graphstructured domains. As the general approach consists of incorporating graph structure into supervised learning, these models assume an attributed graph. However, they can naturally be applied to nonattributed graphs by using the identity matrix as the attribute matrix. The vast majority of neural network based models for semisupervised learning on graphs can be described within a messagepassing framework. In a
Message Passing Neural Network (MPNN) [29], each node has a hidden state which is updated iteratively during training. The initial hidden state of a node corresponds to it’s attribute vector. In a first step, messages from ’s neighborhood are received and aggregated, where a message from neighbor depends on ’s and ’s hidden states. In a second step, ’s state is updated by combining it with the aggregated messages. An important special case are Graph Convolution Networks [17, 18, 19, 20, 22, 23, 25] which aggregate node attributes over local neighborhoods with spatially localized filters, similar to classical convolutional networks on images [41]. The ChebNet [19]aggregates messages from neighbors analogously to the eigenvectors of the graph’s Laplacian matrix. The update function ignores the previous state and applies a nonlinear activation. The resulting filters are
localized. The GCN [20]is a simplification of the ChebNet, which only considers onehop neighbors. Messages are aggregated according to a normalized adjacency matrix. In the update phase, the aggregated messages are multiplied with a learned filter matrix with a ReLU activation. For graph convolution networks, the number of message passing iterations corresponds to the number of layers.
5 Evaluation
We evaluate our approach by performing nodelabel prediction and compare the quality in terms of micro F score for multiclass prediction tasks, respectively micro F and macro F scores for multilabel prediction tasks, against stateoftheart methods.
For both tasks, we compare our model against the following approaches:

Adj: a baseline approach which learns node embeddings only based on the information contained in the adjacency matrix

GCN_only_L: a GCN which applies convolution on label matrix . We use one convolution layer on the adjacency matrix without selflinks, followed by a dense output layer^{1}^{1}1See 3.1 for the explanation why only one convolution layer makes sense

noFeat GCN: the standard 2layer GCN as published by Kipf et al. [20] without using the node attributes

DeepWalk: the DeepWalk model as proposed in [1]

node2vec: the node2vec model as proposed in [2]

PlanetoidG: the Planetoid variant which does not use attribute information [16] ^{2}^{2}2Unless stated differently we use for all competitors the parameter settings as suggested by the corresponding authors. Except for minor adaptations, e.g., to include label information in the one layer GCN models or to make the Planetoid models applicable for multilabel prediciton tasks, we use the original implementations as published by the correpsonding authors.
Our model is denoted as LD (short for Label D
istribution). For these experiments we train a simple feedforward neural network which takes the label distribution based representations as input and retrieves class probabilities as output.
Note that we omit the comparison to label propagation [12] since Yang et al. already showed that the Planetoid model outperforms this approach [16].
5.1 Multiclass Prediciton
5.1.1 Experimental Setup
For the multiclass label prediciton task we use the following three text classification benchmark graph datasets [38, 42]:

Cora. The Cora dataset contains 2’708 publications from seven categories in the area of ML. The citation graph consists of 2’708 nodes, 5’278 edges, 1’433 attributes and 7 classes.

CiteSeer. The CiteSeer dataset contains 3’264 publications from six categories in the area of CS. The citation graph consists of 3’264 nodes, 4’536 edges, 3’703 attributes and 6 classes.

Pubmed. The Pubmed dataset contains 19’717 publications which are related to diabetes and categorized into 3 classes. The citation graph consists of 19’717 nodes, 44’324 edges, 500 attributes and 3 classes.
For each graph, documents are denoted as nodes and undirected links between documents represent citation relationships. If node attributes are applied, bagofwords representations are used as attribute vectors for each document.
We split the data as suggested in [16], i.e., for labeled data our training sets contain 20 randomly selected instances per class, the test sets consist of 1’000 instances, and the validation sets contain 500 instances for each method. The remaining instances are used as unlabeled data. For comparison we use the prediction micro F scores which we collected over 10 different data splits.
Since the numbers of iterations for sampling the graph contexts and the label contexts for Planetoid are suggested only for the CiteSeer data set, we adapted these values relative to the number of nodes for each graph. For node2vec
, we perform grid searches over the hyperparameters
and with and use window size 10 as proposed by the authors. For all models except Planetoidunless otherwise noted, we use one hidden layer with 16 neurons and regularization, learning rate and training procedure as in
[20]. Considering our model, we use as values for the teleportation parameter and as approximation threshold to compute the APPR vectors for each node.We present results computed on the test sets for the best performing hyperparameters. The best performing hyperparameters for all models are determined by using the validation sets.
5.1.2 Results
Figure 2 shows boxplots depicting the micro F scores we achieved for the multiclass prediction task for each considered model on the Cora, CiteSeer and Pubmed networks.
The baseline approach GCN_only_L, i.e., the one layer GCN model which only uses the label distributions of the neighboring nodes to predict a node’s label, shows worst results among the considered models. However, these scores are still promising that the labels may improve the task of learning “good” representations. The baseline method which considers the corresponding rows of the adjacency matrix as node representations, i.e., Adj, achieves slightly better results for all three datasets. For the GCN and Planetoid models that do not make recourse to attribute information, i.e., noFeat GCN, resp. PlanetoidG, the retrieved micro F values are slightly lower than the ones achieved by DeepWalk and node2vec. Our model improve the results produced by node2vec, which means that the label distributions are indeed a useful source of information, although the baseline GCN_only_L shows, especially for Pubmed, rather poor results. This may be reasoned by the fact that this model only considers the label distribution of a very local neighborhood (in fact one hop neighbors). However, collecting the label distribution from a more spacious neighborhood gives a significant boost in terms of prediction accuracy. Indeed the best results for the LD approach are reached for , which corresponds to a rather spacious neighborhood exploration.
5.2 Multilabel Classification
5.2.1 Experimental Setup
We also perform multilabel node classifications on the following two multilabel networks:

BlogCatalog [43]. This is a social network graph where each of the 10,312 nodes corresponds to a user and the 333,983 edges represent the friendship relationships between bloggers. 39 different interest groups provide the labels.

IMDb Germany. This dataset is taken from [4]. It consists of 32,732 nodes, 1,175,364 edges and 27 labels. Each node represents an actor/actress who played in a German movie. Edges connect actors/actresses that were in a cast together and the node labels represent the genres that the corresponding actor/actress played.
Since the fraction of positive instances is relatively small for most of the classes, we use weighted crossentropy as loss function. Therefore, the loss caused by erroneously classified positive instances is weighted higher. We use weight 10 in all our experiments. For the same reason we report micro F and macro F score metrics to measure the quality of the considered methods. We compare our model to the featureless models that we already used for the multiclass experiments ^{3}^{3}3To adapt the PlanetoidG implementation for multilabel classification, we use a sigmoidactivation function at the output layer and also slightly changed the embedding learning step. Entities that are used as context and have the same labels as the node itself are sampled from all classes to which the node belongs to..
We split the data into training, validation and test set so that 70% of all nodes were used for training, 10% for validation and 20% of the data were used to test the model. Note that we could not use stratified sampling splits for these experiments since we optimize for all classes simultaneously instead of using onevsrest classifiers ^{4}^{4}4That is why our results for node2vec and DeepWalk on the BlogCatalog network are slightly worse than reported in [2]. The hyperparameter setting is as described above. For this set of experiments we ran each model, except for PlanetoidG, 10 times on five different data splits. Due to the long runtime of PlanetoidG we trained this model only three times on two data splits.
5.2.2 Results
The results for the BlogCatalog graph are shown in Figure 3. For this network, only using the label information from the direct neighborhood of a node is not useful to infer its labels, c.f., GCN_only_L. However, incorporating the label distribution of somewhat larger neighborhoods as for our model (again, we also use the APPR matrix calculated for small values of to determine the label distribution in neighborhoods that span more than 1hop neighbors) seems to improve the results for the prediciton task significantly. In fact, our model achieves similar, but slightly worse performance than node2vec and DeepWalk. Given these results, we also combined the node embeddings based on local label distributions with embeddings that capture structural properties. To capture the structural properties we select a very simple approach: we multiply an embedding matrix with the preprocessed adjacency matrix as in Kipf et al. [20]. The embedding matrix is randomly initialized. Note that the structural similarity is defined via direct neighbors. The resulting representation is concatenated with the hidden layer of the LD model and the rest of the LD model remains the same. The embedding weights are learned jointly with rest of the model. Having a look at the scores for the resulting model, denoted as LD+EMB, this combination further improves the outcome of the prediction.
For the IMDb Germany network, for which the results can be seen in Figure 4, the labels of even very local neighborhoods are already very expressive. Recalling how this network is constructed, we can expect the latter fact and also the superior performance of our model over the two random walk based methods. Particularly noteworthy for this network is the gain of accuracy that the combination of information from both sources, label distribution and structural properties, achieves.
5.3 Comparison to AttributeBased Methods
To show the power of incorporating label information into the generation process for node embeddings, we also compare our model against the following stateoftheart attributebased methods:

Feat: a baseline approach which predicts node labels only based on the node attributes without considering the underlying graph structure (borrowed from [16])

GCN: the standard 2layer GCN as published in [20]

PlanetoidT: the semisupervised Planetoid framework which uses attribute information as proposed in [16]
For this set of experiments, we again perform multiclass prediciton on the three benchmark text classification datasets and report the prediction accuracy in terms of micro F scores to measure the quality of the retrieved node representations. Note that in contrast to the competitors, our model still does not make use of the node attribute information. The results are depicted in Figure 5 and clearly show that our model can definitely compete with the attributebased methods and hence is a powerful alternative in cases when no node attributes are present.
5.4 Impact of the Parameter
Figure 6 depicts the micro F scores achieved for different values of the teleportation parameter on the three benchmark datasets. As can be seen, particularly for the Pubmed network, the model is quite sensitive to the choice of this parameter. Recall that the teleportation parameter determines how far the neighborhood of each node shall be taken into consideration to get the label distributions for each node. Therefore it might make sense to set the
parameter to a small value so that more labels are collected which in turn leads to a more accurate estimation of the local label distribution. On the other hand, this may not hold in every scenario, for instance if the distribution of classes is heterogeneous, i.e., some classes may only appear in areas of the graph where classes are concentrated locally, while other classes may appear in areas where many classes are mixed even within local neighborhoods. An interesting direction for future work is therefore to optimize for some “good”
value in a datadriven manner. This may be done either by predefining a set of different values of and approaching for the best of these, or by trying to optimize for some “good” value during the learning procedure. Also, the underlying task, e.g., node classification, may benefit from finding “good” values of for each node individually rather than relying on a global solution.6 Conclusion
In this paper, we have introduced a novel labelbased approach for semisupervised node classification. In particular, our method aggregates labels from local neighborhoods using APPR. Most existing approaches consider nodes to be similar, if they are closely related in the graph. Methods for attributed graphs additionally take attributes of the neighboring nodes into account. In contrast, our method can relate nodes even if they are not closeby in the graph and makes more effective use of the labels provided for training to improve the classification quality for graphs with and without node attributes. It is further applicable to nodes unseen during training. The results of our experiments on various realwork datasets demonstrate that local label distributions are able to significantly improve classification results in the multiclass and multilabel setting. Our model is even competitive with stateoftheart models, which take node attributes into consideration. In a first experiment on multilabel datasets, we were already able to significantly boost the performance by using a simple combination of our model with node embeddings.
For future work, we plan to address the problem of selecting a suitable teleportation parameter . The parameter controls the extend of the considered local neighborhood and often has a significant impact on the prediction quality. Performing a grid search to determine a good parameter value is a time consuming task. Furthermore, for different classes varying teleportation parameters might yield the best results.
We also aim at further improving the prediction accuracy by further investigating how to effectively combine labelbased features with different other kinds of other features, such as node attributes, edge attributes or node embeddings in a semisupervised model. Our approach could also be extended to solve additional graph learning tasks, such as link prediction or identification of nodes with unexpected labels for detecting labeling errors or outlier nodes.
References
 [1] B. Perozzi, R. AlRfou, and S. Skiena, “Deepwalk: Online learning of social representations,” in Proc. of ACM SIGKDD, 2014, pp. 701–710.
 [2] A. Grover and J. Leskovec, “node2vec: Scalable feature learning for networks,” in Proc. of ACM SIGKDD, 2016, pp. 855–864.
 [3] J. Tang, M. Qu, M. Wang, M. Zhang, J. Yan, and Q. Mei, “Line: Largescale information network embedding,” in Proc. of WWW. ACM, 2015, pp. 1067–1077.
 [4] E. Faerman, F. Borutta, K. Fountoulakis, and M. W. Mahoney, “Lasagne: Locality and structure aware graph node embedding,” arXiv preprint arXiv:1710.06520, 2017.
 [5] S. Cao, W. Lu, and Q. Xu, “Grarep: Learning graph representations with global structural information,” in Proc. of CIKM. ACM, 2015, pp. 891–900.
 [6] D. Wang, P. Cui, and W. Zhu, “Structural deep network embedding,” in Proc. of ACM SIGKDD. ACM, 2016, pp. 1225–1234.
 [7] L. F. Ribeiro, P. H. Saverese, and D. R. Figueiredo, “struc2vec: Learning node representations from structural identity,” in Proc. of ACM SIGKDD. ACM, 2017, pp. 385–394.
 [8] W. Hamilton, Z. Ying, and J. Leskovec, “Inductive representation learning on large graphs,” in NIPS, 2017, pp. 1025–1035.
 [9] A. Bojchevski and S. Günnemann, “Deep gaussian embedding of attributed graphs: Unsupervised inductive learning via ranking,” arXiv preprint arXiv:1707.03815, 2017.
 [10] T. N. Kipf and M. Welling, “Variational graph autoencoders,” arXiv preprint arXiv:1611.07308, 2016.
 [11] M. McPherson, L. SmithLovin, and J. M. Cook, “Birds of a feather: Homophily in social networks,” Annual review of sociology, vol. 27, no. 1, pp. 415–444, 2001.
 [12] X. Zhu, Z. Ghahramani, and J. D. Lafferty, “Semisupervised learning using gaussian fields and harmonic functions,” in Proc. of ICML, 2003, pp. 912–919.
 [13] D. Zhou, O. Bousquet, T. N. Lal, J. Weston, and B. Schölkopf, “Learning with local and global consistency,” in Advances in neural information processing systems, 2004, pp. 321–328.

[14]
M. Belkin, P. Niyogi, and V. Sindhwani, “Manifold regularization: A geometric
framework for learning from labeled and unlabeled examples,”
Journal of machine learning research
, vol. 7, no. Nov, pp. 2399–2434, 2006. 
[15]
J. Weston, F. Ratle, H. Mobahi, and R. Collobert, “Deep learning via semisupervised embedding,” in
Neural Networks: Tricks of the Trade. Springer, 2012, pp. 639–655.  [16] Z. Yang, W. Cohen, and R. Salakhudinov, “Revisiting semisupervised learning with graph embeddings,” in Proc. of ICDM, 2016, pp. 40–48.
 [17] J. Bruna, W. Zaremba, A. Szlam, and Y. LeCun, “Spectral networks and locally connected networks on graphs,” CoRR, vol. abs/1312.6203, 2013. [Online]. Available: http://arxiv.org/abs/1312.6203
 [18] Y. L. Mikael Henaff, Joan Bruna, “Deep convolutional networks on graphstructured data,” arXiv preprint arXiv:1506.05163, 2015.

[19]
M. Defferrard, X. Bresson, and P. Vandergheynst, “Convolutional neural networks on graphs with fast localized spectral filtering,” in
Advances in Neural Information Processing Systems 29, D. D. Lee, M. Sugiyama, U. V. Luxburg, I. Guyon, and R. Garnett, Eds., 2016, pp. 3844–3852.  [20] T. N. Kipf and M. Welling, “Semisupervised classification with graph convolutional networks,” arXiv preprint arXiv:1609.02907, 2016.
 [21] J. Atwood and D. Towsley, “Searchconvolutional neural networks,” CoRR, vol. abs/1511.02136, 2015. [Online]. Available: http://arxiv.org/abs/1511.02136
 [22] F. Monti, D. Boscaini, J. Masci, E. Rodolà, J. Svoboda, and M. M. Bronstein, “Geometric deep learning on graphs and manifolds using mixture model cnns,” CoRR, vol. abs/1611.08402, 2016. [Online]. Available: http://arxiv.org/abs/1611.08402
 [23] R. Levie, F. Monti, X. Bresson, and M. M. Bronstein, “Cayleynets: Graph convolutional neural networks with complex rational spectral filters,” CoRR, vol. abs/1705.07664, 2017.
 [24] M. Simonovsky and N. Komodakis, “Dynamic edgeconditioned filters in convolutional neural networks on graphs,” in Proc. CVPR, 2017.
 [25] P. Veličković, G. Cucurull, A. Casanova, A. Romero, P. Liò, and Y. Bengio, “Graph attention networks,” arXiv preprint arXiv:1710.10903, 2017.
 [26] J. Shun, F. RoostaKhorasani, K. Fountoulakis, and M. W. Mahoney, “Parallel local graph clustering,” Proc. VLDB Endow., vol. 9, no. 12, pp. 1041–1052, Aug. 2016.
 [27] F. Scarselli, M. Gori, A. C. Tsoi, M. Hagenbuchner, and G. Monfardini, “The graph neural network model,” IEEE Transactions on Neural Networks, vol. 20, no. 1, pp. 61–80, 2009.
 [28] Y. Li, D. Tarlow, M. Brockschmidt, and R. Zemel, “Gated graph sequence neural networks,” in ICLR, 2016.
 [29] J. Gilmer, S. S. Schoenholz, P. F. Riley, O. Vinyals, and G. E. Dahl, “Neural message passing for quantum chemistry,” arXiv preprint arXiv:1704.01212, 2017.
 [30] L. Page, S. Brin, R. Motwani, and T. Winograd, “The pagerank citation ranking: bringing order to the web.” 1999.
 [31] G. Jeh and J. Widom, “Scaling personalized web search,” in Proc. of the 12th WWW. ACM, 2003, pp. 271–279.
 [32] P. Berkhin, “Bookmarkcoloring algorithm for personalized pagerank computing,” Internet Mathematics, vol. 3, no. 1, pp. 41–62, 2006.
 [33] R. Andersen, F. Chung, and K. Lang, “Local graph partitioning using pagerank vectors,” in Proc. of IEEE FOCS. IEEE, 2006, pp. 475–486.
 [34] S. A. AlSayouri, P. Devineni, S. S. Lam, E. E. Papalexakis, and D. Koutra, “Gecs: Graph embedding using connection subgraphs,” 2016.
 [35] S. AbuElHaija, B. Perozzi, R. AlRfou, and A. Alemi, “Watch your step: Learning graph embeddings through attention,” arXiv preprint arXiv:1710.09599, 2017.
 [36] B. P. Chamberlain, J. Clough, and M. P. Deisenroth, “Neural embeddings of graphs in hyperbolic space,” arXiv preprint arXiv:1705.10359, 2017.
 [37] H. Chen, B. Perozzi, Y. Hu, and S. Skiena, “Harp: Hierarchical representation learning for networks,” arXiv preprint arXiv:1706.07845, 2017.
 [38] P. Sen, G. Namata, M. Bilgic, L. Getoor, B. Galligher, and T. EliassiRad, “Collective classification in network data,” AI magazine, vol. 29, no. 3, p. 93, 2008.
 [39] S. Nandanwar and M. N. Murty, “Structural neighborhood based classification of nodes in a network,” in Proc. of ACM SIGKDD, 2016, pp. 1085–1094.
 [40] W. Ye, L. Zhou, D. Mautz, C. Plant, and C. Böhm, “Learning from labeled and unlabeled vertices in networks,” in Proc. of ACM SIGKDD. ACM, 2017, pp. 1265–1274.

[41]
Y. LeCun, B. Boser, J. S. Denker, D. Henderson, R. E. Howard, W. Hubbard, and L. D. Jackel, “Backpropagation applied to handwritten zip code recognition,”
Neural computation, vol. 1, no. 4, pp. 541–551, 1989.  [42] G. Namata, B. London, L. Getoor, B. Huang, and U. EDU, “Querydriven active surveying for collective classification,” in 10th International Workshop on Mining and Learning with Graphs, 2012.
 [43] L. Tang and H. Liu, “Relational learning via latent social dimensions,” in Proc. of ACM SIGKDD. ACM, 2009, pp. 817–826.
Comments
There are no comments yet.