Semi-Supervised Learning on Graphs Based on Local Label Distributions

02/15/2018 ∙ by Evgeniy Faerman, et al. ∙ 0

In this work, we propose a novel approach for the semi-supervised node classification. Precisely, we propose a method which takes labels in the local neighborhood of different locality levels into consideration. Most previous approaches that tackle the problem of node classification consider nodes to be similar, if they have shared neighbors or are close to each other in the graph. Recent methods for attributed graphs additionally take attributes of the neighboring nodes into account. We argue that the labels of the neighbors bear important information and considering them helps to improve classification quality. Two nodes which are similar based on labels in their neighborhood do not need to lie close-by in the graph and may even belong to different connected components. Considering labels can improve node classification for graphs with and without node attributes. However, as we will show, existing methods cannot be adapted to consider the labels of neighboring nodes in a straightforward fashion. Therefore, we propose a new method to learn label-based node embeddings which can mirror a variety of relations between the class labels of neighboring nodes. Furthermore, we propose several network architectures which combine multiple representations of the label distribution in the neighborhood with different localities. Our experimental evaluation demonstrates that our new methods can significantly improve the prediction quality on real world data sets.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Graphs are the most general way to represent structured data. In general, a set of entities with some given pairwise relationships between them can be modeled as a graph with a corresponding node set and an edge set . Real-world examples of graph-structured data are abundant and include social networks, co-citation networks or biological networks.

In addition to the graph structure, further attribute information may be provided for the entities described by the graph nodes. In an attributed graph, each node

is associated with an attribute vector

. For instance, social network users might be enriched with personal information or documents in a co-citation network might be described by bag-of-words vectors. The increasing relevance of graph-structured data has been accompanied by an increased interest in learning algorithms which can leverage underlying graph structure to make accurate predictions for the modeled entities.

An important semi-supervised learning task on graphs is node classification, where each node can be associated with a set of class labels (simply referred to as labels in the following) represented by a label vector where is the number of possible labels. Given a set of already labeled nodes in a graph, the goal is to predict new likely labels for unlabeled nodes. The task is semi-supervised in the sense that connectivity information about the whole graph is available and at least some of the class labels are already known. In the case of attributed graphs, attributes of all nodes can additionally be used for prediction, including those of the unlabeled nodes in the graph. Important applications include recommendation in social networks, where the node labels represent user interests, or document classification in co-citation networks, where the node labels indicate associated fields of research.

Approaches for node classification on graphs may employ additional node attributes or operate on the graph structure alone. We will refer to these approaches as attribute-based and connectivity-based approaches, respectively. Among the most successful connectivity-based methods are node embedding techniques [1, 2, 3, 4, 5, 6, 7, 8, 9, 10]. An underlying assumption of these techniques is that nodes which are closely connected in the graph, should have similar labels, which is commonly referred to as homophily [11]

. Our method does not rely on the homophily assumption, but is still able to relate close-by nodes. Furthermore, unlike most node embedding techniques our new approach can be used to classify nodes unseen during training. In the attribute-based setting the graph structure can be incorporated in different ways, for instance by using regularization

[12, 13, 14, 15], combining attributes with node embeddings [16], or aggregating them over local neighborhoods [17, 18, 19, 20, 21, 22, 23, 24, 25]. While regularization-based methods rely on the homophily assumption and most of them are not able to classify instances unseen during training, all other methods focus on node attributes. In addition to connectivity and node attributes, the labels available during training further provide valuable information that is in general complementary to connectivity and attribute features, and are useful to improve classification. In general learning tasks on independent and identically distributed (iid) data, labels indicate that an observation is sampled from a particular distribution. However, in a graph we have non-iid data and thus, the labels of connected objects allow for a novel use of label information which has not been exploited before for learning graph embeddings.

In this paper, we propose a label-based approach to learn a node embedding which allows for more accurate node classifications. The main idea of our approach is that there often exists a correlation between the labels of a node and the distribution of labels in its local neighborhood. Thus, considering the local label distribution when computing a node embedding can exploit this correlation to improve the descriptiveness of the learned embedding. In Figure 1, we illustrate this for a typical case for which the label of a node is determined by the labels of neighboring nodes and not by node attributes or connectivity. As an additional example, the function of a protein can be expected to correlate strongly with the functions of interacting proteins. As mentioned above, we assume that the labels of at least some of the neighboring nodes are known for each new node with unknown labels. In the majority of applications, this is realistic because new nodes usually connect to already known parts of the network. For instance, new papers usually cite established articles and new members of a social network will usually already know multiple friends in the network to connect to.

Figure 1: Consider a communication network with nodes labeled according to their device type (user, server, database, printer). Assume the labels for the database and printer nodes in the right connected component are unknown while the remaining labels are provided. Further node attributes are not given. We can observe that the roles of printers and databases are clearly defined by the labels of their neighboring nodes, e.g., printers are not connected to server nodes. Homophily-based methods would fail to classify these nodes correctly, since their labels differ from their neighbors. Further, connectivity alone does not explain the roles, since for instance the printer and database in the left part of the graph have the same degree and even their neighbors have the same degrees.

Though labels can be considered as another type of node attributes, there exists an important difference between labels and attributes which prevents attribute-based embeddings to generalize well on label information. Though the attribute values of the predicted node are allowed to be used for learning the embedding, using the node labels even in an transitive way leads to overfitting and a bad generalization performance of the learned embedding. We will discuss these issues in more detail in Section 3 and introduce a simple baseline method. In our new method, we aggregate labels from relevant nodes directly and thus, we can completely exclude any influence of the nodes’ own labels. In a first step, we determine the relevant neighbors of a given node based on Approximate Personalized PageRank (APPR). Since this might be an expensive task for large graphs, we use an adaption of the highly efficient algorithm from [26]. After determining the neighborhood, we compute the label distribution within the neighborhood and classify the node based on this novel representation. We compare our new representation to state-of-the-art graph embeddings based on several benchmark datasets for node classification.

The remainder of the paper is structured as follows: After providing a formal problem definition for our approach in Section 2, we introduce our new method in Section 3, starting with a discussion on the possibility of incorporating label-based features into existing models in Section 3.1. After a discussion of related work in Section 4, the performance of our model is evaluated experimentally and compared to state-of-the-art methods in Section 5. Finally, Section 6 concludes the paper and proposes directions for future work.

2 Problem Setting

We consider (possibly directed) graphs , with node set and edge set . A graph can be represented by an adjacency matrix , where denotes the weight of the edge . In case of an unweighted graph, indicates the existence and the absence of an edge between and . Furthermore, we do not allow self-links, i.e., for all nodes . In an attributed graph, additional node attributes are provided in the form of an attribute vector for each node . The attribute information for the whole graph can be represented by an attribute matrix , where the th row of corresponds to ’s attribute vector . Let us note that an important difference between attributes and labels is that attributes are usually known for all nodes, in particular those nodes without known labels.

Our problem setting is semi-supervised node classification, where the node set is partitioned into a set of labeled nodes and unlabeled nodes , such that and . Thereby, each node is associated with a label vector , where is the number of possible labels and an entry one indicates the presence of the corresponding label for a certain node. The labels available for training can be represented by an label matrix , where the ’ths row of corresponds to the label vector of if . For unlabeled nodes, we assign constant zero vectors. The task is now to train a classifier using , and possibly which accurately predicts for each . In multi-class classification, each node is assigned to exactly one class, such that is the ’s unit vector, if is assigned to class . Multi-label classification denotes the general case, in which each node may be assigned to one or more classes and the goal is to predict all labels assigned to a particular node.

3 Semi-supervised learning on graphs based on local label distribution

3.1 Labels as Attributes

The main idea of our approach is to learn a more descriptive node representation by incorporating the known labels in the neighborhood of a node. In the following, we will show why existing methods are not suitable to consider this information. Methods relying on neighborhood similarity [1, 2, 5, 3, 6, 4, 9] learn representations in an unsupervised manner and thus, only rely on the topology of the graph and not on attributes or labels. The Planetoid-T model [16] considers labels by partly enforcing the similarity between members of the same class and therefore, nodes are related to each other based only on their own labels.

Graph Neural Networks

[27, 28] or Graph Convolution Networks (GCN) [17, 18, 19, 20, 21, 22, 23, 24, 25] are special cases of a Message Passing Neural Network (MPNN) [29] which is a framework describing a family of neural network based models for attributed graphs. All MPNN methods have in common that they use some differentiable function to iteratively compute messages for each node which are passed to all its neighbors. These messages build an input to a differentiable update function which computes new node representations :

Here denotes the current iteration, is the representation of node in iteration and vector corresponds to the input features of node . denotes the set of direct neighbors of node , is the message and the set of update functions. The obvious way to integrate the neighborhood label information into an MPNN-based prediction model is to include the label information into the messages directed to the neighbors in the first iteration. However, even after removing self-links each node would receive information about its own labels already in the second iteration during the training stage. Thus, models learned on these representations overfit on the nodes’ own labels and do not generalize well in the inference step where the node labels are unknown. The same applies to directed graphs with cycles. Therefore, applying MPNN models to communicate neighboring labels is restricted to one iteration only. We use a corresponding model as a baseline for our experiments.

Note that this problem does not apply to label propagation algorithms [13, 12] since they do not have separate training and inference steps. However, these methods have to be recomputed for predicting new objects which limits their usability.

3.2 General Approach

To present our method for semi-supervised learning on graphs using local label distributions we first outline an efficient algorithm for computing node neighborhoods based on Approximated Personalized PageRank (APPR). Afterwards, we describe how to create node representations based on the label distribution in the local neighborhood based on APPR. Finally, the node representations can be used as feature descriptors in arbitrary classification models.

The Personalized PageRank (PPR) corresponds to the PageRank algorithm [30]

, where the probabilities in the starting vector

are biased towards some set of nodes. The result is the “importance” of all nodes in the graph from the viewpoint of the nodes in .

The push algorithm described in [31] and [32] is an efficient way to compute an approximation of the Personalized PageRank (APPR) vector if the start distribution vector is sparse. The idea behind the push algorithm is only to consider a node in the local neighborhood if the probability to visit the node is significantly larger than the probability to visit any other node from the rest of the graph. This leads to a sparse solution meaning that only relatively few nodes of the underlying graph are contained in the resulting APPR vector.

Algorithm 1 describes the computation of APPR using a variant of the push operation on lazy random walk transition matrices of undirected unweighted graphs. This algorithm was proposed in [33], where APPR is used to partition graphs. We describe an adapted version from [26] which converges faster. The algorithm maintains two vectors: the solution vector and a residual vector . The vector is the current approximation of the PPR vector and vector contains the approximation error or the not yet distributed probability mass. and are the entries in vectors and corresponding to node , is the degree of node . In each iteration the algorithm selects a node with sufficient probability mass in vector . This probability mass is spread between the node entry in and the entries of its direct neighbors in . In each step, the exact PPR is the linear combination of the current solution vector and the PPR solution for , i.e., . This procedure can also be trivially adapted to directed graphs and graphs with weighted edges. Moreover, the algorithm requires two parameters: , which is the teleportation parameter and determines the level of locality for each node neighborhood; and , which is an approximation threshold and controls the approximation quality and the runtime. In fact, the push algorithm performs updates as long as there is one node for which at least probabilty mass is moved towards each of its neighbors. The complexity of the procedure is .

0:  Starting vector , Teleportation probability , Approximation threshold
0:  APPR vector
1:  ,
2:  while  for some vertex  do
3:     pick any where
4:     push(u)
5:  end while
6:  return  
Algorithm 1 ApproximatePPR
1:  
2:  for  with  do
3:     
4:  end for
5:  
Algorithm 2 push

3.2.1 Local Label Distribution

In our approach we first compute the APPR vector for each node. Before APPR is computed for node , the corresponding entry in starting vector is set to one and all other entries to zero. Therefore, the APPR vector of describes the importance of local neighbors only from its point of view.

In the APPR result matrix

, each row corresponds to the APPR vector of the corresponding node. The local label distribution representation

is computed by multiplying with the label matrix . The diagonal of is set to zero beforehand to exclude information about the own labels. Therefore, the entry can be interpreted as the probability that a random walk starting from node stops at a neighbor with label .

The local label distribution can be used as a node embedding vector which can be passed into an arbitrary classification algorithm. In our experiments, we employ a multi-layer perceptron.

4 Related Work

Numerous approaches for semi-supervised learning on graphs have been proposed in recent years. This can be sorted into two main categories, unsupervised node embedding techniques and semi-supervised techniques.

4.1 Unsupervised Node Embedding

A strong focus of recent developments related to learning from structural relationships has been placed on learning node embeddings, where a latent vector representation is learned for each node, reflecting its connectivity in the underlying graph. The learned node embeddings can be used as an input to a subsequent down-stream task, such as node classification. Random walk based methods [1, 2] sample a number of random walks from the graph and nodes are related if they have common neighbors. LINE [3] is another variant, which considers direct first- and second-order proximities instead of random walks. Graph2Gauss [9]

learns similarity to hop neighborhoods and embeds each node as a Gaussian distribution to allow for uncertainty in the representation.

GECS [34] uses connections subgraphs to determine appropriate node neighborhood. More closely related to our approach, LASAGNE [4] relies on APPR to determine relevant context nodes. Other works perform matrix factorization. For instance, GraRep [5] factorizes a sequence of -step log-probability matrices with SVD and concatenates the resulting low-dimensional node representations to form the final representations. Abu-El-Haija et al. propose matrix factorization of random-walk occurrence matrix with different approaches to determine context window size distribution [35]. SDNE [6] uses a multi-layer auto-encoder model to capture non-linear structures based on direct first- and second-order proximities. Authors of [36] propose embeddings in hyperbolic space. HARP [37] addresses the local minima problem and introduce an iterative scheme for learning of node representations which can be used with different embedding learning methods. An input graph is coarsened on different levels and node representations are learned starting with the coarsest graph and learned embeddings are provided as initializations for the embeddings of subsequent finer graphs. While the above methods rely on the homophily assumption, struc2vec [7] aims at learning representations which relate structurally similar nodes instead of nodes which are close in the graph. It does so by using degree sequences in neighborhoods of different sizes. All of the above approaches are transductive in the sense that labels can only be predicted for unlabeled nodes observed already at training time. The GraphSAGE [8] framework introduces inductive

node embeddings. The basic idea is to learn an embedding function by sampling and aggregating node attributes in local neighborhoods. The embedding function can further be learned with a supervised loss function. Inductive models are also obtained by considering node attributes.

Variational Graph Auto-Encoders [10] learn node representations using a variational auto-encoder, where the encoder is a two-layer GCN. The model can be applied to attributed and non-attributed graphs.

4.2 Semi-Supervised Learning on Graphs

Compared to separately optimizing steps in a semi-supervised learning pipeline, as is the case for semi-supervised learning with pre-trained node embeddings, end-to-end training usually leads to better performance on the supervised learning objective.

One important direction is Laplacian Regularization, where the prediction loss is augmented with an unsupervised loss function based on the graph’s Laplacian matrix, encoding the homophily assumption that close-by nodes should have the same label. Related approaches include Manifold Regularization [14], a kernel-based method, and Deep Semi-Supervised Embedding [15] which incorporates node embeddings by augmenting neural network models with an embedding layer. Both of these methods generalize to attributed graphs. Label Propagation [12, 13] is more closely related to our approach. The main idea is to propagate the observed labels through the graph via the random walk transition matrix. Similarly, Collective Classification [38] starts with the observed labels and iteratively classifies nodes based on previously inferred labels in the local neighborhood based on a majority vote. However, none of these two methods considers additional node attributes and no actual learning is involved. Similarly to collective classification, the more recent methods Structural Neighborhood Based Classification [39] and Weighted-Vote Geometric Neighbor Classification [40] also classify nodes based on labels in local neighborhoods. They achieve this by learning a model which predicts node labels from a feature vector describing the local -neighborhood. Both methods assume an unattributed graph.

Instead of imposing regularization, Planetoid [16] combines the prediction loss with node embeddings by training a joint model which predicts class labels as well as graph context for a given node. The graph context sampled from random walks as well as the set of nodes with shared labels. This allows Planetoid to relate nodes with similar labels even if they are not close in the graph. Thus, Planetoid does not rely on a strong homophily assumption. In addition to a connectivity-based variant, Planetoid-G, the authors propose two further architectures, which incorporate node attributes. The transductive variant Planetoid-T starts with pre-trained embeddings and alternately optimizes the prediction and embedding loss functions. The inductive variant Planetoid-I on the other hand predicts the graph context from the node features instead.

Another important direction which has recently gained increasing attention is concerned with generalizing deep neural network architectures to graph-structured domains. As the general approach consists of incorporating graph structure into supervised learning, these models assume an attributed graph. However, they can naturally be applied to non-attributed graphs by using the identity matrix as the attribute matrix. The vast majority of neural network based models for semi-supervised learning on graphs can be described within a message-passing framework. In a

Message Passing Neural Network (MPNN) [29], each node has a hidden state which is updated iteratively during training. The initial hidden state of a node corresponds to it’s attribute vector. In a first step, messages from ’s neighborhood are received and aggregated, where a message from neighbor depends on ’s and ’s hidden states. In a second step, ’s state is updated by combining it with the aggregated messages. An important special case are Graph Convolution Networks [17, 18, 19, 20, 22, 23, 25] which aggregate node attributes over local neighborhoods with spatially localized filters, similar to classical convolutional networks on images [41]. The ChebNet [19]

aggregates messages from neighbors analogously to the eigenvectors of the graph’s Laplacian matrix. The update function ignores the previous state and applies a non-linear activation. The resulting filters are

-localized. The GCN [20]

is a simplification of the ChebNet, which only considers one-hop neighbors. Messages are aggregated according to a normalized adjacency matrix. In the update phase, the aggregated messages are multiplied with a learned filter matrix with a ReLU activation. For graph convolution networks, the number of message passing iterations corresponds to the number of layers.

(a) Micro F scores for Cora.
(b) Micro F scores for CiteSeer.
(c) Micro F scores for Pubmed.
Figure 2: Micro F scores for the three benchmark data sets.

5 Evaluation

We evaluate our approach by performing node-label prediction and compare the quality in terms of micro F score for multiclass prediction tasks, respectively micro F and macro F scores for multilabel prediction tasks, against state-of-the-art methods.

For both tasks, we compare our model against the following approaches:

  • Adj: a baseline approach which learns node embeddings only based on the information contained in the adjacency matrix

  • GCN_only_L: a GCN which applies convolution on label matrix . We use one convolution layer on the adjacency matrix without self-links, followed by a dense output layer111See 3.1 for the explanation why only one convolution layer makes sense

  • noFeat GCN: the standard 2-layer GCN as published by Kipf et al. [20] without using the node attributes

  • DeepWalk: the DeepWalk model as proposed in [1]

  • node2vec: the node2vec model as proposed in [2]

  • Planetoid-G: the Planetoid variant which does not use attribute information [16] 222Unless stated differently we use for all competitors the parameter settings as suggested by the corresponding authors. Except for minor adaptations, e.g., to include label information in the one layer GCN models or to make the Planetoid models applicable for multilabel prediciton tasks, we use the original implementations as published by the correpsonding authors.

Our model is denoted as LD (short for Label D

istribution). For these experiments we train a simple feed-forward neural network which takes the label distribution based representations as input and retrieves class probabilities as output.

Note that we omit the comparison to label propagation [12] since Yang et al. already showed that the Planetoid model outperforms this approach [16].

5.1 Multiclass Prediciton

5.1.1 Experimental Setup

For the multiclass label prediciton task we use the following three text classification benchmark graph datasets [38, 42]:

  • Cora. The Cora dataset contains 2’708 publications from seven categories in the area of ML. The citation graph consists of 2’708 nodes, 5’278 edges, 1’433 attributes and 7 classes.

  • CiteSeer. The CiteSeer dataset contains 3’264 publications from six categories in the area of CS. The citation graph consists of 3’264 nodes, 4’536 edges, 3’703 attributes and 6 classes.

  • Pubmed. The Pubmed dataset contains 19’717 publications which are related to diabetes and categorized into 3 classes. The citation graph consists of 19’717 nodes, 44’324 edges, 500 attributes and 3 classes.

For each graph, documents are denoted as nodes and undirected links between documents represent citation relationships. If node attributes are applied, bag-of-words representations are used as attribute vectors for each document.

We split the data as suggested in [16], i.e., for labeled data our training sets contain 20 randomly selected instances per class, the test sets consist of 1’000 instances, and the validation sets contain 500 instances for each method. The remaining instances are used as unlabeled data. For comparison we use the prediction micro F scores which we collected over 10 different data splits.

Since the numbers of iterations for sampling the graph contexts and the label contexts for Planetoid are suggested only for the CiteSeer data set, we adapted these values relative to the number of nodes for each graph. For node2vec

, we perform grid searches over the hyperparameters

and with and use window size 10 as proposed by the authors. For all models except Planetoid

unless otherwise noted, we use one hidden layer with 16 neurons and regularization, learning rate and training procedure as in

[20]. Considering our model, we use as values for the teleportation parameter and as approximation threshold to compute the APPR vectors for each node.

We present results computed on the test sets for the best performing hyperparameters. The best performing hyperparameters for all models are determined by using the validation sets.

5.1.2 Results

Figure 2 shows boxplots depicting the micro F scores we achieved for the multiclass prediction task for each considered model on the Cora, CiteSeer and Pubmed networks.

The baseline approach GCN_only_L, i.e., the one layer GCN model which only uses the label distributions of the neighboring nodes to predict a node’s label, shows worst results among the considered models. However, these scores are still promising that the labels may improve the task of learning “good” representations. The baseline method which considers the corresponding rows of the adjacency matrix as node representations, i.e., Adj, achieves slightly better results for all three datasets. For the GCN and Planetoid models that do not make recourse to attribute information, i.e., noFeat GCN, resp. Planetoid-G, the retrieved micro F values are slightly lower than the ones achieved by DeepWalk and node2vec. Our model improve the results produced by node2vec, which means that the label distributions are indeed a useful source of information, although the baseline GCN_only_L shows, especially for Pubmed, rather poor results. This may be reasoned by the fact that this model only considers the label distribution of a very local neighborhood (in fact one hop neighbors). However, collecting the label distribution from a more spacious neighborhood gives a significant boost in terms of prediction accuracy. Indeed the best results for the LD approach are reached for , which corresponds to a rather spacious neighborhood exploration.

5.2 Multilabel Classification

(a) Micro F scores for BlogCatalog.
(b) Macro F scores for BlogCatalog.
Figure 3: Micro F and macro F for BlogCatalog.

5.2.1 Experimental Setup

We also perform multilabel node classifications on the following two multilabel networks:

  • BlogCatalog [43]. This is a social network graph where each of the 10,312 nodes corresponds to a user and the 333,983 edges represent the friendship relationships between bloggers. 39 different interest groups provide the labels.

  • IMDb Germany. This dataset is taken from [4]. It consists of 32,732 nodes, 1,175,364 edges and 27 labels. Each node represents an actor/actress who played in a German movie. Edges connect actors/actresses that were in a cast together and the node labels represent the genres that the corresponding actor/actress played.

Since the fraction of positive instances is relatively small for most of the classes, we use weighted cross-entropy as loss function. Therefore, the loss caused by erroneously classified positive instances is weighted higher. We use weight 10 in all our experiments. For the same reason we report micro F and macro F score metrics to measure the quality of the considered methods. We compare our model to the featureless models that we already used for the multiclass experiments 333To adapt the Planetoid-G implementation for multilabel classification, we use a sigmoidactivation function at the output layer and also slightly changed the embedding learning step. Entities that are used as context and have the same labels as the node itself are sampled from all classes to which the node belongs to..

(a) Micro F scores for IMDb Germany.
(b) Macro F scores for IMDb Germany.
Figure 4: Micro F and macro F for IMDb Germany.

We split the data into training, validation and test set so that 70% of all nodes were used for training, 10% for validation and 20% of the data were used to test the model. Note that we could not use stratified sampling splits for these experiments since we optimize for all classes simultaneously instead of using one-vs-rest classifiers 444That is why our results for node2vec and DeepWalk on the BlogCatalog network are slightly worse than reported in [2]. The hyperparameter setting is as described above. For this set of experiments we ran each model, except for Planetoid-G, 10 times on five different data splits. Due to the long runtime of Planetoid-G we trained this model only three times on two data splits.

(a) Micro F scores for Cora.
(b) Micro F scores for CiteSeer.
(c) Micro F scores for Pubmed.
Figure 5: Comparison against attribute-based methods: micro F scores for the three benchmark data sets.

5.2.2 Results

The results for the BlogCatalog graph are shown in Figure 3. For this network, only using the label information from the direct neighborhood of a node is not useful to infer its labels, c.f., GCN_only_L. However, incorporating the label distribution of somewhat larger neighborhoods as for our model (again, we also use the APPR matrix calculated for small values of to determine the label distribution in neighborhoods that span more than 1-hop neighbors) seems to improve the results for the prediciton task significantly. In fact, our model achieves similar, but slightly worse performance than node2vec and DeepWalk. Given these results, we also combined the node embeddings based on local label distributions with embeddings that capture structural properties. To capture the structural properties we select a very simple approach: we multiply an embedding matrix with the preprocessed adjacency matrix as in Kipf et al. [20]. The embedding matrix is randomly initialized. Note that the structural similarity is defined via direct neighbors. The resulting representation is concatenated with the hidden layer of the LD model and the rest of the LD model remains the same. The embedding weights are learned jointly with rest of the model. Having a look at the scores for the resulting model, denoted as LD+EMB, this combination further improves the outcome of the prediction.

For the IMDb Germany network, for which the results can be seen in Figure 4, the labels of even very local neighborhoods are already very expressive. Recalling how this network is constructed, we can expect the latter fact and also the superior performance of our model over the two random walk based methods. Particularly noteworthy for this network is the gain of accuracy that the combination of information from both sources, label distribution and structural properties, achieves.

5.3 Comparison to Attribute-Based Methods

To show the power of incorporating label information into the generation process for node embeddings, we also compare our model against the following state-of-the-art attribute-based methods:

  • Feat: a baseline approach which predicts node labels only based on the node attributes without considering the underlying graph structure (borrowed from [16])

  • GCN: the standard 2-layer GCN as published in [20]

  • Chebychev: the spectral convolution method which uses chebychev filters as presented in [19]; as in [20] we also use 3rd order chebychev filters

  • Planetoid-T: the semi-supervised Planetoid framework which uses attribute information as proposed in [16]

For this set of experiments, we again perform multiclass prediciton on the three benchmark text classification datasets and report the prediction accuracy in terms of micro F scores to measure the quality of the retrieved node representations. Note that in contrast to the competitors, our model still does not make use of the node attribute information. The results are depicted in Figure 5 and clearly show that our model can definitely compete with the attribute-based methods and hence is a powerful alternative in cases when no node attributes are present.

5.4 Impact of the Parameter

(a) Micro F scores for Cora.
(b) Micro F scores for CiteSeer.
(c) Micro F scores for Pubmed.
Figure 6: Micro F scores for the three benchmark data sets when considering different locality levels for node neighborhoods.

Figure 6 depicts the micro F scores achieved for different values of the teleportation parameter on the three benchmark datasets. As can be seen, particularly for the Pubmed network, the model is quite sensitive to the choice of this parameter. Recall that the teleportation parameter determines how far the neighborhood of each node shall be taken into consideration to get the label distributions for each node. Therefore it might make sense to set the

parameter to a small value so that more labels are collected which in turn leads to a more accurate estimation of the local label distribution. On the other hand, this may not hold in every scenario, for instance if the distribution of classes is heterogeneous, i.e., some classes may only appear in areas of the graph where classes are concentrated locally, while other classes may appear in areas where many classes are mixed even within local neighborhoods. An interesting direction for future work is therefore to optimize for some “good”

value in a data-driven manner. This may be done either by pre-defining a set of different values of and approaching for the best of these, or by trying to optimize for some “good” value during the learning procedure. Also, the underlying task, e.g., node classification, may benefit from finding “good” values of for each node individually rather than relying on a global solution.

6 Conclusion

In this paper, we have introduced a novel label-based approach for semi-supervised node classification. In particular, our method aggregates labels from local neighborhoods using APPR. Most existing approaches consider nodes to be similar, if they are closely related in the graph. Methods for attributed graphs additionally take attributes of the neighboring nodes into account. In contrast, our method can relate nodes even if they are not close-by in the graph and makes more effective use of the labels provided for training to improve the classification quality for graphs with and without node attributes. It is further applicable to nodes unseen during training. The results of our experiments on various real-work datasets demonstrate that local label distributions are able to significantly improve classification results in the multiclass and multilabel setting. Our model is even competitive with state-of-the-art models, which take node attributes into consideration. In a first experiment on multilabel datasets, we were already able to significantly boost the performance by using a simple combination of our model with node embeddings.

For future work, we plan to address the problem of selecting a suitable teleportation parameter . The parameter controls the extend of the considered local neighborhood and often has a significant impact on the prediction quality. Performing a grid search to determine a good parameter value is a time consuming task. Furthermore, for different classes varying teleportation parameters might yield the best results.

We also aim at further improving the prediction accuracy by further investigating how to effectively combine label-based features with different other kinds of other features, such as node attributes, edge attributes or node embeddings in a semi-supervised model. Our approach could also be extended to solve additional graph learning tasks, such as link prediction or identification of nodes with unexpected labels for detecting labeling errors or outlier nodes.

References