Active Learning on Attributed Graphs via Graph Cognizant Logistic Regression and Preemptive Query Generation

07/09/2020 ∙ by Florence Regol, et al. ∙ 0

Node classification in attributed graphs is an important task in multiple practical settings, but it can often be difficult or expensive to obtain labels. Active learning can improve the achieved classification performance for a given budget on the number of queried labels. The best existing methods are based on graph neural networks, but they often perform poorly unless a sizeable validation set of labelled nodes is available in order to choose good hyperparameters. We propose a novel graph-based active learning algorithm for the task of node classification in attributed graphs; our algorithm uses graph cognizant logistic regression, equivalent to a linearized graph convolutional neural network (GCN), for the prediction phase and maximizes the expected error reduction in the query phase. To reduce the delay experienced by a labeller interacting with the system, we derive a preemptive querying system that calculates a new query during the labelling process, and to address the setting where learning starts with almost no labelled data, we also develop a hybrid algorithm that performs adaptive model averaging of label propagation and linearized GCN inference. We conduct experiments on five public benchmark datasets, demonstrating a significant improvement over state-of-the-art approaches and illustrate the practical value of the method by applying it to a private microwave link network dataset.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

In many classification tasks there are explicit or implicit relationships between the data points that need to be classified. One can represent such data using a graph, where an edge between two nodes (data points) indicates the presence of a relationship. The resultant task of node classification has attracted significant attention from the graph-learning research community, and numerous graph learning architectures have been developed that yield impressive performance, especially in semi-supervised settings 

(Defferrard et al., 2016; Kipf and Welling, 2017; Hamilton et al., 2017; Veličković et al., 2018; Zhuang and Ma, 2018; Gao et al., 2018; Liu et al., 2019). In such cases, knowledge of the graph topology can compensate for scarcity of labelled data.

In practice, the semi-supervised classification task often arises in scenarios where it is challenging or expensive to obtain labels. If we have the opportunity to decide which nodes to query, then we should try to select the most informative nodes that lead to the best classification accuracy. This is, in a nutshell, the goal of active learning; as we acquire labels, we make decisions about which label to query next based on what we have learned. This is important in applications such as medical imaging, where generating labels requires considerable valuable time from domain experts (Hoi et al., 2006; Gal et al., 2017; Kurzendorfer et al., 2017). The development of active learning algorithms for node classification in graphs can be motivated by applications of graph convolutional neural networks (GNNs) in the medical field. For example, in (Parisot et al., 2018), GNNs are used to classify brain scan images, with the goal of predicting disease outcomes or detecting the presence of a disorder. Although we may have access to many brain scans and can specify relationships between them (thus building a graph), obtaining labels for them is expensive because it requires attention from medical experts.

The early research that applies active learning to graph data mainly focuses on the non-attributed graph setting (Zhu et al., 2003b; Ji and Han, 2012; Ma et al., 2013). We focus on node classification for attributed graphs, so the recently proposed GNN-based methods (Cai et al., 2017; Gao et al., 2018) are more aligned with the task we address. The results reported in these works are usually based on GNNs with hyperparameters that have been optimized using a large labelled validation set. This is an unrealistic setting; if we have access to such a large amount of labeled data, we should use much more of it to train the classifier. As we illustrate in the experiments in Section 5, if hyperparameters are not optimized, but are chosen randomly from a reasonable range of candidate values, the performance of the GNN-based active learning methods deteriorates dramatically.

In this work, we aim to address the limitations of the GNN methods. We propose an algorithm that is based on the Expected Error Minimization (EEM) framework (Settles, 2009). In this framework, we select the query that minimizes the expected classification error according to our current model. We use the simplified graph convolution (SGC) (Wu et al., 2019), a graph-cognizant logistic regression, as a predictive model for the labels. This model, which can be derived as a linearization of the graph convolutional neural network (Wu et al., 2019), performs much better when there is limited data compared to a GNN with suboptimal hyperparameters, and achieves competitive accuracy as the number of labels increases.

Most active learning techniques involve initial training of a model and then an iterative process of (i) identifying the best query by some criterion (the core step of active learning); (ii) obtaining the label from an oracle; and (iii) updating the model. In an interactive application, this can lead to a delay if the query generation of step (i) is not extremely fast. Although it is principled and competitive to other approaches, the EEM algorithm does have the disadvantage of an increased computational overhead. However, we note that a delay will also be introduced at step (ii); human labelling can take seconds (document categorization (Settles, 2009)) to minutes (cancer diagnosis from skin lesion images (Gal et al., 2017), MRI tumour segmentation (Kurzendorfer et al., 2017), or fault detection in microwave link networks). With this in mind, the interactive delay can be reduced or even eliminated if the model update and query identification steps can be started and completed while the labelling is conducted. We develop such a preemptive strategy, based on a prediction of the labelling from the oracle.

In summary, our paper makes the following contributions: (i) we propose a practical approach for active learning in graphs that does not have the unrealistic requirement of a validation set for hyperparameter tuning; (ii) we extend the proposed approach to introduce preemptive query generation in order to reduce or eliminate the delay experienced by a labeller during interaction with the system; (iii) we derive bounds on the error in risk evaluation associated with the preemptive prediction; (iv) we analyze performance on five public benchmark datasets and show a significant improvement compared to state-of-the-art GNN active learning methods (and label propagation strategies); (v) we illustrate the practical benefit of our method by demonstrating its application to a private, commercial dataset collected for the task of identifying faulty links in a microwave link network.

2 Related Research

2.1 Active learning on non-attributed graphs

Many methods for active learning on graphs without node or edge features are based on the idea of of propagating label information across the graph, and we hence refer to them throughout the paper as label-propagation methods. The most successful techniques are all based on the Binary Random Markov Field (BRMF) model. The model allows one to define a posterior on the unknown labels conditioned on the graph topology and the observed node labels. This model provides an effective mechanism for representing smoothness of labels with respect to the graph, but evaluating the posterior is a combinatorial problem. As a result, researchers have introduced relaxations or approximation strategies. (Zhu et al., 2003a) relax the BMRF to a Gaussian Random Field (GRF) model, and (Zhu et al., 2003b; Ji and Han, 2012) and subsequently (Ma et al., 2013) also employ this model to derive active learning methods. More recently, (Berberidis and Giannakis, 2018) have applied the expected change maximization strategy to a GRF model. (Jun and Nowak, 2016) takes another approach by proposing a two-step approximation (TSA) of the intractable combinatorial problem rather than relaxing the BMRF model.

These strategies offer the advantage of being principled methods that directly target the quantity we want to optimize, but label propagation-based models cannot take into account node features and consequently must rely on strong assumptions regarding the relationships between the graph topology and the data. Most label propagation methods struggle if the graph is not connected and do not usually translate well to an inductive setting, because query decisions rely on the knowledge of the complete graph topology.

2.2 Graph neural network methods for active learning

(Cai et al., 2017) leverage the output of a Graph Convolution Network (GCN) (Kipf and Welling, 2017)

to design active learning metrics. Their method is to alternate during the training of the GCN between adding one node to the training set and performing one epoch of training. Selection of the query node is based on a score that is a weighted mixture of three metrics covering different active learning strategies: an uncertainty metric, a density-based metric and a graph centrality metric. The uncertainty metric is obtained by taking the entropy of the softmax output given by the current GCN model. The density metric is based on the GCN node embeddings; the embeddings are clustered and the distance between each node’s embedding and the centre of its cluster is computed. A more central embedding indicates a more representative node. The graph centrality metric is independent of the GCN and only relies on the position of the node in the graph. The weights change as more nodes are added to the labelled set, in order to reflect the increased confidence in the two metrics that are derived from the output of the GCN. The weight adaptation schedule in 

(Cai et al., 2017) is fixed; (Gao et al., 2018) propose an alternative multi-armed bandit algorithm that learns how to balance the contributions of the different metrics. They argue that this mechanism can better adapt to the varying natures of different datasets.

3 Problem Setting

3.1 Pool-based Formulation

We consider the problem of active learning on an attributed graph for node classification using feature matrix and labels . The nodes are partitioned into two sets: a small initial labelled set from with node labels (), and a set consisting of the remaining unlabelled nodes. The algorithm is given a budget of nodes that it can query from to augment . We denote by and the sets of labelled and unlabelled nodes, respectively, after nodes have been added to the initial labelled set. The pool-based active learning formulation that we are considering consists of three phases:

  1. [leftmargin=*]

  2. Prediction Step : , and the current node labels, , are used to infer the labels of the nodes .

  3. Query Step : Until the budget is exhausted, select a node to query and to add to the labelled set .

  4. Labeling Step : The oracle takes time to label . We update the sets: , .

The goal is to select the best node to append to at each iteration , in order to optimize the prediction performance throughout the query process. We are not only interested in the end result after exhausting the query budget, but also in how quickly we can increase accuracy. Acquiring labels is presumed to be expensive, so a solution that reaches competitive performance with fewer nodes is desired.

In addition to the transductive setting outlined above, where we know the entire graph and all attributes, we also consider an inductive setting, where our goal is to maximize performance over an additional set of nodes ; we know that these are connected to the graph in some fashion, but we cannot query them and we do not know the edges or features during the active learning process.

3.2 Reducing interaction delay

With the phases outlined above, the active learning algorithm stalls while waiting for the oracle to label at the third phase and then the oracle must wait while the algorithm computes the best subsequent query node . This is inefficient and, for a human oracle, frustrating. We can address this by requiring the active learning algorithm to identify the query node using the label set . If the labelling time and the query computation time are similar, then neither the oracle nor the algorithm stalls for long. While the oracle is generating the label for , this preemptive active learning algorithm identifies in parallel the best query node using . Figure 1 compares the timelines of the standard single-query active learning procedure (query generation algorithm waits for the oracle and vice versa) with the preemptive strategy where labelling and query generation are performed in parallel.

Time

query selection

Oracle labels .

(a) Timeline for the standard active learning process.

Time

Labeling process

(b) Timeline for preemptive active learning process.
Figure 1: A comparison of the timelines of the standard single-query active learning process and the proposed preemptive process.

4 Methodology

4.1 Expected error Minimization (EEM)

An active learning algorithm based on error reduction selects the query to minimize the expected error. For a classification task, the zero-one error is a suitable choice. Denoting by the set of unlabelled nodes after iterations of active learning with node removed, i.e., . The labels associated with this set are . Following (Jun and Nowak, 2016), we can define , the risk of adding node given the current known label set , as:

(1)

Here is the label prediction at node

. We thus calculate expected error by summing error probabilities over the unlabelled set, minus the node

we are considering. Define , where is the set of classes. If the query node has label , then represents the probability of making an error in the prediction of the label of node . If we can compute the distribution , we can evaluate the risk of querying :

(2)

The query algorithm selects the risk-minimizing node :

(3)

It remains to define the probabilistic model .

4.2 Graph-cognizant logistic regression

We propose to use a graph-cognizant logistic regression model to obtain . Such a model was introduced by (Wu et al., 2019), where the SGC is derived as a simplified (linearized) version of the graph convolutional network of (Kipf and Welling, 2017). (Wu et al., 2019) showed that the simplified model can achieve competitive performance for a significantly lower computational cost. In the EEM approach to active learning, we must learn a new model for every potential query node, so it is essential that the computational cost is relatively low. The SGC meets our requirements: its computational requirements are moderate and it takes into account the graph structure and node features.

For a graph with adjacency matrix , let , be the degree matrix, and . We then define

. This can be interpreted as a degree normalized symmetrized adjacency matrix (after self-loops have been added by the identity matrix

). The prediction model has the form:

(4)

We define ; this can be interpreted as graph-based preprocessing of node features. The parameter controls the number of hops that are considered when generating the final node representation. Usually using a 2-hop () neighborhood yields good results.

4.3 Graph EEM (GEEM)

Using the SGC model, we can compute a risk for each query node. At each step , we use the current known labels to find the weights by minimizing the error for predictions . We use a standard iterative algorithm for maximum likelihood logistic regression with L2 regularization (e.g., the scikit-learn liblinear solver). We can then compute , where the index indicates that we extract the

-th element of the vector. Then for each candidate node

, for each possible class , we solve:

(5)

Here the notation indicates that we are adding node to the labelled set and assigning it label . For the adopted model, . The node to query is then the one that minimizes the risk:

(6)

From this formulation, we can see that we first have to evaluate , then calculate for each of the potential augmented labelled sets . This implies that we have a computational complexity of , where represents the complexity associated with training the model. For logistic regression, this is the overhead involved in learning the weights . If the evaluation of is computationally expensive, then the time required to select a query node can become prohibitive. This is a common disadvantage of the expected reduction strategies (Settles, 2009). It then becomes apparent why using the linearized version of the GCN is important.

The proposed algorithm requires the choice of very few hyperparameters (only the number of hops and logistic regression parameters). This contrasts with the active learning approaches based on graph neural networks, where there are multiple hyperparameters that must be selected, and suboptimal choices can have a major impact on performance.

4.4 Preemptive Query (PreGEEM)

In many practical active learning scenarios, labelling is performed by a human, and it is often time-consuming; labelling a single data point (node) can take tens of seconds or minutes. It is desirable to have the next query identified as soon as the labeller has completed the labelling task. With the EEM algorithm formulated above, this is impossible, because the query identification in (6) uses the label associated with the previous query node.

In this section, we outline an alternative approach that performs pre-emptive query calculation, using the labelling time to identify the next node to query. Instead of waiting for the oracle to label to start the identification of during iteration , we propose to approximate the risk before knowing . The direct approach is to replace the risk with the expectation over the possible values of , but this increases the computational complexity by a factor of , which is highly undesirable. To avoid this penalty, we further approximate using the value of risk for the label at the mode of . Effectively, we use the predicted label of the previous model to form an augmented labelled set and define an approximate risk:

(7)

The query node is then . We call this new approach the Preemptive Graph EEM (PreGEEM).

Figure 2 compares the evolution of the two active learning algorithms GEEM and PreGEEM for a small subset of nodes to illustrate how using the approximated risk can impact the query process. We see that the evaluated risks are similar, and although the ordering of query nodes differs, after five steps the same nodes have been selected by both algorithms.

Figure 2: Risk comparison for GEEM vs PreGEEM. This follows the risk computations for 25 nodes in cora dataset for one trial. The black star indicates which node was selected, (following the algorithm, it is the one with the lowest expected risk.)

4.5 Bounds on the PreGEEM Risk Error

We now present bounds on the risk estimation error that can arise by using a predicted label. We focus on the one-step error

, where the labelled sets differ for only one label. For clarity, we first derive a bound for the binary classification task. We then state the more general bound for multiclass classification. Complete proofs are provided in the supplementary material.

The following proposition, which follows straightforwardly from Theorem 1 and Lemma 1 in (Sivan et al., 2019), bounds the difference in prediction values for two label sets that differ for data points. The bound is expressed in terms of the regularization parameter and the graph pre-processed feature vectors .

Proposition 1.

For weights and derived by -penalized logistic regression to two datasets with common feature vectors but label sets differing for vectors indexed by , define and . For any , .

The following result characterizes the risk error when we perform -regularized binary logistic regression on to derive weights for . Define , and .

Theorem 4.1.

The risk error arising from applying binary L2-regularized logistic regression with regularization parameter to two labelled datasets and that differ by one label, associated with the node , is bounded as:

Sketch of Proof.

We define the random variable

which takes value with probability for . The difference in risk is:

(8)

where denotes expectation over conditioned on the observed label set, either or . For query node , for each , we learn weights and using and , respectively. For each , we have:

(9)

Here the first inequality follows from the definition of and the property that for , . The second inequality follows from Proposition (1), observing that the labels can differ only for nodes and .

Observe that for random variables and taking values in and , respectively, . Applying this to (8) and then employing (9) leads to the stated bound on the risk error.

The following bound applies for the case of multiclass classification. For a given label set , we learn weights for each class using L2-regularized binary one-vs-all logistic regression. The output prior to normalization for a given feature vector is then . We then normalize by dividing by to obtain a probability vector. Let be the weight vector learned for class using label data . Let for . Define

Theorem 4.2.

The risk error arising from multiclass regression performed via repeated one-vs-all L2-regularized logistic regression with regularization parameter to labelled datasets and that differ by one label, associated with the node , is bounded as:

(10)

for .

The proof is similar to that of Theorem 4.1, but more involved, and is provided in the supplementary material.

4.6 Combined method

The most extreme case of active learning is when we start with only one labelled node. In this scenario, the logistic regression model cannot make useful predictions until at least some nodes have been queried. To address this scenario, we combine our algorithm with a label-propagation method. The aim is to first use label-propagation when very few node labels are available, then switch to a combination of both algorithms when more information is available, and finally transition to the more accurate graph-cognizant logistic regression. Bayesian model averaging provides a mechanism to make this transition (Minka, 2002).

In Bayesian model averaging, we have different classifiers and our belief is that one of these models is correct. We start with a prior over each model. After observing data , we compute the model evidence . Using Bayes’ rule, we can compute the posterior and then weight the predictions:

In the context of expected error minimization active learning, we need to evaluate the risk associated with a query. We introduce a model-dependent zero-one risk, and in our combined method, we employ a model-averaged risk:

(11)

In order to compute (or approximate) this expression, we need to evaluate . Assuming that we have equal prior belief in the models available to us, this is equivalent to calculating the marginal likelihood .

We incorporate two models, one based on label propagation and the other based on logistic regression. For the binary random field model that underpins the label propagation classifiers, there are no learnable model parameters (there is one fixed hyperparameter). Evaluating the evidence is thus equivalent to computing

under the BMRF model. This is a combinatorial problem, but we can factorize the joint probability into a chain rule of conditionals and use the same two-stage approximation (TSA) that is employed in 

(Jun and Nowak, 2016). Additional details are provided in the supplementary material. We denote this evidence approximation by .

For the logistic regression model, we are using . The joint probability of the complete labelled set is then evaluated as , where is the categorical index of . We then have . To calculate the evidence, we thus need to integrate over the weight matrix , which is not analytically tractable. We choose to approximate by , and we denote this as . This leads to a sufficiently accurate approximation of the evidence for our purpose (which is just to achieve an adaptive balance between label propagation and SGC).

We use the TSA algorithm (Jun and Nowak, 2016) as the label propagation-based estimator, leading to a new combined approach for selecting the query node. We solve

where and are our normalized estimates for the model evidences for the SGC model and the TSA label propagation model, respectively, after observing the data .

(a) Cora (b) Citeseer (c) Amazon-photo
Figure 3:

Comparison of performance of active learning algorithms for Experiment 1. Each point on a curve shows the mean classification accuracy achieved across 20 random partitions after the labelled set has been expanded to the indicated number of nodes. The shaded regions indicate 5/95 confidence intervals on the means derived using bootstrap.

5 Experiments

We examine performance using five of the node classification benchmarks in (Shchur et al., 2018). Cora, Citeseer (Sen et al., 2008) and Pubmed (Namata et al., 2012) are citation datasets. Nodes represent journal articles and an undirected edge is included when one article cites another. The node features are bag-of-words representations of article content. Amazon-Photo and Amazon-Computers are graphs based on customers’ co-purchase history records. For each dataset we isolate the largest connected component in the graph following (Shchur et al., 2018). The description of the dataset statistics is shown in Table 1.

Dataset Classes Features Nodes Edges
Edge
Density
Cora 7 1,433 2,485 5,069 0.04%
Citeseer 6 3,703 2,110 3,668 0.04%
Pubmed 3 500 19,717 44,324 0.01%
Am-comp. 10 767 13,381 245,778 0.07%
Am-photo 8 745 7,487 119,043 0.11%
Microwave 2 19 322 5,753 5.54%
Table 1: Statistics of evaluation datasets.

5.1 Baselines

We compare the following active learning algorithms: (i) Random: This baseline chooses a node to query by uniform random selection, and then performs classification using SGC; (ii) AGE: The graph neural network based algorithm proposed by (Cai et al., 2017); (iii) ANRMAB: The graph neural network algorithm proposed by (Gao et al., 2018), in which a multi-arm bandit is used to adapt the weights assigned to the different metrics used when constructing the score to choose a query node; (iv) TSA: The label-propagation algorithm based on a two-stage approximation of the BMRF model (Jun and Nowak, 2016); (v) EC-TV, EC-MSD: The label-propagation algorithms based on a Gaussian random field approximation to the BMRF model (Berberidis and Giannakis, 2018); (vi) GEEM: The proposed algorithm based on SGC and expected error minimization; (vii) PreGEEM: The proposed algorithm with preemptive queries; (viii) Combined : The proposed combined algorithm that uses Bayesian model averaging to adaptively merge SGC and label propagation in an EEM framework.

5.2 Experimental Settings

For each experiment, we report the average over 20 trials with different random partitions. All GCNs and SGCs have 2 layers. The weight-adapting parameter of AGE is set to the values in (Cai et al., 2017) and to 0.995 for non-included datasets. For the larger datasets, Am-Photo, Am-Comp and Pubmed, we reduce computational complexity for GEEM and PreGEEM by evaluating risk using a subset of 500 nodes, selected randomly in an approach similar to (Roy and McCallum, 2001). This has minimal impact on performance. The GCN hyperparameters are set to the values found by (Shchur et al., 2018) to be the best performing hyperparameter configurations. Early stopping is not employed because access to a validation set is not a reasonable assumption in an active learning setting. We also include a “non-optimized” version of AGE; this is because, in practice, we would usually not have access to the tuned hyperparameters provided by (Shchur et al., 2018), because these are derived using a large validation set. For the non-optimized version of AGE, the hyperparameter configuration for each trial was randomly selected from the values considered in the grid search of (Shchur et al., 2018).

(a) Cora - transductive
(b) Cora - inductive
(c) Citeseer - transductive
(d) Citeseer - inductive
Figure 4: Performance comparison between the label propagation algorithms and the proposed combined model-averaging expected error minimization method for the case when the initial label set consists of one random node. In the transductive setting, accuracy is evaluated across all unlabelled nodes; in the inductive setting, accuracy is evaluated on a held-out test set of nodes.

Experiment 1: Initial Labelled Set, Transductive: Each algorithm is initially provided with a small set of randomly chosen labelled nodes. We evaluate performance on a set of test nodes comprising 20 of the unlabelled set. The algorithms cannot query nodes from this evaluation set. Algorithms have access to the entire topology and all node features. For the Cora and Citeseer datasets, we start with of labelled nodes. We reduce this to for the larger datasets to achieve similar initial set sizes.

Figure 5: Performance of the active learning algorithms for detection of faulty links in the microwave link network.

Experiment 2: Single Labelled Node, Transductive and Inductive: Algorithms start with a single random labeled node. We examine two settings. In the transductive setting, algorithms know the entire graph and can access all features. Performance is assessed over all unlabelled nodes. In the inductive setting, a portion of the graph is held out for testing; the algorithms do not have access to the features and topology information for these nodes.

5.3 Results and Discussion

Experiment 1: Figure 3 and Table 2 show how the algorithms’ accuracies change as more nodes are added to an initial labelled set of size 10-20 nodes. Label propagation algorithm performances are not shown because they are outperformed by the GCN methods for this scenario. For all presented datasets, the proposed algorithms outperform the other GCN-based methods. This holds even for the cases when the hyperparameters of the GCN have been optimized using a validation set. When the hyperparameters are not tuned, the performance of the AGE algorithm deteriorates dramatically. It is better to choose the query node randomly. AGE outperforms the ANRMAB algorithm for the datasets where its weight-adapting parameter was tuned (Cora, Citeseer and Pubmed). The Random baseline and the proposed GEEM method use the same classifier, so they differ only in the nodes that are queried. Choosing an informative set of nodes using our proposed methods leads to a substantially improved accuracy in all cases. For the Cora dataset, the optimized GCN classifier initially outperforms the SGC model. However GEEM quickly outperforms as more nodes are queried, showing that the selection algorithm is more effective. Comparing PreGEEM performance against GEEM, it is clear that the approximation has very little impact; there is no clear performance difference between the two.

budget 0 1 10 30
Cora (60)
GEEM* 39.6 46.5 69.8* 77.2* 79.9
PreGEEM* 39.6 46.5 68.2* 77.1* 80.3
Random 39.6 40.2 49.7 63.0 73.3
AGE 46.6 52.7 61.6 74.9 79.8
ANRMAB 46.6 47.5 59.1 72.7 78.1
Citeseer (60)
GEEM* 40.5 49.7* 65.8* 71.2 72.8
PreGEEM* 40.5 49.7* 66.5* 71.8 73.3
Random 40.5 44.1 53.8 64.4 70.4
AGE 41.2 44.7 60.5 69.1 71.4
ANRMAB 41.2 44.1 55.7 64.6 69.4

Pubmed
(40)
GEEM* 52.3 58.1 72.6 77.6 78.7
PreGEEM* 52.3 58.1 69.3 77.2 78.0
Random 52.3 54.1 64.7 72.3 73.9
AGE 57.3 60.8 70.4 76.7 78.1
ANRMAB 57.3 58.8 69.5 74.1 75.7


Am-photo
(40)
GEEM* 59.6* 64.3* 82.4* 89.2* 90.7*
PreGEEM* 59.6* 64.3* 80.3* 88.8* 89.6
Random 59.6* 61.4* 72.0 82.3 87.6
AGE 45.5 52.0 51.5 67.8 69.3
ANRMAB 45.5 50.6 62.6 67.8 70.0
Am-comp. (40)

GEEM*
54.6* 59.8* 68.8* 74.8* 76.8*
PreGEEM* 54.6* 59.8* 68.4* 76.5* 77.5*

Random
54.6* 57.7* 65.9 72.8 73.3
AGE 47.1 41.5 51.6 52.4 53.3
ANRMAB 47.1 49.4 54.6 58.7 58.5
Microwave (60)
GEEM* 76.4 77.5 80.1 82.9* 86.0*
PreGEEM* 76.4 77.5 80.3 82.4* 86.0*
Random 76.4 76.7 79.1 81.0 83.9
AGE 69.1 68.3 70.3 70.3 75.1
ANRMAB 69.1 67.2 72.3 73.5 73.2
Table 2: Experiment 1 and Practical Application: Average accuracy at different budgets. Asterisks indicate that a Wilcoxon ranking test showed a significant difference (at significance level) between the marked method and the best performing baseline.

Experiment 2: Figure 4 compares the performance of the proposed Combined method with the label propagation algorithms. In the transductive setting, the proposed method is much better than Random selection. Since it incorporates the TSA technique, its performance is similar to TSA when few nodes have been queried. As the number of labels increases, there starts to be a small but significant improvement in accuracy. The inability of the label propagation methods to adapt to the inductive setting is shown clearly in Figures 4(b) and 4(d). In order to choose effective nodes to query, these methods need to know the topology of the entire graph. By contrast, the Combined method, which incorporates graph-based logistic regression, achieves similar performance in both inductive and transductive settings.

5.4 Practical Application

To give a concrete motivating application for PreGEEM, we also report the results of experiments on a private company dataset obtained from measurements of a microwave link network. Currently, faulty links are identified by human operators who must process lengthy performance log files. The identification or labelling of a faulty link takes a few minutes. Link performances vary substantially over time, so it is necessary to repeatedly label data. It is desirable to automate the faulty link detection procedure by training a classifier. Active learning has the potential to substantially reduce the time an operator must devote to the labelling task each week. For graphs the size of common microwave link networks, the GEEM algorithm can return a query in approximately one to two minutes, so this is an example where the PreGEEM algorithm can compute the next query during the labelling process.

The graph is constructed directly from the physical topology and is important because graph-based classification significantly outperforms classification algorithms that ignore the network. The features are link characteristics such as received signal strength and signal distortion metrics. Table 1 provides the statistics of the dataset. We consider an experiment where an initial labelled set of 8 links is provided, and the active learning algorithm must identify query nodes. Table 2 and Figure 5 compare the performance of GEEM, PreGEEM, Random, AGE and ANRMAB. AGE and ANRMAB performs much worse than random selection because the GCN is inaccurate for a small number of labels. GEEM and PreGEEM achieve a small but significant improvement.

6 Conclusion

We have introduced an active learning algorithm for node classification on attributed graphs that uses SGC (a linearized GCN) in an expected error minimization framework. Numerical experiments demonstrate that the proposed method significantly outperforms existing active learning algorithms for attributed graphs without relying on a validation set. We also proposed a preemptive algorithm that can generate a query while the oracle is labelling the previous query, and showed experimentally that this approximation does not impact the performance.

References

  • D. Berberidis and G. B. Giannakis (2018) Data-adaptive active sampling for efficient graph-cognizant classification. IEEE Trans. Signal Processing 66, pp. 5167–5179. Cited by: §2.1, §5.1.
  • H. Cai, V. W. Zheng, and K. C. Chang (2017) Active learning for graph embedding. arXiv preprint arXiv:1705.05085. Cited by: §1, §2.2, §5.1, §5.2.
  • M. Defferrard, X. Bresson, and P. Vandergheynst (2016) Convolutional neural networks on graphs with fast localized spectral filtering. In Proc. Advances in Neural Information Processing Systems, Barcelona, Spain, pp. 3844–3852. Cited by: §1.
  • Y. Gal, R. Islam, and Z. Ghahramani (2017) Deep bayesian active learning with image data. In Proc. Int. Conf. on Machine Learning, Sydney, Australia, pp. 1183–1192. Cited by: §1, §1.
  • H. Gao, Z. Wang, and S. Ji (2018) Large-scale learnable graph convolutional networks. In Proc. Int. Conf. on Knowledge Discovery & Data Mining, London, United Kingdom, pp. 1416–1424. Cited by: §1, §2.2, §5.1.
  • L. Gao, H. Yang, C. Zhou, J. Wu, S. Pan, and Y. Hu (2018) Active discriminative network representation learning. In

    Proc. Int. Joint Conf. Artificial Intelligence

    ,
    Stockholm, Sweden, pp. 2142–2148. Cited by: §1.
  • W. Hamilton, Z. Ying, and J. Leskovec (2017) Inductive representation learning on large graphs. In Proc. Advances in Neural Information Processing Systems, Long Beach, CA, US, pp. 1024–1034. Cited by: §1.
  • S. C. H. Hoi, R. Jin, J. Zhu, and M. R. Lyu (2006) Batch mode active learning and its application to medical image classification. In Proc. Int. Conf. on Machine Learning, Pittsburgh, PA, US, pp. 417–424. Cited by: §1.
  • M. Ji and J. Han (2012)

    A variance minimization criterion to active learning on graphs

    .
    In Proc. Int. Conf. Artificial Intelligence and Statistics, La Palma, Canary Islands, pp. 556–564. Cited by: §1, §2.1.
  • K. Jun and R. Nowak (2016) Graph-based active learning: a new look at expected error minimization. In Proc. IEEE Global Conf. Signal and Information Proc., pp. 1325–1329. Cited by: §2.1, §4.1, §4.6, §4.6, §5.1.
  • T. Kipf and M. Welling (2017) Semi-supervised classification with graph convolutional networks. In Proc. Int. Conf. Learning Representations, Toulon, France. Cited by: §1, §2.2, §4.2.
  • T. Kurzendorfer, P. Fischer, N. Mirshahzadeh, T. Pohl, A. Brost, S. Steidl, and A. Maier (2017)

    Rapid interactive and intuitive segmentation of 3d medical images using radial basis function interpolation

    .
    Journal of Imaging 3, pp. 56. Cited by: §1, §1.
  • Z. Liu, C. Chen, L. Li, J. Zhou, X. Li, and L. Song (2019) GeniePath: graph neural networks with adaptive receptive paths. In Proc. AAAI Conf. on Artificial Intelligence, Honolulu, HI, US, pp. 4424–4431. Cited by: §1.
  • Y. Ma, R. Garnett, and J. Schneider (2013) -Optimality for active learning on Gaussian random fields. In Proc. Adv. Neural Inf. Proc. Systems, Lake Tahoe, NV, US, pp. 2751–2759. Cited by: §1, §2.1.
  • T.P. Minka (2002) Bayesian model averaging is not model combination. Note: MIT Media Lab Note Cited by: §4.6.
  • G. Namata, B. London, L. Getoor, and B. Huang (2012) Query-driven active surveying for collective classification. In Proc. Workshop on Mining and Learning with Graphs, Int. Conf. Machine Learning, Cited by: §5.
  • S. Parisot, S. I. Ktena, E. Ferrante, M. Lee, R. Guerrero, B. Glocker, and D. Rueckert (2018) Disease prediction using graph convolutional networks: application to autism spectrum disorder and Alzheimer’s disease. Medical Image Analysis 48, pp. 117–130. Cited by: §1.
  • N. Roy and A. McCallum (2001) Toward optimal active learning through sampling estimation of error reduction. In Proc. Int. Conf. on Machine Learning, San Francisco, CA, USA, pp. 441–448. Cited by: §5.2.
  • P. Sen, G. Namata, M. Bilgic, L. Getoor, B. Galligher, and T. Eliassi-Rad (2008) Collective classification in network data. AI Magazine 29 (3), pp. 93. Cited by: §5.
  • B. Settles (2009) Active learning literature survey. Computer Sciences Technical Report Technical Report 1648, University of Wisconsin–Madison. Cited by: §1, §1, §4.3.
  • O. Shchur, M. Mumme, A. Bojchevski, and S. Günnemann (2018) Pitfalls of graph neural network evaluation. In Relational Representation Learning Workshop, NeurIPS 2018, Montréal, Canada. Cited by: §5.2, §5.
  • H. Sivan, M. Gabel, and A. Schuster (2019) Online linear models for edge computing. In Proc. Eur. Conf. Machine Learning and Principles and Practice of Knowledge Discovery in Databases (ECML-PKDD), Cited by: §4.5.
  • P. Veličković, G. Cucurull, A. Casanova, A. Romero, P. Liò, and Y. Bengio (2018) Graph attention networks. In Proc. Int. Conf. Learning Representations, Vancouver, Canada. Cited by: §1.
  • F. Wu, A. Souza, T. Zhang, C. Fifty, T. Yu, and K. Weinberger (2019) Simplifying graph convolutional networks. In Proc. Int. Conf. Machine Learning, Long Beach, CA, US, pp. 6861–6871. Cited by: §1, §4.2.
  • X. Zhu, Z. Ghahramani, and J. Lafferty (2003a) Semi-supervised learning using gaussian fields and harmonic functions. In Proc. Int. Conf. on Machine Learning, Washington, DC, USA, pp. 912–919. Cited by: §2.1.
  • X. Zhu, J. Lafferty, and Z. Ghahramani (2003b) Combining active learning and semi-supervised learning using Gaussian fields and harmonic functions. In Proc. Workshop on The Continuum from Labeled to Unlabeled Data (ICML), Washington, DC, US, pp. 58–65. Cited by: §1, §2.1.
  • C. Zhuang and Q. Ma (2018) Dual graph convolutional networks for graph-based semi-supervised classification. In Proc. Int. World Wide Web Conf., Lyon, France, pp. 499–508. Cited by: §1.