1 Introduction
In many classification tasks there are explicit or implicit relationships between the data points that need to be classified. One can represent such data using a graph, where an edge between two nodes (data points) indicates the presence of a relationship. The resultant task of node classification has attracted significant attention from the graphlearning research community, and numerous graph learning architectures have been developed that yield impressive performance, especially in semisupervised settings
(Defferrard et al., 2016; Kipf and Welling, 2017; Hamilton et al., 2017; Veličković et al., 2018; Zhuang and Ma, 2018; Gao et al., 2018; Liu et al., 2019). In such cases, knowledge of the graph topology can compensate for scarcity of labelled data.In practice, the semisupervised classification task often arises in scenarios where it is challenging or expensive to obtain labels. If we have the opportunity to decide which nodes to query, then we should try to select the most informative nodes that lead to the best classification accuracy. This is, in a nutshell, the goal of active learning; as we acquire labels, we make decisions about which label to query next based on what we have learned. This is important in applications such as medical imaging, where generating labels requires considerable valuable time from domain experts (Hoi et al., 2006; Gal et al., 2017; Kurzendorfer et al., 2017). The development of active learning algorithms for node classification in graphs can be motivated by applications of graph convolutional neural networks (GNNs) in the medical field. For example, in (Parisot et al., 2018), GNNs are used to classify brain scan images, with the goal of predicting disease outcomes or detecting the presence of a disorder. Although we may have access to many brain scans and can specify relationships between them (thus building a graph), obtaining labels for them is expensive because it requires attention from medical experts.
The early research that applies active learning to graph data mainly focuses on the nonattributed graph setting (Zhu et al., 2003b; Ji and Han, 2012; Ma et al., 2013). We focus on node classification for attributed graphs, so the recently proposed GNNbased methods (Cai et al., 2017; Gao et al., 2018) are more aligned with the task we address. The results reported in these works are usually based on GNNs with hyperparameters that have been optimized using a large labelled validation set. This is an unrealistic setting; if we have access to such a large amount of labeled data, we should use much more of it to train the classifier. As we illustrate in the experiments in Section 5, if hyperparameters are not optimized, but are chosen randomly from a reasonable range of candidate values, the performance of the GNNbased active learning methods deteriorates dramatically.
In this work, we aim to address the limitations of the GNN methods. We propose an algorithm that is based on the Expected Error Minimization (EEM) framework (Settles, 2009). In this framework, we select the query that minimizes the expected classification error according to our current model. We use the simplified graph convolution (SGC) (Wu et al., 2019), a graphcognizant logistic regression, as a predictive model for the labels. This model, which can be derived as a linearization of the graph convolutional neural network (Wu et al., 2019), performs much better when there is limited data compared to a GNN with suboptimal hyperparameters, and achieves competitive accuracy as the number of labels increases.
Most active learning techniques involve initial training of a model and then an iterative process of (i) identifying the best query by some criterion (the core step of active learning); (ii) obtaining the label from an oracle; and (iii) updating the model. In an interactive application, this can lead to a delay if the query generation of step (i) is not extremely fast. Although it is principled and competitive to other approaches, the EEM algorithm does have the disadvantage of an increased computational overhead. However, we note that a delay will also be introduced at step (ii); human labelling can take seconds (document categorization (Settles, 2009)) to minutes (cancer diagnosis from skin lesion images (Gal et al., 2017), MRI tumour segmentation (Kurzendorfer et al., 2017), or fault detection in microwave link networks). With this in mind, the interactive delay can be reduced or even eliminated if the model update and query identification steps can be started and completed while the labelling is conducted. We develop such a preemptive strategy, based on a prediction of the labelling from the oracle.
In summary, our paper makes the following contributions: (i) we propose a practical approach for active learning in graphs that does not have the unrealistic requirement of a validation set for hyperparameter tuning; (ii) we extend the proposed approach to introduce preemptive query generation in order to reduce or eliminate the delay experienced by a labeller during interaction with the system; (iii) we derive bounds on the error in risk evaluation associated with the preemptive prediction; (iv) we analyze performance on five public benchmark datasets and show a significant improvement compared to stateoftheart GNN active learning methods (and label propagation strategies); (v) we illustrate the practical benefit of our method by demonstrating its application to a private, commercial dataset collected for the task of identifying faulty links in a microwave link network.
2 Related Research
2.1 Active learning on nonattributed graphs
Many methods for active learning on graphs without node or edge features are based on the idea of of propagating label information across the graph, and we hence refer to them throughout the paper as labelpropagation methods. The most successful techniques are all based on the Binary Random Markov Field (BRMF) model. The model allows one to define a posterior on the unknown labels conditioned on the graph topology and the observed node labels. This model provides an effective mechanism for representing smoothness of labels with respect to the graph, but evaluating the posterior is a combinatorial problem. As a result, researchers have introduced relaxations or approximation strategies. (Zhu et al., 2003a) relax the BMRF to a Gaussian Random Field (GRF) model, and (Zhu et al., 2003b; Ji and Han, 2012) and subsequently (Ma et al., 2013) also employ this model to derive active learning methods. More recently, (Berberidis and Giannakis, 2018) have applied the expected change maximization strategy to a GRF model. (Jun and Nowak, 2016) takes another approach by proposing a twostep approximation (TSA) of the intractable combinatorial problem rather than relaxing the BMRF model.
These strategies offer the advantage of being principled methods that directly target the quantity we want to optimize, but label propagationbased models cannot take into account node features and consequently must rely on strong assumptions regarding the relationships between the graph topology and the data. Most label propagation methods struggle if the graph is not connected and do not usually translate well to an inductive setting, because query decisions rely on the knowledge of the complete graph topology.
2.2 Graph neural network methods for active learning
(Cai et al., 2017) leverage the output of a Graph Convolution Network (GCN) (Kipf and Welling, 2017)
to design active learning metrics. Their method is to alternate during the training of the GCN between adding one node to the training set and performing one epoch of training. Selection of the query node is based on a score that is a weighted mixture of three metrics covering different active learning strategies: an uncertainty metric, a densitybased metric and a graph centrality metric. The uncertainty metric is obtained by taking the entropy of the softmax output given by the current GCN model. The density metric is based on the GCN node embeddings; the embeddings are clustered and the distance between each node’s embedding and the centre of its cluster is computed. A more central embedding indicates a more representative node. The graph centrality metric is independent of the GCN and only relies on the position of the node in the graph. The weights change as more nodes are added to the labelled set, in order to reflect the increased confidence in the two metrics that are derived from the output of the GCN. The weight adaptation schedule in
(Cai et al., 2017) is fixed; (Gao et al., 2018) propose an alternative multiarmed bandit algorithm that learns how to balance the contributions of the different metrics. They argue that this mechanism can better adapt to the varying natures of different datasets.3 Problem Setting
3.1 Poolbased Formulation
We consider the problem of active learning on an attributed graph for node classification using feature matrix and labels . The nodes are partitioned into two sets: a small initial labelled set from with node labels (), and a set consisting of the remaining unlabelled nodes. The algorithm is given a budget of nodes that it can query from to augment . We denote by and the sets of labelled and unlabelled nodes, respectively, after nodes have been added to the initial labelled set. The poolbased active learning formulation that we are considering consists of three phases:

[leftmargin=*]

Prediction Step : , and the current node labels, , are used to infer the labels of the nodes .

Query Step : Until the budget is exhausted, select a node to query and to add to the labelled set .

Labeling Step : The oracle takes time to label . We update the sets: , .
The goal is to select the best node to append to at each iteration , in order to optimize the prediction performance throughout the query process. We are not only interested in the end result after exhausting the query budget, but also in how quickly we can increase accuracy. Acquiring labels is presumed to be expensive, so a solution that reaches competitive performance with fewer nodes is desired.
In addition to the transductive setting outlined above, where we know the entire graph and all attributes, we also consider an inductive setting, where our goal is to maximize performance over an additional set of nodes ; we know that these are connected to the graph in some fashion, but we cannot query them and we do not know the edges or features during the active learning process.
3.2 Reducing interaction delay
With the phases outlined above, the active learning algorithm stalls while waiting for the oracle to label at the third phase and then the oracle must wait while the algorithm computes the best subsequent query node . This is inefficient and, for a human oracle, frustrating. We can address this by requiring the active learning algorithm to identify the query node using the label set . If the labelling time and the query computation time are similar, then neither the oracle nor the algorithm stalls for long. While the oracle is generating the label for , this preemptive active learning algorithm identifies in parallel the best query node using . Figure 1 compares the timelines of the standard singlequery active learning procedure (query generation algorithm waits for the oracle and vice versa) with the preemptive strategy where labelling and query generation are performed in parallel.
4 Methodology
4.1 Expected error Minimization (EEM)
An active learning algorithm based on error reduction selects the query to minimize the expected error. For a classification task, the zeroone error is a suitable choice. Denoting by the set of unlabelled nodes after iterations of active learning with node removed, i.e., . The labels associated with this set are . Following (Jun and Nowak, 2016), we can define , the risk of adding node given the current known label set , as:

(1) 
Here is the label prediction at node
. We thus calculate expected error by summing error probabilities over the unlabelled set, minus the node
we are considering. Define , where is the set of classes. If the query node has label , then represents the probability of making an error in the prediction of the label of node . If we can compute the distribution , we can evaluate the risk of querying :(2) 
The query algorithm selects the riskminimizing node :
(3) 
It remains to define the probabilistic model .
4.2 Graphcognizant logistic regression
We propose to use a graphcognizant logistic regression model to obtain . Such a model was introduced by (Wu et al., 2019), where the SGC is derived as a simplified (linearized) version of the graph convolutional network of (Kipf and Welling, 2017). (Wu et al., 2019) showed that the simplified model can achieve competitive performance for a significantly lower computational cost. In the EEM approach to active learning, we must learn a new model for every potential query node, so it is essential that the computational cost is relatively low. The SGC meets our requirements: its computational requirements are moderate and it takes into account the graph structure and node features.
For a graph with adjacency matrix , let , be the degree matrix, and . We then define
. This can be interpreted as a degree normalized symmetrized adjacency matrix (after selfloops have been added by the identity matrix
). The prediction model has the form:(4) 
We define ; this can be interpreted as graphbased preprocessing of node features. The parameter controls the number of hops that are considered when generating the final node representation. Usually using a 2hop () neighborhood yields good results.
4.3 Graph EEM (GEEM)
Using the SGC model, we can compute a risk for each query node. At each step , we use the current known labels to find the weights by minimizing the error for predictions . We use a standard iterative algorithm for maximum likelihood logistic regression with L2 regularization (e.g., the scikitlearn liblinear solver). We can then compute , where the index indicates that we extract the
th element of the vector. Then for each candidate node
, for each possible class , we solve:(5) 
Here the notation indicates that we are adding node to the labelled set and assigning it label . For the adopted model, . The node to query is then the one that minimizes the risk:
(6) 
From this formulation, we can see that we first have to evaluate , then calculate for each of the potential augmented labelled sets . This implies that we have a computational complexity of , where represents the complexity associated with training the model. For logistic regression, this is the overhead involved in learning the weights . If the evaluation of is computationally expensive, then the time required to select a query node can become prohibitive. This is a common disadvantage of the expected reduction strategies (Settles, 2009). It then becomes apparent why using the linearized version of the GCN is important.
The proposed algorithm requires the choice of very few hyperparameters (only the number of hops and logistic regression parameters). This contrasts with the active learning approaches based on graph neural networks, where there are multiple hyperparameters that must be selected, and suboptimal choices can have a major impact on performance.
4.4 Preemptive Query (PreGEEM)
In many practical active learning scenarios, labelling is performed by a human, and it is often timeconsuming; labelling a single data point (node) can take tens of seconds or minutes. It is desirable to have the next query identified as soon as the labeller has completed the labelling task. With the EEM algorithm formulated above, this is impossible, because the query identification in (6) uses the label associated with the previous query node.
In this section, we outline an alternative approach that performs preemptive query calculation, using the labelling time to identify the next node to query. Instead of waiting for the oracle to label to start the identification of during iteration , we propose to approximate the risk before knowing . The direct approach is to replace the risk with the expectation over the possible values of , but this increases the computational complexity by a factor of , which is highly undesirable. To avoid this penalty, we further approximate using the value of risk for the label at the mode of . Effectively, we use the predicted label of the previous model to form an augmented labelled set and define an approximate risk:
(7) 
The query node is then . We call this new approach the Preemptive Graph EEM (PreGEEM).
Figure 2 compares the evolution of the two active learning algorithms GEEM and PreGEEM for a small subset of nodes to illustrate how using the approximated risk can impact the query process. We see that the evaluated risks are similar, and although the ordering of query nodes differs, after five steps the same nodes have been selected by both algorithms.
4.5 Bounds on the PreGEEM Risk Error
We now present bounds on the risk estimation error that can arise by using a predicted label. We focus on the onestep error
, where the labelled sets differ for only one label. For clarity, we first derive a bound for the binary classification task. We then state the more general bound for multiclass classification. Complete proofs are provided in the supplementary material.The following proposition, which follows straightforwardly from Theorem 1 and Lemma 1 in (Sivan et al., 2019), bounds the difference in prediction values for two label sets that differ for data points. The bound is expressed in terms of the regularization parameter and the graph preprocessed feature vectors .
Proposition 1.
For weights and derived by penalized logistic regression to two datasets with common feature vectors but label sets differing for vectors indexed by , define and . For any , .
The following result characterizes the risk error when we perform regularized binary logistic regression on to derive weights for . Define , and .
Theorem 4.1.
The risk error arising from applying binary L2regularized logistic regression with regularization parameter to two labelled datasets and that differ by one label, associated with the node , is bounded as:
Sketch of Proof.
We define the random variable
which takes value with probability for . The difference in risk is:(8) 
where denotes expectation over conditioned on the observed label set, either or . For query node , for each , we learn weights and using and , respectively. For each , we have:
(9) 
Here the first inequality follows from the definition of and the property that for , . The second inequality follows from Proposition (1), observing that the labels can differ only for nodes and .
Observe that for random variables and taking values in and , respectively, . Applying this to (8) and then employing (9) leads to the stated bound on the risk error.
∎
The following bound applies for the case of multiclass classification. For a given label set , we learn weights for each class using L2regularized binary onevsall logistic regression. The output prior to normalization for a given feature vector is then . We then normalize by dividing by to obtain a probability vector. Let be the weight vector learned for class using label data . Let for . Define
Theorem 4.2.
The risk error arising from multiclass regression performed via repeated onevsall L2regularized logistic regression with regularization parameter to labelled datasets and that differ by one label, associated with the node , is bounded as:
(10) 
for .
The proof is similar to that of Theorem 4.1, but more involved, and is provided in the supplementary material.
4.6 Combined method
The most extreme case of active learning is when we start with only one labelled node. In this scenario, the logistic regression model cannot make useful predictions until at least some nodes have been queried. To address this scenario, we combine our algorithm with a labelpropagation method. The aim is to first use labelpropagation when very few node labels are available, then switch to a combination of both algorithms when more information is available, and finally transition to the more accurate graphcognizant logistic regression. Bayesian model averaging provides a mechanism to make this transition (Minka, 2002).
In Bayesian model averaging, we have different classifiers and our belief is that one of these models is correct. We start with a prior over each model. After observing data , we compute the model evidence . Using Bayes’ rule, we can compute the posterior and then weight the predictions:
In the context of expected error minimization active learning, we need to evaluate the risk associated with a query. We introduce a modeldependent zeroone risk, and in our combined method, we employ a modelaveraged risk:
(11) 
In order to compute (or approximate) this expression, we need to evaluate . Assuming that we have equal prior belief in the models available to us, this is equivalent to calculating the marginal likelihood .
We incorporate two models, one based on label propagation and the other based on logistic regression. For the binary random field model that underpins the label propagation classifiers, there are no learnable model parameters (there is one fixed hyperparameter). Evaluating the evidence is thus equivalent to computing
under the BMRF model. This is a combinatorial problem, but we can factorize the joint probability into a chain rule of conditionals and use the same twostage approximation (TSA) that is employed in
(Jun and Nowak, 2016). Additional details are provided in the supplementary material. We denote this evidence approximation by .For the logistic regression model, we are using . The joint probability of the complete labelled set is then evaluated as , where is the categorical index of . We then have . To calculate the evidence, we thus need to integrate over the weight matrix , which is not analytically tractable. We choose to approximate by , and we denote this as . This leads to a sufficiently accurate approximation of the evidence for our purpose (which is just to achieve an adaptive balance between label propagation and SGC).
We use the TSA algorithm (Jun and Nowak, 2016) as the label propagationbased estimator, leading to a new combined approach for selecting the query node. We solve
where and are our normalized estimates for the model evidences for the SGC model and the TSA label propagation model, respectively, after observing the data .
(a) Cora  (b) Citeseer  (c) Amazonphoto 
Comparison of performance of active learning algorithms for Experiment 1. Each point on a curve shows the mean classification accuracy achieved across 20 random partitions after the labelled set has been expanded to the indicated number of nodes. The shaded regions indicate 5/95 confidence intervals on the means derived using bootstrap.
5 Experiments
We examine performance using five of the node classification benchmarks in (Shchur et al., 2018). Cora, Citeseer (Sen et al., 2008) and Pubmed (Namata et al., 2012) are citation datasets. Nodes represent journal articles and an undirected edge is included when one article cites another. The node features are bagofwords representations of article content. AmazonPhoto and AmazonComputers are graphs based on customers’ copurchase history records. For each dataset we isolate the largest connected component in the graph following (Shchur et al., 2018). The description of the dataset statistics is shown in Table 1.
Dataset  Classes  Features  Nodes  Edges 



Cora  7  1,433  2,485  5,069  0.04%  
Citeseer  6  3,703  2,110  3,668  0.04%  
Pubmed  3  500  19,717  44,324  0.01%  
Amcomp.  10  767  13,381  245,778  0.07%  
Amphoto  8  745  7,487  119,043  0.11%  
Microwave  2  19  322  5,753  5.54% 
5.1 Baselines
We compare the following active learning algorithms: (i) Random: This baseline chooses a node to query by uniform random selection, and then performs classification using SGC; (ii) AGE: The graph neural network based algorithm proposed by (Cai et al., 2017); (iii) ANRMAB: The graph neural network algorithm proposed by (Gao et al., 2018), in which a multiarm bandit is used to adapt the weights assigned to the different metrics used when constructing the score to choose a query node; (iv) TSA: The labelpropagation algorithm based on a twostage approximation of the BMRF model (Jun and Nowak, 2016); (v) ECTV, ECMSD: The labelpropagation algorithms based on a Gaussian random field approximation to the BMRF model (Berberidis and Giannakis, 2018); (vi) GEEM: The proposed algorithm based on SGC and expected error minimization; (vii) PreGEEM: The proposed algorithm with preemptive queries; (viii) Combined : The proposed combined algorithm that uses Bayesian model averaging to adaptively merge SGC and label propagation in an EEM framework.
5.2 Experimental Settings
For each experiment, we report the average over 20 trials with different random partitions. All GCNs and SGCs have 2 layers. The weightadapting parameter of AGE is set to the values in (Cai et al., 2017) and to 0.995 for nonincluded datasets. For the larger datasets, AmPhoto, AmComp and Pubmed, we reduce computational complexity for GEEM and PreGEEM by evaluating risk using a subset of 500 nodes, selected randomly in an approach similar to (Roy and McCallum, 2001). This has minimal impact on performance. The GCN hyperparameters are set to the values found by (Shchur et al., 2018) to be the best performing hyperparameter configurations. Early stopping is not employed because access to a validation set is not a reasonable assumption in an active learning setting. We also include a “nonoptimized” version of AGE; this is because, in practice, we would usually not have access to the tuned hyperparameters provided by (Shchur et al., 2018), because these are derived using a large validation set. For the nonoptimized version of AGE, the hyperparameter configuration for each trial was randomly selected from the values considered in the grid search of (Shchur et al., 2018).
Experiment 1: Initial Labelled Set, Transductive: Each algorithm is initially provided with a small set of randomly chosen labelled nodes. We evaluate performance on a set of test nodes comprising 20 of the unlabelled set. The algorithms cannot query nodes from this evaluation set. Algorithms have access to the entire topology and all node features. For the Cora and Citeseer datasets, we start with of labelled nodes. We reduce this to for the larger datasets to achieve similar initial set sizes.
Experiment 2: Single Labelled Node, Transductive and Inductive: Algorithms start with a single random labeled node. We examine two settings. In the transductive setting, algorithms know the entire graph and can access all features. Performance is assessed over all unlabelled nodes. In the inductive setting, a portion of the graph is held out for testing; the algorithms do not have access to the features and topology information for these nodes.
5.3 Results and Discussion
Experiment 1: Figure 3 and Table 2 show how the algorithms’ accuracies change as more nodes are added to an initial labelled set of size 1020 nodes. Label propagation algorithm performances are not shown because they are outperformed by the GCN methods for this scenario. For all presented datasets, the proposed algorithms outperform the other GCNbased methods. This holds even for the cases when the hyperparameters of the GCN have been optimized using a validation set. When the hyperparameters are not tuned, the performance of the AGE algorithm deteriorates dramatically. It is better to choose the query node randomly. AGE outperforms the ANRMAB algorithm for the datasets where its weightadapting parameter was tuned (Cora, Citeseer and Pubmed). The Random baseline and the proposed GEEM method use the same classifier, so they differ only in the nodes that are queried. Choosing an informative set of nodes using our proposed methods leads to a substantially improved accuracy in all cases. For the Cora dataset, the optimized GCN classifier initially outperforms the SGC model. However GEEM quickly outperforms as more nodes are queried, showing that the selection algorithm is more effective. Comparing PreGEEM performance against GEEM, it is clear that the approximation has very little impact; there is no clear performance difference between the two.
budget  0  1  10  30  

Cora  (60)  
GEEM*  39.6  46.5  69.8*  77.2*  79.9 
PreGEEM*  39.6  46.5  68.2*  77.1*  80.3 
Random  39.6  40.2  49.7  63.0  73.3 
AGE  46.6  52.7  61.6  74.9  79.8 
ANRMAB  46.6  47.5  59.1  72.7  78.1 
Citeseer  (60)  
GEEM*  40.5  49.7*  65.8*  71.2  72.8 
PreGEEM*  40.5  49.7*  66.5*  71.8  73.3 
Random  40.5  44.1  53.8  64.4  70.4 
AGE  41.2  44.7  60.5  69.1  71.4 
ANRMAB  41.2  44.1  55.7  64.6  69.4 
Pubmed 
(40)  
GEEM*  52.3  58.1  72.6  77.6  78.7 
PreGEEM*  52.3  58.1  69.3  77.2  78.0 
Random  52.3  54.1  64.7  72.3  73.9 
AGE  57.3  60.8  70.4  76.7  78.1 
ANRMAB  57.3  58.8  69.5  74.1  75.7 
Amphoto 
(40)  
GEEM*  59.6*  64.3*  82.4*  89.2*  90.7* 
PreGEEM*  59.6*  64.3*  80.3*  88.8*  89.6 
Random  59.6*  61.4*  72.0  82.3  87.6 
AGE  45.5  52.0  51.5  67.8  69.3 
ANRMAB  45.5  50.6  62.6  67.8  70.0 
Amcomp.  (40)  
GEEM* 
54.6*  59.8*  68.8*  74.8*  76.8* 
PreGEEM*  54.6*  59.8*  68.4*  76.5*  77.5* 
Random 
54.6*  57.7*  65.9  72.8  73.3 
AGE  47.1  41.5  51.6  52.4  53.3 
ANRMAB  47.1  49.4  54.6  58.7  58.5 
Microwave  (60)  
GEEM*  76.4  77.5  80.1  82.9*  86.0* 
PreGEEM*  76.4  77.5  80.3  82.4*  86.0* 
Random  76.4  76.7  79.1  81.0  83.9 
AGE  69.1  68.3  70.3  70.3  75.1 
ANRMAB  69.1  67.2  72.3  73.5  73.2 
Experiment 2: Figure 4 compares the performance of the proposed Combined method with the label propagation algorithms. In the transductive setting, the proposed method is much better than Random selection. Since it incorporates the TSA technique, its performance is similar to TSA when few nodes have been queried. As the number of labels increases, there starts to be a small but significant improvement in accuracy. The inability of the label propagation methods to adapt to the inductive setting is shown clearly in Figures 4(b) and 4(d). In order to choose effective nodes to query, these methods need to know the topology of the entire graph. By contrast, the Combined method, which incorporates graphbased logistic regression, achieves similar performance in both inductive and transductive settings.
5.4 Practical Application
To give a concrete motivating application for PreGEEM, we also report the results of experiments on a private company dataset obtained from measurements of a microwave link network. Currently, faulty links are identified by human operators who must process lengthy performance log files. The identification or labelling of a faulty link takes a few minutes. Link performances vary substantially over time, so it is necessary to repeatedly label data. It is desirable to automate the faulty link detection procedure by training a classifier. Active learning has the potential to substantially reduce the time an operator must devote to the labelling task each week. For graphs the size of common microwave link networks, the GEEM algorithm can return a query in approximately one to two minutes, so this is an example where the PreGEEM algorithm can compute the next query during the labelling process.
The graph is constructed directly from the physical topology and is important because graphbased classification significantly outperforms classification algorithms that ignore the network. The features are link characteristics such as received signal strength and signal distortion metrics. Table 1 provides the statistics of the dataset. We consider an experiment where an initial labelled set of 8 links is provided, and the active learning algorithm must identify query nodes. Table 2 and Figure 5 compare the performance of GEEM, PreGEEM, Random, AGE and ANRMAB. AGE and ANRMAB performs much worse than random selection because the GCN is inaccurate for a small number of labels. GEEM and PreGEEM achieve a small but significant improvement.
6 Conclusion
We have introduced an active learning algorithm for node classification on attributed graphs that uses SGC (a linearized GCN) in an expected error minimization framework. Numerical experiments demonstrate that the proposed method significantly outperforms existing active learning algorithms for attributed graphs without relying on a validation set. We also proposed a preemptive algorithm that can generate a query while the oracle is labelling the previous query, and showed experimentally that this approximation does not impact the performance.
References
 Dataadaptive active sampling for efficient graphcognizant classification. IEEE Trans. Signal Processing 66, pp. 5167–5179. Cited by: §2.1, §5.1.
 Active learning for graph embedding. arXiv preprint arXiv:1705.05085. Cited by: §1, §2.2, §5.1, §5.2.
 Convolutional neural networks on graphs with fast localized spectral filtering. In Proc. Advances in Neural Information Processing Systems, Barcelona, Spain, pp. 3844–3852. Cited by: §1.
 Deep bayesian active learning with image data. In Proc. Int. Conf. on Machine Learning, Sydney, Australia, pp. 1183–1192. Cited by: §1, §1.
 Largescale learnable graph convolutional networks. In Proc. Int. Conf. on Knowledge Discovery & Data Mining, London, United Kingdom, pp. 1416–1424. Cited by: §1, §2.2, §5.1.

Active discriminative network representation learning.
In
Proc. Int. Joint Conf. Artificial Intelligence
, Stockholm, Sweden, pp. 2142–2148. Cited by: §1.  Inductive representation learning on large graphs. In Proc. Advances in Neural Information Processing Systems, Long Beach, CA, US, pp. 1024–1034. Cited by: §1.
 Batch mode active learning and its application to medical image classification. In Proc. Int. Conf. on Machine Learning, Pittsburgh, PA, US, pp. 417–424. Cited by: §1.

A variance minimization criterion to active learning on graphs
. In Proc. Int. Conf. Artificial Intelligence and Statistics, La Palma, Canary Islands, pp. 556–564. Cited by: §1, §2.1.  Graphbased active learning: a new look at expected error minimization. In Proc. IEEE Global Conf. Signal and Information Proc., pp. 1325–1329. Cited by: §2.1, §4.1, §4.6, §4.6, §5.1.
 Semisupervised classification with graph convolutional networks. In Proc. Int. Conf. Learning Representations, Toulon, France. Cited by: §1, §2.2, §4.2.

Rapid interactive and intuitive segmentation of 3d medical images using radial basis function interpolation
. Journal of Imaging 3, pp. 56. Cited by: §1, §1.  GeniePath: graph neural networks with adaptive receptive paths. In Proc. AAAI Conf. on Artificial Intelligence, Honolulu, HI, US, pp. 4424–4431. Cited by: §1.
 Optimality for active learning on Gaussian random fields. In Proc. Adv. Neural Inf. Proc. Systems, Lake Tahoe, NV, US, pp. 2751–2759. Cited by: §1, §2.1.
 Bayesian model averaging is not model combination. Note: MIT Media Lab Note Cited by: §4.6.
 Querydriven active surveying for collective classification. In Proc. Workshop on Mining and Learning with Graphs, Int. Conf. Machine Learning, Cited by: §5.
 Disease prediction using graph convolutional networks: application to autism spectrum disorder and Alzheimer’s disease. Medical Image Analysis 48, pp. 117–130. Cited by: §1.
 Toward optimal active learning through sampling estimation of error reduction. In Proc. Int. Conf. on Machine Learning, San Francisco, CA, USA, pp. 441–448. Cited by: §5.2.
 Collective classification in network data. AI Magazine 29 (3), pp. 93. Cited by: §5.
 Active learning literature survey. Computer Sciences Technical Report Technical Report 1648, University of Wisconsin–Madison. Cited by: §1, §1, §4.3.
 Pitfalls of graph neural network evaluation. In Relational Representation Learning Workshop, NeurIPS 2018, Montréal, Canada. Cited by: §5.2, §5.
 Online linear models for edge computing. In Proc. Eur. Conf. Machine Learning and Principles and Practice of Knowledge Discovery in Databases (ECMLPKDD), Cited by: §4.5.
 Graph attention networks. In Proc. Int. Conf. Learning Representations, Vancouver, Canada. Cited by: §1.
 Simplifying graph convolutional networks. In Proc. Int. Conf. Machine Learning, Long Beach, CA, US, pp. 6861–6871. Cited by: §1, §4.2.
 Semisupervised learning using gaussian fields and harmonic functions. In Proc. Int. Conf. on Machine Learning, Washington, DC, USA, pp. 912–919. Cited by: §2.1.
 Combining active learning and semisupervised learning using Gaussian fields and harmonic functions. In Proc. Workshop on The Continuum from Labeled to Unlabeled Data (ICML), Washington, DC, US, pp. 58–65. Cited by: §1, §2.1.
 Dual graph convolutional networks for graphbased semisupervised classification. In Proc. Int. World Wide Web Conf., Lyon, France, pp. 499–508. Cited by: §1.