1 Introduction
The performance of a classification model depends on the quality and quantity of training data, often requiring a huge labeling effort. With everincreasing amounts of data, active learning (AL) is gaining the attention of researchers as well as practitioners as a way to reduce the effort spent on labeling data instances. An AL algorithm selects a set of unlabeled instances based on an informative metric, gets their labels, and updates the labeled dataset. Then the classification model is retrained using the acquired labels. This process is repeated until a desirable level of performance (e.g. accuracy) is reached.
In this paper, we consider the task of applying AL for semisupervised problems. In a semisupervised learning problem the learning algorithm can utilize all data instances including the unlabeled ones. Only the labels of unlabeled instances are not known. We evaluate our approach on classifying nodes on attributed graphs. Reducing the number of labeled nodes required in node classification can benefit a variety of practical applications such as in recommender systems
(pinsage18; rubens2015active) and text classification (yao2019graph).An acquisition function is used to evaluate the informativeness of an unlabeled instance. Since quantifying the informativeness of an instance is not straightforward, a multitude of heuristics have been proposed in AL literature (settles2009active). For example, uncertainty sampling selects instances which the model is most uncertain about (houlsby2011bald)
. The most common method is to select the unlabeled instance corresponding to the maximum entropy over the class probabilities predicted by the model. However, such heuristics are not flexible to adapt to the distribution of data and can not exploit inherent characteristics of a given dataset. Often, the performance of heuristic active learners is not consistent across different datasets, sometimes worse than random selection of unlabeled instances.
Compared to applications of AL on image data, only a limited number of AL models have been developed for graph data. Previous work on applying AL on graph data (gu2012towards; bilgic2010networkdata; ji2012variance) depend on earlier classification models such as Gaussian random fields, in which the features of nodes are not being used. Therefore, selecting query nodes uniformly in random coupled with a recent graph neural network (GNN) model can easily outperform such AL models. AL models that use recent GNN architectures (age2017cai; gao2018active) are limited and they rely on linear combinations of uncertainty and various heuristics such as node centrality measures.
We overcome this problem by directly considering the performance of the classifier into the acquisition function in semisupervised learning problems. We build our work motivated by the framework of expected error reduction (EER) (roy2001eer; guo2008discriminative; aodha2014hierarchical), in which the objective is to query instances which would maximizes the expected performance gain. Original EER formulation is extremely time consuming and not practical to be used with neural network classifiers. We formulate this objective as a bilevel optimization problem. Based on recent advances in metalearning (finn2017maml), we utilize metagradients to make this optimization efficient. zugner2018adversarial propose using metagradients for modeling an adversarial attack on GNNs. Our motivation in using metagradients is the opposite, evaluating the importance of labeling each unlabeled instance. In section 4, with empirical evidence, we show that MetAL significantly outperforms existing AL algorithms.
Our contributions are:

We introduce MetAL, a novel active learning algorithm based on the expected error reduction principle.

We discuss the importance of performing exploration in AL and introduce a simple countbased exploration term.

We demonstrate that our proposed algorithm MetAL can consistently outperform stateoftheart AL algorithms on a variety of real world graphs.
2 Our Framework
2.1 Problem Setting
In this paper, we apply AL for the multiclass node classification of a given undirected attributed graph of nodes. The graph consists of an adjacency matrix and a node attribute matrix , where is the number of attributes. Labels of a small set of nodes are given initially and labels of rest of the nodes are unknown. A labeled node is assigned a label in , where is the number of classes. The objective of a learner is to learn a function which predicts the class label of a given test node . This function can be any node classification algorithm. Graph neural networks (kipf2017gcn; sgc2019) (GNN) are commonly used in the present day. Parameters
of the model are estimated by minimizing a loss function, usually using a gradientbased optimization algorithm.
We consider a poolbased active learning setting, in which the labeled dataset is much smaller compared to a large pool of unlabeled items . We can acquire the label of any unlabeled item by querying an oracle (e.g. a human annotator) at a uniform cost per item. Suppose we are given a query budget , such that we are allowed to query labels of a maximum of unlabeled items. An optimal active learner selects the set of items which maximizes the expected performance gain of the classification model upon retraining it with their labels. Selection of items for querying is done in an iterative manner such that in each iteration an instance is queried and the model is retrained with its label.
2.2 Optimization Problem
We define our objective as finding unlabeled items which maximizes the likelihood of labeled instances while minimizing the uncertainty of label predictions of the unlabeled instances . For any we estimate this objective of the model after training it on . Training on an item updates model parameters to such that
(1) 
where is the loss function (e.g. crossentropy). We can write our objective as an optimization problem:
(2) 
where is a cost function defined as
(3) 
in which we minimize the loss over labeled instances combined with , the entropy of unlabeled instances.
Since the label of an unlabeled instance is unknown, we compute the expected loss over all possible labels. We rewrite Equation (3) as
(4) 
In this case, we select the instance which minimizes the expected value of . denotes the parameters of a model trained with instance having the label .
2.3 Metalearning Approach
Since the label of an item is unknown, we use the posterior class probabilities as a proxy for . This approach requires training a separate model for each possible label of each unlabeled item (). Training this many models is prohibitively time consuming.
To remedy this issue, we estimate the impact of a query with label () by training a model with label upweighted by a small perturbation such that (), where is the perturbation added to label . This idea is motivated by the use of perturbations in the feature space for finding training instances responsible for a given prediction (koh2017influence). In contrast, our objective is to find unlabeled instances which incur the greatest impact on the performance on test instances, once their labels are known. We repurpose the use of perturbations to understand the impact an unlabeled instance may have on the model performance if it has the label . We rewrite Equation (1) as
(5) 
We quantify the impact of retraining the model with added to the labeled set as the change in loss
(6) 
and the expected change of loss for querying the item by
(7) 
is the posterior class probabilities of the current model and it is estimated with
(8) 
When is arbitrarily small, this change can be computed as the gradient of the loss with respect to the label perturbation , . We rewrite Equation (7) using gradient as
(9) 
The term quantifies the impact of labeling a query . This simplifies the active learning problem to finding the item corresponding to the minimum expected metagradient (Equation (9)) such that
(10) 
Here, a negative valued expected metagradient corresponds to a model with lower expected loss. In other words, we need to find a query which maximizes the negative of the expected gradient ().
Equation (5) and Equation(10) form a bilevel optimization problem. Calculating the metagradients as in Equation (9) involves a calculation of two gradients in a nested order, the inner one for optimizing the model parameters for perturbed labels and the outer one for calculating the gradient with respect to the perturbation . Therefore, the expected value of indirectly depends on via . This is similar to the computation of metagradients in metalearning approaches used for fewshot learning (finn2017maml). It should be noted that, unlike in fewshot learning, we calculate metagradients with respect to a perturbation added to the labels instead of differentiating with respect to model parameters.
Calculating for each unlabeled node with Equation(9) is inefficient for practical applications of this algorithm.We address this problem by selecting a subset of unlabeled items having higher prediction uncertainty to estimate the model uncertainty in Equation (3) and remaining unlabeled items as query items . We add a small perturbation to the labels of
items and retrain the model with these perturbed labels. With vector notation we can rewrite Equation (
5) as(11) 
Then we calculate the cost and its gradient with respect to (. is a real valued matrix, in which a row corresponds to an unlabeled instance and a column corresponds to a label . For example, the gradient vector of query instance belonging to class can be expressed as . We use the notation to denote the element at ^{th} row and ^{th} column.
In our experiments, we use the top 10% unlabeled items with the largest prediction entropy to estimate the model entropy and the rest of unlabeled items as . Our algorithm is shown in Algorithm 1. We select the node corresponding to the minimal metagradient and retrieve its label from the oracle. We add this node and its label to the labeled set and retrain the model.
2.4 The Importance of Exploration
After each acquisition step, the classifier is trained on a limited number of labeled instances, which in turn are selected by the active learner. Hence, the selected labeled instances tend to bias towards instances evaluated to be ‘informative’ by the active learner. In MetAL, the active learner selects the instance which minimizes the metagradient. Therefore, the distribution of labeled instances is far from the true underlying distribution. The active learner cannot observe the consequences of selecting an instance which has lower ‘informativeness’. Therefore, it is desirable to query a few instances in addition to the ones maximizing our selection criteria. This step is known as ‘exploration’ while selecting the instance maximizing the criteria is ‘exploitation’. Intuitively, an active learner should perform more exploration initially, so it can have a better view of the true distribution of data.
This problem is known as exploration vs exploitation tradeoff in sequential decisionmaking problems. Solving this tradeoff requires the learner to acquire potentially suboptimal instances (i.e., exploration) in addition to the optimal ones. This problem is studied under the framework of multiarmed bandits problem (lattimore2020bandit) (MAB). A multitude of approaches is used in solving online learning problems modeled as MAB problems. greedy, upper confidence bounds (UCB) (auer2002ucb)
, and Thompson sampling
(thompson1933likelihood) are few of the frequently used techniques. Influenced by countbased approaches proposed for MAB problems, we introduce a simple exploration term in addition to the exploitation performed using the metagradients. We define the exploration term of an instance as the logarithm of the number of unlabeled neighboring nodes of . This term encourages the learner to sample nodes from less labeled neighborhoods. Since this term and the gradient calculated in Equation (9) are on different scales, we normalize both of these quantities into range and get and respectively. We linearly combine these normalized quantities to get the criterion for acquiring nodes as(12) 
where the exploration coefficient is a hyperparameter that balances exploration and exploitation. Setting to 0 corresponds to pure exploration disregarding the feedback of the classifier (i.e. metagradient information). On the other hand, is equivalent to pure exploitation selecting the node with the minimum metagradient. We vary the value of with time, such that more exploration is performed during the initial acquisition steps followed by more exploitation in later rounds. To achieve this effect, we assume
is sampled from a Beta distribution such that
. We linearly increase the value of over iterations of acquisitions to achieve the required effect of . As shown in zhang2017activedisc, we observe smoother performance compared to setting the value of deterministically. Figure 1 shows how the value of varies over time in average.3 Experiments
3.1 Data
We evaluate our proposed approach on 6 datasets belonging to different domains. CiteSeer, PubMed, and CORA (sen2008cora) are commonly used citation graphs. Each of these graphs is made of documents as nodes and citations as edges between them. If one document cites another, they are linked by an edge. Each node contains bagofwords features of its text as its attributes. Coauthor CS and Coauthor Physics are coauthorship graphs constructed from Microsoft Academic Graph. Nodes are authors, two authors are linked by an edge if they have coauthored a paper. Node features correspond to the keywords of the papers authored by a particular author. An author’s most active field of study is used as the node label. Amazon Computers is a subgraph of the Amazon copurchase graph (mcauley2015amazon). Products are represented as nodes, two nodes are connected by an edge of those two products are frequently bought together. Node features correspond to product reviews encoded as bagofwords. The product category is the node label.
For each dataset, we randomly select two nodes belonging to each label as the initial labeled set . We leave 5% of the rest of the unlabeled nodes as the test set. The remaining unlabeled nodes qualify to be queried. The size of the initial labeled set and its size as a fraction of the total nodes (labeling rate) are shown in Table 1.
Dataset  Nodes  Classes  Features  Labels (%) 

CiteSeer  2110  6  3703  12 (0.56) 
PubMed  19717  3  500  6 (0.03) 
CORA  2485  7  1433  14 (0.56) 
Amazon Computers  13752  10  767  20 (0.14) 
Coauthor Physics  34493  5  8415  10 (0.03) 
Coauthor CS  18333  15  6805  30 (0.16) 
3.2 Model
We evaluate the effectiveness of MetAL, the proposed algorithm using a twolayer GCN model (kipf2017gcn) with 64 hidden units and SGC (sgc2019)
, a simplified GNN architecture that does not include a hidden layer and nonlinear activation functions. In all experiments, we use the default hyperparameters used in GNN literature (e.g. learning rate = 0.01). We do not perform any datasetspecific hyperparameter tuning since hyperparameter tuning while training a model with AL can lead to label inefficiency
(ash2019badge). We use the following algorithms in our comparison:
Random: Selects an unlabeled node randomly.

PageRank (PR): Selects the unlabeled node with the largest PageRank centrality value.

Degree: Selects the unlabeled node with the largest degree centrality value.

Entropy: Calculates the entropy of predictions of the current model over unlabeled nodes and select the node corresponding to the largest entropy value.

AGE (age2017cai): Selects the node which maximizes a linear combination of three metrics: PageRank centrality, model entropy and information density.

BALD (gal2017bald; houlsby2011bald): Selects the node which has the the largest mutual information value between predictions and model posterior.

MetAL: This is our proposed algorithm. We select the node maximizing the quantity in Equation (12)
Here, entropy and BALD are uncertaintybased acquisition functions. For computing entropy, mutual information in BALD, and posterior class probabilities predicted by the current model in MetAL, we use 20 iterations of MCdropout to approximate a Bayesian model (gal2016dropout). In contrast, centrality metrics such as PageRank and degree centrality can be considered as heuristics for selecting ‘influential’ instances in a graph dataset. The sequence of acquisitions is determined only based on the graph structure and does not depend on the features of instances nor the current set of labeled instances.
We acquire the label of an unlabeled node and retrain the GNN model by performing 50 steps of adam optimizer (kingma2014adam). We perform 40 acquisition steps and repeat this process on 10 different randomly initialized training and test splits for each dataset. We report the average F1 score (Macroaveraged) over the test sets in each experiment. In most cases, average accuracy follows a similar trend. In MetAL, we execute 10 steps of gradient descent with momentum as the inner optimization loop and then we calculate the metagradient matrix.
4 Results and Discussion
4.1 Comparison of AL Strategies
In Figure 2 we observe that MetAL contributes to the best performance when the GCN model is used as the node classifier. We do not show the performance for degree centrality sampling for the clarity of visualizations since it exhibits the worst performance compared to all other acquisition functions. Figure 3 shows that MetAL works similarly with SGC as the classifier as well. However, we observe that the performance of SGC on some datasets is inferior compared to the GCN model. Lack of a hidden layer and nonlinear activation functions can be the reason contributing to reduced performance. Even though PageRank centrality is proposed as a heuristic for acquiring nodes of a graph in previous work age2017cai, we observe that its performance is inferior on larger graphs such as Amazon and coauthorship graphs. The performance of uncertaintybased active learners (entropy and BALD) is not consistent over different datasets. It is interesting that MetAL consistently outperforms AGE, the graphspecific AL benchmark without relying on timeconsuming clustering algorithms. As one of its constituent criteria, AGE computes an information density measure using the learned features of the GNN model. This step depends on clustering the unlabeled instances and then calculating the euclidean distance to the cluster centers. This process is time consuming as evident in Table 2.
Figure 4 shows the results of the ablation studies we perform to understand the impact of the exploration coefficient . Here, we run the acquisition step in Equation (12) with different values: 0, 0.1, 0.5, and 1.0. Timedependent sampled from a Beta distribution works the best on most datasets. Notably, selecting nodes solely based on the metagradient values ( = 0) results in competitive results in most cases. However, pure exploration ( = 1) results in inferior performance. This demonstrates that our proposed metagradient criterion is successful in finding ‘informative’ instances for labeling. However, this experiment shows that performance can be further improved by adaptively updating based on the feedback of the active learner.
4.2 Running Time
Table 2 lists the execution time each algorithm spends to acquire a set of 40 unlabeled instances on average. Even though our proposed approach MetAL consumes additional time compared to uncertaintybased algorithms, it is several times faster than the graphspecific baseline AGE. For example, MetAL is 20 times faster than AGE for the coauthor Physics dataset. Further, the ultimate goal of applying AL is to reduce total human time spent on labeling instances. MetAL achieves this key objective at the cost of slightly increased acquisition time.
Classifier  Dataset  Random  Entropy  PR  AGE  BALD  MetAL 

GCN  CiteSeer  4.2  4.8  4.8  21.5  4.8  9.7 
PubMed  6.9  7.6  25.4  1125.9  7.9  34.6  
CORA  4.2  4.5  4.6  26.8  4.5  9.8  
Coauthor CS  20.4  22.3  40.8  2154 .2  23.7  61.3  
Coauthor Phy.  46.1  50.5  116.4  2436.9  50.8  125.4  
Amazon Comp.  17.5  19.1  31.8  1688.9  19.2  45.2  
SGC  CiteSeer  1.7  1.9  2.1  18.3  1.9  5.4 
PubMed  2.0  2.2  20.0  1229.2  2.2  30.6  
CORA  1.3  1.8  1.8  23.7  1.9  5.5  
Coauthor CS  16.8  19.8  33.2  2098.2  19.8  48.6  
Coauthor Phy.  35.6  40.7  90.4  2232.3  40.8  97.0  
Amazon Comp.  2.2  2.5  17.2  1134.6  2.5  22.0 
5 Related Work
5.1 Graph Neural Networks (GNNs)
GNNs (li2015gated; kipf2017gcn; sgc2019) achieve stateoftheart performance on the node classification problem providing a significant improvement over previously used embedding algorithms (perozzi2014deepwalk; planetoid2016revisiting)
. What sets GNNs apart from previous models is their ability to jointly model both structural information and node attributes. In principle, all GNN models consist of a message passing scheme that propagates feature information of a node to its neighbors. Most GNN architectures use a learnable parameter matrix for projecting features to a different feature space. Usually, two or more of such layers are used along with a nonlinearity (e.g. ReLU). With normalized adjacency matrix
a twolayer GCN model (kipf2017gcn) can be expressed as(13) 
where and are the adjacency matrix and the degree matrix of graph . and are the weight matrices of two neural layers.
sgc2019 arrived at a much simpler model named SGC by removing hidden layers and nonlinear activations in GCN model. This model can be written as
(14) 
5.2 Active Learning
AL research has contributed a multitude of approaches for training supervised learning models with less labeled data. We recommend settles2009active for a detailed review of AL.The objective of most existing AL approaches is to select the most informative instance for labeling. Uncertainty sampling is the most commonly used AL approach. gal2016dropout
propose using dropout at evaluation time as a way to calculate the model uncertainty of convolutional neural networks (CNN).
gal2017bald provide a comparison of various acquisition functions for quantifying the model uncertainty of CNN models. The use of metalearning for AL has been considered in a few recent works (woodward2017active; bachman2017learning). However, these algorithms are designed for the fewshot learning setting and tied to RNNbased metalearning models such as matching networks (vinyals2016matching). Additionally, their reliance on reinforcement learning makes the training difficult. In contrast, our approach is built on model agnostic metalearning (MAML)
(finn2017maml) which is efficient and can be used with a variety of supervised loss functions.6 Conclusion
In this paper we introduced MetAL, a principled approach to perform active learning on graph data. We expressed the semisupervised active learning problem as a bilevel optimization problem and demonstrated that metagradients can be used to make the bilevel optimization problem tractable. Empirical performance on benchmark attributed graphs drawn from multiple domains shows that our proposed method is superior to existing heuristicsbased AL algorithms. We further show the importance of performing exploration in addition to exploitation in AL problems. Adaptively learning the exploration coefficient using the feedback from the active learner is an interesting future direction.
In this work, we acquire a single unlabeled instance in each AL step and retrain the classifier. However, acquiring a batch of instances can make the learning process more efficient by reducing the number of retraining steps. We consider adapting MetAL for batchmode acquisition as another avenue for future improvement. Additionally, understanding which characteristics of an attributed graph make AL easier or difficult is an open research problem. Such an understanding will lead to more efficient AL algorithms in the future.
Comments
There are no comments yet.