MetAL: Active Semi-Supervised Learning on Graphs via Meta Learning

07/22/2020 ∙ by Kaushalya Madhawa, et al. ∙ 0

The objective of active learning (AL) is to train classification models with less number of labeled instances by selecting only the most informative instances for labeling. The AL algorithms designed for other data types such as images and text do not perform well on graph-structured data. Although a few heuristics-based AL algorithms have been proposed for graphs, a principled approach is lacking. In this paper, we propose MetAL, an AL approach that selects unlabeled instances that directly improve the future performance of a classification model. For a semi-supervised learning problem, we formulate the AL task as a bilevel optimization problem. Based on recent work in meta-learning, we use the meta-gradients to approximate the impact of retraining the model with any unlabeled instance on the model performance. Using multiple graph datasets belonging to different domains, we demonstrate that MetAL efficiently outperforms existing state-of-the-art AL algorithms.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

The performance of a classification model depends on the quality and quantity of training data, often requiring a huge labeling effort. With ever-increasing amounts of data, active learning (AL) is gaining the attention of researchers as well as practitioners as a way to reduce the effort spent on labeling data instances. An AL algorithm selects a set of unlabeled instances based on an informative metric, gets their labels, and updates the labeled dataset. Then the classification model is retrained using the acquired labels. This process is repeated until a desirable level of performance (e.g. accuracy) is reached.

In this paper, we consider the task of applying AL for semi-supervised problems. In a semi-supervised learning problem the learning algorithm can utilize all data instances including the unlabeled ones. Only the labels of unlabeled instances are not known. We evaluate our approach on classifying nodes on attributed graphs. Reducing the number of labeled nodes required in node classification can benefit a variety of practical applications such as in recommender systems

(pinsage18; rubens2015active) and text classification (yao2019graph).

An acquisition function is used to evaluate the informativeness of an unlabeled instance. Since quantifying the informativeness of an instance is not straightforward, a multitude of heuristics have been proposed in AL literature (settles2009active). For example, uncertainty sampling selects instances which the model is most uncertain about (houlsby2011bald)

. The most common method is to select the unlabeled instance corresponding to the maximum entropy over the class probabilities predicted by the model. However, such heuristics are not flexible to adapt to the distribution of data and can not exploit inherent characteristics of a given dataset. Often, the performance of heuristic active learners is not consistent across different datasets, sometimes worse than random selection of unlabeled instances.

Compared to applications of AL on image data, only a limited number of AL models have been developed for graph data. Previous work on applying AL on graph data (gu2012towards; bilgic2010networkdata; ji2012variance) depend on earlier classification models such as Gaussian random fields, in which the features of nodes are not being used. Therefore, selecting query nodes uniformly in random coupled with a recent graph neural network (GNN) model can easily outperform such AL models. AL models that use recent GNN architectures (age2017cai; gao2018active) are limited and they rely on linear combinations of uncertainty and various heuristics such as node centrality measures.

We overcome this problem by directly considering the performance of the classifier into the acquisition function in semi-supervised learning problems. We build our work motivated by the framework of expected error reduction (EER) (roy2001eer; guo2008discriminative; aodha2014hierarchical), in which the objective is to query instances which would maximizes the expected performance gain. Original EER formulation is extremely time consuming and not practical to be used with neural network classifiers. We formulate this objective as a bilevel optimization problem. Based on recent advances in meta-learning (finn2017maml), we utilize meta-gradients to make this optimization efficient. zugner2018adversarial propose using meta-gradients for modeling an adversarial attack on GNNs. Our motivation in using meta-gradients is the opposite, evaluating the importance of labeling each unlabeled instance. In section 4, with empirical evidence, we show that MetAL significantly outperforms existing AL algorithms.

Our contributions are:

  1. We introduce MetAL, a novel active learning algorithm based on the expected error reduction principle.

  2. We discuss the importance of performing exploration in AL and introduce a simple count-based exploration term.

  3. We demonstrate that our proposed algorithm MetAL can consistently outperform state-of-the-art AL algorithms on a variety of real world graphs.

2 Our Framework

2.1 Problem Setting

In this paper, we apply AL for the multi-class node classification of a given undirected attributed graph of nodes. The graph consists of an adjacency matrix and a node attribute matrix , where is the number of attributes. Labels of a small set of nodes are given initially and labels of rest of the nodes are unknown. A labeled node is assigned a label in , where is the number of classes. The objective of a learner is to learn a function which predicts the class label of a given test node . This function can be any node classification algorithm. Graph neural networks (kipf2017gcn; sgc2019) (GNN) are commonly used in the present day. Parameters

of the model are estimated by minimizing a loss function, usually using a gradient-based optimization algorithm.

We consider a pool-based active learning setting, in which the labeled dataset is much smaller compared to a large pool of unlabeled items . We can acquire the label of any unlabeled item by querying an oracle (e.g. a human annotator) at a uniform cost per item. Suppose we are given a query budget , such that we are allowed to query labels of a maximum of unlabeled items. An optimal active learner selects the set of items which maximizes the expected performance gain of the classification model upon retraining it with their labels. Selection of items for querying is done in an iterative manner such that in each iteration an instance is queried and the model is retrained with its label.

2.2 Optimization Problem

We define our objective as finding unlabeled items which maximizes the likelihood of labeled instances while minimizing the uncertainty of label predictions of the unlabeled instances . For any we estimate this objective of the model after training it on . Training on an item updates model parameters to such that

(1)

where is the loss function (e.g. cross-entropy). We can write our objective as an optimization problem:

(2)

where is a cost function defined as

(3)

in which we minimize the loss over labeled instances combined with , the entropy of unlabeled instances.

Since the label of an unlabeled instance is unknown, we compute the expected loss over all possible labels. We rewrite Equation (3) as

(4)

In this case, we select the instance which minimizes the expected value of . denotes the parameters of a model trained with instance having the label .

2.3 Meta-learning Approach

Since the label of an item is unknown, we use the posterior class probabilities as a proxy for . This approach requires training a separate model for each possible label of each unlabeled item (). Training this many models is prohibitively time consuming.

To remedy this issue, we estimate the impact of a query with label () by training a model with label upweighted by a small perturbation such that (), where is the perturbation added to label . This idea is motivated by the use of perturbations in the feature space for finding training instances responsible for a given prediction (koh2017influence). In contrast, our objective is to find unlabeled instances which incur the greatest impact on the performance on test instances, once their labels are known. We re-purpose the use of perturbations to understand the impact an unlabeled instance may have on the model performance if it has the label . We rewrite Equation (1) as

(5)

We quantify the impact of retraining the model with added to the labeled set as the change in loss

(6)

and the expected change of loss for querying the item by

(7)

is the posterior class probabilities of the current model and it is estimated with

(8)

When is arbitrarily small, this change can be computed as the gradient of the loss with respect to the label perturbation , . We rewrite Equation (7) using gradient as

(9)

The term quantifies the impact of labeling a query . This simplifies the active learning problem to finding the item corresponding to the minimum expected meta-gradient (Equation (9)) such that

(10)

Here, a negative valued expected meta-gradient corresponds to a model with lower expected loss. In other words, we need to find a query which maximizes the negative of the expected gradient ().

Equation (5) and Equation(10) form a bilevel optimization problem. Calculating the meta-gradients as in Equation (9) involves a calculation of two gradients in a nested order, the inner one for optimizing the model parameters for perturbed labels and the outer one for calculating the gradient with respect to the perturbation . Therefore, the expected value of indirectly depends on via . This is similar to the computation of meta-gradients in meta-learning approaches used for few-shot learning (finn2017maml). It should be noted that, unlike in few-shot learning, we calculate meta-gradients with respect to a perturbation added to the labels instead of differentiating with respect to model parameters.

Calculating for each unlabeled node with Equation(9) is inefficient for practical applications of this algorithm.We address this problem by selecting a subset of unlabeled items having higher prediction uncertainty to estimate the model uncertainty in Equation (3) and remaining unlabeled items as query items . We add a small perturbation to the labels of

items and retrain the model with these perturbed labels. With vector notation we can rewrite Equation (

5) as

(11)

Then we calculate the cost and its gradient with respect to (. is a real valued matrix, in which a row corresponds to an unlabeled instance and a column corresponds to a label . For example, the gradient vector of query instance belonging to class can be expressed as . We use the notation to denote the element at th row and th column.

In our experiments, we use the top 10% unlabeled items with the largest prediction entropy to estimate the model entropy and the rest of unlabeled items as . Our algorithm is shown in Algorithm 1. We select the node corresponding to the minimal meta-gradient and retrieve its label from the oracle. We add this node and its label to the labeled set and retrain the model.

2.4 The Importance of Exploration

After each acquisition step, the classifier is trained on a limited number of labeled instances, which in turn are selected by the active learner. Hence, the selected labeled instances tend to bias towards instances evaluated to be ‘informative’ by the active learner. In MetAL, the active learner selects the instance which minimizes the meta-gradient. Therefore, the distribution of labeled instances is far from the true underlying distribution. The active learner cannot observe the consequences of selecting an instance which has lower ‘informativeness’. Therefore, it is desirable to query a few instances in addition to the ones maximizing our selection criteria. This step is known as ‘exploration’ while selecting the instance maximizing the criteria is ‘exploitation’. Intuitively, an active learner should perform more exploration initially, so it can have a better view of the true distribution of data.

This problem is known as exploration vs exploitation tradeoff in sequential decision-making problems. Solving this tradeoff requires the learner to acquire potentially sub-optimal instances (i.e., exploration) in addition to the optimal ones. This problem is studied under the framework of multi-armed bandits problem (lattimore2020bandit) (MAB). A multitude of approaches is used in solving online learning problems modeled as MAB problems. -greedy, upper confidence bounds (UCB) (auer2002ucb)

, and Thompson sampling

(thompson1933likelihood) are few of the frequently used techniques. Influenced by count-based approaches proposed for MAB problems, we introduce a simple exploration term in addition to the exploitation performed using the meta-gradients. We define the exploration term of an instance as the logarithm of the number of unlabeled neighboring nodes of . This term encourages the learner to sample nodes from less labeled neighborhoods. Since this term and the gradient calculated in Equation (9) are on different scales, we normalize both of these quantities into range and get and respectively. We linearly combine these normalized quantities to get the criterion for acquiring nodes as

(12)

where the exploration coefficient is a hyper-parameter that balances exploration and exploitation. Setting to 0 corresponds to pure exploration disregarding the feedback of the classifier (i.e. meta-gradient information). On the other hand, is equivalent to pure exploitation selecting the node with the minimum meta-gradient. We vary the value of with time, such that more exploration is performed during the initial acquisition steps followed by more exploitation in later rounds. To achieve this effect, we assume

is sampled from a Beta distribution such that

. We linearly increase the value of over iterations of acquisitions to achieve the required effect of . As shown in zhang2017activedisc, we observe smoother performance compared to setting the value of deterministically. Figure 1 shows how the value of varies over time in average.

Figure 1: Variation of over time. Average of 10 random samples of Beta(, ) distribution. We set the value of to 1 and increase with time .
  Input: Graph , Query budget , Initial labels
  Output: An improved model
  
  for  to  do
     Calculate posterior class probabilities with the current model
     Sample a set of instances from
     Train a model with perturbed labels of instances with Equation (11)
     Calculate meta-gradient
     Select the best instance using Equation (12)
     Query the oracle and retrieve the label
     Update label set
     Retrain the model
  end for
  Return
Algorithm 1 MetAL: Meta Learning Active Node Classification.

3 Experiments

3.1 Data

We evaluate our proposed approach on 6 datasets belonging to different domains. CiteSeer, PubMed, and CORA (sen2008cora) are commonly used citation graphs. Each of these graphs is made of documents as nodes and citations as edges between them. If one document cites another, they are linked by an edge. Each node contains bag-of-words features of its text as its attributes. Co-author CS and Co-author Physics are co-authorship graphs constructed from Microsoft Academic Graph. Nodes are authors, two authors are linked by an edge if they have co-authored a paper. Node features correspond to the keywords of the papers authored by a particular author. An author’s most active field of study is used as the node label. Amazon Computers is a subgraph of the Amazon co-purchase graph (mcauley2015amazon). Products are represented as nodes, two nodes are connected by an edge of those two products are frequently bought together. Node features correspond to product reviews encoded as bag-of-words. The product category is the node label.

For each dataset, we randomly select two nodes belonging to each label as the initial labeled set . We leave 5% of the rest of the unlabeled nodes as the test set. The remaining unlabeled nodes qualify to be queried. The size of the initial labeled set and its size as a fraction of the total nodes (labeling rate) are shown in Table 1.

Dataset Nodes Classes Features Labels (%)
CiteSeer 2110 6 3703 12 (0.56)
PubMed 19717 3 500 6 (0.03)
CORA 2485 7 1433 14 (0.56)
Amazon Computers 13752 10 767 20 (0.14)
Co-author Physics 34493 5 8415 10 (0.03)
Co-author CS 18333 15 6805 30 (0.16)
Table 1: Dataset statistics. Labeling rate as a percentage of total nodes is shown within brackets.

3.2 Model

We evaluate the effectiveness of MetAL, the proposed algorithm using a two-layer GCN model (kipf2017gcn) with 64 hidden units and SGC (sgc2019)

, a simplified GNN architecture that does not include a hidden layer and nonlinear activation functions. In all experiments, we use the default hyper-parameters used in GNN literature (e.g. learning rate = 0.01). We do not perform any dataset-specific hyper-parameter tuning since hyper-parameter tuning while training a model with AL can lead to label inefficiency 

(ash2019badge). We use the following algorithms in our comparison:

  • Random: Selects an unlabeled node randomly.

  • PageRank (PR): Selects the unlabeled node with the largest PageRank centrality value.

  • Degree: Selects the unlabeled node with the largest degree centrality value.

  • Entropy: Calculates the entropy of predictions of the current model over unlabeled nodes and select the node corresponding to the largest entropy value.

  • AGE (age2017cai): Selects the node which maximizes a linear combination of three metrics: PageRank centrality, model entropy and information density.

  • BALD (gal2017bald; houlsby2011bald): Selects the node which has the the largest mutual information value between predictions and model posterior.

  • MetAL: This is our proposed algorithm. We select the node maximizing the quantity in Equation (12)

Here, entropy and BALD are uncertainty-based acquisition functions. For computing entropy, mutual information in BALD, and posterior class probabilities predicted by the current model in MetAL, we use 20 iterations of MC-dropout to approximate a Bayesian model (gal2016dropout). In contrast, centrality metrics such as PageRank and degree centrality can be considered as heuristics for selecting ‘influential’ instances in a graph dataset. The sequence of acquisitions is determined only based on the graph structure and does not depend on the features of instances nor the current set of labeled instances.

We acquire the label of an unlabeled node and retrain the GNN model by performing 50 steps of adam optimizer (kingma2014adam). We perform 40 acquisition steps and repeat this process on 10 different randomly initialized training and test splits for each dataset. We report the average F1 score (Macro-averaged) over the test sets in each experiment. In most cases, average accuracy follows a similar trend. In MetAL, we execute 10 steps of gradient descent with momentum as the inner optimization loop and then we calculate the meta-gradient matrix.

[CiteSeer] [PubMed] [CORA] [Amazon Computers] [Co-author Physics] [Co-author CS]

Figure 2: Performance of active learners with a 2-layer GCN model as the node classifier. Macro-F1 score (test) of active learning algorithms with number of acquisitions.

[CiteSeer] [PubMed] [CORA] [Amazon Computers] [Co-author Physics] [Co-author CS]

Figure 3: Performance of active learners with an SGC model as the node classifier. Macro-F1 score (test) of active learning algorithms with number of acquisitions. .

[CiteSeer] [PubMed] [CORA] [Amazon Computers] [Co-author CS] [Co-author Physics]

Figure 4: The importance of exploration coefficient . Our algorithm MetAL is run on these datasets with different values of : 0.1, 0.1, 0.5, and 1.0. The performance of MetAL with a fixed value is compared against the performance when it is sampled from a Beta distribution which is time-dependent.

4 Results and Discussion

4.1 Comparison of AL Strategies

In Figure 2 we observe that MetAL contributes to the best performance when the GCN model is used as the node classifier. We do not show the performance for degree centrality sampling for the clarity of visualizations since it exhibits the worst performance compared to all other acquisition functions. Figure 3 shows that MetAL works similarly with SGC as the classifier as well. However, we observe that the performance of SGC on some datasets is inferior compared to the GCN model. Lack of a hidden layer and nonlinear activation functions can be the reason contributing to reduced performance. Even though PageRank centrality is proposed as a heuristic for acquiring nodes of a graph in previous work age2017cai, we observe that its performance is inferior on larger graphs such as Amazon and co-authorship graphs. The performance of uncertainty-based active learners (entropy and BALD) is not consistent over different datasets. It is interesting that MetAL consistently outperforms AGE, the graph-specific AL benchmark without relying on time-consuming clustering algorithms. As one of its constituent criteria, AGE computes an information density measure using the learned features of the GNN model. This step depends on clustering the unlabeled instances and then calculating the euclidean distance to the cluster centers. This process is time consuming as evident in Table 2.

Figure 4 shows the results of the ablation studies we perform to understand the impact of the exploration coefficient . Here, we run the acquisition step in Equation (12) with different values: 0, 0.1, 0.5, and 1.0. Time-dependent sampled from a Beta distribution works the best on most datasets. Notably, selecting nodes solely based on the meta-gradient values ( = 0) results in competitive results in most cases. However, pure exploration ( = 1) results in inferior performance. This demonstrates that our proposed meta-gradient criterion is successful in finding ‘informative’ instances for labeling. However, this experiment shows that performance can be further improved by adaptively updating based on the feedback of the active learner.

4.2 Running Time

Table 2 lists the execution time each algorithm spends to acquire a set of 40 unlabeled instances on average. Even though our proposed approach MetAL consumes additional time compared to uncertainty-based algorithms, it is several times faster than the graph-specific baseline AGE. For example, MetAL is 20 times faster than AGE for the co-author Physics dataset. Further, the ultimate goal of applying AL is to reduce total human time spent on labeling instances. MetAL achieves this key objective at the cost of slightly increased acquisition time.

Classifier Dataset Random Entropy PR AGE BALD MetAL
GCN CiteSeer 4.2 4.8 4.8 21.5 4.8 9.7
PubMed 6.9 7.6 25.4 1125.9 7.9 34.6
CORA 4.2 4.5 4.6 26.8 4.5 9.8
Co-author CS 20.4 22.3 40.8 2154 .2 23.7 61.3
Co-author Phy. 46.1 50.5 116.4 2436.9 50.8 125.4
Amazon Comp. 17.5 19.1 31.8 1688.9 19.2 45.2
SGC CiteSeer 1.7 1.9 2.1 18.3 1.9 5.4
PubMed 2.0 2.2 20.0 1229.2 2.2 30.6
CORA 1.3 1.8 1.8 23.7 1.9 5.5
Co-author CS 16.8 19.8 33.2 2098.2 19.8 48.6
Co-author Phy. 35.6 40.7 90.4 2232.3 40.8 97.0
Amazon Comp. 2.2 2.5 17.2 1134.6 2.5 22.0
Table 2: Running time (seconds): average time taken to acquire 40 unlabeled instances. We run all experiments on a single Nvidia GTX 1080-Ti GPU.

5 Related Work

5.1 Graph Neural Networks (GNNs)

GNNs (li2015gated; kipf2017gcn; sgc2019) achieve state-of-the-art performance on the node classification problem providing a significant improvement over previously used embedding algorithms (perozzi2014deepwalk; planetoid2016revisiting)

. What sets GNNs apart from previous models is their ability to jointly model both structural information and node attributes. In principle, all GNN models consist of a message passing scheme that propagates feature information of a node to its neighbors. Most GNN architectures use a learnable parameter matrix for projecting features to a different feature space. Usually, two or more of such layers are used along with a nonlinearity (e.g. ReLU). With normalized adjacency matrix

a two-layer GCN model (kipf2017gcn) can be expressed as

(13)

where and are the adjacency matrix and the degree matrix of graph . and are the weight matrices of two neural layers.

sgc2019 arrived at a much simpler model named SGC by removing hidden layers and nonlinear activations in GCN model. This model can be written as

(14)

5.2 Active Learning

AL research has contributed a multitude of approaches for training supervised learning models with less labeled data. We recommend settles2009active for a detailed review of AL.The objective of most existing AL approaches is to select the most informative instance for labeling. Uncertainty sampling is the most commonly used AL approach. gal2016dropout

propose using dropout at evaluation time as a way to calculate the model uncertainty of convolutional neural networks (CNN).

gal2017bald provide a comparison of various acquisition functions for quantifying the model uncertainty of CNN models. The use of meta-learning for AL has been considered in a few recent works (woodward2017active; bachman2017learning). However, these algorithms are designed for the few-shot learning setting and tied to RNN-based meta-learning models such as matching networks (vinyals2016matching)

. Additionally, their reliance on reinforcement learning makes the training difficult. In contrast, our approach is built on model agnostic meta-learning (MAML) 

(finn2017maml) which is efficient and can be used with a variety of supervised loss functions.

6 Conclusion

In this paper we introduced MetAL, a principled approach to perform active learning on graph data. We expressed the semi-supervised active learning problem as a bilevel optimization problem and demonstrated that meta-gradients can be used to make the bilevel optimization problem tractable. Empirical performance on benchmark attributed graphs drawn from multiple domains shows that our proposed method is superior to existing heuristics-based AL algorithms. We further show the importance of performing exploration in addition to exploitation in AL problems. Adaptively learning the exploration coefficient using the feedback from the active learner is an interesting future direction.

In this work, we acquire a single unlabeled instance in each AL step and retrain the classifier. However, acquiring a batch of instances can make the learning process more efficient by reducing the number of retraining steps. We consider adapting MetAL for batch-mode acquisition as another avenue for future improvement. Additionally, understanding which characteristics of an attributed graph make AL easier or difficult is an open research problem. Such an understanding will lead to more efficient AL algorithms in the future.

References