Meta-Learning Transferable Active Learning Policies by Deep Reinforcement Learning

by   Kunkun Pang, et al.

Active learning (AL) aims to enable training high performance classifiers with low annotation cost by predicting which subset of unlabelled instances would be most beneficial to label. The importance of AL has motivated extensive research, proposing a wide variety of manually designed AL algorithms with diverse theoretical and intuitive motivations. In contrast to this body of research, we propose to treat active learning algorithm design as a meta-learning problem and learn the best criterion from data. We model an active learning algorithm as a deep neural network that inputs the base learner state and the unlabelled point set and predicts the best point to annotate next. Training this active query policy network with reinforcement learning, produces the best non-myopic policy for a given dataset. The key challenge in achieving a general solution to AL then becomes that of learner generalisation, particularly across heterogeneous datasets. We propose a multi-task dataset-embedding approach that allows dataset-agnostic active learners to be trained. Our evaluation shows that AL algorithms trained in this way can directly generalise across diverse problems.



There are no comments yet.


page 1

page 2

page 3

page 4


Learning active learning at the crossroads? evaluation and discussion

Active learning aims to reduce annotation cost by predicting which sampl...

Augmented Memory Networks for Streaming-Based Active One-Shot Learning

One of the major challenges in training deep architectures for predictiv...

Active Learning-Based Optimization of Scientific Experimental Design

Active learning (AL) is a machine learning algorithm that can achieve gr...

Learning how to Active Learn: A Deep Reinforcement Learning Approach

Active learning aims to select a small subset of data for annotation suc...

Inspecting Sample Reusability for Active Learning

Active Learning (AL) exploits a learning algorithm to selectively sample...

Iterative Peptide Modeling With Active Learning And Meta-Learning

Often the development of novel materials is not amenable to high-through...

Adversarial Vulnerability of Active Transfer Learning

Two widely used techniques for training supervised machine learning mode...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

In many applications, supervision is costly relative to the data volume. Active learning (AL) aims to carefully choose training data, so that a classifier can perform well even with relatively sparse supervision. This field has collectively proposed numerous query criteria, such as margin (Tong and Koller, 2002) and uncertainty-based (Kapoor et al., 2007) sampling, representative and diversity-based (Chattopadhyay et al., 2012) sampling, or combinations thereof (Hsu and Lin, 2015)

. It is hard to pick a clear winner, because each is based on a reasonable and appealing – but completely different – motivation; and there is no method that consistently wins on all datasets. Rather than hand-designing a criterion, we take a learning-based approach. We treat active learning method design as a meta-learning problem and train an active learning policy represented by a neural network using deep reinforcement learning (DRL). It is natural to represent AL as a sequential decision making problem since each action (queried point) affects the context (available query points, state of the base learner) successively for the next decision. In this way the active query policy trained by RL can potentially learn a powerful and non-myopic policy. By treating the increasing accuracy of the base learner as the reward, we optimise for the final goal: the accuracy of a classifier. As the class of deep neural network (DNN) models we use includes many classic criteria as special cases, we can expect this approach should be at least as good as existing methods and likely better due to exploiting more information and non-myopic optimisation of the actual evaluation metric.

The idea of learning the best criterion within a general function class is appealing, and very recent research has had similar inspiration (Bachman et al., 2017). However crucially it does not provide a general solution to AL unless the learned criterion generalises across diverse datasets/learning problems. With DRL we can learn an excellent query policy for a given dataset, but this requires the dataset’s labels; and if we had those labels we would not need to do AL in the first place. Therefore this paradigm is only useful if a dataset/learner-agnostic criterion can be trained. Thus our research question for AL moves from “what is a good criterion?” to “how to learn a criterion that generalises?”. In this paper we investigate how to train AL query criteria that generalise across tasks/datasets. Our approach is to define a DNN query criterion that is paramaterised by a dataset embedding. By multi-task training our DNN policy on a diverse batch of source datasets, the network learns how to calibrate its strategy according to the statistics of a given dataset. Specifically we adapt the recently proposed auxiliary network idea (Romero et al., 2017) to define a meta-network that provides unsupervised domain adaptation. The meta network generates a dataset embedding and produces the weight matricies that parameterise the main policy. This enables an end-to-end query policy to generalize across datasets – even those with different feature space dimensionality. Finally, unlike Woodward and Finn (2017) and Bachman et al. (2017), our framework is agnostic to the base classifier. Treating the underlying learner as part of the environment to be optimised means that it can be applied to improve the label efficiency of any existing learning architecture or algorithm.

2 Preliminaries

Reinforcement Learning (RL) In model-free reinforcement learning, an agent interacts with an environment over a number of time steps . At each step, it receives the state from environment and selects an action based on its policy . The agent then receives a new state and reward from . The aim of RL is to maximise the discounted return . There are various ways to learn the policy . We use direct policy search (Kober and Peters, 2009), which learns by gradient ascent on an objective function .

Active Learning (AL)  A dataset contains instances and labels , most or all of which are unknown in advance. In AL, the data is split between a labelled and unlabelled set where and a classifier has been trained on so far. In each iteration, a pool-based active learner selects an instance/point from unlabelled pool to query its label , where . Then the selected point is removed from and added to along with its label, and the classifier is retrained based on the updated .

Connection between RL and AL We model an AL criterion as a neural network, and discovery of the ideal criterion as an RL problem. Let the world state contain a featurisation of the dataset and the base classifier. An AL criterion is a policy where the discrete action selects a point in to query. After a query, the state is updated to as the point is moved from to and the classifier updates accordingly. The reward is the quantity we wish to maximise, e.g., , the accuracy at query . In this paper we focus on binary classification.

3 Methods

We aim to train an effective dataset-agnostic active query policy . The key challenge is how to learn a policy given that: (i) the test and training dataset statistics may differ, and moreover (ii) different datasets have different feature dimensionality . This is addressed by defining the policy in terms of two sub-networks – a policy network and meta network.

Figure 1: Left: Policy and meta-network for learning a task-agnostic active query policy. Policy inputs data-points

and outputs a query probability

. The policy is paramaterised by weights generated by the meta network based on the current state which represents the current dataset and classifier. Middle: Illustrative active learning curves from evaluating our learned policy on held out UCI dataset diabetes. Right: Cross-dataset generalisation. Average performance over all training and testing sets when varying the number of training domains.

Policy Network  The policy inputs the currently unlabelled instances and outputs an -way softmax for selecting the instance to query. It selects actions via the softmax , where is the th unlabelled instance in and are dataset dependent weights. Although dimensionality varies by dataset, the encoding does not, so the rest of the policy network is independent of . The key is then how to obtain encoder which will be provided by the meta network. Following previous work (Bachman et al., 2017; Konyushkova et al., 2017) we also allow the instances to be augmented by instance-level expert features so where are the raw instances and are the expert features (distance furthest first and uncertainty) of each raw instance.

Meta Network  The encoding parameters of the policy are obtained from the meta network: . Following Romero et al. (2017) we also use the dimensional decoder to regularise this process by reconstructing the input features. Romero et al. (2017) applied meta networks for parameter reduction, with all training and testing are performed on the same dataset. Here the meta network idea is used to learn how to perform unsupervised deep domain adaptation across datasets: by synthesising dataset-conditional weight matricies based on dataset-embeddings of described next. Note that this deep adaptation is performed in a single feed-forward pass in contrast to existing shallow (Csurka, 2017) or deep but iterative (Ganin and Lempitsky, 2015) methods.

3.1 Achieving Cross Dataset Generalisation

Dimension-wise Embedding  The meta-network builds a dataset size independent dimension-wise embedding of the input , shown in light blue part in Fig. 1. Then it predicts


Here is a non-linear feature embedding,

indexes features, selecting the

th embedded feature and the th row of , and

is the non-linear mapping of the meta-network, which outputs a vector of dimension

. Similarly, the meta-network also predicts the weight matrix used for auto-encoding reconstruction (Fig 1). Although is dataset dependent, the meta network generates weights for a policy network of appropriate dimensionality () to the target problem.

Choice of Embeddings  We use ‘representative’ and ‘discriminative’ histogram embeddings. Representative embedding ( and ): We encode each feature dimension as a histogram over the instances in that dimension. Specifically, we rescale the th dimension features into and divide the dimension into 10 bins. Then we count the proportion of labelled and unlabelled data for each bin. This gives a

histogram embedding for each dimension that encodes its moments.

Discriminative embedding (): We create a 2-D histogram of 10 bins per dimension. In this histogram we count the frequency of instances with feature values within each bin (as per the previous embedding) jointly with the frequency of instances with posterior values within each bin (i.e., binning on the [0,1] posterior of the binary base classifier.) Finally counts in the grid are vectorised to . Concatenating these two embeddings we have a dimensional representation of each feature dimension for processing by the meta network.

Training for Cross Dataset Generalisation We train policy and meta networks jointly using REINFORCE policy search (Williams, 1992) to maximise the return (active learning accuracy). To ensure that our networks achieve the desired learning problem invariance, we perform multi-task training on multiple source datasets: In every mini-batch the return is averaged over a randomly sampled subset of source datasets. Thus achieving high return means the meta network has learned to synthesise a good per-dataset policy based on the dataset embedding. We further standardise the return from each episode to compensate for diverse return scale across datasets of differing difficulty.

3.2 Reinforcement Learning Training and Objective Functions

Reward We define the reward the improvement in test split accuracy: . We then optimise the return of an active learning session .

Auxiliary Regularisation Losses  Besides optimising the obtained reward, we also optimise for two auxiliary regularisation losses. Reconstruction: the policy network should reconstruct the unlabelled input data using predicted by the meta-network (Romero et al., 2017). We optimise

, the mean square reconstruction error of the autoencoder.

Entropy: following Mnih et al. (2016), we also prefer a policy that maintains a high-entropy posterior over actions so as to continue to explore and avoid pre-emptive convergence to an over-confident solution.

With these three objectives, we train both networks where end-to-end, maximising in:


4 Experiments

Datasets  We experiment with a diverse set of 14 UCI datasets including austra, heart, german, ILPD, ionospheres, pima, wdbc, breast, diabetes, fertility, fourclass, habermann, livers, planning. We use leave-one-out (LOO) setting: training on 13 datasets, and evaluating on the held out dataset.

Architecture  The auxiliary network for encoder has fully connected (FC) layers of size () and an analogous structure for the decoder. The policy network has layers of size ( input matrix ), , , , (

-way output). All penultimate layers use ReLU activation. The first FC layer of the policy is generated by the auxiliary network. Thereafter for efficient implementation with few parameters and to deal with the variable sized input and output, the policy network is implemented convolutionally. We convolve a

sized filter across the dimension of each shaped layer to obtain the next layer.


  We use Adam with initial learning rate 0.001 and hyperparameters

, and discount factor . During RL training, we use two tricks to stabilise the policy gradient. 1) We use a relatively large batch size of 32 episodes. 2) We smooth the gradient by accumulation where is the gradient of the in time step and the is the accumulated gradient. We train the policy and meta network end-to-end for 50,000 iterations and perform active learning over a time horizon (budget) of . As base learner we use linear SVM with class balancing. All results are averages over 100 trials of training and testing dataset splits.

4.1 Results

Alternatives  We compare with the classic approaches uncertainty/margin-based sampling (US) (Tong and Koller, 2002; Kapoor et al., 2007), furthest-first (DFF) (Baram et al., 2003) and query-by-bagging (QBB) (Abe and Mamitsuka, 1998), as well as random sampling (RAND) as a lower bound. US simply queries the instance with minimum certainty. While simple, it is competitive to more sophisticated criteria and robust in the sense of hardly ever being a very poor criterion on any benchmark. We also compare with QUIRE (Huang et al., 2010) as a representative more sophisticated approach, and ALBL (Hsu and Lin, 2015) — a recent (within-dataset) learning based approach. We denote our method meta-learned policy for general active learning (MLP-GAL). As a related alternative we propose SingleRL. This is our RL approach, but without the meta-network, so a single model is learned over all datasets. Without the meta-network it can only use expert features so that dimensionality is fixed over datasets. SingleRL can also be seen as a version of one of the few state-of-the-art learning-based alternatives (Konyushkova et al., 2017)

, with an important upgrade from supervised learning used there to non-myopic reinforcement learning.

Multi-Task Training Evaluation  We first verify if it is possible to learn a single policy that generalises across multiple training datasets. In our leave-one-out setting, this means generalising across all 13 datasets in any split. Each result in the MLP-GAL (Tr) column of Table 1 is an average across all the 13 combinations. MLP-GAL learns an effective criterion that outperforms competitors.

Liinear MLP-GAL (Tr) MLP-GAL (Te) SingleRL (Te) Uncertainty DFF RAND ALBL T-LSA QUIRE QBB
austra 80.14 78.09 75.72 78.24 75.63 75.87 75.31 72.98 64.46 78.58
breast 96.67 95.95 94.78 95.41 95.76 94.71 95.67 96.21 95.60 95.73
diabetes 67.53 65.99 64.78 64.18 57.31 64.05 61.35 57.34 53.75 64.46
fertility 78.26 75.09 77.86 75.79 70.44 71.28 66.92 71.18 54.93 73.87
fourclass 74.79 74.11 71.83 69.55 71.26 69.08 68.69 69.98 64.48 70.81
haberman 67.31 65.61 64.91 60.16 60.26 57.40 52.49 59.67 45.89 60.58
heart 76.68 72.77 72.84 73.38 73.99 73.06 71.78 71.52 67.07 73.36
german 68.01 64.68 63.35 63.34 61.78 62.77 61.74 58.75 51.82 64.16
ILPD 62.48 59.30 61.08 57.60 50.97 57.62 52.91 53.15 48.57 56.77
ionospheres 74.96 71.46 69.78 70.47 59.64 69.81 68.44 58.95 57.84 70.40
liver 55.66 55.51 55.62 53.45 52.87 52.87 51.25 51.36 48.11 52.13
pima 67.64 67.01 64.67 64.18 57.31 63.69 61.27 57.03 53.75 64.24
planning 60.74 58.63 56.75 55.09 52.77 54.17 49.46 52.04 39.90 55.43
wdbc 90.90 90.09 88.72 90.93 87.55 88.52 88.41 85.15 82.17 90.68
Avg 72.98 70.94 70.19 69.41 66.25 68.21 66.12 65.38 59.17 69.37
Num Wins - 7 3 1 1 0 0 1 0 1
Table 1: Comparison of AL algorithms, leave one dataset out setting. Linear SVM base learner. AUC averages (%) over 100 trials (and 13 training occurrences for MLP-GAL (Tr)).

Cross-Task Generalisation  In our leave-one-out setting, each row in Table 1 represents a testing set, and the MLP-GAL (Te) result is the performance on this test set after training on all 13 other datasets. Our MLP-GAL outperforms alternatives in both average performance and number of wins. SingleRL is generally also effective compared to prior methods, showing the efficacy of training a policy with RL. However it does not benefit from a meta network, so is not as effective and robust as our MLP-GAL. It is also interesting to see that while sophisticated methods such as QUIRE sometimes perform very well, they also often perform very badly – even worse than random. Meanwhile the classic uncertainty and QBB methods perform consistently well. This dichotomy illustrates the challenge in building sophisticated AL algorithms that generalise to datasets that they were not engineered on. In contrast, although our MLP-GAL (Te) has not seen these datasets during training, it performs consistently well due to adapting to each dataset via the meta-network.

Dependence on Number of Training Domains  We next investigate how performance depends on the number of training domains. We train MLP-GAL with an increasing number of source datasets – 1, 4, 7 (multiple splits each), and (13 splits LOO setting). Then we compute the average performance over all training and all testing domains, in all of their multiple occurrences across the splits. From the results in Fig 1, we see that the training performance becomes worse when doing a higher-way multi-task training. This is intuitive: it becomes harder to overfit to more datasets simultaneously. Meanwhile testing performance improves, demonstrating that the model learns to generalise better to held out problems when forced to learn on a greater diversity of source datasets.

5 Discussion

We have proposed a learning-based perspective on active query criteria design. Our meta-network learns unsupervised domain adaptation: for the first time addressing the key challenge of learning deep query policies with dataset-agnostic generality, rather than requiring dataset-specific training. Our method is also base learner agnostic unlike Bachman et al. (2017) and Woodward and Finn (2017), so it can be used with any classifier. A limitation thus far (shared by (Konyushkova et al., 2017)) is that we have only focused on a binary base classifier. In the future we would like to evaluate our method on with deep multi-class classifiers as base learners by designing embeddings which can represent the state of such learners, as well as explore application to the stream-based AL setting.


  • Abe and Mamitsuka [1998] Naoki Abe and Hiroshi Mamitsuka. Query learning strategies using boosting and bagging. In ICML, 1998.
  • Bachman et al. [2017] Philip Bachman, Alessandro Sordoni, and Adam Trischler. Learning algorithms for active learning. ICML, 2017.
  • Baram et al. [2003] Yoram Baram, Ran El-Yaniv, and Kobi Luz. Online choice of active learning algorithms.

    Journal of Machine Learning Research

    , 5:255–291, 2003.
  • Chattopadhyay et al. [2012] Rita Chattopadhyay, Zheng Wang, Wei Fan, Ian Davidson, Sethuraman Panchanathan, and Jieping Ye.

    Batch mode active sampling based on marginal probability distribution matching.

    KDD. ACM, 2012.
  • Csurka [2017] Gabriela Csurka.

    Domain Adaptation in Computer Vision Applications

    Springer, 2017.
  • Ganin and Lempitsky [2015] Yaroslav Ganin and Victor Lempitsky.

    Unsupervised domain adaptation by backpropagation.

    In ICML, 2015.
  • Hsu and Lin [2015] Wei-Ning Hsu and Hsuan-Tien Lin. Active learning by learning. In AAAI, 2015.
  • Huang et al. [2010] Sheng-jun Huang, Rong Jin, and Zhi-hua Zhou. Active learning by querying informative and representative examples. In NIPS. 2010.
  • Kapoor et al. [2007] Ashish Kapoor, Kristen Grauman, Raquel Urtasun, and Trevor Darrell. Active learning with gaussian processes for object categorization. In ICCV, 2007.
  • Kober and Peters [2009] Jens Kober and Jan R. Peters. Policy search for motor primitives in robotics. In NIPS. 2009.
  • Konyushkova et al. [2017] Ksenia Konyushkova, Raphael Sznitman, and Pascal Fua. Learning active learning from real and synthetic data. NIPS, 2017.
  • Mnih et al. [2016] Volodymyr Mnih, Adria Puigdomenech Badia, Mehdi Mirza, Alex Graves, Timothy Lillicrap, Tim Harley, David Silver, and Koray Kavukcuoglu. Asynchronous methods for deep reinforcement learning. In ICML, 2016.
  • Romero et al. [2017] Adriana Romero, Pierre Luc Carrier, Akram Erraqabi, Tristan Sylvain, Alex Auvolat, Etienne Dejoie, Marc-André Legault, Marie-Pierre Dube, Julie G. Hussin, and Yoshua Bengio. Diet networks: Thin parameters for fat genomics. ICLR, 2017.
  • Tong and Koller [2002] Simon Tong and Daphne Koller. Support vector machine active learning with applications to text classification. J. Mach. Learn. Res., 2, March 2002.
  • Williams [1992] Ronald J Williams. Simple statistical gradient-following algorithms for connectionist reinforcement learning. Machine Learning, 1992.
  • Woodward and Finn [2017] Mark Woodward and Chelsea Finn. Active one-shot learning. CoRR, abs/1702.06559, 2017.