Bayesian active learning for production, a systematic study and a reusable library

06/17/2020 ∙ by Parmida Atighehchian, et al. ∙ 1

Active learning is able to reduce the amount of labelling effort by using a machine learning model to query the user for specific inputs. While there are many papers on new active learning techniques, these techniques rarely satisfy the constraints of a real-world project. In this paper, we analyse the main drawbacks of current active learning techniques and we present approaches to alleviate them. We do a systematic study on the effects of the most common issues of real-world datasets on the deep active learning process: model convergence, annotation error, and dataset imbalance. We derive two techniques that can speed up the active learning loop such as partial uncertainty sampling and larger query size. Finally, we present our open-source Bayesian active learning library, BaaL.



There are no comments yet.


page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

The amount of data readily available for machine learning has exploded in recent years. However, for data to be used for deep learning models, labelling is often a required step. A common problem when labelling new datasets is the required human effort to perform the annotation. In particular, tasks that require particular domain knowledge such as medical imaging, are expensive to annotate. To solve this, active learning (AL) has been proposed to only label the core set of observations useful for training.

While the active learning field includes many approaches (kirsch2019batchbald; tsymbalov2019deeper; beluch2018power; maddox2019simple), these methods are often not scalable to large datasets or too slow to be used in a more realistic environment e.g. in a production setup. In particular, active learning applied to images or text requires the usage of deep learning models which are slow to train and themselves require a noticeable huge amount of data to be effective (imagenet_cvpr09; abu2016youtube).

Furthermore, deep learning models require carefully tuned hyperparameters to be effective. In a research environment, one can fine-tune and perform hyperparameter search to find the optimal combination that gives the biggest reduction in labelling effort. In a real-world setting, the hyperparameters are set at the beginning with no guarantee for the outcome.

Finally, in a real-world setup, the data is often not cleaned nor balanced. In particular, studies have shown that humans are far from perfect when labelling and the problem is even worse when using crowd-sourcing (ipeirotis2010quality; allahbakhsh2013quality).

Our contributions are three-fold. We perform a systematic study on the effect of the most common pathologies found in real-world datasets on active learning. Second, we propose several techniques that make active learning suitable for production. Finally, we present a case study of using active learning on a real-world dataset.

In addition, we present our freely available Bayesian active learning library, BaaL111 which provides all the necessary set up and tools for active learning experiments at any scale.

2 Problem setting

We consider the problem of supervised learning where we observe a dataset made of

pairs and our goal is to estimate a prediction function . In addition, we have observations without label . More specifically, we consider the problem of active learning where the algorithm is summarized in Algorithm 1.



Label randomly B points 

while labelling budget is available do

       Train model to convergence on
Compute uncertainty
Label top-k most uncertain samples 
Algorithm 1 Active learning process. For batch active learning, the algorithm train a model on before estimating the uncertainty on the pool , the most uncertain samples are labelled by a human before restarting the loop.

3 Background

Active learning has received a lot of attention in the past years, specifically on classification tasks (gal2017deep). However, some work has been done on segmentation (kendall2017uncertainties), localization (miller2019evaluating)

, natural language processing

(siddhant2018deep), and time series (peng2017acts). In this paper, we focus our attention on image classification.

Bayesian active Learning

Current state-of-the-art techniques used in active learning relies on uncertainty estimation to perform queries (gal2017deep). A common issue highlighted in tsymbalov2019deeper; kirsch2019batchbald is the need to retrain the model and recompute the uncertainties as often as possible. Otherwise, the next samples to be selected may be too similar to previously annotated samples. This is problematic due to the long training time of deep-learning models as well as the expensive task of uncertainty estimation. tsymbalov2019deeper; houlsby2011bayesian and Wilson2015DeepKL proposed solutions to this issue, are memory expensive and time-consuming when used on large input size or large datasets. In reality, due to the large cost of inference and retraining for large scale datasets, it is not feasible to recompute the uncertainties in a timely fashion. In consequence, multiple samples are annotated between retraining. We call this framework batch active learning.

Machine leanings algorithms can suffer from two types of uncertainties (kendall2017uncertainties):

1) Aleatoric Uncertainty, the uncertainty intrinsic to the data, which cannot be explained with more samples. This is due to e.g.: errors during labelling, occlusion, poor data acquisition ,or when two classes are highly confused.

2) Epistemic Uncertainty, the uncertainty about the underlying model. Obtaining more samples will provide more information about the underlying model and reduce the amount of epistemic uncertainty. Crucially, some samples are more informative than others.

Uncertainty estimations

Computing the uncertainty of deep neural networks is crucial to many applications from medical imaging to loan application. Unfortunately, deep neural networks are often overconfident as they are not designed to provide calibrated predictions

(scalia2019evaluating; gal2016uncertainty). Hence, researchers proposed new methods to get a trustful estimation of the epistemic uncertainty such as MC-Dropout (gal2016dropout), Bayesian neural networks (blundell2015weight) or Ensembles. More recently, wilson2020bayesian proposed to combine variational inference and ensembles. While this approach is state-of-the-art, it is far too computationally expensive to be used in the industry.

In this paper, we will use MC-Dropout (gal2016dropout). In this technique proposes the Dropout layers are kept activated at test time to sample from the posterior distribution. Hence, this method can be used on any architecture that uses Dropout which makes it usable on a wide range of applications.

Acquisition functions

Many heuristics have been proposed to extract the uncertainty value from the stochastic prediction sampling. We define Monte-Carlo sampling from the posterior distribution


where is the number of Monte-Carlo samples. We compute the Bayesian model average, . When highly uncertain,

will be close to a uniform distribution. A naive approach to estimate the uncertainty is to compute the entropy of this distribution.

A more sophisticated approach is BALD (houlsby2011bayesian), which estimates the epistemic uncertainty by computing the mutual information between the model posterior distribution and the prediction:

BALD compares the entropy of the mean estimator to the entropies of all estimators. The result is high when there are high disagreements between predictions, which addresses the overconfidence issue in deep learning models.

4 Experiments

In this paper, we want to demonstrate the usability of active learning in a real-world scenario. First, we analyze the effect of common pathologies in deep learning on active learning. Secondly, a common issue in active learning is the time required between steps in the active learning loop. As stated by kirsch2019batchbald, retraining as soon as possible is crucial to obtain decorrelated samples. We investigate if a) this is the case in large-scale datasets and b) what can we do to make this faster. Implementation details can be found in Annex. Baselines for all acquisition functions can be found in Fig. 1.

4.1 Pathologies

In this section, we verify if common pathologies in deep learning hold for active learning. Problems such as annotation error or model convergence may be hurtful to the procedure and are often overlooked in the literature. In particular, due to the small amount of annotated data, models are more at risk than when they are trained on large datasets.

Effect of annotation error

While standard datasets are of good quality, humans are far from perfect and will produce errors when labelling. This is especially true when using crowdsourcing (allahbakhsh2013quality). Because active learning relies on the training data to train a model and there are only a few labelled samples, we make the hypothesis that active learning would be highly sensitive to noise.

To confirm this hypothesis, we introduce noise by corrupting of the labels. We test our hypothesis on CIFAR10 (krizhevsky2009learning). In Table 1, we can assess that depending on , the active learning procedure is highly affected by labelling noise. Furthermore, when we compare to random selection, the gain of using active learning decreases when noise is involved, but it is still useful.

Dataset size 5000 10000 20000
= 0
BALD 0.65 0.01 0.53 0.01 0.43 0.02
Entropy 0.68 0.03 0.52 0.02 0.43 0.03
Random 0.71 0.02 0.58 0.02 0.47 0.01
= 0.05
BALD 0.72 0.02 0.57 0.01 0.43 0.02
Entropy 0.72 0.02 0.54 0.02 0.41 0.01
Random 0.73 0.03 0.61 0.03 0.51 0.02
= 0.1
BALD 0.78 0.03 0.62 0.01 0.48 0.01
Entropy 0.71 0.02 0.57 0.01 0.44 0.02
Random 0.76 0.02 0.64 0.02 0.54 0.01
Table 1: Effect of annotation error on active learning by randomly shuffling % labels. The test log-likelihood is averaged over 5 runs.

Effect of model convergence

Because we have no control over the training regime at each time step, it is hard to train the model to an optimal solution. With fully annotated datasets, we can fine-tune our training setup with hyper-parameter search or train for days at a time. In a production environment, we are limited in our ability to best train the model. In consequence, the model may be under or overfitted to the current dataset and provide flawed uncertainty estimations.

To confirm our hypothesis, we vary the number of epochs the model is trained for. As seen in

Fig. 2, underfitted models are highly affected while overfitted models suffer, but are still performant. This is due to a poor fit of the model that lead to a wrong estimation of the model uncertainty. In Annex, we present the difference in performance between BALD and Random.

Figure 2: Effect of different training schedules. By comparing overfitted and underfitted models, we assess the impact of uncertainty quality on active learning. Performance averaged over 5 runs.

In this section, we investigated the effect of two common deep learning pathologies in active learning. In summary, prior knowledge on the quality of the annotations and on how long to train the model, could help using active learning.

4.2 Efficient techniques for active learning

An important problem with current active learning pipelines is the delay between active learning steps. To make active learning efficient, we propose several techniques that maintain performance while speeding up the training or inference phases.

Query size

An important hyper-parameter in batch active learning is to decide how many samples should be labelled at each active learning step (gal2017deep; tsymbalov2019deeper). In a real-world scenario, we can’t ask the annotation team to wait between steps especially in a crowd-sourcing environment. Therefore, there needs to be a configuration where we could benefit from a good uncertainty estimation quality and a reasonable runtime. We present our findings in Table 2 where we tested this approach on CIFAR10. From our results, the query size does decreases performances, especially at 10,000 labels where the gap between BALD and Entropy is thinner as the query size grows.

Dataset size 5000 10000 20000
Random 0.71 0.03 0.54 0.01 0.42 0.05
BALD 0.59 0.01 0.46 0.05 0.34 0.02
Entropy 0.69 0.06 0.55 0.11 0.34 0.00
BALD 0.61 0.03 0.43 0.01 0.35 0.03
Entropy 0.67 0.05 0.49 0.04 0.35 0.00
BALD 0.61 0.07 0.42 0.02 0.36 0.01
Entropy 0.61 0.07 0.47 0.00 0.37 0.00
BALD 0.77 0.05 0.53 0.03 0.37 0.03
Entropy 0.87 0.01 0.52 0.07 0.35 0.01
Table 2: Effect of increasing the query size on CIFAR10. Performance averaged over 5 runs. BALD is weaker when used with a large query size, making Entropy competitive.

Limit pool size

The most time-consuming part of active learning is the uncertainty estimation step. In particular, this step is expensive when using techniques that require Monte-Carlo sampling such as MC-Dropout or Bayesian neural networks. Of course, this problem is embarrassingly parallel, but for low-budget deployment, one has not access to the resources required to parallelize this task cheaply. A simple idea to solve this is to randomly select unlabelled samples from the pool instead of using the entire pool. We test this idea by varying the number of samples selected for uncertainty estimation. From our experiments (figure in Annex), we show that the performance is not affected when using less than 25% of the pool. By doing this, we can speed up this phase by a factor of 3.

In this section, we proposed two approaches to make active learning usable in production. First, we can increase the query size higher than previously used in the literature. Second, we can select the next batch using a small subset of the pool.

5 Case study: Mio-TCD

Few datasets have been proposed to mimic a ”real-world” situation where the dataset suffers from labelling noise, duplicates, or imbalanced datasets. Mio-TCD (luo2018mio) has been recently proposed to showcase these issues. The dataset contains 500,000 samples split into 11 classes with heavy class imbalance. For example, the training set contains 50% Cars and 20% Background.

Benefits of active learning.

As shown in gal2017deep (and further in Annex), active learning helps when used on imbalanced data. We can verify this, by comparing the performance of underrepresented classes in Mio-TCD. From the current leader board, we can select two difficult classes: Single-Unit Truck, and Bicycle. We use the same setup as before, but limit the size of the pool to 20,000 samples.

Figure 3: Performance of different active learning procedures on Mio-TCD. While any active learning method is strong against random, BALD is especially strong at the beginning of the labelling process. Performance averaged over 5 runs.

In Fig. 3, we present the F1 scores for the two most difficult classes. One can clearly see the impact of using active learning in this setting. With active learning, underrepresented classes are quickly selected and get decent performance. In addition, the performance for the most populous class Cars stays similar across acquisition functions (figure in Annex).

This experiment shows that using active learning on non-academic datasets is highly beneficial and the need for the active learning community to use new benchmarks to compare methods.

6 Conclusion

In this paper, we have investigated the impact of uncleaned data on deep active learning models used for active learning. We also propose several techniques to make active learning usable in a real-world environment. Subsequently, we test our findings on a real-world dataset Mio-TCD, showing that active learning can be used in this setting. As a result of this study, we introduce our newly released Bayesian active learning library which can be useful to both researchers and developers (see Annex).

In summary, we show that active learning can be used in a production setting on real data with success. We hope this work can fasten the application of active learning on real-world projects and improve the quality of annotation by getting more information per sample. Interesting areas of research include the study of the interaction between the human and the machine during a labelling task.


Appendix A Implementation details

Our methodology is as follows. We train a VGG-16 (zhang2015accelerating)

pretrained on ImageNet 

(imagenet_cvpr09). Our initial training set contains 500 samples. We estimate the uncertainty using 20 MC samples and label the 100 most uncertain elements. Following gal2017deep, we reset the weights to their initial value between steps.

Appendix B Imbalanced datasets

How to deal with imbalanced datasets is an entire area of research (krawczyk2016learning), but little has been done to deal with it when we are not aware of the a priori class distribution. In consequence, the active learning model may quickly overfit to the more popular classes and reduce the effectiveness of active learning procedure. From gal2017deep, it is known that Bayesian active learning will favor underrepresented classes. But, we find the reported experiments to be too simple. We test this hypothesis in a controlled environment where we can set the number of unrepresented classes.

In Table 3, we took the standard CIFAR100 dataset and we mimic an imbalanced dataset where few classes have a high number of examples. A class selected to be underrepresented sees its number of samples to be reduced by 75%. When we increase the number of underrepresented classes, the gain of using MC-Dropout versus random sampling becomes more obvious. This is due to regions on the learned manifold associated with underrepresented classes to be highly uncertain. In consequence, these regions will be selected for labelling very early in the process.

Dataset size 5000 10000 20000
BALD 4.39 0.4 3.99 0.01 3.57 0.05
Entropy 4.71 0.02 4.54 0.07 3.94 0.01
Random 4.52 0.09 4.10 0.03 3.71 0.05
BALD 4.40 0.03 4.04 0.03 3.61 0.08
Entropy 4.76 0.02 4.68 0.08 4.00 0.01
Random 4.58 0.08 4.18 0.04 3.75 0.01
BALD 4.49 0.08 4.07 0.02 3.66 0.04
Entropy 4.83 0.04 4.60 0.14 4.07 0.28
Random 4.62 0.03 4.21 0.02 3.76 0.04
Table 3: Effect of using active learning on imbalanced versions of CIFAR100. is the number of class that contains 25% of their data. From (gal2017deep), we know that BALD is robust to imbalanced datasets, but the study was not extensive. While BALD is robust to imbalanced datasets, the effect is catastrophic when using Entropy. Performance averaged over 5 runs.

Appendix C Effect of convergence

In Fig. 4, we computed the difference in performance between BALD and random. We call this measure the Active gain . When using an underfitted model, the gain goes negative i.e. you would be better to use random selection.

Figure 4: Gain of using active learning when varying the number of training epochs. An underfitted model will cause harm to the model training and in this case, just using random would’ve been better.

Appendix D Effect of reducing the pool size

As part of the experiments, we test whether limiting the pool size would affect the performance of active learning. Our experiments in Fig. 5 show that whether to calculate the uncertainty for the whole pool data or a randomly selected subset, the performance of active learning is not affected. This leads to an the interesting outcome of limiting the uncertainty calculations (which is the most expensive part of an active learning loop) in production setup for faster active learning loops.

Figure 5: Effect of reducing the size of the pool on CIFAR100. -1 indicates no reduction. For all heuristics, the performance is not affected by the size of the pool showing that AL can be efficient when tuned properly. Performance averaged over 5 runs.
Figure 6: F1 for the class car. BALD is great for underrepresented classes while not affecting more popular classes. Entropy decreases the performance on this class.

Appendix E Bayesian active learning Library (BaaL)

All the experiments in this paper have done using our publicly available Bayesian active learning library. The goal of this library is to provide an easy to use but complete setup to test active learning on any project with few lines of code. We included features that current active learning libraries do not support. In particular, Bayesian methods such as MC-Dropout or Coresets are not widely available and there is no standard implementation of the active learning loop. Furthermore, research codebases are often hard to read and hard to maintain. Our proposed unified API could satisfy both research and industrial users.

Our recently published open-source package named BaaL, aims at accelerating the transition from research to production. The core philosophy behind our library is to provide researchers with a well-designed API so that they focus on their novel idea and not on technical details. Our library proposes a task-agnostic system where one can mix-and-match any set of acquisition functions and uncertainty estimation methods. The library consists of three main components:

  1. Dataset management to keep track and manage the labelled data and the unlabelled data .

  2. Bayesian Methods i.e. MC-Dropout, MC-DropConnect and so on.

  3. Acquisition functions i.e. BALD, BatchBALD, Entropy and more.

We provide full support for Pytorch

(paszke2017automatic) deep learning modules but our acquisition functions which are the most important part of active learning is implemented in Numpy(oliphant2006guide) and hence can be used on any platform. Our Data management module keeps track of what is labelled and what is unlabelled. We also provide facilitator methods to label a data point, update the pool of unlabelled data, and to randomly label a portion of the dataset. In our Bayesian module, we provide utilities to make any Pytorch model Bayesian with a single instruction. We also provide training, testing, and active learning loops that facilitate the active training procedure. Our acquisition functions are up-to-date with state-of-the-art methods. We provide easy to follow tutorials ( for each section of the library so that the user understands how each component works. Finally, our library is a member of Pytorch Ecosystem, which is reserved for libraries with outstanding documentation.

Our road-map has been indicated in the repository. Our current focus will include model calibration and semi-supervised learning. As more researchers contribute their methods to our library, we aim to become the standard Bayesian active learning library.