DEAL: Deep Evidential Active Learning for Image Classification

07/22/2020 ∙ by Patrick Hemmer, et al. ∙ KIT 0

Convolutional Neural Networks (CNNs) have proven to be state-of-the-art models for supervised computer vision tasks, such as image classification. However, large labeled data sets are generally needed for the training and validation of such models. In many domains, unlabeled data is available but labeling is expensive, for instance when specific expert knowledge is required. Active Learning (AL) is one approach to mitigate the problem of limited labeled data. Through selecting the most informative and representative data instances for labeling, AL can contribute to more efficient learning of the model. Recent AL methods for CNNs propose different solutions for the selection of instances to be labeled. However, they do not perform consistently well and are often computationally expensive. In this paper, we propose a novel AL algorithm that efficiently learns from unlabeled data by capturing high prediction uncertainty. By replacing the softmax standard output of a CNN with the parameters of a Dirichlet density, the model learns to identify data instances that contribute efficiently to improving model performance during training. We demonstrate in several experiments with publicly available data that our method consistently outperforms other state-of-the-art AL approaches. It can be easily implemented and does not require extensive computational resources for training. Additionally, we are able to show the benefits of the approach on a real-world medical use case in the field of automated detection of visual signals for pneumonia on chest radiographs.



There are no comments yet.


page 1

page 7

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

Over the last years, Convolutional Neural Networks (CNNs) have contributed to an unprecedented development of prediction accuracy in the realm of computer vision, even exceeding human-level performance for specific image classification tasks [14]. However, one major drawback is their dependency on vast amounts of labeled data. Even though more and more data is becoming available, labeling of data instances is often costly. In many application domains, such as medical diagnosis or manufacturing, the knowledge of highly trained experts is essential. This results in the need for developing techniques to reduce data labeling effort, especially when labeling resources are scarce. Multiple techniques are available, which lead to a significant decrease in required labeled data, e.g., Inductive Programming [29]

, Semi-Supervised Learning

[26], External Memories [39]

, or Active Learning (AL). The key idea of AL is that a machine learning model can achieve a desired performance level using fewer training instances if it can select the data which is the most beneficial to its learning process

[35]. When applying AL, a model is initially trained on a small labeled data set. Repeatedly, new data instances are selected through an acquisition function, labeled by an expert, and added to the labeled data set until a specific labeling budget is depleted [6].

Intensive research has been conducted on the topic of AL over the past years, and it has been applied successfully to a variety of use cases, such as information extraction from text documents [32]

or anomaly detection


. However, one major challenge in AL is the difficulty to deal with high-dimensional data, such as images

[37]. Multiple researchers have addressed this challenge and developed several methods for the classification of image data with CNNs [4, 11, 33, 41, 42].

In the work at hand, we propose Deep Evidential Active Learning (DEAL), an AL algorithm that selects unlabeled data instances for annotation based on prediction uncertainty. Uncertainty estimates are derived by replacing the softmax output function of a CNN with the parameters of a Dirichlet density, as proposed by Sensoy et al. (2018)

[34]. The main contributions of our work are threefold: First, we apply this modified CNN to AL, which enables the model to form a more accurate opinion about the most informative samples. In each AL round, the uncertainty estimates are used to query a new batch of unlabeled data instances for annotation until the labeling resources are depleted. Second, we demonstrate in extensive experiments on MNIST and CIFAR-10 that our proposed method consistently outperforms other state-of-the-art AL approaches. Lastly, the application of our method to the use case of detecting visual signals for pneumonia in pediatric chest X-ray images stresses the benefits of DEAL: Its implementation would lead to a 34.52% reduction in the number of labeled images necessary to achieve a test accuracy of 90%, in comparison to random acquisition.

Our remaining work is structured as follows: In Section II, we review existing AL approaches for image classification, in particular those compatible with CNNs. Our method’s theoretical foundations are outlined and formalized in Section III. In Section IV, we conduct a thorough experimental evaluation of DEAL. Section V concludes our work.

Ii Related Work

Generative Pool-Based
Diversity Combination Uncertainty
Non-Ensemble Ensemble
Softmax Non-Softmax
ASAL cGAN core-set
Mayer and Timofte
(2020) [28]
Mahapatra et al.
(2018) [27]
Sener and Savarese
(2017) [33]
Kirsch et al.
(2019) [21]
Ash et al.
(2019) [3]
Wang et al.
(2016) [41]
Our work
Yoo and Kweon
(2019) [42]
Beluch et al.
(2018) [4]
Gal et al.
(2017) [11]
TABLE I: An overview of state-of-the-art AL methods for CNNs.

AL has been intensively researched over the past decades. Settles (2009) [35]

provides a comprehensive overview of the most commonly used query strategy frameworks. However, it does not take into account approaches compatible with CNNs, as AL research on image data was then predominantly focused on methods such as Support Vector Machines (SVMs).

In contrast to SVMs, CNNs can capture spatial information in images—the main reason why they have become state-of-the-art technology for image classification tasks [16], and the primary motivation for researchers to develop AL methods compatible with CNNs. To the best of our knowledge, Table I summarizes the latest AL approaches for CNNs applied to image data. In the following, we divide them into generative and pool-based approaches.


methods use Generative Adversarial Networks (GANs) to generate informative samples which are added to the training set. To realize this, Mahapatra et al. (2018)

[27] condition their GAN (cGAN) on a real image pool. In contrast, in their method called ASAL, Mayer and Timofte (2020) [28] use the generated images to retrieve similar real-world images and add them to the training set after annotation.

Pool-based approaches make use of different acquisition strategies to sample the most informative data. We divide them into diversity- and uncertainty-based approaches as well as a combination of both. Diversity-based methods pursue the idea of selecting samples that represent the unlabeled data pool most adequately. Sener and Savarese (2017) [33] frame AL as a core-set selection problem by minimizing the Euclidean distance in the model’s feature space between sampled and non-selected data points. The aim is to query a subset of samples from the data pool such that a model trained on this subset performs comparably to a (hypothetical) model trained on the whole data set. However, distance-based methods like core-set have the disadvantage that distance metrics can concentrate in high-dimensional space, which means that distances between data elements appear identical [8]. The assumption of uncertainty-based approaches is that the more uncertain a model is with respect to a prediction, the more informative this data has to be for the model. Wang et al. (2016) [41]

query the most informative samples by applying least confidence, minimal margin, and entropy acquisition functions to the class probabilities of the softmax output. However, a limitation of this approach is that a model can be uncertain in its predictions even with a high softmax output

[12]. Therefore, Gal and Ghahramani (2016) [10] address the representation of uncertainty from a Bayesian perspective and propose a framework for modeling uncertainty in deep neural networks with dropout at inference time. It allows to obtain prediction uncertainty estimates by conducting multiple forward passes of each data instance through a model with dropout enabled. Consequently, the models and corresponding prediction results are different for each forward pass. This technique called Monte Carlo (MC-)Dropout yields more accurate uncertainty approximations compared to single softmax point estimates, as it approximates a distribution over the parameters. Applied to AL, it results in the acquisition of more informative samples leading to faster model learning [11]. Instead of applying MC-Dropout, Beluch et al. (2018) [4] approximate such a distribution through a full model ensemble. Benchmarking against MC-Dropout, they demonstrate that such an ensemble of CNNs can infer more calibrated predictive uncertainties, and thus, further enhances the AL performance. An alternative approach is proposed by Yoo and Kweon (2019) [42]. Their idea is to attach an extra loss prediction model to the network, which learns to predict the losses of unlabeled samples. Consequently, they can query the data points which are expected to have high losses. Moreover, several approaches propose to combine diversity- with uncertainty-based data acquisition. As CNNs are trained in a batch setting, selecting instances solely based on uncertainty entails the risk of redundancy in batch-wise queried data, which can lead in some cases to a worse performance than random data selection. Each of the selected points might be informative itself, however, not jointly. In this context, Kirsch et al. (2019) [21] improve the performance of the acquisition function BALD (Bayesian Active Learning by Diverse) [17]

in the batch setting (BatchBALD). They query instances by calculating the mutual information between a joint of multiple data points and model parameters. An alternative approach called Batch Active learning by Diverse Gradient Embeddings (BADGE) incorporates both diversity and uncertainty for batch acquisition by measuring uncertainty through gradient embeddings and diversity through sampling instances via the k-MEANS++ initialization scheme


Our method DEAL belongs to the uncertainty-based AL approaches as we acquire each data batch based on uncertainty estimates from a Dirichlet distribution that is placed on the class probabilities. Inferring uncertainty estimates from the softmax probabilities entails the risk that a model can be uncertain in its predictions even with a high softmax output [12], as mentioned earlier. Approaches mitigating this issue require either each data point to be passed multiple times through the network using MC-Dropout [11], infer uncertainty estimates through an ensemble of several models [4], or attach a separate model to the network [42]. The first two options have the drawback that each acquisition step is increasingly time-consuming, whereas the latter involves the implementation overhead of an additional learning loss module.

Acquiring unlabeled data points in each AL round using DEAL has (a) the advantage of deriving high-quality uncertainty estimates leading to faster learning of the model, and (b) requires only one forward pass of each data instance through the network.

Iii Methodology

In this section, we first outline the necessary theoretical foundations for our AL approach. Subsequently, we introduce the acquisition function concept and formally describe our method as an algorithm.

Iii-a Theory of Evidence

Our AL algorithm is based on the method of quantifying uncertainty in neural networks, as proposed by Sensoy et al. (2018) [34], which originates from the Dempster-Shafer Theory of Evidence (DST) [7], a generalization of the Bayesian theory to subjective probabilities. Using subjective logic, DST can be formalized as a Dirichlet distribution, and thus quantify belief masses and uncertainty [18].

In general, a softmax function is typically used in the output layer of CNNs for classification tasks. Specifically, it provides class probability estimates for each class in the form of point estimates. However, a model can be uncertain in its predictions even with a high softmax output for a particular class [12]. In contrast to Gal et al. (2017) [11] and Beluch et al. (2018) [4]

, who approximate a distribution of class probabilities through MC-Dropout and model ensembles, we directly model a Dirichlet posterior with its hyperparameters learned from the data. In detail, the idea is to replace the softmax function of the CNN with a nonlinear activation function such as Softsign and use the outputs as evidence vector for a Dirichlet distribution. Moreover, the loss function is adapted in a way that it comprises both the output loss and a regularization term in the form of Kullback-Leibler (KL) divergence, which regularizes the predictive distribution. In the following, we summarize the method to infer prediction uncertainty

[34], as this is the basis for our AL algorithm.

First, we can define mutually exclusive singletons with a non-negative belief mass assignable to each of them, and an overall uncertainty mass . Assuming the singletons to be the outputs of a CNN (-class classification), we can formulate the following equation with 0 for and :


The variable represents the -th Softsign output and is interpreted as the belief mass of the -th class, whereas is the uncertainty mass of the particular outputs. Moreover, let be the evidence for the -th output. Then, the belief mass and uncertainty with can be defined as:


In this approach, evidence quantifies the support from data, which results in the classification of a sample into a particular class. Thus, it differs from the Bayesian nomenclature. Additionally, assigning a belief mass corresponds to a Dirichlet distribution with the parameters . Therefore, a subjective opinion can be formed by the parameters of the corresponding Dirichlet distribution using . Here,

represents the Dirichlet strength. Contrary to the standard softmax classifier, which assigns a probability for each possible class and sample, a Dirichlet distribution denotes the density for each probability assignment on the basis of its parameters derived from the evidence vector


. Specifically, a Dirichlet distribution is a probability density function for possible values of the probability mass function

. It has parameters and the form:


is a -dimensional multinomial beta function [22]. In the presence of an opinion, the expected probability for the -th output results from the mean of the respective Dirichlet distribution:


A CNN classifies a sample with by conceiving an opinion as a Dirichlet distribution , where describes the assigned class probabilities. Given a sample , is the evidence vector predicted by the CNN with the network parameters . Thus, the parameters of the Dirichlet distribution are , and we can calculate the mean for estimating the class probabilities.

For a data sample , the variable

denotes the ground-truth class in the form of a one-hot encoded vector. The variable

represents the parameters of the Dirichlet density on the predictors. Additionally, is a prior on the likelihood , with being a multinominal mass function. Then, the loss function can be defined as follows, using the technique of Type II Maximum Likelihood Estimation:


To ensure that the total evidence decreases to zero for a sample that cannot be correctly classified, the KL-divergence is incorporated into the loss function with , where

denotes the current training epoch:


The term refers to the uniform Dirichlet distribution, and , with referring to the Hadamard (element-wise) product. We can calculate the KL-divergence as follows, with denoting the gamma function and referring to the digamma function:


A CNN with these modifications is the basis for our proposed AL algorithm.

Iii-B Uncertainty-Based AL

Before formally defining our AL algorithm, we introduce the concept of an acquisition function including the minimal margin uncertainty measure.

Iii-B1 Acquisition Function

Given a model , an unlabeled data pool , a labeled data pool and observations , an AL algorithm uses an acquisition function to choose the next data sample(s) to be queried [11]. We define as the immediate next sample to be queried, satisfying


For , we apply the uncertainty measure minimal margin: Choose the sample with the smallest


where and are the first and second most probable class labels of the respective sample.

The CNN’s expected class probabilities serve as input, derived from Equation 4. Besides the uncertainty measure minimal margin, other metrics such as Shannon Entropy [36] are applicable as well. However, we have found in previous experiments that the DEAL algorithm yields the best results with the minimal margin measure.

Iii-B2 AL Framework

The pool-based AL setting originates from an unlabeled data pool . Initially, a model is trained on a small data set , which is drawn uniformly at random. Labels are obtained from an expert. In each AL round, data samples are selected for labeling and added to , based on the acquisition function . Subsequently, the model is trained from scratch. This process is repeated until a given labeling budget is exhausted.

Deriving classification uncertainty by placing a Dirichlet distribution on the class probabilities in combination with the minimal margin uncertainty measure forms the basis of DEAL. We formalize this approach in Algorithm 1.

Input: Unlabeled set , labeled set , model with loss function , acquisition size , labeling budget (#samples). Result: Updated model .
Acquisition function based on Equation 9;
Compute ;
Set samples drawn uniformly at random from , labeled by expert;
Set ;
Initialize parameters and train on minimizing ;
for  do
       for  do
             Compute as in Equation 8;
             Request ground-truth label for ;
             Set ;
             Set ;
       end for
      Initialize and train on minimizing ;
end for
Algorithm 1 Deep Evidential Active Learning (DEAL)

Iv Experiments

In this section, we first present the experimental scenario used for the evaluation of DEAL on MNIST [24] and CIFAR-10 [23]. We benchmark the performance of DEAL against other state-of-the-art AL approaches and highlight its advantage in terms of acquisition time. Second, we present the results of DEAL applied to a real-world medical use case in the field of automated detection of visual signs for pneumonia on chest radiographs.

Fig. 1: MNIST (upper two) and CIFAR-10 (lower two) test accuracy over the percentage of acquired labeled training data. We benchmark the performance of DEAL against the approaches introduced in Section IV-A

. The solid horizontal line represents the CNN trained with all labeled training data. Shaded regions display standard deviations.

Iv-a Implementation Details

We evaluate our method on the publicly available scientific data sets MNIST and CIFAR-10. The former comprises a total of 70,000 greyscale images, of which we assign 58,000 to the training set, 2,000 to the validation set, and 10,000 to the test set. The latter consists of 60,000 RGB-images with 48,000 belonging to the training set, 2,000 to the validation set, and 10,000 to the test set. Both data sets involve ten classes each. We choose these two data sets for evaluation of our approach as they differ in terms of image diversity: While MNIST contains inherently many redundant images, CIFAR-10’s images are more diverse [40].

We conduct our experiments with two popular CNN architectures on both data sets. The first one is the LeNet [25] and the second one the ResNet-18 [15]

implementation. For both architectures, the standard softmax layer is replaced by a Softsign layer whose output is used as evidence vector for the Dirichlet distribution. In each AL round, we train the networks from scratch for 100 epochs, using batch size 32 for the LeNet, 8 for the ResNet on MNIST and 64 on CIFAR-10. We choose a learning rate of 0.0005 and Adam


as optimizer for both networks. All experiments are implemented in TensorFlow

[1]. Our code is available at The experiments are conducted in the following standardized setting: In the first AL round, we sample uniformly at random 100 MNIST and 2,000 CIFAR-10 images and train the model over the defined number of epochs. In each subsequent AL round, a new batch of images with the same acquisition size is selected, added to the labeled data pool, and the model is re-trained from scratch. We repeat this procedure until the test accuracy differs only marginally from that of a model trained on all labeled images. Regarding the LeNet architecture on MNIST, we stop the AL algorithm after the acquisition of 2,000 and on CIFAR-10 after training with 42,000 images. With the ResNet architecture we terminate the AL algorithm after the selection of 1,500 MNIST and 30,000 CIFAR-10 images. Each experiment is repeated 5 times, and the average test accuracy including standard deviation is reported.

We compare our AL method with the following state-of-the-art approaches: Minimal margin with softmax [41], core-set [33], MC-Dropout [11] and Deep Ensemble [4]. The latter two use both the variation ratio [9] as acquisition function. Both Gal et al. (2017) [11] and Beluch et al. (2018) [4] apply further uncertainty-based acquisition functions to their approach. However, we only use variation ratio as it achieves the best AL performance in both papers. Random sampling serves as a baseline for all. Analogously to Beluch et al. (2018) [4], for MC-Dropout, we conduct 25 forward passes, and for Deep Ensemble, each ensemble consists of 5 models with different random initializations.

T-statistic Model architecture Data set uniform minimal margin core-set MC-Dropout variation ratio Deep Ensemble variation ratio
DEAL minimal margin LeNet MNIST 11.5612 5.6201 10.7127 4.5650 5.1262
CIFAR-10 7.1938 8.2872 8.6366 7.9109 7.4767
ResNet MNIST 7.5776 5.5596 7.0798 6.3425 3.4235
CIFAR-10 15.0278 8.4264 10.1137 15.0638 9.0240
TABLE II: T-statistics of DEAL paired with other AL methods considering the respective model architectures and data sets. The asterisk denotes statistical significance at the 0.01 level.
Test accuracy uniform minimal margin core-set MC-Dropout variation ratio Deep Ensemble variation ratio DEAL minimal margin
95% (MNIST) 820 (160) 620 (75) 1,280 (160) 860 (136) 680 (75) 540 (49)
87% (CIFAR-10) 28,000 (2,000) 22,500 (2,958) 30,000 (0) 29,000 (1,000) 22,800 (980) 21,200 (2,040)
TABLE III: Average number of images over 5 experimental runs to achieve a predefined model performance (ResNet) on the test set. The values in parentheses denote standard deviations.
Data set uniform minimal margin core-set MC-Dropout variation ratio Deep Ensemble variation ratio DEAL minimal margin
MNIST 1.38 12.62 23.41 1.9018.37 18.37 12.54
CIFAR-10 17.22 139.25 1070.79 1.90143.33 143.33 139.51
TABLE IV: Average acquisition time in seconds over all AL rounds for MNIST and CIFAR-10 (ResNet). The experiments are conducted on a Tesla V100-SXM2-32GB. The number of forward passes for MC-Dropout is denoted by , whereas refers to the number of ensemble members. Moreover, is the average training time for one member over all epochs. We use the same setting as specified in Section IV-A.

Iv-B Experimental Results

We benchmark DEAL with minimal margin-based acquisition function against the state-of-the-art approaches from Section IV-A using both networks. The test accuracy for MNIST is illustrated in the upper two graphs of Figure 1, whereas the lower two display the results for CIFAR-10. For both networks and data sets, DEAL consistently outperforms all other approaches over all acquisition rounds. In detail, averaged over all 5 experiments and all AL rounds, on MNIST, DEAL outperforms the second-best method by 1.01% (LeNet: minimal margin with softmax) and 1.06% (ResNet: Deep Ensemble), respectively. Concerning CIFAR-10 using LeNet, none of the other approaches yields better test accuracy than random sampling. However, DEAL outperforms this baseline, on average by 1.51%. Regarding the ResNet network, the Deep Ensemble approach yields the second best test accuracy. Here, DEAL achieves an average improvement of 0.51% relative to this method.

In order to demonstrate the statistical significance of these findings, we perform a paired t-test, where the pairs consist of test accuracy development over all AL rounds for DEAL and each benchmark method.

Table II displays all t-statistics. It becomes evident that for both network architectures and data sets the t-statistic indicates statistical significance at the 0.01 level.

In practice, it is often essential to reach a predefined performance threshold with as little labeling effort as possible. Thus, we illustrate in Table III the number of images that are required to achieve a predefined test set accuracy. It highlights the advantages of AL in general and DEAL in particular. For instance, with the help of our approach, a labeling expert would have to label 280 images less for the MNIST and 6,800 images less for the CIFAR-10 data set (compared to random sampling), in order to achieve a test set accuracy of 95% and 87%, respectively. This means that the annotation effort can be reduced by 34.15% for MNIST and 24.29% for CIFAR-10. Lastly, the comparison with the second-best approach at this selected accuracy level—minimal margin with softmax—shows that for MNIST, 80 images less have to be labeled, which corresponds to a saving of 12.91%. In contrast, CIFAR-10 requires 1,300 fewer images, which is a reduction of 5.78%.

Iv-B1 Acquisition Time Analysis

Even though computing capacities are ever-increasing, the objective of an AL algorithm should not be limited to reaching a desired performance level with fewer training data, but also to make the selection of the next batch of data samples as time-efficient as possible. Thus, we compare the acquisition time of DEAL with that of the other methods. For each approach, we calculate the average sampling time in seconds (s) over all acquisition rounds for the MNIST and CIFAR-10 AL settings, respectively. The time measurements are displayed in Table IV. With 12.54s for MNIST and 139.51s for CIFAR-10, the acquisition step within DEAL is one of the least time-consuming. Only random sampling with 1.38s (MNIST) and 17.22s (CIFAR-10) requires less time. Minimal margin with 12.62s (MNIST) and 139.25s (CIFAR-10) takes approximately the same amount of acquisition time. In contrast, the run-time of the acquisition step for MC-Dropout depends on the number of forward passes through the network at test time. Using 25, it results in 65.87s (MNIST) and 190.83s (CIFAR-10), respectively. Similarly, without parallel training of the ensemble members for Deep Ensemble, the acquisition step depends in time on the number of members and the average training time over all epochs per member. Therefore, it is dependent on the network and data dimensionality. With 5, 718.69s for MNIST and 2,253.39s for CIFAR-10, the average MNIST acquisition time is 3,611.82s, whereas for CIFAR-10, it results in 11,410.28s. Lastly, the core-set approach is dependent on the dimensionality of the data instances to be queried as well. While the acquisition run-time on MNIST is, on average, only 10.87s slower in comparison with DEAL, the difference of 931.28s becomes noticeable on CIFAR-10.

T-statistic uniform minimal margin core-set MC-Dropout variation ratio Deep Ensemble variation ratio
DEAL minimal margin 7.3670 6.6137 9.8789 9.1429 9.2657
TABLE V: T-statistics of DEAL paired with other AL methods on the pediatric pneumonia data set. The asterisk denotes statistical significance at the 0.01 level.

Iv-B2 Real-World Data Set

We assess the proposed approach on a real-world medical use case in the field of automated pneumonia detection from chest X-rays. Pneumonia is an infection of the lung, which involves a higher global mortality rate among young children than any other infectious disease [31]. In fact, pneumonia is causing death to more children than HIV/AIDS, malaria, and measles combined [2]. In the United Stated, about 50,000 people die from pneumonia each year [5]. Chest radiographs are the most commonly used method for the diagnosis of this disease. However, its interpretation requires knowledge and experience of highly trained radiologists. Additionally, immediate radiologic interpretation of images is not always available, especially in regions with deficient medical infrastructure. Thus, an automated diagnosis system could not only support radiologists in image interpretation but also transfer knowledge to regions with missing expertise. Rajpurkar et al. (2017) [30] and Varshni et al. (2019) [38] successfully demonstrate the applicability of CNNs to pneumonia detection from chest X-rays. However, one general challenge for building such systems in the medical domain remains the availability of annotated data instances. In order to demonstrate the benefits of our proposed approach for reducing the necessary number of annotated images, we apply DEAL to a pediatric pneumonia data set collected by Kermany et al. (2018) [19]. From the original data set with 5,232 X-ray images of children, we randomly sample a subset of 3,100 images consisting of equal partitions of images with clear lungs and images with visual symptoms for pneumonia. We do this to achieve better comparability with the test accuracy development of the evenly balanced scientific data set analyses.

Fig. 2: The upper two chest X-rays represent lungs without any abnormal opacification in the image. The lower left chest X-ray depicts a viral pneumonia, whereas the lower right image shows a chest X-ray of a bacterial pneumonia.

Figure 2 exemplifies two healthy lungs (upper two) and two lungs suffering from pneumonia (lower two). All images of the data set are recorded with different resolutions. Therefore, we convert the images to greyscale and compress them to 128128 pixels. We allocate 1,500 images to the training set, 200 to the validation set, and 1,400 to the test set.

Fig. 3: Pediatric pneumonia data set test accuracy over the percentage of acquired labeled training data. We benchmark the performance of DEAL against the approaches introduced in Section IV-A. The solid horizontal line represents the CNN trained with all labeled training data. Shaded regions display standard deviations.

We conduct all experiments with the ResNet architecture introduced in Section IV-A. In each AL round, the network is trained from scratch for 100 epochs with a batch size of 8. We use a learning rate of 0.0005 and the Adam optimizer. Initially, we train the model with 64 randomly sampled images. In each AL round, 64 images are queried according to the respective acquisition function. We terminate the AL algorithm after the selection of 704 images, as the test accuracy of a model trained on all images is only slightly superior.

Figure 3 displays the experimental results of DEAL in comparison with all benchmarks. Similar to the analysis on MNIST and CIFAR-10, we note that DEAL consistently outperforms all other approaches.

More precisely, averaged over all 5 experiments and all AL rounds, DEAL outperforms the second-best method—minimal margin with softmax—by 1.76%, and the baseline random sampling by 2.86%. To achieve a test accuracy of 90%, an expert using DEAL could reduce the number of images to be labeled by 64 compared to the second-best approach, and by 243 images in comparison to the baseline random sampling. This equals a saving of 12.19% and 34.52%, respectively. When performing the paired t-test (Table V), again, we find that all tested pairs—DEAL and each benchmark method—indicate statistical significance at the 0.01 level. Finally, it is noteworthy that all methods have a larger standard deviation over the individual runs compared to the experimental results on MNIST and CIFAR-10. We suspect that the reason for this might be the small number of images used to train the CNN.

V Conclusion

In this paper, we present a novel uncertainty-based AL method and demonstrate its ability to outperform state-of-the-art AL approaches on the scientific data sets MNIST and CIFAR-10. By inferring uncertainty estimates from unlabeled data instances obtained from the class probabilities of a Dirichlet distribution instead of the standard softmax output, data points that contribute to more efficient learning can be identified. We show that to achieve a predefined model performance, our approach not only reduces the number of required labeled data instances but is also superior in terms of acquisition time. By applying our method to a real-world use case of pediatric pneumonia chest X-ray images, we point out its potential to successfully reduce the number of images that have to be labeled by experts. A shortcoming of this approach may be found in the solely uncertainty-based selection of data points, since the acquisition exclusively on the basis of uncertainty carries the risk of introducing redundancy in each acquired batch, which can lead to suboptimal solutions. Hence, an interesting avenue for future research would be to incorporate a diversity criterion into our approach.


  • [1] M. Abadi, P. Barham, J. Chen, Z. Chen, A. Davis, J. Dean, M. Devin, S. Ghemawat, G. Irving, M. Isard, et al. (2016) Tensorflow: a system for large-scale machine learning. In 12th Symposium on Operating Systems Design and Implementation, pp. 265–283. Cited by: §IV-A.
  • [2] R. A. Adegbola (2012) Childhood pneumonia as a global health priority and the strategic interest of the bill & melinda gates foundation. Clinical infectious diseases 54 (suppl_2), pp. S89–S92. Cited by: §IV-B2.
  • [3] J. T. Ash, C. Zhang, A. Krishnamurthy, J. Langford, and A. Agarwal (2019) Deep batch active learning by diverse, uncertain gradient lower bounds. arXiv preprint arXiv:1906.03671. Cited by: TABLE I, §II.
  • [4] W. H. Beluch, T. Genewein, A. Nürnberger, and J. M. Köhler (2018) The power of ensembles for active learning in image classification. In

    Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition

    pp. 9368–9377. Cited by: §I, TABLE I, §II, §II, §III-A, §IV-A.
  • [5] Centers for Disease Control and Prevention (2019) Pneumonia. External Links: Link Cited by: §IV-B2.
  • [6] D. A. Cohn, Z. Ghahramani, and M. I. Jordan (1996) Active learning with statistical models.

    Journal of artificial intelligence research

    4, pp. 129–145.
    Cited by: §I.
  • [7] A. P. Dempster (1968)

    A generalization of bayesian inference

    Journal of the Royal Statistical Society: Series B (Methodological) 30 (2), pp. 205–232. Cited by: §III-A.
  • [8] D. François (2008) High-dimensional data analysis.

    From Optimal Metric to Feature Selection

    , pp. 54–55.
    Cited by: §II.
  • [9] L. C. Freeman (1965) Elementary applied statistics: for students in behavioral science. John Wiley & Sons. Cited by: §IV-A.
  • [10] Y. Gal and Z. Ghahramani (2016)

    Dropout as a bayesian approximation: representing model uncertainty in deep learning

    In international conference on machine learning, pp. 1050–1059. Cited by: §II.
  • [11] Y. Gal, R. Islam, and Z. Ghahramani (2017) Deep bayesian active learning with image data. In Proceedings of the 34th International Conference on Machine Learning-Volume 70, pp. 1183–1192. Cited by: §I, TABLE I, §II, §II, §III-A, §III-B1, §IV-A.
  • [12] Y. Gal (2016) Uncertainty in deep learning. University of Cambridge 1, pp. 3. Cited by: §II, §II, §III-A.
  • [13] Z. Ghafoori, J. Bezdek, C. Leckie, and S. Karunasekera (2019) Unsupervised and active learning using maximin-based anomaly detection. In ECML PKDD, pp. . Cited by: §I.
  • [14] K. He, X. Zhang, S. Ren, and J. Sun (2015)

    Delving deep into rectifiers: surpassing human-level performance on imagenet classification

    In IEEE ICCV, pp. 1026–1034. Cited by: §I.
  • [15] K. He, X. Zhang, S. Ren, and J. Sun (2016) Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 770–778. Cited by: §IV-A.
  • [16] G. E. Hinton, N. Srivastava, A. Krizhevsky, I. Sutskever, and R. R. Salakhutdinov (2012) Improving neural networks by preventing co-adaptation of feature detectors. arXiv preprint arXiv:1207.0580. Cited by: §II.
  • [17] N. Houlsby, F. Huszár, Z. Ghahramani, and M. Lengyel (2011) Bayesian active learning for classification and preference learning. arXiv preprint arXiv:1112.5745. Cited by: §II.
  • [18] A. Jsang (2018) Subjective logic: a formalism for reasoning under uncertainty. Springer. Cited by: §III-A, §III-A.
  • [19] D. S. Kermany, M. Goldbaum, W. Cai, C. C. Valentim, H. Liang, S. L. Baxter, A. McKeown, G. Yang, X. Wu, F. Yan, et al. (2018) Identifying medical diagnoses and treatable diseases by image-based deep learning. Cell 172 (5), pp. 1122–1131. Cited by: §IV-B2.
  • [20] D. P. Kingma and J. Ba (2014) Adam: a method for stochastic optimization. arXiv preprint arXiv:1412.6980. Cited by: §IV-A.
  • [21] A. Kirsch, J. van Amersfoort, and Y. Gal (2019) Batchbald: efficient and diverse batch acquisition for deep bayesian active learning. In Advances in Neural Information Processing Systems, pp. 7024–7035. Cited by: TABLE I, §II.
  • [22] S. Korz, N. Balakrishnan, and N. Johnson (2000) Continuous multivariate distributions. Wiley, New York. Cited by: §III-A.
  • [23] A. Krizhevsky, G. Hinton, et al. (2009) Learning multiple layers of features from tiny images. Citeseer. Cited by: §IV.
  • [24] Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner (1998) Gradient-based learning applied to document recognition. Proceedings of the IEEE 86 (11), pp. 2278–2324. Cited by: §IV.
  • [25] Y. LeCun, P. Haffner, L. Bottou, and Y. Bengio (1999) Object recognition with gradient-based learning. In Shape, contour and grouping in computer vision, pp. 319–345. Cited by: §IV-A.
  • [26] F. Lin and W. W. Cohen (2010) Semi-supervised classification of network data using very few labels. In 2010 International Conference on Advances in Social Networks Analysis and Mining, pp. 192–199. Cited by: §I.
  • [27] D. Mahapatra, B. Bozorgtabar, J. Thiran, and M. Reyes (2018) Efficient active learning for image classification and segmentation using a sample selection and conditional generative adversarial network. In MICCAI, pp. 580–588. Cited by: TABLE I, §II.
  • [28] C. Mayer and R. Timofte (2020) Adversarial sampling for active learning. In The IEEE Winter Conference on Applications of Computer Vision, pp. 3071–3079. Cited by: TABLE I, §II.
  • [29] R. Olsson (1995) Inductive functional programming using incremental program transformation. Artificial intelligence 74 (1), pp. 55–81. Cited by: §I.
  • [30] P. Rajpurkar, J. Irvin, K. Zhu, B. Yang, H. Mehta, T. Duan, D. Ding, A. Bagul, C. Langlotz, K. Shpanskaya, et al. (2017) Chexnet: radiologist-level pneumonia detection on chest x-rays with deep learning. arXiv preprint arXiv:1711.05225. Cited by: §IV-B2.
  • [31] I. Rudan, C. Boschi-Pinto, Z. Biloglav, K. Mulholland, and H. Campbell (2008) Epidemiology and etiology of childhood pneumonia. Bulletin of the world health organization 86, pp. 408–416B. Cited by: §IV-B2.
  • [32] T. Scheffer and S. Wrobel (2001)

    Active learning of partially hidden markov models

    In In Proceedings of the ECML/PKDD Workshop on Instance Selection, Cited by: §I.
  • [33] O. Sener and S. Savarese (2017) Active learning for convolutional neural networks: a core-set approach. arXiv preprint arXiv:1708.00489. Cited by: §I, TABLE I, §II, §IV-A.
  • [34] M. Sensoy, L. Kaplan, and M. Kandemir (2018) Evidential deep learning to quantify classification uncertainty. In NIPS, pp. 3179–3189. Cited by: §I, §III-A, §III-A.
  • [35] B. Settles (2009) Active learning literature survey. Technical report University of Wisconsin-Madison Department of Computer Sciences. Cited by: §I, §II.
  • [36] C. E. Shannon (1948) A mathematical theory of communication. Bell system technical journal 27 (3), pp. 379–423. Cited by: §III-B1.
  • [37] S. Tong (2001) Active learning: theory and applications. Stanford University. Cited by: §I.
  • [38] D. Varshni, K. Thakral, L. Agarwal, R. Nijhawan, and A. Mittal (2019)

    Pneumonia detection using cnn based feature extraction

    In 2019 IEEE International Conference on Electrical, Computer and Communication Technologies (ICECCT), pp. 1–7. Cited by: §IV-B2.
  • [39] O. Vinyals, C. Blundell, T. Lillicrap, D. Wierstra, et al. (2016) Matching networks for one shot learning. In Advances in neural information processing systems, pp. 3630–3638. Cited by: §I.
  • [40] K. Vodrahalli, K. Li, and J. Malik (2018) Are all training examples created equal? an empirical study. arXiv preprint arXiv:1811.12569. Cited by: §IV-A.
  • [41] K. Wang, D. Zhang, Y. Li, R. Zhang, and L. Lin (2016) Cost-effective active learning for deep image classification. IEEE TCSVT 27 (12), pp. 2591–2600. Cited by: §I, TABLE I, §II, §IV-A.
  • [42] D. Yoo and I. S. Kweon (2019) Learning loss for active learning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 93–102. Cited by: §I, TABLE I, §II, §II.