O-MedAL: Online Active Deep Learning for Medical Image Analysis

08/28/2019 ∙ by Asim Smailagic, et al. ∙ Carnegie Mellon University 9

Active Learning methods create an optimized and labeled training set from unlabeled data. We introduce a novel Online Active Deep Learning method for Medical Image Analysis. We extend our MedAL active learning framework to present new results in this paper. Experiments on three medical image datasets show that our novel online active learning model requires significantly less labelings, is more accurate, and is more robust to class imbalances than existing methods. Our method is also more accurate and computationally efficient than the baseline model. Compared to random sampling and uncertainty sampling, the method uses 275 and 200 (out of 768) fewer labeled examples, respectively. For Diabetic Retinopathy detection, our method attains a 5.88 accuracy improvement over the baseline model when 80 labeled, and the model reaches baseline accuracy when only 40

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 7

Code Repositories

O-MedAL

O-MedAL: Online Active Deep Learning for Medical Image Analysis. This repo contains code for the paper.


view repo
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

Introduction

Active Learning (AL) is an emerging technique for machine learning that aims to reduce the amount of labeled training data necessary for the learning task. AL techniques are sequential in nature, employing various sampling methods to select examples from an unlabeled set. The selected examples are labeled and then used to train the model. A carefully designed sampling method can reduce the overall number of labeled data points required to train the model and make the model robust to class imbalances

[Ertekin:2007:LBA:1321440.1321461] or implicit bias [Richards_2011] in the dataset.

AL assumes the training process requires labeled data, and secondly that data is costly to label. Medical image analysis is particularly well framed by these assumptions, as the domain offers many opportunities for machine learning solutions, and labeling medical images requires extensive investment of time and effort by trained medical personnel. In particular, AL can be especially useful in the context of deep learning for medical image analysis, where deep networks typically require large labeled training datasets [litjens2017survey].

We introduce MedAL in our prior work. MedAL is a novel AL approach to optimize the selection of unlabeled samples by combining predictive entropy based uncertainty sampling with a distance function on a learned feature space [smailagic2018medal]. MedAL’s active sampling mechanism minimizes the required labels by selecting only those unlabeled images that are most informative to the model as it trains.

However, MedAL does not address improving computational efficiency. Each time new labeled examples are added to the training set, MedAL resets the model weights and re-trains the model using all available labeled data. As a result, the method processes training examples many more times than necessary, increasing the time between each sampling step. In a real world application, the trained medical personnel labeling the data would need to wait for the model to finish training, reducing the interactivity and applicability of the system.

To improve the computational performance of MedAL, we introduce “Online” MedAL (O-MedAL) which trains the model incrementally by using only the new set of labeled data and a subset of previously labeled data. By minimizing the training data used in each AL iteration, O-MedAL is computationally faster and more accurate than the original baseline model while retaining all the benefits of MedAL. MedAL is experimentally validated on three medical image diagnosis tasks and modalities: diabetic retinopathy detection from retinal fundus images, breast cancer grading from histopathological images, and melanoma diagnosis from skin images. O-MedAL is compared to MedAL on the retinal fundus image dataset for diabetic retinopathy detection. To the best of our knowledge, online active learning has never been directly applied to a medical image analysis setting, and it has never been applied to a deep learning setting. In fact, there is little prior work on online active learning. We discuss existing approaches and their drawbacks in the Related Work section.

Our main contributions are:

  • Novelty: we present a novel AL sampling method that queries the unlabeled examples maximally distant to the centroid of all training set examples in a learned feature space.

  • Accurate: we introduce an online training technique that is more accurate than the original baseline model trained on a fully labeled dataset.

  • Efficient: our method achieves better results with fewer labeled examples than competing methods.

  • Computationally Efficient:

    the online training technique performs fewer backpropagation updates than the original baseline model trained on the fully labeled training set.

  • Robust: we test our active learning method on binary and multi-class classification problems using balanced and unbalanced datasets.

The remainder of the paper is structured as follows: in the Related Work section, we discuss relevant prior work on active and online learning approaches. In the Proposed Method section, we describe the sampling process and online training technique. In the Experiments section, we present results comparing MedAL to common active learning approaches on three medical image datasets. We also present results comparing O-MedAL to MedAL. Finally, we discuss the implications of the results and future directions, and we provide conclusions summarizing our work.

Related Work

Most AL systems approach the challenge of labeling data by selecting unlabeled examples likely to improve predictive performance. The AL scenario assumes that an oracle can assign a label to the selected examples. In practice, the process of labeling data can be difficult, time-consuming and expensive. Therefore, it is valuable to design sampling methods that identify the unlabeled examples most informative to the learning task while querying as few labels as possible.

AL was applied to histopathological image analysis on unbalanced data [homeyer_comparison_2012], cell nucleus segmentation [wen_comparison_2018], CT scan and MRI analysis [pace_interactive_2015], and computer-aided diagnosis of masses in digital mammograms [zhao_minimization_2018]. An AL uncertainty sampling method was created for skin lesion segmentation [gorriz2017cost]. Additionally, AL was used in a Multiple Instance Learning framework for tuberculosis detection of chest radiographies [melendez_combining_2016],

The main AL scenarios include membership query synthesis, pool-based active learning and stream-based active learning. In membership query synthesis, the learner generates an unlabeled input and queries its label [angluin2004queries]. In pool-based active learning, a pool of unlabeled instances is ranked and the top subset is labeled [settles2008curious]. Stream-based active learning assumes a stream of unlabeled data and the learner decides on-the-fly whether to label it  [dasgupta2008general].

The sample/query mechanisms mainly include Query By Committee (QBC) [freund1997selective, freund1993information], Expected Error Reduction (EER) [cohn1995active, melville2004diverse], Expected Model Change (EMC) [cai2013maximizing, freytag2014selecting], and Uncertainty Sampling (US) [yang2015multi, lewis1994heterogeneous, lewis1994sequential].

Our work mainly expands on pool-based active learning and US. US is a sampling technique to select and label the unlabeled examples the model is most uncertain about. Uncertainty can by computed by predictive entropy, or the entropy of a prediction given an unlabeled input example . US selects unlabeled examples near the classification boundary without making use of labels.

Regarding online active learning, some prior works exist. The work of [murugesan2017active] proposes a multi-task learning algorithm to minimize labeling requests by attempting to infer a label with high confidence from other tasks rather than ask the oracle for the label. The work of [Sculley2007OnlineAL]

addresses a setting where the algorithm requests a label immediately after an example has been classified; the method reduces the computational cost simply because it requires less labels. The work of

[Baram2003OnlineCO] proposes an multi-arm bandits method where learners are selected from the ensemble to sample the next example to be labeled. The reward function for the bandits problem is classification entropy maximization. In some cases, the approach has better performance than other ensemble methods evaluated.

Proposed Method

AL techniques developed for classical machine learning methods are not ideal for

Deep Neural Network

(DNN) architectures. In particular, [zhang2016understanding] shows DNNs are capable of fitting a random labeling of the training data with low uncertainty. These findings suggest that traditional AL sampling methods based on predictive uncertainty will be less effective with DNNs.

We present a method specifically tailored to DNNs. We first describe our novel AL sampling method and then we show how to train the system via online learning.

Sampling Based on Distance between Feature Embeddings

As shown in Figure 1, let be the initial training set of labeled examples and be the unlabeled dataset. We train a model using and then use the model to find a fixed number of the most informative unlabeled examples, . These examples are then labeled and added to and the model is trained on the new data. Oracle examples are iteratively labeled in this fashion until the oracle set is exhausted or sufficient performance is attained.

Uncertainty-based methods for choosing evaluate about how informative a given is by computing the uncertainty in the model’s prediction. In this work, we depart from the standard practice of only using the model’s prediction to choose . Instead, we propose to use a feature embedding function in conjunction with uncertainty.

In particular, we first compute the predictive entropy of each unlabeled example in and then we select only the top highest entropy examples as a set . Using this subset of reduces the number of examples considered and ensures the active learning process ultimately selects unlabeled examples closest to the decision boundary.

(1)

To further evaluate the amount of new information an unlabeled example, , can add to the training set, we evaluate each example’s average distance in feature space to each labeled example by means of a feature embedding function and a distance function .

(2)

where , , is the feature embedding function and is a Euclidean distance function.

We calculate for every example in , ask the oracle to label the example that maximizes and add the example to . We repeat this process to select a fixed number of examples from .

(3)

Building on results from the original MedAL paper [smailagic2018medal], we assume the feature space of the embedding function is always Euclidean. The assumption implies we can compute as the distance to the centroid of the labeled examples, reducing the number of distance function calls and saving computation ( vs ). Each time an is chosen, we update the centroid of the labeled examples to include the new example.

Finally, we train the model on the new data and repeat the sampling process over again. In following sections, we will describe in detail the feature embedding function , and specifically how we train the model with new labels.

Deep Representations as Feature Embeddings

We train a Convolutional Neural Network (CNN) based architecture to extract powerful representations from the data and simultaneously solve the image classification problem. When interpreted in metric space, the feature embeddings learned by CNNs are known to encode semantic meaning. In particular, nearby elements tend to be visually similar

[mikolov2013efficient, costa2018end], and conversely, elements that are far away tend to be visually different. We therefore make the assumption that images with mostly similar features will have a smaller distance in embedding space, while images with mostly different features will have a larger distance. The assumption that the feature embedding can be interpreted in metric space is the basis for Eq. (2).

As shown in Figure 1, we define as the feature embedding function, , the activations of a particular CNN layer. Since is tied to the model, it evolves during training. As the model performance improves, the model will extract better representations, leading to better classification accuracy as well as more informative examples sampled from . If was statically defined, the function could be computed off-line and the model reduces to a simple predictive entropy based sampling method.

Online Active Learning

The online learning technique describes how we train the model given a stream of newly labeled examples generated by the AL sampling method. Online MedAL (O-MedAL) introduces a few changes to MedAL. First, while MedAL re-initializes the model parameters each AL iteration, O-MedAL maintains the parameters at each new AL iteration and incrementally updates the model. This change enables significant computational savings because while MedAL re-trains the model once per AL iteration, O-MedAL trains just one model across all AL iterations, and the total number of epochs used to train the model can be an order of magnitude smaller. Second, MedAL trains the model using the full training set

, while O-MedAL trains on the newly labeled items and a random subset of previously labeled items. We have found empirically that including a non-null subset of previously labeled data is necessary, and we discuss further in the Discussion section. Using a subset of available training data also reduces the overall number of examples used to train the model. Both of these changes result in computational efficiency. The first change reduces the total number of epochs required to train the model while the second change reduces the number of examples per epoch.

The number of examples that will be used to update model weights during backpropagation increases exponentially, and can be expressed using the equation:

(4)

where is the cumulative number of example images used for backpropagation after AL iterations, is the number of images labeled each AL iteration,

is the hyperparameter determining the fraction of previously labeled dataset to use,

is the number of labeled examples in the previous AL iteration and is the number of epochs at the AL iteration. is the initial number of labeled examples; for O-MedAL,

. The equation is useful to estimate the number of images processed by backpropagation. It can also be used to estimate how many times an O-MedAL model’s parameters have been updated after any given AL iteration; in this case, the cumulative number of weight updates will be

, where is the minibatch size.

Experiments

We perform experiments on three medical datasets to validate the accuracy and robustness of our proposed approach. First, we introduce the datasets. Next, we describe MedAL and online MedAL implementation details. Then, we evaluate our sampling method on the three datasets. Last, we evaluate our online training approach.

Dataset Description

Figure 2: Examples from all three datasets evaluated in this work. From left to right, top to bottom: Messidor, Skin Cancer and Breast Cancer datasets. The class labels are shown bellow each of the images.

Active learning reduces the amount of labeled data necessary for training, and it is therefore well positioned medical image analysis. We evaluate MedAL on three publicly available datasets. Figure 2 presents representative images in these datasets.

Messidor Dataset contains 1200 eye fundus images from 654 diabetic and 546 healthy patients collected from three hospitals in France, we believe between 2005 and 2006. This dataset was labeled for Diabetic Retinopathy (DR) grading and for Risk of Macular Edema. We use Messidor to classify eye fundus images as healthy (DR grade ) or diseased (DR grade ).

Breast Cancer Diagnosis Dataset, originally presented in the ICIAR 2018 Grand Challenge [aresta2018bach], consists of 400 high resolution images of breast tissue cells evenly split into four distinct classes: Normal, Benign, in-situ carcinoma and invasive carcinoma (100 images per class).

Skin Cancer dataset contains 900 cell tissue images classified as either benign or malignant [gutman2016skin]. The dataset is highly unbalanced, with benign examples and positive examples. MedAL makes possible the following class balancing technique: in each AL iteration, we query additional malignant training examples by making use of randomized preprocessing techniques.

MedAL Implementation

The underlying convolutional neural network architecture is Inception V3 [szegedy2016rethinking]

, with weights pre-trained on ImageNet. We replace the top layer with Global Average Pooling and then a Fully-Connected layer, where the number of hidden units corresponds to the number of output classes in the dataset. We use Adam

[kingma2014adam] optimizer with learning rate of and we use the default recommended values for () and ().

At the start of each AL iteration, we reset the model’s weights the initial starting values (after pre-training on ImageNet). The top layer weights, since they were not pre-trained on ImageNet, are randomly initialized using the Glorot method [glorot2010understanding].

Each AL iteration, we train the model until it reaches training accuracy. We perform hyperparameter selection using only the Messidor validation set and we apply these same hyperparameters to the Skin Cancer and Breast Cancer datasets in order to avoid using a labeled validation set for these datasets. The ability to avoid using a validation set is important to show both robustness of our method and, most importantly, to show that our method minimizes the number of labeled images used. Table 1 shows the dataset size and the specific hyperparameters that were different across the datasets.

We initialize the training set by randomly sampling one unlabeled image from , and then use the ORB initialization method described in [smailagic2018medal] to increase the size of our initial training set, .

For data pre-processing, we resize all images to pixels and use the following dataset augmentation: 15 degree random rotation, random scaling in the range [0.9, 1], and random horizontal flips.

To configure the AL sampler, we use the euclidean distance function, and we obtain feature embeddings from the Mixed5 layer of Inception V3. These choices are a result of our prior empirical evidence that this combination of layer and distance function achieves the highest entropy  [smailagic2018medal].

Parameters Messidor Breast Cancer Skin Cancer
(train + oracle) size 768 320 700
validation set size 240 0 0
test set size 192 80 200
- Initial training set size 100 30 100
- Imgs labeled each AL iter 20 5 10
- Num max entropy samples 50 30 50
Table 1: MedAL Implementation Details. Showing dataset size and values of parameters used.

Online Active Learning (O-MedAL) Implementation

We start with the configuration defined in the MedAL Implementation Details section, and then make the following changes.

First, we replace the Inception V3 network with a ResNet18 network [resnet]

pre-trained on ImageNet. ResNet18 performs nearly as well as Inception V3 on the fully labeled dataset, and ResNet18 is a smaller model. Since one of the reasons to consider O-MedAL is computational efficiency, we decided to use ResNet18. We use the SGD optimizer with a Nesterov momentum of 0.9, learning rate of 0.003, weight decay of 0.01. The batch size is 48. We replace the last layer with a linear layer followed by Sigmoid. The feature embeddings are extracted from the output of the "layer 2" layer. ResNet18 with these hyperparameters is the network used for all O-MedAL evaluation experiments.

Second, regarding Messidor, we use an 80/20 random split stratified across hospitals. The split is re-computed each time a model is trained. It assigns 949 images to the combined train and oracle sets, and the remaining 238 images to the test set. For these tests, we also correct the published errata associated with Messidor’s dataset (this includes removal of 13 duplicate images from one of the hospitals).

We use the same pre-processing transformations used for MedAL evaluation, and we also use the same hyperparameters for Messidor; the number of max entropy samples is and images added per iteration is . For results described in the O-MedAL Evaluation section, we do not use ORB initialization for MedAL or O-MedAL models, and more specifically, randomly sampled and labeled image.

Third, we introduce a hyperparameter, , that determines at any given AL iteration what percent of previously labeled examples in should be used to train the model. The same images are used for all epochs in a given AL iteration. This parameter is the independent variable of our O-MedAL experiments. It was previously described in Equation 4.

Fourth, since the method is online, we do not reset the weights of the model at the start of each AL iteration.

Last, we rewrite the code in a new implementation, migrating from Keras to PyTorch. The new code is available online at

https://www.github.com/adgaudio/O-MedAL.

Distance-based Sampling Method Evaluation

(a) Messidor Results.
(b) Breast Cancer Results.
(c) Skin Cancer Results.
(d) AUC on the Skin Cancer dataset
Figure 3: (a) (b) and (c): Test set accuracy as a function of the number of labeled training images. On all three datasets, our method samples images that consistently improve the test set performance, and our method outperforms both uncertainty and random sampling.
(d) Our method slightly outperforms Gal, et al.’s Deep Bayesian AL method [gal2017deep] on the Skin Cancer dataset.

We compare the performance of our AL sampling method on the three datasets mentioned above to the performance of uncertainty sampling and random sampling. The datasets each present different learning challenges: Messidor as a balanced binary classification, the Breast Cancer dataset as a balanced multi-class classification, and the Skin Cancer dataset as an unbalanced binary classification. After each AL iteration, the test accuracy of our model is evaluated.

Our method consistently outperforms both uncertainty and random sampling on all three datasets, as shown in Figure 3. Random sampling chooses images uniformly at random, and therefore on balanced datasets, we expect accuracy to increase in a nearly linear fashion. Any improvement over random sampling will choose more informative images earlier in the sampling process, resulting in a non linear curve. MedAL shows large increases in accuracy with fewer images. For instance, on Messidor, Figure 2(a) shows our method obtains accuracy with 425 images, whereas uncertainty sampling and random sampling require 625 and 700 images, respectively, to achieve the same accuracy. Moreover, our method obtains results comparable to the baseline accuracy using only 650 of 768 training images (84.6%).

As shown in Figure 2(b), our method is also consistently better than competing methods in the Breast Cancer dataset, although the difference is not as striking as in Messidor. Our approach reaches accuracy with 230 of 320 (76.9%) images labeled, whereas uncertainty sampling and random sampling require 250 and 255 images, respectively, to achieve the same accuracy.

Finally, our approach also reaches accuracy on the Skin Cancer dataset with 460 images labeled, as shown in Figure 2(c), whereas uncertainty sampling and random sampling require 570 and 640 labeled training images, respectively, to achieve the same results. Furthermore, our method achieves baseline accuracy of after being trained with 610 of 700 images (87%).

Figure 4: O-MedAL vs. MedAL Performance Plotting test accuracy as a function of the amount of labeled training data. O-MedAL (red lines, p=0.875) preserves the labeling efficiency of MedAL (green line) while attaining a higher accuracy than both MedAL and the baseline ResNet18 model (blue dashed line).
(a) O-MedAL Computational Efficiency
(b) O-MedAL Accuracy
Experiment Test Accuracy Percent Labeled Examples Processed
Online - 0.875 91.60% (+5.88%) 80.08% 128716 (+69.54%)
Online - 0.875 85.71% (+0.00%) 40.04% 31090 (-59.05%)
Online - 0.125 85.71% (+0.00%) 80.08% 24296 (-68.00%)
MedAL (patience=20) 91.18% (+5.46%) 73.76% 1178079 (+1451.74%)
MedAL (patience=10) 86.13% (+0.42%) 90.62% 1120482 (+1375.87%)
ResNet18 Baseline 85.71% (+0.00%) 100.00% 75920 (+0.00%)
(a) Experimental Results
Figure 5: O-MedAL Evaluation. Dark Gray marker shows most accurate model . Light gray marker shows most efficient model that reaches baseline accuracy, . Orange marker shows most computationally efficient model that reaches the baseline accuracy, . Table summarizes the keypoints and compares to baseline.

Online Active Learning (O-MedAL) Evaluation

We evaluate O-MedAL by comparing it to MedAL and the ResNet18 baseline across three areas: test set accuracy, percent dataset labeled and computational efficiency. In particular, we conduct a study to determine whether previously labeled data is useful or necessary to train the model. Results show that the online approach significantly improves over the baseline in all three areas.

We train O-MedAL nine times independently, varying , the percent of previously labeled images to include in any particular AL iteration. The nine values we consider are .

Figure 4 shows the test accuracy (y-axis) as a function of percent dataset labeled (x-axis) for both O-MedAL (p=0.875) and MedAL. Percent dataset labeled is directly proportional to AL iteration. Within an AL iteration, we compute test accuracy after each epoch. Therefore, the x axis can equivalently be labeled with a sequence of the form (AL iteration, epoch). The dashed blue line is the baseline ResNet18 highest test accuracy (85.71%). The green line is MedAL, and the red line is O-MedAL.

The plot exposes key details of MedAL’s learning process. MedAL, represented by the green line, resets the weights of the model at each AL iteration and then trains for up to 150 epochs (early stopping may end the iteration at an earlier epoch). Thus, we see the test accuracy after the first 3 epochs of each AL iteration is 49.58%, and the algorithm spends time learning the same features over and over again to regain the accuracy of the former AL iteration. We also see that the MedAL curve reaches baseline accuracy when approximately 61% of the dataset is labeled, which is slightly better than our previous tests on MedAL with Inception V3. Possible reasons for this improvement are (a) different dataset (the Messidor train set is larger, train and test sets are stratified across hospitals and the duplicate images removed), (b) different implementation (PyTorch vs Keras), (c) some baseline architectures (such as ResNet18) may be better suited to MedAL than others (such as Inception V3).

As the percent of the dataset labeled approaches 100, the model retrains the baseline model over and over again, effectively retraining the baseline model once for each AL iteration. Therefore, we expect that if MedAL converges before 100% of dataset is labeled, its maximum accuracy should be larger than the baseline due to random chance. The results of Figure 4 confirm this finding, and an improvement in accuracy over MedAL’s highest accuracy is a significant improvement over the baseline.

The red lines of Figure 4 represent our most accurate O-MedAL model, which uses , or of previously labeled data at each AL iteration. Each small vertical red line represents 10 epochs of training time. The lines are separated in order to visually align O-MedAL epochs to the first 10 MedAL epochs at each AL iteration. We can see by the spacing between red lines that the model trains for fewer epochs than MedAL. We also see the initial accuracy at the start of each AL iteration increases. O-MedAL’s minimum accuracy at the start of each AL iteration is generally increasing, which suggests that on this dataset, the online method is able to learn incrementally without forgetting previous learnings. Another interesting finding is that the model does not perform as well as MedAL when less than 40% of data is labeled. If this matters, using O-MedAL with early stopping should increase accuracy, at the cost of increased computation.

The two subplots of Figure 5 show how O-MedAL exposes a three-way relationship between accuracy, percent dataset labeled and computational efficiency. Figure 4(a) addresses computational efficiency while Figure 4(b) addresses accuracy. Both plots share an x axis of percent dataset labeled. The dashed blue line represents the baseline ResNet18 model trained on the fully labeled dataset for 100 epochs. The dashed black and dark blue lines are MedAL with early stopping (patience=20 and patience=10, respectively). We plot results from training nine O-MedAL models, each with a distinct .

Figure 4(a) plots the cumulative number of examples processed (on log scale) as a function of the percent dataset labeled. The curve for each of the O-MedAL models can be computed statically using Equation 4. The MedAL curve is not computed in advance, since early stopping leads to varying epochs per AL iteration. MedAL and O-MedAL curves show exponential growth. However, in most cases, O-MedAL models process less data than the baseline while meeting or exceeding its accuracy, suggesting they scale to larger datasets. Moreover, when , O-MedAL is guaranteed to process less examples than the ResNet18 baseline run for 80 epochs. When , O-MedAL only exceeds the baseline if more than 59% of the dataset is labeled. If we use a baseline network that converges in more than 100 epochs, we expect O-MedAL’s computational advantage will be more significant. This efficiency is useful because it offers flexibility in choice of .

Figure 4(b) is a scatter plot of test accuracy as a function of percent dataset labeled. The two MedAL dashed lines highlight a region containing most of the variation in accuracy of MedAL due to random chance. For each O-MedAL model, we show only the top 10 AL iterations with highest test accuracy. The low accuracy of the model shows that we need to include some amount of previously labeled data while training the model. We also found that all models except and clearly and consistently outperform the baseline, and three O-MedAL models () outperform MedAL (with patience=20), which denotes a significant improvement over the baseline.

Finally, we highlight five key moments while training the MedAL and O-MedAL models, shown in the table and plots of Figure

5. The black and blue crosses correspond to the highest accuracy attained using MedAL with early stopping a patience of 20 and 10, respectively. The two gray crosses correspond to the model. The model is a good choice if accuracy and labeling less data is a priority. The orange cross corresponds to the model, which is a computationally efficient model.

We can infer from Figure 4(b) that a larger generally results in a higher accuracy model, and from Figure 4(a), that a smaller results in a more computationally efficient model. Given a choice of , training can stop when either too many samples have been labeled or sufficient accuracy reached.

Discussion and Future Work

On each of the three datasets tested, MedAL achieves a higher overall accuracy and uses less labeled training data. The results confirm that informative examples are maximally dissimilar in feature space to previously trained examples and close to the decision boundary of the current model.

MedAL uses the trained baseline model to sample unlabeled examples. As the model improves its ability to classify the data, it also naturally improves its ability to identify unlabeled examples worthy of labeling. Since MedAL samples examples with maximum uncertainty and distance to the labeled examples, we can infer that MedAL’s performance will continue to improve relative to other sampling techniques on larger unlabeled datasets. Future work could attempt to quantify how the performance changes as a function of unlabeled dataset size.

MedAL is designed to query a fixed number of examples at a time. This naturally translates to a medical setting where a physician will label a batch of images at once (ie ). O-MedAL enables an efficient alternative where if , a physician can label one image at a time without waiting on the model to request more image labels. When , the feature embedding used to query unlabeled images is always as up-to-date as possible. Future work will be to test the utility of O-MedAL with M=1, as well as the possibility of near real-time interaction (in terms of wall time) with a physician.

In the future, we plan to continue our collaboration with Ophthalmologists in collection of Diabetic Retinopathy data and in integrating Deep Active Online Learning techniques with Ophthalmologists and other physicians to make more informed health care decisions.

Conclusions

In this paper we extend MedAL by introducing an online learning method. The online method retains all the benefits of MedAL while improving both its accuracy and computational efficiency.

We evaluate MedAL on three distinct medical image datasets: Messidor, the Breast Cancer dataset, and the Skin Cancer dataset. Our results show that MedAL is efficient, requiring less labeled examples than existing methods. The method achieves comparable results to a model trained on the fully labeled dataset using only 650 of 768 images. Additionally, MedAL attains 80% accuracy using 425 images, corresponding to 32% and 40% reduction compared to uncertainty and random sampling methods, respectively. Finally, we show that MedAL is robust, performing well on both binary and multi-class classification problems, and also on balanced and unbalanced datasets. MedAL shows consistently better performance than the competing methods on all three medical image datasets.

We compare O-MedAL to both MedAL and the underlying baseline deep network. We show that O-MedAL is more accurate, improving over the deep network test accuracy by 5.88% while labeling 80% of the dataset and processing 69.54% more examples. O-MedAL requires fewer labels, achieving baseline accuracy with only 40% of the dataset labeled and using 59% less computation. O-MedAL is computationally efficient; our compute optimized model processed 68% fewer examples than the baseline with 80% of data labeled when it exactly matches the accuracy of the deep network. We have shown that O-MedAL offers choices between labeling efficiency, computational efficiency and accuracy. We provide a hyperparameter, , to prioritize accuracy versus computational efficiency.

Conflict Of Interest

The authors declare that they have no conflicts of interest for this work.

References