Convolutional neural networks (CNNs)  have brought about a revolution in computer vision thanks to large annotated datasets, such as ImageNet  and Places . As evidenced by an IEEE TMI special issue  and two recent books [5, 6], intense interest in applying CNNs in biomedical image analysis is widespread, but its success is impeded by the lack of such large annotated datasets in biomedical imaging. Annotating biomedical images is not only tedious and time consuming, but also demanding of costly, specialty-oriented knowledge and skills, which are not easily accessible. Therefore, we seek to answer this critical question: How to dramatically reduce the cost of annotation when applying CNNs in biomedical imaging, as well as an accompanying question: given a labeled dataset, how to determine its sufficiency in covering the variations of objects of interest. In doing so, we present a novel method called AFT to naturally integrate active learning and transfer learning into a single framework. Our AFT method starts directly with a pre-trained CNN to seek “salient” samples from the unannotated for annotation, and the (fine-tuned) CNN is continuously fine-tuned by newly annotated samples combined with all misclassified samples. We have evaluated our method in three different applications including colonoscopy frame classification, polyp detection, and pulmonary embolism (PE) detection, demonstrating that the cost of annotation can be cut by at least half.
This outstanding performance is attributed to a simple yet powerful observation: To boost the performance of CNNs in biomedical imaging, multiple patches are usually generated automatically for each candidate through data augmentation; these patches generated from the same candidate share the same label, and are naturally expected to have similar predictions by the current CNN before they are expanded into the training dataset. As a result, their entropy  and diversity  provide a useful indicator to the “power” of a candidate in elevating the performance of the current CNN. However, automatic data augmentation inevitably generates “hard” samples for some candidates, injecting noisy labels; therefore, to significantly enhance the robustness of our method, we compute entropy and diversity by selecting only a portion of the patches of each candidate according to the predictions by the current CNN.
Several researchers have demonstrated the utility of fine-tuning CNNs for biomedical image analysis, but they only performed one-time fine-tuning, that is, simply fine-tuning a pre-trained CNN once with all available training samples involving no active selection processes (e.g., [9, 10, 11, 12, 13, 14, 15, 16]). To our knowledge, our proposed method is among the first to integrate active learning into fine-tuning CNNs in a continuous fashion to make CNNs more amicable for biomedical image analysis with an aim to cut annotation cost dramatically. Compared with conventional active learning, our method, summarized as Alg. LABEL:alg:AFT, offers eight advantages:
Starting with a completely empty labeled dataset, requiring no seed labeled candidates (see Alg. LABEL:alg:AFT);
Incrementally improving the learner through continuous fine-tuning rather than repeatedly re-training (see Sec. 3.5);
Actively selecting most informative and representative candidates by naturally exploiting expected consistency among the patches within each candidate (see Sec. 3.2);
Computing selection criteria locally on a small number of patches within each candidates, saving computation time considerably (see Sec. 3.2);
Automatically handling noisy labels via majority selection (see Sec. 3.3);
Combining newly selected candidates with misclassified candidates, eliminating easy samples to improve training efficiency, and focusing on hard samples to prevent catastrophic forgetting (see Sec. 3.5);
Incorporating randomness in active selection to reach a near optimal trade-off between exploration and exploitation (see Sec. 3.4).
More importantly, our method has the potential to exert important impact on computer-aided diagnosis (CAD) in biomedical imaging, because the current regulations require that CAD systems be deployed in a “closed” environment, in which all CAD results be reviewed and errors if any be corrected by radiologists; as a result, all false positives are supposed to be dismissed and all false negatives supplied, an instant on-line feedback that may make CAD systems self-learning and improving possible after deployment given the continuous fine-tuning capability of our method.
2 Related work
The literature of active learning, deep learning, and transfer learning is rich and deep[4, 5, 6, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28]. Due to space, we focus on the most relevant works only.
2.1 Transfer learning for medical imaging
Gustavo et al.  replaced the fully connected layers of a pre-trained CNN with a new logistic layer and trained only the appended layer with the labeled data while keeping the rest of the network the same, yielding promising results for classification of unregistered multiview mammograms. In , a fine-tuned pre-trained CNN was applied for localizing standard planes in ultrasound images. Gao et al.  fine-tuned all layers of a pre-trained CNN for automatic classification of interstitial lung diseases. In , Shin et al. used fine-tuned pre-trained CNNs to automatically map medical images to document-level topics, document-level sub-topics, and sentence-level topics. In , fine-tuned pre-trained CNNs were used to automatically retrieve missing or noisy cardiac acquisition plane information from magnetic resonance imaging and predict the five most common cardiac views. Schlegl et al.  explored unsupervised pre-training of CNNs to inject information from sites or image classes for which no annotations were available, and showed that such across site pre-training improved classification accuracy compared to random initialization of the model parameters. Tajbakhsh et al.  systematically investigated the capabilities of transfer learning in several medical imaging applications. However, they all performed one-time fine-tuning—simply fine-tuning a pre-trained CNN just once with available training samples, involving neither active selection processes nor continuous fine-tuning.
2.2 Integrating active learning with deep learning
Research in integrating active learning and deep learning is sparse: Wang and Shang 
may be the first to incorporate active learning with deep learning, and based their approach on stacked restricted Boltzmann machines and stacked autoencoders. A similar idea was reported for hyperspectral image classification. Stark et al.  applied active learning to improve the performance of CNNs for CAPTCHA recognition, while Al Rahhal et al.  exploited deep learning for active electrocardiogram classification. Most recently, Yang et al.  presented active learning framework to reduce annotation effort by judiciously suggesting the most effective annotation areas for segmentation based on uncertainty and similarity information provided by fully convolutional networks . Their approach is very expensive in computation, as they need train a set of models from scratch (unlike ours via fine-tuning) via bootstrapping  in order to compute their uncertainty measure based on these models’ disagreements, and must solve a generalized version of the maximum set cover problem , which is NP-hard, to determined the most representative areas. In a word, all aforementioned approaches are fundamentally different from AFT in that in each step they all repeatedly re-trained the learner from scratch, while we continuously fine-tune the (fine-tuned) CNN in an incremental manner, offering several advantages as listed in Sec. 1, leading to dramatic annotation cost reduction and computation efficiency.
2.3 Our work
We presented a method for integrated active learning and deep learning via continuous fine-tuning in our CVPR paper , but it was limited to binary classifications and biomedical imaging, and used all labeled samples available in each step, thereby demanding long training time and large computer memory. This paper is a significant extension of our CVPR paper , making several contributions: (1) generalizing binary classification to multi-class classification; (2) extending from computer-aided diagnosis in medical imaging to scene interpretation in natural images; (3) combining newly selected samples with hard (mis-classified) samples to eliminate easy samples for reducing training time, and concentrate on hard samples for preventing catastrophic forgetting; (4) injecting randomness to enhance robustness in active selection; (5) experimenting extensively with all reasonable combinations of data and models in search for an optimal strategy; (6) demonstrating consistent annotation reduction with different CNN architectures; and (7) illustratively explaining the active selection process with a gallery of the patches associated with predictions.
3 Proposed method
AFT was conceived in the context of computer-aided diagnosis (CAD) in biomedical imaging. A CAD system typically has a candidate generator, which can quickly produce a set of candidates, among which, some are true positives and some are false positives. After candidate generation, the task is to train a classifier to eliminate as many as possible false positives while keeping as many as possible true positives. To train a classifier, each of the candidates must be labeled. We assume that each candidate takes one of possible labels. To boost the performance of CNNs for CAD systems, multiple patches are usually generated automatically for each candidate through data augmentation; these patches generated from the same candidate inherit the candidate’s label. In other words, all labels are acquired at the candidate level. Mathematically, given a set of candidates, , where is the number of candidates, and each candidate is associated with patches, our AFT algorithm iteratively selects a set of candidates for labeling as illustrated in Alg. LABEL:alg:AFT.
However, AFT is generic and applicable to many tasks in computer vision and image analysis. For clarity, we illustrate the ideas behind AFT with the Places database  for scene interpretation in natural images, where no candidate generator is needed, as each image may be directly regarded as a candidate. For simplicity as an illustration, yet without loss of generality, we limit to three categories (kitchen, living room, and office), and divide the Places images in each category into training (14,000 images), validation (1,000 images), and test (100 images) with no overlaps.
Designing an active learning algorithm involves two key issues: (1) how to determine the “worthiness” of a candidate for annotation and (2) how to update the classifier/learner. Before formally addressing the first issue in Secs. 3.2–3.4 and the second one in Sec. 3.5, we illustrate our ideas in Sec. 3.1 with Fig. 3 and Tab. 1.
3.1 Illustrating active candidate selection
Fig. 3 shows the active candidate selection process for multi-class classification, while, for easy understanding, Tab. 1 illustrates it in case of binary classification. Assuming the prediction of patch by the current CNN is , we call the histogram of the prediction pattern of candidate . As shown in Row 1 of Table 1, in binary classification, there are seven typical prediction patterns:
Pattern B: It is flatter than Pattern A, as the patches’ predictions are spread widely from 0 to 1 with a higher degree of inconsistency among the patches’ predictions. Since all the patches belonging to a candidate are generated via data argumentation, they (at least their majority) are expected to have similar predictions. This type of candidates have potential to contribute significantly to enhancing the current CNN’s performance.
Pattern C: The patches’ predictions are clustered at the both ends, with a higher degree of diversity. This type of candidates are most likely associated with noise labels at the patch level as illustrated in Fig. 4 (c), and they are the least favorable in active selection because they may cause confusion in fine-tuning the CNN.
Patterns D and E: The patches’ predictions are clustered at either end (i.e., 0 or 1), with a higher degree of certainty. This type of candidates should be postponed in annotating at this stage because mostly likely the current CNN has predicted them correctly, and they would contribute very little in fine-tuning the current CNN.
Patterns F and G:
They have a higher degree of certainty in some of the patches’ predictions and are associated with some outliers in the patches’ predictions. This type of candidates are valuable because they are capable of smoothly improving the CNN’s performance. They might not make dramatic contributions but they would not cause significant harms in enhancing the CNN’s performance.
3.2 Seeking worthy candidates
In active learning, the key is to develop a criterion for determining the “worthiness” of a candidate for annotation. Our criterion is based on a simple yet powerful observation: All patches augmented from the same candidate (Fig. 3) share the same label; they are expected to have similar predictions by the current CNN. As a result, their entropy and diversity provide a useful indicator to the “power” of a candidate in elevating the performance of the current CNN. Intuitively, entropy captures classification certainty—a higher uncertainty value denotes a higher degree of information (e.g., pattern A in Tab. 1); while diversity indicates prediction consistency among the patches of a candidate—a higher diversity value denotes a higher degree of prediction inconsistency (e.g., pattern C in Tab. 1). Formally, assuming that each candidate takes one of possible labels, we define the entropy of as
and the diversity of as
Combining entropy and diversity yields
where and are trade-offs between entropy and diversity. We use two parameters for convenience, so as to easily turn on/off entropy or diversity during experiments.
3.3 Handling noise labels via majority selection
Automatic data augmentation is essential to boost CNN’s performance, but it inevitably generates “hard” samples for some candidates as shown in Fig. 4 (c), injecting noisy labels; therefore, to significantly enhance the robustness of our method, we compute entropy and diversity by selecting only a portion of the patches of each candidate according to the predictions by the current CNN.
Specially, for each candidate we first determine its dominant category, which is defined the category with the highest confidence in the mean prediction, that is,
where is the outputs of the current CNN given on label . After sorting according to dominant category , we apply Eq. 3 on the top 100% of the patches for constructing the score matrix of size for each candidate in . Our proposed majority selection method automatically excludes the patches with noisy labels (see Tab. 1: diversity and diversity) because of their low confidences.
We should note that the idea of combining entropy and diversity was inspired by , but there is a fundamental difference: they computed across the whole unlabeled dataset with time complexity , which is very computational expensive, while we compute locally on the selected patches within each candidate, saving computation time considerably with time complexity , where in our experiments. As a matter of fact, it is computationally infeasible to apply this method  in our three real-world applications because (a) their selection criteria involves all unlabeled samples (patches). For instance, we have 391,200 training patches for polyp detection (see Sec. 4.2), and computing their would demand 1.1 TB memory (); (b) their algorithms for batch selection were based on the truncated power method , which was unable to find a solution even for our smallest application (colonoscopy frame classification with 42,000 training patches in Sec. 4.1). Furthermore, as a conventional method, it does not have those advantages listed in Sec. 1. Specifically, they used SVM  as the base classifier, (a) cannot effectively start with a completely empty labeled dataset, and (b) cannot incrementally improve the learner through continuous fine-tuning. They have no concept of candidate, (c) cannot exploit expected consistency among the patches within each candidate for active selection and (d) cannot automatically handle noisy labels by using majority selection.
3.4 Injecting randomization in active selection
As discussed in , simple random selection may outperform active selection at the beginning, because the active selection depends on the current model to select examples for labeling; as a result, a poor selection made at an early stage may adversely affect the quality of subsequent selections; while the random selection gets less frequently locked into a poor hypothesis. In other words, the active selection concentrates on exploiting the knowledge gained from the labels already acquired to further explore the decision boundary, while the random selection concentrates on exploration, and so is able to locate areas of the feature space where the classifier performs poorly. Therefore, an effective active learning strategy must a balance between exploration and exploitation. To this end, we inject randomization in our method by selecting actively according to its sampling probability .
where is sorted according to its value in a descending order, is named random extension. Suppose number of candidates are required for annotation. Instead of selecting top candidates, we extend the candidate selection pool to . Then we select candidates from this pool with their sampling probabilities to inject randomization.
3.5 Comparing various learning strategies
From the discussion above, several active learning strategies can be derived as summarized in Tab. II. Our comprehensive comparisons (reported in Sec. 4.4) show that (1) AFT is unstable; (2) AFT needs a careful parameter adjustment; and (3) AFT is the most reliable in comparison with AFT and AFT, but it requires to fine-tune the original model from the beginning using the all presently-available-labeled samples in each step. To overcome this drawback, we develop an optimized version AFT, which continuously fine-tunes the current model using newly annotated candidates enlarged by those misclassified candidates, that is, . Several researchers have demonstrated that fine-tuning offers better performance and is more robust than training from scratch. Moreover, our experiments show AFT saves training time by converging faster than repeatedly fine-tuning the original pre-trained CNN, and boosts performance by eliminating easy samples, focusing on hard samples, and preventing catastrophic forgetting.
Fig. 2 (a) compares AFT with RFT using the Places database. For RFT, six different sequences are generated via systematic random sampling. The final curve is plotted by the average performance of six runs. As shown in Fig. 2 (a), AFT, with only 2,906 candidate queries, can achieve the performance of RFT with 4,452 candidate queries in terms of AUC (area under the curve), while, with only 1,176 candidate queries, it can achieve the performance of full training with all 42,000 candidates. Thereby, 34.7% of the labeling cost could be saved from RFT, and 97.2% of the labeling cost could be saved from full training. When nearly 100% training data are used, the performance still keeps increasing; therefore, considering 22 layers GoogLeNet architecture, the size of dataset is still not enough. AFT is a general algorithm not only useful in biomedical dataset but also other datasets; AFT works for multi-class problems.
Active: diversity, diversity, diversity, diversity, etc.
: labeled candidates.
: newly selected candidates.
: misclassified candidates.
We apply our methods AFT and AFT to three applications, including colonoscopy frame classification, polyp detection, and pulmonary embolism (PE) detection. In terms of active selection criteria, eight strategies are compared in the experiment, i.e., diversity, (Eq. 2), diversity (Sec. 3.3), diversity (Sec 3.4), diversity (Sec. 3.3 and Sec 3.4), entropy (Eq. 1), entropy, entropy, entropy. For all applications, we set to 1/4 and to 5. Tajbakhsh et al.  reported the state-of-the-art performance of fine-tuning and learning from scratch based on the whole datasets, which is used for the baseline performance for comparison. They also investigated the performance of (partial) fine-tuning with a sequence of partial training datasets, but their partition of the datasets were different; therefore, for fair comparison with their idea, we introduce RFT, which fine-tunes the original model from the beginning using the all presently-available-labeled samples , where is randomly selected, in each step. In all three applications, our AFT begins with an empty training dataset and directly uses pre-trained models (AlexNet and GoogLeNet adopted) on ImageNet.
4.1 Colonoscopy Frame Classification
Image quality assessment can have a major role in objective quality assessment of colonoscopy procedures. Typically, a colonoscopy video contains a large number of non-informative images with poor colon visualization that are not suitable for inspecting the colon or performing therapeutic actions. The larger the fraction of non-informative images in a video, the lower the quality of colon visualization, and thus the lower the quality of colonoscopy. Therefore, one way to measure the quality of a colonoscopy procedure is to monitor the quality of the captured images. Such quality assessment can be used during live procedures to limit low-quality examinations or in a post-processing setting for quality monitoring purposes. Technically, image quality assessment at colonoscopy can be viewed as an image classification task whereby an input image is labeled as either informative or non-informative. Fig. 4 shows examples of informative and non-informative colonoscopy frames.
In this application, frames are regarded as candidates, as the labels (informative or non-informative) are associated with frames as illustrated in Fig. 4 (a) (b). In total, there are 4,000 colonoscopy candidates from 6 complete colonoscopy videos. A trained expert then manually labeled the collected images as informative or non-informative (line 11 in Alg. LABEL:alg:AFT). A gastroenterologist further reviewed the labeled images for corrections. The labeled frames at the video level are separated into training and test sets, each containing approximately 2,000 colonoscopy frames. For data augmentation, we extracted 21 patches from each frame.
Fig. 5 (a) shows that AFT with only around 120 candidate queries (6.0%) can achieve the performance of 100% training dataset fine-tune from AlexNet (solid black line in the figure, where AUC=.9366), and with only 80 candidate queries (4.0%) can achieve the performance of 100% training dataset learning from scratch (dashed black line in the figure, where AUC=.9204), and with 80 candidate queries can achieve the performance of RFT with 320 candidate queries. Thereby, around 75% labeling cost could be save from RFT in colonoscopy frame classification. At the early steps, RFT yields great performance than most of active selecting strategies, because 1) random selection gives the samples with the positive-negative ratio compatible with the testing and validation dataset; 2) the pre-trained model gives poor predictions on biomedical image domain, as it was trained by natural images. Its output probabilities are mostly confused or even opposite, yielding poor selection scores. However, with the randomness injected as described in Sec. 3.4, AFTdiversity and AFTentropy performance better even at the early stages and increase the performance very quickly along the step. In addition, AFT performs promisingly and efficiently comparing to AFT which always uses the whole labeled dataset and fine-tune from the beginning (see comparisons in Tab. III).
4.2 Polyp Detection
Colonoscopy is the preferred technique for colon cancer screening and prevention. The goal of colonoscopy is to find and remove colonic polyps—precursors to colon cancer. Polyps, as shown in Fig. 6
, can appear with substantial variations in color, shape, and size. The challenging appearance of polyps can often lead to misdetection, particularly during long and back-to-back colonoscopy procedures where fatigue negatively affects the performance of colonoscopists. Polyp miss-rates are estimated to be about 4% to 12%[48, 49, 50, 51]; however, a more recent clinical study  is suggestive that this misdetection rate may be as high as 25%. Missed polyps can lead to the late diagnosis of colon cancer with an associated decreased survival rate of less than 10% for metastatic colon cancer . Computer-aided polyp detection may enhance optical colonoscopy screening by reducing polyp misdetection.
The dataset contains 38 patients with one video each. The training dataset is composed of 21 videos (11 with polyps and 10 without polyps), while the testing dataset is composed of 17 videos (8 videos with polyps and 9 videos without polyps). In this application, each polyp candidate is regarded as a candidate. At the video level, the candidates were divided into the training dataset (16300 candidates) and testing dataset 11950 candidates). Each candidate contains 24 patches.
At each polyp candidate location with the given bounding box, we performed a data augmentation by a factor . At each scale, we extracted patches after the candidate is translated by 10 percent of the resized bounding box in vertical and horizontal directions. We further rotated each resulting patch 8 times by mirroring and flipping. The patches generated by data augmentation belong to the same candidate.
Fig. 5 (b) shows that AFT with only around 320 candidate queries (2.04%) can achieve the performance of 100% training dataset fine-tune from AlexNet (solid black line in the figure, where AUC=.9615), and with only 10 candidate queries (0.06%) can achieve the performance of 100% training dataset learning from scratch (dashed black line in the figure, where AUC=.9358), and with 10 candidate queries can achieve the performance of RFT with 80 candidate queries. Thereby, nearly 87% labeling cost could be save from RFT in polyp detection. The fast convergence and outstanding performance of AFT is attributed to the majority and randomization method, which can both efficiently select the informative and representative candidates while excluding those with noisy labels, and boost the initial performance at the early stages. Diversity, which strongly favors candidates whose prediction pattern resembles Pattern C (see Tab. 1), works even poorer than RFT because of noisy labels generated through data augmentation.
4.3 Pulmonary Embolism Detection
Pulmonary Embolism (PE) is a major national health problem, and computer-aid PE detection can improve diagnostic capabilities of radiologists. A PE is a blood clot that travels from a lower extremity source to the lung, where it causes blockage of the pulmonary arteries. The mortality rate of untreated PE may approach 30% , but it decreases to as low as 2% with early diagnosis and appropriate treatment . CT pulmonary angiography (CTPA) is the primary means for PE diagnosis, wherein a radiologist carefully traces each branch of the pulmonary artery for any suspected PEs. CTPA interpretation is a time-consuming task whose accuracy depends on human factors, such as attention span and sensitivity to the visual characteristics of PEs. CAD can have a major role in improving PE diagnosis and decreasing the reading time of CTPA datasets.
We use a database consisting of 121 CTPA datasets with a total of 326 PEs. Each PE candidate is regarded as a candidate with 50 patches. We divided candidates at the patient level into a training dataset with 434 true positives (199 unique PEs) and 3,406 false positives, and a testing dataset with 253 true positives (127 unique PEs) and 2,162 false positives. The overall PE probability is calculated by averaging the probabilistic prediction generated for the patches within PE candidate after data augmentation.
Fig. 5 (c) shows that AFT with 2560 candidate queries (66.68%) can almost achieve the performance of 100% both training dataset fine-tune from AlexNet and learning from scratch (solid black line and dashed black line in the figure, where AUC=.8763 and AUC=.8706), and with 1280 candidate queries can achieve the performance of RFT with 2560 candidate queries. Based on this analysis, the cost of annotation can be cut at least half by our method from RFT in pulmonary embolism detection.
4.4 Comparison of all the methods
As summarized in Tab. II, several active learning strategies can be derived. The prediction performance was evaluated according to the ALC (Area under the Learning Curve). A learning curve plots AUC computed on the testing dataset, as a function of the number of labels queried . Tab. III shows the ALC of AFT, AFT, AFT and AFT comparing to RFT. Our comprehensive experiments have demonstrated that:
Active fine-tuning methods using entropy and diversity with majority and randomization selection can achieve better performance than random selection.
Only using newly selected candidates for fine-tuning current model (strategy AFT) is unstable because pre-trained samples may be forgotten if the classifier is only trained on the newly selected samples along the steps, leading to a lower ALC.
AFT requires a careful parameter adjustment. Although its performance is acceptable, it requires the same computing time as AFT, meaning there is no advantage on continuously fine-tuning the current model.
AFT maintains the most reliable performance in comparison with AFT and AFT.
The optimized version AFT shares the comparable performance as AFT and outperforms occasionally by eliminating easy samples, focusing on hard samples and preventing catastrophic forgetting.
Active selection with diversity works even poorer than RFT because it favors Pattern C, which is most likely associated with noisy labels injected through data augmentation (see Fig. 4 (c), (d)).
Majority selection (described in Sec. 3.3) can automatically handle noisy labels and overcome the drawback of diversity, especially for colonoscopy frame classification with severe noisy labels as shown in Fig. 4 (c), (d). The majority selection involves a fewer patches within each annotation unit and saves more computation time.
4.5 Observations on selected patterns
We meticulously monitored the active selection process and examined the selected candidates, as an example, we include the top 10 candidates selected by the four AFT methods at Step 3 in colonoscopy frame classification in Fig. 8. From this process, we have observed the following:
Patterns A and B are dominant in the earlier stages of AFT as the CNN has not been fine-tuned properly to the target domain.
Patterns C, D and E are dominant in the later stages of AFT as the CNN has been largely fine-tuned on the target dataset.
The majority selection, i.e., entropy, diversity, is effective in excluding Patterns C, D, and E, while entropy (without the majority selection) can handle Patterns C, D, and E reasonably well.
Patterns B, F, and G generally make good contributions to elevating the current CNN’s performance.
Entropy and entropy favor Pattern A because of its higher degree of uncertainty as shown in Fig. 8.
Diversity prefers Pattern B while diversity prefers Pattern C (Fig. 8). This is why diversity may cause sudden disturbances in the CNN’s performance and why diversity should be preferred in general.
In addition, to create a visual impression on how newly selected images look like, we show the top and bottom five images selected by four active selection strategies (i.e., diversity, diversity, entropy and entropy) from Places at Step 11 in Fig. 11 and their associated predictions by the current CNN in Tab. V. Such a gallery offers an intuitive way to analyze the most/least favored images and has helped us develop different active selection strategies.
4.6 Automatically balancing positive-negative ratios
In real-world applications, datasets are usually unbalanced. In order to achieve good classification performance, it is better to balance the training dataset in terms of classes. For random selection, the ratio is roughly the same as the whole training dataset. We observe that our active learning methods, AFT and AFT, are capable of making the selected training dataset balanced automatically. Our datasets are unbalanced—more negatives and fewer positives. As shown in Fig. 9, for colonoscopy frame classification, the ratio between positives and negatives is around 3:7; for polyp detection and pulmonary embolism detection, the ratios are around 1:9. After monitoring the active selection process, AFT and AFT can select at least twice more positives than random selection. We believe that this is one of the reasons that AFT and AFT achieve better performance quickly.
4.7 Generalizability of AFT in CNN architectures
|Colonoscopy Frame Classification||0.9||0.0001||0.001||0.95||8|
|Pulmonary Embolism Detection||0.9||0.001||0.01||0.95||5|
We based our experiments on AlexNet, because its architecture offers a nice balance in depth: it is deep enough that we can investigate the impact of AFT and AFT on the performance of pre-trained CNNs, and it is also shallow enough that we can conduct experiments quickly. The learning parameters used for training and fine-tuning of AlexNet in our experiments are summarized in Tab. IV. Alternatively, deeper architectures, such as GoogLeNet , ResNet , and DenseNet , could have been used and have shown relatively high performance for challenging computer vision tasks. However, the purpose of this work is not to achieve the highest performance for different biomedical image tasks but to answer a critical question: How to dramatically reduce the cost of annotation when applying CNNs in biomedical imaging. For these purposes, AlexNet is a reasonable architectural choice. Nevertheless, we have experimented our three applications on GoogLeNet, demonstrating consistent patterns between AlexNet and GoogLeNet as shown in Figs. 2, 5 and 10. As a result, given this generalizability, we could focus on comparing the prediction patterns (summarized in Tab. 1) and learning strategies (summarized in Tab. II) instead of running experiments on various deep neural network architectures.
In this paper, we have developed a novel method for dramatically cutting annotation cost by integrating active learning and transfer learning. In comparison with the state-of-the-art method  (random selection), our method can cut the annotation cost by at least half in three biomedical applications, and 33% in nature image database relative to their random selection. This performance is attributed to our eight advantages associated with our method (detailed in Sec. 1).
We choose to select, classify and label samples at the candidate level. Labeling at the patient level would certainly reduce the cost of annotation more but introduce more severe label noise; labeling at the patch level would cope with the label noise but impose a much heavier burden on experts for annotation. We believe that labeling at the candidate level offers a sensible balance in our three applications.
This research has been supported partially by the NIH under Award Number R01HL128785 and by ASU and Mayo Clinic through a Seed Grant and an Innovation Grant. The content is solely the responsibility of the authors and does not necessarily represent the official views of the NIH.
-  Y. LeCun, Y. Bengio, and G. Hinton, “Deep learning,” Nature, vol. 521, no. 7553, pp. 436–444, 2015.
-  J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei, “Imagenet: A large-scale hierarchical image database,” in Computer Vision and Pattern Recognition, 2009. CVPR 2009. IEEE Conference on. IEEE, 2009, pp. 248–255.
B. Zhou, A. Lapedriza, A. Khosla, A. Oliva, and A. Torralba, “Places: A 10 million image database for scene recognition,”IEEE transactions on pattern analysis and machine intelligence, 2017.
-  H. Greenspan, B. van Ginneken, and R. M. Summers, “Guest editorial deep learning in medical imaging: Overview and future promise of an exciting new technique,” IEEE Transactions on Medical Imaging, vol. 35, no. 5, pp. 1153–1159, 2016.
-  K. Zhou, H. Greenspan, and D. Shen, Deep Learning for Medical Image Analysis. Academic Press, 2016.
-  L. Lu, Y. Zheng, G. Carneiro, and L. Yang, Deep learning and convolutional neural networks for medical image computing: precision medicine, high performance and large-scale datasets. Springer, 2017.
-  C. E. Shannon, “A mathematical theory of communication,” Bell System Technical Journal. 27 (3): 379–423, vol. 7, no. 3, p. 379–423, 1948.
-  M. Kukar, “Transductive reliability estimation for medical diagnosis,” Artificial Intelligence in Medicine, vol. 29, no. 1, pp. 81–106, 2003.
H. Chen, Q. Dou, D. Ni, J.-Z. Cheng, J. Qin, S. Li, and P.-A. Heng, “Automatic fetal ultrasound standard plane detection using knowledge transferred recurrent neural networks,” inInternational Conference on Medical Image Computing and Computer-Assisted Intervention. Springer, 2015, pp. 507–514.
-  T. Schlegl, J. Ofner, and G. Langs, “Unsupervised pre-training across image domains improves lung tissue classification,” in Medical Computer Vision: Algorithms for Big Data. Springer, 2014, pp. 82–93.
-  H. Chen, D. Ni, J. Qin, S. Li, X. Yang, T. Wang, and P. A. Heng, “Standard plane localization in fetal ultrasound via domain transferred deep neural networks,” Biomedical and Health Informatics, IEEE Journal of, vol. 19, no. 5, pp. 1627–1636, Sept 2015.
-  G. Carneiro, J. Nascimento, and A. Bradley, “Unregistered multiview mammogram analysis with pre-trained deep learning models,” in Medical Image Computing and Computer-Assisted Intervention – MICCAI 2015, ser. Lecture Notes in Computer Science, N. Navab, J. Hornegger, W. M. Wells, and A. F. Frangi, Eds. Springer International Publishing, 2015, vol. 9351, pp. 652–660. [Online]. Available: http://dx.doi.org/10.1007/978-3-319-24574-4_78
-  H.-C. Shin, L. Lu, L. Kim, A. Seff, J. Yao, and R. M. Summers, “Interleaved text/image deep mining on a very large-scale radiology database,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2015, pp. 1090–1099.
-  M. Gao, U. Bagci, L. Lu, A. Wu, M. Buty, H.-C. Shin, H. Roth, G. Z. Papadakis, A. Depeursinge, R. M. Summers et al., “Holistic classification of ct attenuation patterns for interstitial lung diseases via deep convolutional neural networks,” in the 1st Workshop on Deep Learning in Medical Image Analysis, International Conference on Medical Image Computing and Computer Assisted Intervention, at MICCAI-DLMIA’15, 2015.
-  J. Margeta, A. Criminisi, R. Cabrera Lozoya, D. C. Lee, and N. Ayache, “Fine-tuned convolutional neural nets for cardiac mri acquisition plane recognition,” Computer Methods in Biomechanics and Biomedical Engineering: Imaging & Visualization, pp. 1–11, 2015.
-  N. Tajbakhsh, J. Y. Shin, S. R. Gurudu, R. T. Hurst, C. B. Kendall, M. B. Gotway, and J. Liang, “Convolutional neural networks for medical image analysis: Full training or fine tuning?” IEEE transactions on medical imaging, vol. 35, no. 5, pp. 1299–1312, 2016.
H. Wang, Z. Zhou, Y. Li, Z. Chen, P. Lu, W. Wang, W. Liu, and L. Yu, “Comparison of machine learning methods for classifying mediastinal lymph node metastasis of non-small cell lung cancer from 18 f-fdg pet/ct images,”EJNMMI research, vol. 7, no. 1, p. 11, 2017.
-  B. Settles, “Active learning literature survey,” University of Wisconsin, Madison, vol. 52, no. 55-66, p. 11.
-  I. Guyon, G. Cawley, G. Dror, V. Lemaire, and A. Statnikov, JMLR Workshop and Conference Proceedings (Volume 16): Active Learning Challenge. Microtome Publishing, 2011.
-  K. Konyushkova, R. Sznitman, and P. Fua, “Learning active learning from data,” in Advances in Neural Information Processing Systems, 2017, pp. 4226–4236.
-  A. Mosinska, J. Tarnawski, and P. Fua, “Active learning and proofreading for delineation of curvilinear structures,” in International Conference on Medical Image Computing and Computer-Assisted Intervention. Springer, 2017, pp. 165–173.
-  S.-J. Huang, R. Jin, and Z.-H. Zhou, “Active learning by querying informative and representative examples,” in Advances in neural information processing systems, 2010, pp. 892–900.
-  M. Li and I. K. Sethi, “Confidence-based active learning,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 28, no. 8, pp. 1251–1261, 2006.
-  G. Litjens, T. Kooi, B. E. Bejnordi, A. A. A. Setio, F. Ciompi, M. Ghafoorian, J. A. van der Laak, B. van Ginneken, and C. I. Sánchez, “A survey on deep learning in medical image analysis,” arXiv preprint arXiv:1702.05747, 2017.
-  J. Zhang, W. Li, and P. Ogunbona, “Transfer learning for cross-dataset recognition: A survey.”
-  R. Chattopadhyay, W. Fan, I. Davidson, S. Panchanathan, and J. Ye, “Joint transfer and batch-mode active learning,” in International Conference on Machine Learning, 2013, pp. 253–261.
-  S. J. Pan and Q. Yang, “A survey on transfer learning,” IEEE Transactions on knowledge and data engineering, vol. 22, no. 10, pp. 1345–1359, 2010.
-  K. Weiss, T. M. Khoshgoftaar, and D. Wang, “A survey of transfer learning,” Journal of Big Data, vol. 3, no. 1, p. 9, 2016.
-  D. Wang and Y. Shang, “A new active labeling method for deep learning,” in 2014 International Joint Conference on Neural Networks (IJCNN), July 2014, pp. 112–119.
-  J. Li, “Active learning for hyperspectral image classification with a stacked autoencoders based neural network,” in 2016 IEEE International Conference on Image Processing (ICIP), Sept 2016, pp. 1062–1065.
-  F. Stark, C. Hazırbas, R. Triebel, and D. Cremers, “Captcha recognition with active deep learning,” in Workshop New Challenges in Neural Computation 2015. Citeseer, 2015, p. 94.
-  M. Al Rahhal, Y. Bazi, H. AlHichri, N. Alajlan, F. Melgani, and R. Yager, “Deep learning approach for active classification of electrocardiogram signals,” Information Sciences, vol. 345, pp. 340–354, 2016.
-  L. Yang, Y. Zhang, J. Chen, S. Zhang, and D. Z. Chen, “Suggestive annotation: A deep active learning framework for biomedical image segmentation,” arXiv preprint arXiv:1706.04737, 2017.
-  E. Shelhamer, J. Long, and T. Darrell, “Fully convolutional networks for semantic segmentation,” IEEE transactions on pattern analysis and machine intelligence, vol. 39, no. 4, pp. 640–651, 2017.
-  B. Efron and R. J. Tibshirani, An introduction to the bootstrap. CRC press, 1994.
-  U. Feige, “A threshold of ln n for approximating set cover,” J. ACM, vol. 45, no. 4, pp. 634–652, Jul. 1998. [Online]. Available: http://doi.acm.org/10.1145/285055.285059
-  Z. Zhou, J. Shin, L. Zhang, S. Gurudu, M. Gotway, and J. Liang, “Fine-tuning convolutional neural networks for biomedical image analysis: actively and incrementally,” in IEEE conference on computer vision and pattern recognition (CVPR), Hawaii, USA, 2017, pp. 7340–7351.
-  S. Chakraborty, V. Balasubramanian, Q. Sun, S. Panchanathan, and J. Ye, “Active batch selection via convex relaxations with guaranteed solution bounds,” IEEE transactions on pattern analysis and machine intelligence, vol. 37, no. 10, pp. 1945–1958, 2015.
X.-T. Yuan and T. Zhang, “Truncated power method for sparse eigenvalue problems,”Journal of Machine Learning Research, vol. 14, no. Apr, pp. 899–925, 2013.
S. R. Gunn et al.
, “Support vector machines for classification and regression,”ISIS technical report, vol. 14, pp. 85–86, 1998.
-  A. Borisov, E. Tuv, and G. Runger, “Active batch learning with stochastic query by forest,” in JMLR: Workshop and Conference Proceedings (2010). Citeseer, 2010.
-  Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner, “Gradient-based learning applied to document recognition,” Proceedings of the IEEE, vol. 86, no. 11, pp. 2278–2324, 1998.
-  A. Krizhevsky, I. Sutskever, and G. E. Hinton, “Imagenet classification with deep convolutional neural networks,” in Advances in neural information processing systems, 2012, pp. 1097–1105.
-  C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan, V. Vanhoucke, and A. Rabinovich, “Going deeper with convolutions,” in The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2015.
-  K. Simonyan and A. Zisserman, “Very deep convolutional networks for large-scale image recognition,” arXiv preprint arXiv:1409.1556, 2014.
-  K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” in The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2016.
-  G. Huang, Z. Liu, K. Q. Weinberger, and L. van der Maaten, “Densely connected convolutional networks,” arXiv preprint arXiv:1608.06993, 2016.
-  A. Pabby, R. E. Schoen, J. L. Weissfeld, R. Burt, J. W. Kikendall, P. Lance, M. Shike, E. Lanza, and A. Schatzkin, “Analysis of colorectal cancer occurrence during surveillance colonoscopy in the dietary polyp prevention trial,” Gastrointest Endosc, vol. 61, no. 3, pp. 385–91, 2005.
-  J. van Rijn, J. Reitsma, J. Stoker, P. Bossuyt, S. van Deventer, and E. Dekker, “Polyp miss rate determined by tandem colonoscopy: a systematic review,” American Journal of Gastroenterology, vol. 101, no. 2, pp. 343–350, 2006.
-  D. H. Kim, P. J. Pickhardt, A. J. Taylor, W. K. Leung, T. C. Winter, J. L. Hinshaw, D. V. Gopal, M. Reichelderfer, R. H. Hsu, and P. R. Pfau, “Ct colonography versus colonoscopy for the detection of advanced neoplasia,” N Engl J Med, vol. 357, no. 14, pp. 1403–12, 2007.
-  D. Heresbach, T. Barrioz, D. Lapalus, MG.and Coumaros, and et al., “Miss rate for colorectal neoplastic polyps: a prospective multicenter study of back-to-back video colonoscopies,” Endoscopy, vol. 40, no. 4, pp. 284–290, 2008.
-  A. Leufkens, M. van Oijen, F. Vleggaar, and P. Siersema, “Factors influencing the miss rate of polyps in a back-to-back colonoscopy study,” Endoscopy, vol. 44, no. 05, pp. 470–475, 2012.
-  L. Rabeneck, H. El-Serag, J. Davila, and R. Sandler, “Outcomes of colorectal cancer in the united states: no change in survival (1986-1997).” The American journal of gastroenterology, vol. 98, no. 2, p. 471, 2003.
-  K. K. Calder, M. Herbert, and S. O. Henderson, “The mortality of untreated pulmonary embolism in emergency department patients.” Annals of emergency medicine, vol. 45, no. 3, pp. 302–310, 2005. [Online]. Available: http://dx.doi.org/10.1016/j.annemergmed.2004.10.001
-  G. Sadigh, A. M. Kelly, and P. Cronin, “Challenges, controversies, and hot topics in pulmonary embolism imaging,” American Journal of Roentgenology, vol. 196, no. 3, 2011. [Online]. Available: http://dx.doi.org/10.2214/AJR.10.5830
-  N. Tajbakhsh, M. B. Gotway, and J. Liang, “Computer-aided pulmonary embolism detection using a novel vessel-aligned multi-planar image representation and convolutional neural networks,” in International Conference on Medical Image Computing and Computer-Assisted Intervention. Springer, 2015, pp. 62–69.
-  I. Guyon, G. C. Cawley, G. Dror, and V. Lemaire, “Results of the active learning challenge.” Active Learning and Experimental Design@ AISTATS, vol. 16, pp. 19–45, 2011.