AFT*: Integrating Active Learning and Transfer Learning to Reduce Annotation Efforts

02/03/2018 ∙ by Zongwei Zhou, et al. ∙ Arizona State University Mayo Foundation for Medical Education and Research 0

The splendid success of convolutional neural networks (CNNs) in computer vision is largely attributed to the availability of large annotated datasets, such as ImageNet and Places. However, in biomedical imaging, it is very challenging to create such large annotated datasets, as annotating biomedical images is not only tedious, laborious, and time consuming, but also demanding of costly, specialty-oriented skills, which are not easily accessible. To dramatically reduce annotation cost, this paper presents a novel method to naturally integrate active learning and transfer learning (fine-tuning) into a single framework, called AFT*, which starts directly with a pre-trained CNN to seek "worthy" samples for annotation and gradually enhance the (fine-tuned) CNN via continuous fine-tuning. We have evaluated our method in three distinct biomedical imaging applications, demonstrating that it can cut the annotation cost by at least half, in comparison with the state-of-the-art method. This performance is attributed to the several advantages derived from the advanced active, continuous learning capability of our method. Although AFT* was initially conceived in the context of computer-aided diagnosis in biomedical imaging, it is generic and applicable to many tasks in computer vision and image analysis; we illustrate the key ideas behind AFT* with the Places database for scene interpretation in natural images.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 2

page 4

page 7

page 8

page 9

page 10

page 13

page 14

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Convolutional neural networks (CNNs) [1] have brought about a revolution in computer vision thanks to large annotated datasets, such as ImageNet [2] and Places [3]. As evidenced by an IEEE TMI special issue [4] and two recent books [5, 6], intense interest in applying CNNs in biomedical image analysis is widespread, but its success is impeded by the lack of such large annotated datasets in biomedical imaging. Annotating biomedical images is not only tedious and time consuming, but also demanding of costly, specialty-oriented knowledge and skills, which are not easily accessible. Therefore, we seek to answer this critical question: How to dramatically reduce the cost of annotation when applying CNNs in biomedical imaging, as well as an accompanying question: given a labeled dataset, how to determine its sufficiency in covering the variations of objects of interest. In doing so, we present a novel method called AFT to naturally integrate active learning and transfer learning into a single framework. Our AFT method starts directly with a pre-trained CNN to seek “salient” samples from the unannotated for annotation, and the (fine-tuned) CNN is continuously fine-tuned by newly annotated samples combined with all misclassified samples. We have evaluated our method in three different applications including colonoscopy frame classification, polyp detection, and pulmonary embolism (PE) detection, demonstrating that the cost of annotation can be cut by at least half.

algocf[tb]    

This outstanding performance is attributed to a simple yet powerful observation: To boost the performance of CNNs in biomedical imaging, multiple patches are usually generated automatically for each candidate through data augmentation; these patches generated from the same candidate share the same label, and are naturally expected to have similar predictions by the current CNN before they are expanded into the training dataset. As a result, their entropy [7] and diversity [8] provide a useful indicator to the “power” of a candidate in elevating the performance of the current CNN. However, automatic data augmentation inevitably generates “hard” samples for some candidates, injecting noisy labels; therefore, to significantly enhance the robustness of our method, we compute entropy and diversity by selecting only a portion of the patches of each candidate according to the predictions by the current CNN.

Several researchers have demonstrated the utility of fine-tuning CNNs for biomedical image analysis, but they only performed one-time fine-tuning, that is, simply fine-tuning a pre-trained CNN once with all available training samples involving no active selection processes (e.g., [9, 10, 11, 12, 13, 14, 15, 16]). To our knowledge, our proposed method is among the first to integrate active learning into fine-tuning CNNs in a continuous fashion to make CNNs more amicable for biomedical image analysis with an aim to cut annotation cost dramatically. Compared with conventional active learning, our method, summarized as Alg. LABEL:alg:AFT, offers eight advantages:

  1. [noitemsep]

  2. Starting with a completely empty labeled dataset, requiring no seed labeled candidates (see Alg. LABEL:alg:AFT);

  3. Incrementally improving the learner through continuous fine-tuning rather than repeatedly re-training (see Sec. 3.5);

  4. Actively selecting most informative and representative candidates by naturally exploiting expected consistency among the patches within each candidate (see Sec. 3.2);

  5. Computing selection criteria locally on a small number of patches within each candidates, saving computation time considerably (see Sec. 3.2);

  6. Automatically handling noisy labels via majority selection (see Sec. 3.3);

  7. Autonomously balancing training samples among classes (see Sec. 4.6 and Fig. 9);

  8. Combining newly selected candidates with misclassified candidates, eliminating easy samples to improve training efficiency, and focusing on hard samples to prevent catastrophic forgetting (see Sec. 3.5);

  9. Incorporating randomness in active selection to reach a near optimal trade-off between exploration and exploitation (see Sec. 3.4).

More importantly, our method has the potential to exert important impact on computer-aided diagnosis (CAD) in biomedical imaging, because the current regulations require that CAD systems be deployed in a “closed” environment, in which all CAD results be reviewed and errors if any be corrected by radiologists; as a result, all false positives are supposed to be dismissed and all false negatives supplied, an instant on-line feedback that may make CAD systems self-learning and improving possible after deployment given the continuous fine-tuning capability of our method.

Fig. 1: We illustrate the ideas behind AFT by utilizing Places [3] for scene interpretation in natural images. For simplicity yet without loss of generality, we limit to 3 categories: (k) kitchen, (l) living room, and (o) office. Places has 15,100 images in each category, while, as we can imagine, possible layouts and appearances of kitchen, living room, and office may vary dramatically in the real world.
Fig. 2: Referring to Alg. LABEL:alg:AFT, AFT aims to cut annotation cost by recommending the most informative and representative samples iteratively for experts to label, so as to minimize the number of labeled samples required to achieve a satisfactory performance. For Places (a), by actively selecting 2,906 images (6.92% of the whole dataset), AFT (solid orange) can offer the same performance of 4,452 images through random selection (RFT in dashed orange) in terms of AUC (Area Under the Curve); thus saving 34.7% annotation cost relative to RFT. Furthermore, with 1,176 actively-selected images (2.80% of the whole dataset), AFT can reach the equivalent performance of full training (dashed black) with 42,000 images; thereby saving 97.2% annotation cost relative to full training. In Sec. 4, we will perform thorough experiments, demonstrating that, with a small subset of the labeled samples, AFT can approximately achieve the performance of full training (dashed black) or fine tuning (solid black) using all samples for three distinct medical imaging applications including colonoscopy frame classification (b), polyp detection (c), and pulmonary embolism detection (d) [see Sec. 4 for details]. However, for Places, AFT keeps improving in performance until all images available in Places are used at the end, indicating that the Places database is still small—this comes in no surprise, given only 15,100 images under each category in Places relative to the large variations of layouts and appearances of kitchen, living room, and office as shown in Fig. 1. Therefore, not only can AFT suggest samples worthy of labeling but also help determine if a labeled dataset is sufficient in covering the variations of objects of interest. Please note that, following the standard active learning experimental setup, both AFT and RFT select samples from the remaining training dataset; they will eventually use the same whole training dataset, naturally yielding a similar performance at the end. However, the goal of active learning is to find such sweet spots where a learner can achieve an acceptable performance with a minimal number of labeled samples.
Fig. 3: Illustrated are two images (A and B) and their augmented image patches arranged according to the predictions on the dominant category by the CNN at Step 10 (after 3,000 image label queries). Intuitively, an image would contribute very little in boosting the current CNN’s performance if the predictions of its augmented patches are highly certain and consistent; naturally, the entropy and diversity of its augmented patches provide a useful indicator to its “power” in elevating the current CNN. However, automatic data augmentation inevitably generates hard

samples, and there is no need to classify them all confidently in the intermediate stages; therefore, we select only the top

100% of the patches with the highest predictions on the dominant category in computing entropy and diversity. We have found that =1/4 works well across our three distinct medical imaging applications. In this case, entropy and diversity for Images A and B are (2.17, 0.35) and (4.59, 9.32), respectively, show that Image B is more uncertain and diverse than Image A, and therefore more worthy of labeling. As a matter of fact, its label is living room in Places; thus its augmented patches are classified mostly wrong by the current CNN, and including it into the training set is of great value. For comparison, Image A is labeled as office and the current CNN classifies its top augmented patches as office with high confidence; thereby, labeling it would prove to be fruitless. As a note, computing entropy and diversity on entire augmented patches yields (17.33, 297.52) for Images A and (18.50, 262.39) for Image B, which would mislead the selection, as it indicates that the two images are close in entropy (17.33 vs. 18.50), and Image A is more diverse than Image B (297.52 vs. 262.39). Therefore, the majority selection presented in Sec. 3.3 is a critical component in AFT.

2 Related work

The literature of active learning, deep learning, and transfer learning is rich and deep 

[4, 5, 6, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28]. Due to space, we focus on the most relevant works only.

2.1 Transfer learning for medical imaging

Gustavo et al.  [12] replaced the fully connected layers of a pre-trained CNN with a new logistic layer and trained only the appended layer with the labeled data while keeping the rest of the network the same, yielding promising results for classification of unregistered multiview mammograms. In [11], a fine-tuned pre-trained CNN was applied for localizing standard planes in ultrasound images. Gao et al.  [14] fine-tuned all layers of a pre-trained CNN for automatic classification of interstitial lung diseases. In [13], Shin et al. used fine-tuned pre-trained CNNs to automatically map medical images to document-level topics, document-level sub-topics, and sentence-level topics. In [15], fine-tuned pre-trained CNNs were used to automatically retrieve missing or noisy cardiac acquisition plane information from magnetic resonance imaging and predict the five most common cardiac views. Schlegl et al.  [10] explored unsupervised pre-training of CNNs to inject information from sites or image classes for which no annotations were available, and showed that such across site pre-training improved classification accuracy compared to random initialization of the model parameters. Tajbakhsh et al.  [16] systematically investigated the capabilities of transfer learning in several medical imaging applications. However, they all performed one-time fine-tuning—simply fine-tuning a pre-trained CNN just once with available training samples, involving neither active selection processes nor continuous fine-tuning.

2.2 Integrating active learning with deep learning

Research in integrating active learning and deep learning is sparse: Wang and Shang [29]

may be the first to incorporate active learning with deep learning, and based their approach on stacked restricted Boltzmann machines and stacked autoencoders. A similar idea was reported for hyperspectral image classification 

[30]. Stark et al.  [31] applied active learning to improve the performance of CNNs for CAPTCHA recognition, while Al Rahhal et al.  [32] exploited deep learning for active electrocardiogram classification. Most recently, Yang et al.  [33] presented active learning framework to reduce annotation effort by judiciously suggesting the most effective annotation areas for segmentation based on uncertainty and similarity information provided by fully convolutional networks [34]. Their approach is very expensive in computation, as they need train a set of models from scratch (unlike ours via fine-tuning) via bootstrapping [35] in order to compute their uncertainty measure based on these models’ disagreements, and must solve a generalized version of the maximum set cover problem [36], which is NP-hard, to determined the most representative areas. In a word, all aforementioned approaches are fundamentally different from AFT in that in each step they all repeatedly re-trained the learner from scratch, while we continuously fine-tune the (fine-tuned) CNN in an incremental manner, offering several advantages as listed in Sec. 1, leading to dramatic annotation cost reduction and computation efficiency.

2.3 Our work

We presented a method for integrated active learning and deep learning via continuous fine-tuning in our CVPR paper [37], but it was limited to binary classifications and biomedical imaging, and used all labeled samples available in each step, thereby demanding long training time and large computer memory. This paper is a significant extension of our CVPR paper [37], making several contributions: (1) generalizing binary classification to multi-class classification; (2) extending from computer-aided diagnosis in medical imaging to scene interpretation in natural images; (3) combining newly selected samples with hard (mis-classified) samples to eliminate easy samples for reducing training time, and concentrate on hard samples for preventing catastrophic forgetting; (4) injecting randomness to enhance robustness in active selection; (5) experimenting extensively with all reasonable combinations of data and models in search for an optimal strategy; (6) demonstrating consistent annotation reduction with different CNN architectures; and (7) illustratively explaining the active selection process with a gallery of the patches associated with predictions.

3 Proposed method

AFT was conceived in the context of computer-aided diagnosis (CAD) in biomedical imaging. A CAD system typically has a candidate generator, which can quickly produce a set of candidates, among which, some are true positives and some are false positives. After candidate generation, the task is to train a classifier to eliminate as many as possible false positives while keeping as many as possible true positives. To train a classifier, each of the candidates must be labeled. We assume that each candidate takes one of possible labels. To boost the performance of CNNs for CAD systems, multiple patches are usually generated automatically for each candidate through data augmentation; these patches generated from the same candidate inherit the candidate’s label. In other words, all labels are acquired at the candidate level. Mathematically, given a set of candidates, , where is the number of candidates, and each candidate is associated with patches, our AFT algorithm iteratively selects a set of candidates for labeling as illustrated in Alg. LABEL:alg:AFT.

However, AFT is generic and applicable to many tasks in computer vision and image analysis. For clarity, we illustrate the ideas behind AFT with the Places database [3] for scene interpretation in natural images, where no candidate generator is needed, as each image may be directly regarded as a candidate. For simplicity as an illustration, yet without loss of generality, we limit to three categories (kitchen, living room, and office), and divide the Places images in each category into training (14,000 images), validation (1,000 images), and test (100 images) with no overlaps.

Designing an active learning algorithm involves two key issues: (1) how to determine the “worthiness” of a candidate for annotation and (2) how to update the classifier/learner. Before formally addressing the first issue in Secs. 3.23.4 and the second one in Sec. 3.5, we illustrate our ideas in Sec. 3.1 with Fig. 3 and Tab. 1.

TABLE I: Relationships among seven prediction patterns and four methods in active candidate selection111Available at: https://github.com/MrGiovanni/Active-Learning. We assume that a candidate

has 11 patches, and their probabilities

predicted by the current CNN are listed in Row 2. Entropy and diversity operate on the top 100% of the candidate’s patches based on the prediction on the dominant category as described in Sec. 3.3. In this illustration, we choose to be 1/4, meaning that the selection criteria (Eqs. 1 and 2) are computed based on 3 patches within each candidate. The first choice of each method is highlighted in blue and the second choice is in light blue. Combing entropy and diversity would be highly desirable, but striking a balance between them is not trivial, as it demands application-specific and (see Eq. 3) and requires further research.

3.1 Illustrating active candidate selection

Fig. 3 shows the active candidate selection process for multi-class classification, while, for easy understanding, Tab. 1 illustrates it in case of binary classification. Assuming the prediction of patch by the current CNN is , we call the histogram of the prediction pattern of candidate . As shown in Row 1 of Table 1, in binary classification, there are seven typical prediction patterns:

  • Pattern A: The patches’ predictions are mostly concentrated at 0.5, with a higher degree of uncertainty. Most active learning algorithms [18, 19] favor this type of candidates as they are good at reducing the uncertainty.

  • Pattern B: It is flatter than Pattern A, as the patches’ predictions are spread widely from 0 to 1 with a higher degree of inconsistency among the patches’ predictions. Since all the patches belonging to a candidate are generated via data argumentation, they (at least their majority) are expected to have similar predictions. This type of candidates have potential to contribute significantly to enhancing the current CNN’s performance.

  • Pattern C: The patches’ predictions are clustered at the both ends, with a higher degree of diversity. This type of candidates are most likely associated with noise labels at the patch level as illustrated in Fig. 4 (c), and they are the least favorable in active selection because they may cause confusion in fine-tuning the CNN.

  • Patterns D and E: The patches’ predictions are clustered at either end (i.e., 0 or 1), with a higher degree of certainty. This type of candidates should be postponed in annotating at this stage because mostly likely the current CNN has predicted them correctly, and they would contribute very little in fine-tuning the current CNN.

  • Patterns F and G:

    They have a higher degree of certainty in some of the patches’ predictions and are associated with some outliers in the patches’ predictions. This type of candidates are valuable because they are capable of smoothly improving the CNN’s performance. They might not make dramatic contributions but they would not cause significant harms in enhancing the CNN’s performance.

3.2 Seeking worthy candidates

In active learning, the key is to develop a criterion for determining the “worthiness” of a candidate for annotation. Our criterion is based on a simple yet powerful observation: All patches augmented from the same candidate (Fig. 3) share the same label; they are expected to have similar predictions by the current CNN. As a result, their entropy and diversity provide a useful indicator to the “power” of a candidate in elevating the performance of the current CNN. Intuitively, entropy captures classification certainty—a higher uncertainty value denotes a higher degree of information (e.g., pattern A in Tab. 1); while diversity indicates prediction consistency among the patches of a candidate—a higher diversity value denotes a higher degree of prediction inconsistency (e.g., pattern C in Tab. 1). Formally, assuming that each candidate takes one of possible labels, we define the entropy of as

(1)

and the diversity of as

(2)

Combining entropy and diversity yields

(3)

where and are trade-offs between entropy and diversity. We use two parameters for convenience, so as to easily turn on/off entropy or diversity during experiments.

3.3 Handling noise labels via majority selection

Automatic data augmentation is essential to boost CNN’s performance, but it inevitably generates “hard” samples for some candidates as shown in Fig. 4 (c), injecting noisy labels; therefore, to significantly enhance the robustness of our method, we compute entropy and diversity by selecting only a portion of the patches of each candidate according to the predictions by the current CNN.

Specially, for each candidate we first determine its dominant category, which is defined the category with the highest confidence in the mean prediction, that is,

(4)

where is the outputs of the current CNN given on label . After sorting according to dominant category , we apply Eq. 3 on the top 100% of the patches for constructing the score matrix of size for each candidate in . Our proposed majority selection method automatically excludes the patches with noisy labels (see Tab. 1: diversity and diversity) because of their low confidences.

We should note that the idea of combining entropy and diversity was inspired by [38], but there is a fundamental difference: they computed across the whole unlabeled dataset with time complexity , which is very computational expensive, while we compute locally on the selected patches within each candidate, saving computation time considerably with time complexity , where in our experiments. As a matter of fact, it is computationally infeasible to apply this method [38] in our three real-world applications because (a) their selection criteria involves all unlabeled samples (patches). For instance, we have 391,200 training patches for polyp detection (see Sec. 4.2), and computing their would demand 1.1 TB memory (); (b) their algorithms for batch selection were based on the truncated power method [39], which was unable to find a solution even for our smallest application (colonoscopy frame classification with 42,000 training patches in Sec. 4.1). Furthermore, as a conventional method, it does not have those advantages listed in Sec. 1. Specifically, they used SVM [40] as the base classifier, (a) cannot effectively start with a completely empty labeled dataset, and (b) cannot incrementally improve the learner through continuous fine-tuning. They have no concept of candidate, (c) cannot exploit expected consistency among the patches within each candidate for active selection and (d) cannot automatically handle noisy labels by using majority selection.

3.4 Injecting randomization in active selection

As discussed in [41], simple random selection may outperform active selection at the beginning, because the active selection depends on the current model to select examples for labeling; as a result, a poor selection made at an early stage may adversely affect the quality of subsequent selections; while the random selection gets less frequently locked into a poor hypothesis. In other words, the active selection concentrates on exploiting the knowledge gained from the labels already acquired to further explore the decision boundary, while the random selection concentrates on exploration, and so is able to locate areas of the feature space where the classifier performs poorly. Therefore, an effective active learning strategy must a balance between exploration and exploitation. To this end, we inject randomization in our method by selecting actively according to its sampling probability .

(5)
(6)

where is sorted according to its value in a descending order, is named random extension. Suppose number of candidates are required for annotation. Instead of selecting top candidates, we extend the candidate selection pool to . Then we select candidates from this pool with their sampling probabilities to inject randomization.

3.5 Comparing various learning strategies

From the discussion above, several active learning strategies can be derived as summarized in Tab. II. Our comprehensive comparisons (reported in Sec. 4.4) show that (1) AFT is unstable; (2) AFT needs a careful parameter adjustment; and (3) AFT is the most reliable in comparison with AFT and AFT, but it requires to fine-tune the original model from the beginning using the all presently-available-labeled samples in each step. To overcome this drawback, we develop an optimized version AFT, which continuously fine-tunes the current model using newly annotated candidates enlarged by those misclassified candidates, that is, . Several researchers have demonstrated that fine-tuning offers better performance and is more robust than training from scratch. Moreover, our experiments show AFT saves training time by converging faster than repeatedly fine-tuning the original pre-trained CNN, and boosts performance by eliminating easy samples, focusing on hard samples, and preventing catastrophic forgetting.

Fig. 2 (a) compares AFT with RFT using the Places database. For RFT, six different sequences are generated via systematic random sampling. The final curve is plotted by the average performance of six runs. As shown in Fig. 2 (a), AFT, with only 2,906 candidate queries, can achieve the performance of RFT with 4,452 candidate queries in terms of AUC (area under the curve), while, with only 1,176 candidate queries, it can achieve the performance of full training with all 42,000 candidates. Thereby, 34.7% of the labeling cost could be saved from RFT, and 97.2% of the labeling cost could be saved from full training. When nearly 100% training data are used, the performance still keeps increasing; therefore, considering 22 layers GoogLeNet architecture, the size of dataset is still not enough. AFT is a general algorithm not only useful in biomedical dataset but also other datasets; AFT works for multi-class problems.

Selection Sample Model Terminology
Active AFT
Active AFT
Active AFT
Active AFT
Random RFT
  • Active: diversity, diversity, diversity, diversity, etc.

  • : labeled candidates.

  • : newly selected candidates.

  • : misclassified candidates.

  • : pre-trained LeNet [42], AlexNet [43], GooLeNet [44], VGG [45], ResNet [46], DenseNet [47], etc.

TABLE II: Active learning strategies.

4 Applications

We apply our methods AFT and AFT to three applications, including colonoscopy frame classification, polyp detection, and pulmonary embolism (PE) detection. In terms of active selection criteria, eight strategies are compared in the experiment, i.e., diversity, (Eq. 2), diversity (Sec. 3.3), diversity (Sec 3.4), diversity (Sec. 3.3 and Sec 3.4), entropy (Eq. 1), entropy, entropy, entropy. For all applications, we set to 1/4 and to 5. Tajbakhsh et al. [16] reported the state-of-the-art performance of fine-tuning and learning from scratch based on the whole datasets, which is used for the baseline performance for comparison. They also investigated the performance of (partial) fine-tuning with a sequence of partial training datasets, but their partition of the datasets were different; therefore, for fair comparison with their idea, we introduce RFT, which fine-tunes the original model from the beginning using the all presently-available-labeled samples , where is randomly selected, in each step. In all three applications, our AFT begins with an empty training dataset and directly uses pre-trained models (AlexNet and GoogLeNet adopted) on ImageNet.

4.1 Colonoscopy Frame Classification

Image quality assessment can have a major role in objective quality assessment of colonoscopy procedures. Typically, a colonoscopy video contains a large number of non-informative images with poor colon visualization that are not suitable for inspecting the colon or performing therapeutic actions. The larger the fraction of non-informative images in a video, the lower the quality of colon visualization, and thus the lower the quality of colonoscopy. Therefore, one way to measure the quality of a colonoscopy procedure is to monitor the quality of the captured images. Such quality assessment can be used during live procedures to limit low-quality examinations or in a post-processing setting for quality monitoring purposes. Technically, image quality assessment at colonoscopy can be viewed as an image classification task whereby an input image is labeled as either informative or non-informative. Fig. 4 shows examples of informative and non-informative colonoscopy frames.

Fig. 4: Three examples of colonoscopy frames, (a) informative, (b) non-informative, and (c) ambiguous but labeled as “informative” because experts label frames based on the overall quality: if over 75% of a frame (i.e., candidate in this application) is clear, it is considered “informative”. As a result, an ambiguous candidate contains both clear and blur parts, and generates noise labels at the patch level from automatic data argumentation. For example, the whole frame (c) is labeled as “informative”, but not all the patches (d) associated with this frames are “informative”, although they inherit the “informative” label. This is the main motivation for the majority selection in our AFT method.

In this application, frames are regarded as candidates, as the labels (informative or non-informative) are associated with frames as illustrated in Fig. 4 (a) (b). In total, there are 4,000 colonoscopy candidates from 6 complete colonoscopy videos. A trained expert then manually labeled the collected images as informative or non-informative (line 11 in Alg. LABEL:alg:AFT). A gastroenterologist further reviewed the labeled images for corrections. The labeled frames at the video level are separated into training and test sets, each containing approximately 2,000 colonoscopy frames. For data augmentation, we extracted 21 patches from each frame.

Fig. 5: Comparison of eight active selection strategies for three medical applications, including (a) colonoscopy frame classification, (b) polyp detection, and (c) pulmonary embolism detection, on the testing dataset using AlexNet. The solid black line is for the current state-of-the-art performance of fine-tuning using full training data and the dashed black line is for the performance of training from scratch using full training data.

Fig. 5 (a) shows that AFT with only around 120 candidate queries (6.0%) can achieve the performance of 100% training dataset fine-tune from AlexNet (solid black line in the figure, where AUC=.9366), and with only 80 candidate queries (4.0%) can achieve the performance of 100% training dataset learning from scratch (dashed black line in the figure, where AUC=.9204), and with 80 candidate queries can achieve the performance of RFT with 320 candidate queries. Thereby, around 75% labeling cost could be save from RFT in colonoscopy frame classification. At the early steps, RFT yields great performance than most of active selecting strategies, because 1) random selection gives the samples with the positive-negative ratio compatible with the testing and validation dataset; 2) the pre-trained model gives poor predictions on biomedical image domain, as it was trained by natural images. Its output probabilities are mostly confused or even opposite, yielding poor selection scores. However, with the randomness injected as described in Sec. 3.4, AFTdiversity and AFTentropy performance better even at the early stages and increase the performance very quickly along the step. In addition, AFT performs promisingly and efficiently comparing to AFT which always uses the whole labeled dataset and fine-tune from the beginning (see comparisons in Tab. III).

4.2 Polyp Detection

Colonoscopy is the preferred technique for colon cancer screening and prevention. The goal of colonoscopy is to find and remove colonic polyps—precursors to colon cancer. Polyps, as shown in Fig. 6

, can appear with substantial variations in color, shape, and size. The challenging appearance of polyps can often lead to misdetection, particularly during long and back-to-back colonoscopy procedures where fatigue negatively affects the performance of colonoscopists. Polyp miss-rates are estimated to be about 4% to 12% 

[48, 49, 50, 51]; however, a more recent clinical study [52] is suggestive that this misdetection rate may be as high as 25%. Missed polyps can lead to the late diagnosis of colon cancer with an associated decreased survival rate of less than 10% for metastatic colon cancer [53]. Computer-aided polyp detection may enhance optical colonoscopy screening by reducing polyp misdetection.

The dataset contains 38 patients with one video each. The training dataset is composed of 21 videos (11 with polyps and 10 without polyps), while the testing dataset is composed of 17 videos (8 videos with polyps and 9 videos without polyps). In this application, each polyp candidate is regarded as a candidate. At the video level, the candidates were divided into the training dataset (16300 candidates) and testing dataset 11950 candidates). Each candidate contains 24 patches.

Fig. 6: Polyps in colonoscopy videos with different shape and appearance.

At each polyp candidate location with the given bounding box, we performed a data augmentation by a factor . At each scale, we extracted patches after the candidate is translated by 10 percent of the resized bounding box in vertical and horizontal directions. We further rotated each resulting patch 8 times by mirroring and flipping. The patches generated by data augmentation belong to the same candidate.

Fig. 5 (b) shows that AFT with only around 320 candidate queries (2.04%) can achieve the performance of 100% training dataset fine-tune from AlexNet (solid black line in the figure, where AUC=.9615), and with only 10 candidate queries (0.06%) can achieve the performance of 100% training dataset learning from scratch (dashed black line in the figure, where AUC=.9358), and with 10 candidate queries can achieve the performance of RFT with 80 candidate queries. Thereby, nearly 87% labeling cost could be save from RFT in polyp detection. The fast convergence and outstanding performance of AFT is attributed to the majority and randomization method, which can both efficiently select the informative and representative candidates while excluding those with noisy labels, and boost the initial performance at the early stages. Diversity, which strongly favors candidates whose prediction pattern resembles Pattern C (see Tab. 1), works even poorer than RFT because of noisy labels generated through data augmentation.

4.3 Pulmonary Embolism Detection

Pulmonary Embolism (PE) is a major national health problem, and computer-aid PE detection can improve diagnostic capabilities of radiologists. A PE is a blood clot that travels from a lower extremity source to the lung, where it causes blockage of the pulmonary arteries. The mortality rate of untreated PE may approach 30% [54], but it decreases to as low as 2% with early diagnosis and appropriate treatment [55]. CT pulmonary angiography (CTPA) is the primary means for PE diagnosis, wherein a radiologist carefully traces each branch of the pulmonary artery for any suspected PEs. CTPA interpretation is a time-consuming task whose accuracy depends on human factors, such as attention span and sensitivity to the visual characteristics of PEs. CAD can have a major role in improving PE diagnosis and decreasing the reading time of CTPA datasets.

We use a database consisting of 121 CTPA datasets with a total of 326 PEs. Each PE candidate is regarded as a candidate with 50 patches. We divided candidates at the patient level into a training dataset with 434 true positives (199 unique PEs) and 3,406 false positives, and a testing dataset with 253 true positives (127 unique PEs) and 2,162 false positives. The overall PE probability is calculated by averaging the probabilistic prediction generated for the patches within PE candidate after data augmentation.

Fig. 7: Five different PEs in the standard 3-channel representation, as well as in the 2-channel representation [56] , which was adopted in this work because it achieves greater classification accuracy and accelerates CNN training convergence. The figure is used with permission.

Fig. 5 (c) shows that AFT with 2560 candidate queries (66.68%) can almost achieve the performance of 100% both training dataset fine-tune from AlexNet and learning from scratch (solid black line and dashed black line in the figure, where AUC=.8763 and AUC=.8706), and with 1280 candidate queries can achieve the performance of RFT with 2560 candidate queries. Based on this analysis, the cost of annotation can be cut at least half by our method from RFT in pulmonary embolism detection.

4.4 Comparison of all the methods

application method diversity diversity diversity diversity entropy entropy entropy entropy
Colonoscopy
frame
classification
AFT .8375 .8773 .8995 .9160 .8444 .8227 .9136 .9061
AFT .8501 .8956 .9083 .9262 .9149 .9051 .9033 .9223
AFT .9183 .9253 .9299 .9344 .9219 .9180 .9268 .9291
AFT .9048 .9236 .9241 .9179 .9198 .9266 .9257 .9293
Polyp
detection
AFT .8669 .9023 .8984 .9168 .8834 .8656 .9034 .9271
AFT .9195 .9142 .9497 .9488 .9204 .9255 .9475 .9444
AFT .9242 .9285 .9353 .9355 .9292 .9238 .9367 .9522
AFT .9013 .9370 .9116 .9363 .9321 .9436 .9196 .9443
Pulmonary
embolism
detection
AFT .7828 .7911 .7690 .7977 .7855 .7736 .7296 .7833
AFT .8083 .8176 .7975 .8263 .8032 .8086 .8022 .8245
AFT .7650 .7973 .7978 .8040 .7917 .7878 .7964 .8222
AFT .8272 .7876 .8047 .8245 .8218 .7995 .8155 .8205
TABLE III: Comparison of all the methods. The baseline performance is that RFT (ALC=.8991) for colonoscopy frame classification, RFT (ALC=.9379) for polyp detection, RFT (ALC=.7874) for pulmonary embolism detection using AlexNet. Bolded values indicate the outstanding learning strategies (see Tab. II) using certain active selection criteria, red values represent the best performance taken both learning strategies and active selection criteria into consideration.
Fig. 8: Top 10 candidates selected by the four AFT methods at Step 3 in colonoscopy frame classification. Positive candidates are in red and negative candidates are in blue. AFTDiversity prefers Pattern B while AFTDiversity suggests Pattern C. Both AFTEntropy and AFTEntropy favor Pattern A because of its higher degrees of uncertainty. However, in this case at Step 3, with AFTEntropy, there are no more candidates with Pattern A; therefore, candidates with Pattern B are selected.

As summarized in Tab. II, several active learning strategies can be derived. The prediction performance was evaluated according to the ALC (Area under the Learning Curve). A learning curve plots AUC computed on the testing dataset, as a function of the number of labels queried [57]. Tab. III shows the ALC of AFT, AFT, AFT and AFT comparing to RFT. Our comprehensive experiments have demonstrated that:

  1. Active fine-tuning methods using entropy and diversity with majority and randomization selection can achieve better performance than random selection.

  2. Only using newly selected candidates for fine-tuning current model (strategy AFT) is unstable because pre-trained samples may be forgotten if the classifier is only trained on the newly selected samples along the steps, leading to a lower ALC.

  3. AFT requires a careful parameter adjustment. Although its performance is acceptable, it requires the same computing time as AFT, meaning there is no advantage on continuously fine-tuning the current model.

  4. AFT maintains the most reliable performance in comparison with AFT and AFT.

  5. The optimized version AFT shares the comparable performance as AFT and outperforms occasionally by eliminating easy samples, focusing on hard samples and preventing catastrophic forgetting.

  6. Active selection with diversity works even poorer than RFT because it favors Pattern C, which is most likely associated with noisy labels injected through data augmentation (see Fig. 4 (c), (d)).

  7. Majority selection (described in Sec. 3.3) can automatically handle noisy labels and overcome the drawback of diversity, especially for colonoscopy frame classification with severe noisy labels as shown in Fig. 4 (c), (d). The majority selection involves a fewer patches within each annotation unit and saves more computation time.

4.5 Observations on selected patterns

We meticulously monitored the active selection process and examined the selected candidates, as an example, we include the top 10 candidates selected by the four AFT methods at Step 3 in colonoscopy frame classification in Fig. 8. From this process, we have observed the following:

  • [noitemsep]

  • Patterns A and B are dominant in the earlier stages of AFT as the CNN has not been fine-tuned properly to the target domain.

  • Patterns C, D and E are dominant in the later stages of AFT as the CNN has been largely fine-tuned on the target dataset.

  • The majority selection, i.e., entropy, diversity, is effective in excluding Patterns C, D, and E, while entropy (without the majority selection) can handle Patterns C, D, and E reasonably well.

  • Patterns B, F, and G generally make good contributions to elevating the current CNN’s performance.

  • Entropy and entropy favor Pattern A because of its higher degree of uncertainty as shown in Fig. 8.

  • Diversity prefers Pattern B while diversity prefers Pattern C (Fig. 8). This is why diversity may cause sudden disturbances in the CNN’s performance and why diversity should be preferred in general.

In addition, to create a visual impression on how newly selected images look like, we show the top and bottom five images selected by four active selection strategies (i.e., diversity, diversity, entropy and entropy) from Places at Step 11 in Fig. 11 and their associated predictions by the current CNN in Tab. V. Such a gallery offers an intuitive way to analyze the most/least favored images and has helped us develop different active selection strategies.

4.6 Automatically balancing positive-negative ratios

In real-world applications, datasets are usually unbalanced. In order to achieve good classification performance, it is better to balance the training dataset in terms of classes. For random selection, the ratio is roughly the same as the whole training dataset. We observe that our active learning methods, AFT and AFT, are capable of making the selected training dataset balanced automatically. Our datasets are unbalanced—more negatives and fewer positives. As shown in Fig. 9, for colonoscopy frame classification, the ratio between positives and negatives is around 3:7; for polyp detection and pulmonary embolism detection, the ratios are around 1:9. After monitoring the active selection process, AFT and AFT can select at least twice more positives than random selection. We believe that this is one of the reasons that AFT and AFT achieve better performance quickly.

Fig. 9: Positive-negative ratio in the candidates selected by AFT, AFT and RFT. Please note that, the ratio in RFT is approximately standing for the ratio of the whole dataset.

4.7 Generalizability of AFT in CNN architectures

Fig. 10: Comparing AFT and AFT on GoogLeNet for our three applications, including (a) colonoscopy frame classification, (b) polyp detection, and (c) pulmonary embolism detection, demonstrates consistent patterns with AlexNet (see Fig. 5).
Applications epoch
Colonoscopy Frame Classification 0.9 0.0001 0.001 0.95 8
Polyp Detection 0.9 0.0001 0.001 0.95 10
Pulmonary Embolism Detection 0.9 0.001 0.01 0.95 5
TABLE IV: Learning parameters used for training and fine-tuning of AlexNet for AFT in our experiments. is the momentum, is the learning rate of the weights in the last layer, is the learning rate of the weights in the rest layers, and determines how decreases over epochs. “epochs” indicates the number of epochs used in each step. For AFT, all the parameters are set to the same as AFT except the learning rate , which is set to 1/10 of that for AFT.

We based our experiments on AlexNet, because its architecture offers a nice balance in depth: it is deep enough that we can investigate the impact of AFT and AFT on the performance of pre-trained CNNs, and it is also shallow enough that we can conduct experiments quickly. The learning parameters used for training and fine-tuning of AlexNet in our experiments are summarized in Tab. IV. Alternatively, deeper architectures, such as GoogLeNet [44], ResNet [46], and DenseNet [47], could have been used and have shown relatively high performance for challenging computer vision tasks. However, the purpose of this work is not to achieve the highest performance for different biomedical image tasks but to answer a critical question: How to dramatically reduce the cost of annotation when applying CNNs in biomedical imaging. For these purposes, AlexNet is a reasonable architectural choice. Nevertheless, we have experimented our three applications on GoogLeNet, demonstrating consistent patterns between AlexNet and GoogLeNet as shown in Figs. 2, 5 and 10. As a result, given this generalizability, we could focus on comparing the prediction patterns (summarized in Tab. 1) and learning strategies (summarized in Tab. II) instead of running experiments on various deep neural network architectures.

5 Conclusion

In this paper, we have developed a novel method for dramatically cutting annotation cost by integrating active learning and transfer learning. In comparison with the state-of-the-art method [16] (random selection), our method can cut the annotation cost by at least half in three biomedical applications, and 33% in nature image database relative to their random selection. This performance is attributed to our eight advantages associated with our method (detailed in Sec. 1).

We choose to select, classify and label samples at the candidate level. Labeling at the patient level would certainly reduce the cost of annotation more but introduce more severe label noise; labeling at the patch level would cope with the label noise but impose a much heavier burden on experts for annotation. We believe that labeling at the candidate level offers a sensible balance in our three applications.

Fig. 11: Gallery of top five and bottom five candidates actively selected at Step 11 by the methods proposed in Sec. 3.2 and Sec 3.3 under the experimental setting.
TABLE V: Detailed predictions of top five and bottom five candidates at Step 11. Three probabilities predicted from CNN, where bolded columns represent dominated predictions, lighted blue numbers are used to calculate different active selection criteria.

Acknowledgments

This research has been supported partially by the NIH under Award Number R01HL128785 and by ASU and Mayo Clinic through a Seed Grant and an Innovation Grant. The content is solely the responsibility of the authors and does not necessarily represent the official views of the NIH.

References

  • [1] Y. LeCun, Y. Bengio, and G. Hinton, “Deep learning,” Nature, vol. 521, no. 7553, pp. 436–444, 2015.
  • [2] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei, “Imagenet: A large-scale hierarchical image database,” in Computer Vision and Pattern Recognition, 2009. CVPR 2009. IEEE Conference on.   IEEE, 2009, pp. 248–255.
  • [3]

    B. Zhou, A. Lapedriza, A. Khosla, A. Oliva, and A. Torralba, “Places: A 10 million image database for scene recognition,”

    IEEE transactions on pattern analysis and machine intelligence, 2017.
  • [4] H. Greenspan, B. van Ginneken, and R. M. Summers, “Guest editorial deep learning in medical imaging: Overview and future promise of an exciting new technique,” IEEE Transactions on Medical Imaging, vol. 35, no. 5, pp. 1153–1159, 2016.
  • [5] K. Zhou, H. Greenspan, and D. Shen, Deep Learning for Medical Image Analysis.   Academic Press, 2016.
  • [6] L. Lu, Y. Zheng, G. Carneiro, and L. Yang, Deep learning and convolutional neural networks for medical image computing: precision medicine, high performance and large-scale datasets.   Springer, 2017.
  • [7] C. E. Shannon, “A mathematical theory of communication,” Bell System Technical Journal. 27 (3): 379–423, vol. 7, no. 3, p. 379–423, 1948.
  • [8] M. Kukar, “Transductive reliability estimation for medical diagnosis,” Artificial Intelligence in Medicine, vol. 29, no. 1, pp. 81–106, 2003.
  • [9]

    H. Chen, Q. Dou, D. Ni, J.-Z. Cheng, J. Qin, S. Li, and P.-A. Heng, “Automatic fetal ultrasound standard plane detection using knowledge transferred recurrent neural networks,” in

    International Conference on Medical Image Computing and Computer-Assisted Intervention.   Springer, 2015, pp. 507–514.
  • [10] T. Schlegl, J. Ofner, and G. Langs, “Unsupervised pre-training across image domains improves lung tissue classification,” in Medical Computer Vision: Algorithms for Big Data.   Springer, 2014, pp. 82–93.
  • [11] H. Chen, D. Ni, J. Qin, S. Li, X. Yang, T. Wang, and P. A. Heng, “Standard plane localization in fetal ultrasound via domain transferred deep neural networks,” Biomedical and Health Informatics, IEEE Journal of, vol. 19, no. 5, pp. 1627–1636, Sept 2015.
  • [12] G. Carneiro, J. Nascimento, and A. Bradley, “Unregistered multiview mammogram analysis with pre-trained deep learning models,” in Medical Image Computing and Computer-Assisted Intervention – MICCAI 2015, ser. Lecture Notes in Computer Science, N. Navab, J. Hornegger, W. M. Wells, and A. F. Frangi, Eds.   Springer International Publishing, 2015, vol. 9351, pp. 652–660. [Online]. Available: http://dx.doi.org/10.1007/978-3-319-24574-4_78
  • [13] H.-C. Shin, L. Lu, L. Kim, A. Seff, J. Yao, and R. M. Summers, “Interleaved text/image deep mining on a very large-scale radiology database,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2015, pp. 1090–1099.
  • [14] M. Gao, U. Bagci, L. Lu, A. Wu, M. Buty, H.-C. Shin, H. Roth, G. Z. Papadakis, A. Depeursinge, R. M. Summers et al., “Holistic classification of ct attenuation patterns for interstitial lung diseases via deep convolutional neural networks,” in the 1st Workshop on Deep Learning in Medical Image Analysis, International Conference on Medical Image Computing and Computer Assisted Intervention, at MICCAI-DLMIA’15, 2015.
  • [15] J. Margeta, A. Criminisi, R. Cabrera Lozoya, D. C. Lee, and N. Ayache, “Fine-tuned convolutional neural nets for cardiac mri acquisition plane recognition,” Computer Methods in Biomechanics and Biomedical Engineering: Imaging & Visualization, pp. 1–11, 2015.
  • [16] N. Tajbakhsh, J. Y. Shin, S. R. Gurudu, R. T. Hurst, C. B. Kendall, M. B. Gotway, and J. Liang, “Convolutional neural networks for medical image analysis: Full training or fine tuning?” IEEE transactions on medical imaging, vol. 35, no. 5, pp. 1299–1312, 2016.
  • [17]

    H. Wang, Z. Zhou, Y. Li, Z. Chen, P. Lu, W. Wang, W. Liu, and L. Yu, “Comparison of machine learning methods for classifying mediastinal lymph node metastasis of non-small cell lung cancer from 18 f-fdg pet/ct images,”

    EJNMMI research, vol. 7, no. 1, p. 11, 2017.
  • [18] B. Settles, “Active learning literature survey,” University of Wisconsin, Madison, vol. 52, no. 55-66, p. 11.
  • [19] I. Guyon, G. Cawley, G. Dror, V. Lemaire, and A. Statnikov, JMLR Workshop and Conference Proceedings (Volume 16): Active Learning Challenge.   Microtome Publishing, 2011.
  • [20] K. Konyushkova, R. Sznitman, and P. Fua, “Learning active learning from data,” in Advances in Neural Information Processing Systems, 2017, pp. 4226–4236.
  • [21] A. Mosinska, J. Tarnawski, and P. Fua, “Active learning and proofreading for delineation of curvilinear structures,” in International Conference on Medical Image Computing and Computer-Assisted Intervention.   Springer, 2017, pp. 165–173.
  • [22] S.-J. Huang, R. Jin, and Z.-H. Zhou, “Active learning by querying informative and representative examples,” in Advances in neural information processing systems, 2010, pp. 892–900.
  • [23] M. Li and I. K. Sethi, “Confidence-based active learning,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 28, no. 8, pp. 1251–1261, 2006.
  • [24] G. Litjens, T. Kooi, B. E. Bejnordi, A. A. A. Setio, F. Ciompi, M. Ghafoorian, J. A. van der Laak, B. van Ginneken, and C. I. Sánchez, “A survey on deep learning in medical image analysis,” arXiv preprint arXiv:1702.05747, 2017.
  • [25] J. Zhang, W. Li, and P. Ogunbona, “Transfer learning for cross-dataset recognition: A survey.”
  • [26] R. Chattopadhyay, W. Fan, I. Davidson, S. Panchanathan, and J. Ye, “Joint transfer and batch-mode active learning,” in International Conference on Machine Learning, 2013, pp. 253–261.
  • [27] S. J. Pan and Q. Yang, “A survey on transfer learning,” IEEE Transactions on knowledge and data engineering, vol. 22, no. 10, pp. 1345–1359, 2010.
  • [28] K. Weiss, T. M. Khoshgoftaar, and D. Wang, “A survey of transfer learning,” Journal of Big Data, vol. 3, no. 1, p. 9, 2016.
  • [29] D. Wang and Y. Shang, “A new active labeling method for deep learning,” in 2014 International Joint Conference on Neural Networks (IJCNN), July 2014, pp. 112–119.
  • [30] J. Li, “Active learning for hyperspectral image classification with a stacked autoencoders based neural network,” in 2016 IEEE International Conference on Image Processing (ICIP), Sept 2016, pp. 1062–1065.
  • [31] F. Stark, C. Hazırbas, R. Triebel, and D. Cremers, “Captcha recognition with active deep learning,” in Workshop New Challenges in Neural Computation 2015.   Citeseer, 2015, p. 94.
  • [32] M. Al Rahhal, Y. Bazi, H. AlHichri, N. Alajlan, F. Melgani, and R. Yager, “Deep learning approach for active classification of electrocardiogram signals,” Information Sciences, vol. 345, pp. 340–354, 2016.
  • [33] L. Yang, Y. Zhang, J. Chen, S. Zhang, and D. Z. Chen, “Suggestive annotation: A deep active learning framework for biomedical image segmentation,” arXiv preprint arXiv:1706.04737, 2017.
  • [34] E. Shelhamer, J. Long, and T. Darrell, “Fully convolutional networks for semantic segmentation,” IEEE transactions on pattern analysis and machine intelligence, vol. 39, no. 4, pp. 640–651, 2017.
  • [35] B. Efron and R. J. Tibshirani, An introduction to the bootstrap.   CRC press, 1994.
  • [36] U. Feige, “A threshold of ln n for approximating set cover,” J. ACM, vol. 45, no. 4, pp. 634–652, Jul. 1998. [Online]. Available: http://doi.acm.org/10.1145/285055.285059
  • [37] Z. Zhou, J. Shin, L. Zhang, S. Gurudu, M. Gotway, and J. Liang, “Fine-tuning convolutional neural networks for biomedical image analysis: actively and incrementally,” in IEEE conference on computer vision and pattern recognition (CVPR), Hawaii, USA, 2017, pp. 7340–7351.
  • [38] S. Chakraborty, V. Balasubramanian, Q. Sun, S. Panchanathan, and J. Ye, “Active batch selection via convex relaxations with guaranteed solution bounds,” IEEE transactions on pattern analysis and machine intelligence, vol. 37, no. 10, pp. 1945–1958, 2015.
  • [39]

    X.-T. Yuan and T. Zhang, “Truncated power method for sparse eigenvalue problems,”

    Journal of Machine Learning Research, vol. 14, no. Apr, pp. 899–925, 2013.
  • [40] S. R. Gunn et al.

    , “Support vector machines for classification and regression,”

    ISIS technical report, vol. 14, pp. 85–86, 1998.
  • [41] A. Borisov, E. Tuv, and G. Runger, “Active batch learning with stochastic query by forest,” in JMLR: Workshop and Conference Proceedings (2010).   Citeseer, 2010.
  • [42] Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner, “Gradient-based learning applied to document recognition,” Proceedings of the IEEE, vol. 86, no. 11, pp. 2278–2324, 1998.
  • [43] A. Krizhevsky, I. Sutskever, and G. E. Hinton, “Imagenet classification with deep convolutional neural networks,” in Advances in neural information processing systems, 2012, pp. 1097–1105.
  • [44] C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan, V. Vanhoucke, and A. Rabinovich, “Going deeper with convolutions,” in The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2015.
  • [45] K. Simonyan and A. Zisserman, “Very deep convolutional networks for large-scale image recognition,” arXiv preprint arXiv:1409.1556, 2014.
  • [46] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” in The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2016.
  • [47] G. Huang, Z. Liu, K. Q. Weinberger, and L. van der Maaten, “Densely connected convolutional networks,” arXiv preprint arXiv:1608.06993, 2016.
  • [48] A. Pabby, R. E. Schoen, J. L. Weissfeld, R. Burt, J. W. Kikendall, P. Lance, M. Shike, E. Lanza, and A. Schatzkin, “Analysis of colorectal cancer occurrence during surveillance colonoscopy in the dietary polyp prevention trial,” Gastrointest Endosc, vol. 61, no. 3, pp. 385–91, 2005.
  • [49] J. van Rijn, J. Reitsma, J. Stoker, P. Bossuyt, S. van Deventer, and E. Dekker, “Polyp miss rate determined by tandem colonoscopy: a systematic review,” American Journal of Gastroenterology, vol. 101, no. 2, pp. 343–350, 2006.
  • [50] D. H. Kim, P. J. Pickhardt, A. J. Taylor, W. K. Leung, T. C. Winter, J. L. Hinshaw, D. V. Gopal, M. Reichelderfer, R. H. Hsu, and P. R. Pfau, “Ct colonography versus colonoscopy for the detection of advanced neoplasia,” N Engl J Med, vol. 357, no. 14, pp. 1403–12, 2007.
  • [51] D. Heresbach, T. Barrioz, D. Lapalus, MG.and Coumaros, and et al., “Miss rate for colorectal neoplastic polyps: a prospective multicenter study of back-to-back video colonoscopies,” Endoscopy, vol. 40, no. 4, pp. 284–290, 2008.
  • [52] A. Leufkens, M. van Oijen, F. Vleggaar, and P. Siersema, “Factors influencing the miss rate of polyps in a back-to-back colonoscopy study,” Endoscopy, vol. 44, no. 05, pp. 470–475, 2012.
  • [53] L. Rabeneck, H. El-Serag, J. Davila, and R. Sandler, “Outcomes of colorectal cancer in the united states: no change in survival (1986-1997).” The American journal of gastroenterology, vol. 98, no. 2, p. 471, 2003.
  • [54] K. K. Calder, M. Herbert, and S. O. Henderson, “The mortality of untreated pulmonary embolism in emergency department patients.” Annals of emergency medicine, vol. 45, no. 3, pp. 302–310, 2005. [Online]. Available: http://dx.doi.org/10.1016/j.annemergmed.2004.10.001
  • [55] G. Sadigh, A. M. Kelly, and P. Cronin, “Challenges, controversies, and hot topics in pulmonary embolism imaging,” American Journal of Roentgenology, vol. 196, no. 3, 2011. [Online]. Available: http://dx.doi.org/10.2214/AJR.10.5830
  • [56] N. Tajbakhsh, M. B. Gotway, and J. Liang, “Computer-aided pulmonary embolism detection using a novel vessel-aligned multi-planar image representation and convolutional neural networks,” in International Conference on Medical Image Computing and Computer-Assisted Intervention.   Springer, 2015, pp. 62–69.
  • [57] I. Guyon, G. C. Cawley, G. Dror, and V. Lemaire, “Results of the active learning challenge.” Active Learning and Experimental Design@ AISTATS, vol. 16, pp. 19–45, 2011.