Joint Weakly and Semi-Supervised Deep Learning for Localization and Classification of Masses in Breast Ultrasound Images

10/10/2017 ∙ by Seung Yeon Shin, et al. ∙ Seoul National University Hankuk University of Foreign Studies Soonchunhyang University 0

We propose a framework for localization and classification of masses in breast ultrasound (BUS) images. In particular, we simultaneously use a weakly annotated dataset and a relatively small strongly annotated dataset to train a convolutional neural network detector. We have experimentally found that mass detectors trained with small, strongly annotated datasets are easily overfitted, whereas those trained with large, weakly annotated datasets present a non-trivial problem. To overcome these problems, we jointly use datasets with different characteristics in a hybrid manner. Consequently, a sophisticated weakly and semi-supervised training scenario is introduced with appropriate training loss selection. Experimental results show that the proposed method successfully localizes and classifies masses while requiring less effort in annotation work. The influences of each component in the proposed framework are also validated by conducting an ablative analysis. Although the proposed method is intended for masses in BUS images, it can also be applied as a general framework to train computer-aided detection and diagnosis systems for a wide variety of image modalities, target organs, and diseases.



There are no comments yet.


page 1

page 2

page 3

page 8

page 9

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Breast cancer is the most common type of cancer among women worldwide and approximately 1 in 8 (12%) women in the US will develop invasive breast cancer during their lifetime [1, 2]. Consequently, annual breast cancer screening for the early detection of this type of cancer is crucial in every country. Ultrasound imaging is one of the modalities used for the screening and diagnosis of breast cancer. It is a popular test due to its high availability, cost-effectiveness, harmlessness, and acceptable diagnostic performance. However, given that image acquisition and interpretation using traditional handheld ultrasound devices are concurrently conducted by clinicians, the results are likely to become more subjective than those of other imaging tests. Thus, efforts have been exerted to reduce subjectivity by standardizing ultrasound imaging procedures [3].

Breast ultrasound (BUS) imaging aims to detect and classify abnormalities, such as masses, as either benign or malignant, as shown in Fig. 1

. Most conventional methods are based on a sequential framework that comprises image preprocessing, region segmentation, feature extraction, and classification 

[4, 5]. These approaches require careful tweaking of each component, particularly because the results of preceding processes will affect the results of subsequent processes. Conventional methods also frequently require user interaction in region segmentation, which is typically conducted manually or semi-automatically. Semi-automatic methods require user-provided seeds to operate. For example, the method presented in [6] assumes manually segmented masses and uses several manually designed geometric and echo features for their classification. Existing approaches for cancer detection and classification using BUS images are summarized in [7] based on the four aforementioned general steps used in ultrasound computer-aided diagnosis systems.

Other works aim to fully automatically detect lesions. A cascaded detector for breast lesions was proposed in [8]

. Haar features within an AdaBoost learning framework were used to locate potential tumor locations. Then, a support vector machine combined with quantized intensity features was utilized for refinement. In 

[9], the authors detected lesions in BUS images by pruning background edges from lesion-specific edges. This process requires preceding initial edge detection and subsequent contour aggregation methods to generate final detections.

Several approaches based on deep learning have also been proposed recently [10, 11, 12, 13]. Although these methods may be clinically useful, they are limited to either localizing target objects of interest [10, 11] or classifying given regions of interest into benign or malignant [12, 13], rather than conducting the two methods simultaneously.

Figure 1: Example of BUS images with masses. Bounding boxes with solid and dashed lines respectively represent ground truths (GTs) and detections using the proposed method. Boxes are colored as blue/red if the GT or predicted label is benign/malignant. All figures are best viewed in color.

Thus, we present a method for simultaneously localizing and classifying masses in BUS images. In particular, we train a convolutional neural network (CNN) which performs regression to determine the bounding box positions and classifications of masses, and consequently, assign their diagnostic labels. Training such a model typically requires a strongly supervised dataset, which includes images, bounding boxes, and box-level labels. An increase in dataset size helps avoid overfitting and maximize performance; however, considerable time and cost are required to obtain expert annotations. A dataset with weak annotations (e.g., image-level diagnostic labels), which is frequently the case in BUS images, may be insufficient to train a model regardless of its size.

Weakly supervised learning is a mechanism for datasets with noisy or sparse, i.e., weak, annotations. Multiple-instance learning (MIL) is a paradigm that defines the label relationship between a bag and its constituent multiple instances, to learn more specific information from these weak labels. Most studies based on MIL have followed the work of Dietterich et al. [14], which was used for predicting drug activities. In line with recent successes, deep learning has also been integrated into MIL frameworks. Several pacesetting works have proposed the use of deep learning features in MIL tasks [15, 16]. Xu et al. [15] studied colon cancer classification based on histopathology images. Song et al. [16] also proposed a weakly supervised object detection method using deeply learned features. As an extension, Wu et al. [17] incorporated learning deep representation into the MIL framework rather than merely using deep learning features as input. They redesigned a typical deep learning architecture to reflect the MIL assumption in image classification and annotation. Methods that incorporate MIL into the deep learning framework have also been proposed for medical imaging problems [18, 19]. The method proposed by Yan et al. [18] addresses the body part recognition problem based on the observation that the body part of a slice is typically identified through local discriminative regions instead of the global image context. In the work of Shen et al. [19], the cancer malignancy of a patient is determined by aggregating nodule-level malignancies.

A few approaches have also been studied recently to use datasets with different supervision levels [20, 21]. In particular, the authors of [20] used weakly labeled and unlabeled images for multi-label image annotation. Different losses are utilized to harness each image with varying characteristics. Similarly, Papandreou et al. [21]

used weakly and strongly labeled images simultaneously for semantic image segmentation. They developed expectation-maximization methods to train CNNs from weakly labeled images.

In the current study, we present a method based on deep learning to localize and classify masses from BUS images that is trained on a relatively small dataset with strong annotations and a large dataset with weak annotations in a hybrid manner. The proposed approach achieves good balance between performance improvement and annotation cost reduction. This method has been developed in a typical medical image setting, where strong annotations are available for only a portion of available medical image data due to the limited resources of physicians. Although the proposed framework exhibits similarities with the methods presented in[20, 21], technical details differ significantly, due to the different domains and objectives.

The main contributions of our work are the development of i) a one-shot method for the concurrent localization and classification of masses present in BUS images and ii) a sophisticated weakly and semi-supervised training scenario for using a strongly annotated dataset, DX+Loc, which comprises BUS images, bounding box coordinates, and the diagnostic labels of masses present in the images, and a weakly annotated dataset, DX, which comprises BUS images and their image-level diagnostic labels, with appropriate training loss selection. In particular, two data streams are established during the training process, one for DX+Loc and another for DX, while images from both sets are fed into a shared network, as shown in Fig. 2(a). DX optimizes network weights via MIL loss, whereas DX+Loc computes losses based on given mass-level GT labels (Fig. 2(c)). Although various network models can be used, we provide the Faster R-CNN method [22] as an example in Fig. 2(b). The experiments show that the proposed method, which uses both datasets, improves performance compared with the method that uses only either the DX+Loc dataset or DX dataset. The proposed method can be applied as a general framework to train a computer-aided detection and diagnosis system for a wide variety of image modalities, target organs, and diseases.

Figure 2: Illustration of the proposed framework. (a) Images from two different data streams are forward-propagated into a shared network. (b) The Faster R-CNN [22] used for the “network” of (a). The network is composed of the region proposal network (RPN) and Fast R-CNN [23] with shared convolutional layers. This figure was previously presented in [22] and is reprinted in this paper for the description of the Faster R-CNN. We also note that the proposed method is a general framework; hence, other supervised approaches can also be adopted. (c) An image-level loss is used for images from DX, whereas region-level losses are used for images from DX+Loc. Refer to Subsections 2.2 and 2.3 for details.

2 Methods

2.1 Datasets

The entire dataset comprises 5424 images from 2994 clinical cases and 2449 patients. The images were acquired using ultrasound systems from multiple vendors, including Philips (ATL HDI 5000, iU22), SuperSonic Imagine (Aixplorer), and Samsung Medison (RS80A), at seven different resolutions ranging from 476x640 to 872x1280 and 8 bit depth. All the images have pathologically proven biopsy labels regarding benignancy or malignancy.

The DX+Loc

subset, which has diagnostic label and mass bounding box annotations, comprises 1200 images (600 benign and 600 malignant). Only the bounding box and label of a single mass of interest (MoI) is marked, whereas other probable masses are disregarded by the operator because US images are intentionally focused only on the MoI. Although MoIs comprise positive samples of masses, we explicitly draw background (non-mass) boxes as negative samples, and thus, other probable masses are excluded.

DX+Loc is further split by patients into training and test sets, namely, DX+Loc-Train (800 images, 400 benign and 400 malignant) and DX+Loc-Test (400 images, 200 benign and 200 malignant). By contrast, the DX set comprises 3291 benign and 933 malignant images. The ratio between benign and malignant in the DX set follows natural statistics. The DX set is only used for training in experiments.

2.2 Strongly Supervised Learning Using the DX+Loc Subset

Clinically, the final desired output is the image-level diagnostic label; during the process, a clinician inherently detects the MoI. Thus, we aim to jointly perform MoI localization and classification. To achieve this objective, we apply the Faster R-CNN method proposed in[22] to our problem. Other supervised approaches can also be adopted.

The Faster R-CNN is composed of the region proposal network (RPN) and the Fast R-CNN detector [23], as shown in Fig. 2(b). The fully convolutional RPN generates rectangular region proposals. The Fast R-CNN performs localization and classification on these proposals to detect objects of interests. The combined network is designed such that the RPN and the Fast R-CNN share the convolutional layers. This structure not only enables efficient region proposal generation and detection but also improves the precision of both tasks.

The loss functions comprise four terms, which are the classification and regression losses

, and , of the RPN and the Fast R-CNN, respectively. In this study, several changes are necessary to adapt to our problem. For all the terms, losses are based on the difference between the network outputs and the corresponding GT boxes defined by overlaps. For , proposals are defined as positive if the intersection over union (IoU) overlap with GT mass boxes, whether benign or malignant, is over 0.7. A proposal is defined as negative if over 70% of its area overlaps with one of the annotated background boxes, as described in Subsection 2.1. Other proposals are disregarded. For , the probabilities for background, benign, and malignant classes are considered. Regression losses are defined such that only positive proposals contribute to and only detected benign or malignant boxes contribute to . In both regression losses, bounding box similarities are measured against the GT boxes with the highest IoU.

2.3 Weakly Supervised Learning Using the Dx Subset

DX is used to prevent overfitting and improve performance. Images from the DX set are fed into a network shared with DX+Loc to produce region-wise classification results, similar to the process in supervised learning. However, we incorporate a MIL scheme to define an appropriate loss function because no region-level GT label exists. In a MIL framework, a bag that comprises multiple instances is positive if at least one instance is positive and negative if all instances are negative. This assumption suggests that we can confidently label all instances in a negative bag as negative. Moreover, at least one instance in a positive bag will be positive. Compared with a similar MIL approach in [19], one bag for each image in our problem regards all the detected mass regions as instances. If at least one region is classified as malignant, then that image is labeled as malignant. Thus, the per-image image-level loss is defined as follows:


where is the cross entropy between GT label and prediction for the th image . The class weights are omitted for brevity although cross entropies (1) multiplied by class weights are used in the experiments to address the class imbalance problem in the DX set.

The image-level label set comprises normal (without any mass) N, benign B, and malignant M. The image-level prediction is inherited from that of MoI . Thus, is defined as


The MoI can be selected based on several criteria. In particular, for images labeled B, we test four different selection criteria, and each criterion selects the most benign (3), malignant (4), discriminative (5), or abnormal (6) region in the image, whereas the most malignant (4) region is always selected as the MoI for images labeled M:


where is the set of detected regions for , and , , and are the probabilities of normality, benignancy, and malignancy, respectively, for region . Our definitions are based on the assumption that a clinician only focuses on the MoI. No image in our dataset is actually labeled N due to this assumption.

2.4 Joint Weakly and Semi-Supervised Learning Using the DX+Loc and Dx Subsets

We jointly train our network with all the images in DX+Loc and DX using the training schemes described in Subsections 2.2 and 2.3. The proposed training algorithm is summarized in Algorithm 1. DX+Loc is used to update the entire network parameters , which comprise of the shared convolutional layers, specific to the RPN, and specific to the Fast R-CNN. DX is used to update the rest, except for , which comprises a box regression layer of the Fast R-CNN. Furthermore, we relatively scale MIL loss (1) by , which increases to 1 as training continues, to prevent the network from converging to undesired local minima by excessively focusing on MIL loss. In addition, we introduce another iterative training algorithm (Algorithm 2), in which images from DX+Loc and DX are fed into a network “iteratively,” whereas images are fed “simultaneously” in Algorithm 1.

Input : DX+Loc, DX, numbers of images to sample for constituting a single mini-batch for each set & , learning rates , initial relative scale factor for MIL loss , initial CNN parameters ().
Output : CNN trained to localize and classify masses in BUS images.
Iterate 1. Get a next mini-batch of size , consisting of images from DX+Loc and images from DX 2. Compute , where , and use SGD with learning rate to update .
Algorithm 1 Simultaneous Weakly and Semi-Supervised Learning of a CNN.
Input : DX+Loc, DX, respective mini-batch sizes for each set & , learning rates for each set & , initial relative scale factor for MIL loss , initial CNN parameters ().
Output : CNN trained to localize and classify masses in BUS images.
Iterate between Supervised iteration: 1. Get a next mini-batch of size for DX+Loc 2. Compute , where , and use SGD with learning rate to update Weakly supervised iteration: 1. Get a next mini-batch of size for DX 2. Compute , where , and use SGD with learning rate to update .
Algorithm 2 Iterative Weakly and Semi-Supervised Learning of a CNN.

3 Results

3.1 Evaluation Details

We use the ImageNet pre-trained VGG-16 

[24] model to initialize our network, and only fine-tune the layers and up, as done in [22]. We reduce the sizes of the two fully connected layers of the Fast R-CNN to 512 and use a weight decay of 0.0005 to further prevent overfitting. For data augmentation, we apply horizontal flipping, random brightness, and contrast adjustment to the DX+Loc-Train and DX sets, and additional image-wise random rotation and central cropping to DX. Moreover, we use a simple Adam optimizer [25]

to reduce the amount of hyperparameters for tuning. The parameters are fixed to

, , , and . is fixed to 1 because multiple regions from an image actually constitute a mini-batch in supervised learning.

We first conduct an ablative analysis to investigate the influence of each component in the proposed framework, particularly for components introduced in Subsections 2.3 and 2.4. We also show the applicability of the proposed method as a general framework by replacing VGG-16 with a 101-layer residual net (ResNet-101) [26]. In this study, all the parameters remain unchanged regardless of network structure. Finally, comparisons are made with fully weakly supervised and fully supervised approaches. Fully weakly supervised methods include the methods of 1) Zhou et al. [27], which can produce class-wise heat maps via “class activation mapping,” along with classification; and 2) a MIL with region proposals generated via selective search [28] and our proposed loss, which is denoted as SS+MIL. In SS+MIL, the extracted region patches are resized into a fixed size and independently fed into a CNN. An image-level prediction is made by (2). A fully supervised method is that presented in[22] and introduced in Subsection 2.2. In addition, we train the proposed joint weakly and semi-supervised model with various configurations of DX+Loc-Train and DX for comparison. The evaluations are conducted on DX+Loc-Test with a correct localization (CorLoc) measure [29]. CorLoc is the percentage of images in which a method correctly localizes an object of the target class according to the PASCAL criterion (IoU 0.5). This measure is more appropriate than mean average precision in our case because only the bounding box and the label of a single MoI is marked, whereas other probable masses are disregarded in the annotation process. An image is counted as correct if the contained MoI is correctly localized and classified. Notably, non-maximum suppression (NMS) and thresholding with class probabilities are applied to generate the final detection outputs for all the methods.

3.2 Ablation Study

Table 1 shows the influence of each component in the proposed framework. We first study the effect of diverse MIL criteria, with each criterion selecting the most benign, malignant, discriminative, or abnormal region in the image as the MoI. We test the four types of criteria for images labeled B, whereas the most malignant region is always selected for images labeled M. Among these criteria, selecting the most malignant/malignant region for each image labeled B/M exhibits the best result, which we believe is supported by the following assumptions. 1) Clinicians always focus on the most “seemingly” malignant region as the MoI in conducting diagnostic tests. 2) Image-level diagnostic labels are determined based on the label of the MoI. Poorer performance is observed among the other MIL criteria because the difference in MIL criteria between images with different labels introduces inconsistencies in determining the MoI. The effectiveness of a gradually increasing scale factor for MIL loss, , is highlighted by the “static value (0.5)” variant in Table 1, where a static value is used for the scale factor from start to finish. In particular, 0.5 is used because DX has less precise information than DX+Loc. The use of the gradually increasing scale factor helps prevent the network from converging to undesired local minima, which is confirmed by an increase of 0.0225 in performance.

Lastly, we train the same model using an iterative training algorithm (Algorithm 2). Compared with this model, the model trained by Algorithm 1 exhibits better performance by simultaneously probing DX and DX+Loc while updating the parameters of the entire network.

Table 2 demonstrates the results by replacing VGG-16 with a 101-layer residual net (ResNet-101) [26]. Analogous to the result reported in [22], Faster R-CNN increases the performance from 0.8000 (VGG-16) to 0.8125 (ResNet-101). The proposed method combined with ResNet-101 also performs better compared with Faster R-CNN combined with ResNet-101. However, the performance achieved by applying the proposed method decreases compared with that of VGG-16. We hypothesize that the improved feature learning relatively reduces the effect of learning the DX set by replacing VGG-16 with ResNet-101.

Aspect Variant CorLoc
Proposed (malignant/malignant, gradual increase, simultaneous) 0.8450
MIL criteria benign/malignant 0.7950
discriminative/malignant 0.8050
abnormal/malignant 0.8200
MIL scale factor static value (0.5) 0.8225
Training algorithm iterative 0.8250
“Proposed” refers to the method that uses a MIL selection criterion, where the most malignant/malignant region is selected as the MoI for images labeled B/M, with a gradually increasing scale factor, and the simultaneous version of the training algorithm.
Table 1: Ablation study showing variants of MIL criteria and the training algorithm
Level of supervision Strong Weak & Semi
Method  [22]  [22]+ResNet-101 Proposed Proposed+ResNet-101
#Strong 800 800 800 800
#Weak - - 4224 4224
CorLoc 0.8000 0.8125 0.8450 0.8325
Replacement is applied to fully supervised (strong) and the proposed joint weakly and semi-supervised (weak and semi) methods.
Table 2: Results of replacing VGG-16 with a 101-layer residual net (ResNet-101) [26]

3.3 Quantitative Evaluation

Table 3 compares the proposed method with fully weakly supervised and fully supervised methods. We empirically found that training a fully weakly supervised model is non-trivial. We failed to train a competent mass detector due to the inferior image quality and complex patterns of BUS images. Our proposed joint weakly and semi-supervised approach clearly outperforms the other methods.

We also tested the effect of the amount of data and supervision used in the training of our network by varying the sizes of the strongly and weakly supervised sets () as follows. 1) All the images in DX and DX+Loc-Train are used for weakly supervised training (). 2) Only a portion of the DX set is used for weakly supervised learning (). 3-5) Only a portion of the DX+Loc-Train set is used for strongly supervised training, whereas the rest are used for weakly supervised training (, , ). The use of the maximum amount of data and supervision possibly provides the best result, except for the case of (). This result unexpectedly shows that applying the same data to supervised and weakly supervised training can have a negative effect.

A comparison between the results in configurations 3-5) and the corresponding results using the same amounts of images only from the DX+Loc-Train set (, , ) are separately represented in Fig. 3, and the effectiveness of the proposed method is clearly illustrated. In this study, better or comparable results can be achieved with a considerably smaller amount of strongly supervised data complemented with weakly supervised data. Thus, depending on the relative cost between the strong and weak annotations of a specific problem, an optimal ratio of strong and weak supervision may be determined to minimize annotation cost given a particular target performance. Moreover, by learning the DX set with the DX+Loc-Train set, the performance is relatively more robust when the size of the strongly supervised dataset DX+Loc-Train is small, thereby implying that weakly supervised data can have greater impact when the cost of achieving strongly supervised data is high.

Level of supervision Weak Strong Weak & Semi
Method  [27] SS+MIL  [22] Proposed
#Strong - - 800 800 800 800
#Weak 5024 5024 - 4224 5024 2000
CorLoc 0.1025 0.2125 0.8000 0.8450 0.8350 0.8350
Results of varying numbers of images from DX+Loc-Train and DX are also presented.
Table 3: Quantitative results of fully weakly supervised (weak), fully supervised (strong), and the proposed joint weakly and semi-supervised (weak and semi) methods
Figure 3: Results of the varying ratios of DX+Loc-Train to DX. The corresponding results using the same amounts of images only from DX+Loc-Train are also presented for comparison. The actual numbers of images of DX+Loc-Train and DX are presented for each marked point.

3.4 Qualitative Evaluation

Fig. 4 shows the qualitative results of the proposed method and those of the fully weakly supervised and fully supervised methods. The proposed method successfully detects and more precisely classifies various types of masses than the method presented in [22].

Fig. 5 provides representative failure cases. All the methods failed to detect masses in the top two rows due to their small size and unclear boundary. In the bottom two examples, the proposed method precisely localized but failed to correctly classify masses probably due to insufficient and confusing features. For example, a mass with an irregular shape and a nonparallel orientation in the third row is likely to be seen as malignant from its image features.

Weak [27] Strong [22] Weak & Semi (Proposed)
Figure 4: Qualitative results. Each row shows different images. Each two rows present cases with various types of masses, which can be small, large, or unclear masses.

Weak [27] Strong [22] Weak & Semi (Proposed)
Figure 5: Example failure cases. Each row shows different images. The top two rows present cases where localization fails. The bottom two rows present cases where localization succeeds but classification fails.

4 Conclusion

We have proposed a method for concurrently localizing and classifying masses present in BUS images. In particular, we train a CNN detector by jointly utilizing a weakly annotated dataset and a relatively small strongly annotated dataset. In this manner, a sophisticated weakly and semi-supervised training algorithm is introduced with appropriate training loss selection. The experimental results show the practical usefulness of the proposed method. For our future work, we plan to extend the proposed method to 3D automated BUS. The application of the appealing joint weakly and semi-supervised framework to a wide variety of image modalities, target organs, and diseases can also be our next step.


  • [1] (2013) Breast cancer statistics. [Online]. Available:
  • [2] (2016) Risk of developing breast cancer. [Online]. Available:
  • [3] A. C. of Radiology and C. D’Orsi, ACR BI-RADS Atlas: Breast Imaging Reporting and Data System ; Mammography, Ultrasound, Magnetic Resonance Imaging, Follow-up and Outcome Monitoring, Data Dictionary.   ACR, American College of Radiology, 2013. [Online]. Available:
  • [4] X. Shi, H. D. Cheng, L. Hu, W. Ju, and J. Tian, “Detection and Classification of Masses in Breast Ultrasound Images,” Digit. Signal Process., vol. 20, no. 3, pp. 824–836, May 2010. [Online]. Available:
  • [5] M. Minavathi, M. S, and M. Dinesh, “Classification of Mass in Breast Ultrasound Images using Image Processing Techniques,” vol. 42, pp. 29–36, 03 2012.
  • [6] G. N. Lee, D. Fukuoka, Y. Ikedo, T. Hara, H. Fujita, E. Takada, T. Endo, and T. Morita, Classification of Benign and Malignant Masses in Ultrasound Breast Image Based on Geometric and Echo Features.   Berlin, Heidelberg: Springer Berlin Heidelberg, 2008, pp. 433–439. [Online]. Available:
  • [7] H. D. Cheng, J. Shan, W. Ju, Y. Guo, and L. Zhang, “Automated Breast Cancer Detection and Classification Using Ultrasound Images: A Survey,” Pattern Recogn., vol. 43, no. 1, pp. 299–317, Jan. 2010. [Online]. Available:
  • [8] P. Jiang, J. Peng, G. Zhang, E. Cheng, V. Megalooikonomou, and H. Ling, “Learning-based automatic breast tumor detection and segmentation in ultrasound images,” in 2012 9th IEEE International Symposium on Biomedical Imaging (ISBI), May 2012, pp. 1587–1590.
  • [9] P. Kisilev, E. Barkan, G. Shakhnarovich, and A. Tzadok, “Learning to detect lesion boundaries in breast ultrasound images.”
  • [10] H. Chen, Y. Zheng, J. H. Park, P.-A. Heng, and S. K. Zhou, “Iterative Multi-domain Regularized Deep Learning for Anatomical Structure Detection and Segmentation from Ultrasound Images,” in International Conference on Medical Image Computing and Computer-Assisted Intervention.   Springer, 2016, pp. 487–495.
  • [11] M. H. Yap, G. Pons, J. Marti, S. Ganau, M. Sentís, R. Zwiggelaar, A. Davison, and R. Martí, “Automated Breast Ultrasound Lesions Detection using Convolutional Neural Networks,” vol. PP, pp. 1–1, 07 2017.
  • [12]

    B. Huynh, K. Drukker, and M. Giger, “MO-DE-207B-06: Computer-Aided Diagnosis of Breast Ultrasound Images Using Transfer Learning From Deep Convolutional Neural Networks,” vol. 43, pp. 3705–3705, 06 2016.

  • [13] J.-Z. Cheng, D. Ni, Y.-H. Chou, J. Qin, C.-M. Tiu, Y.-C. Chang, C.-S. Huang, D. Shen, and C.-M. Chen, “Computer-Aided Diagnosis with Deep Learning Architecture: Applications to Breast Lesions in US Images and Pulmonary Nodules in CT Scans,” vol. 6, p. 24454, 04 2016.
  • [14] T. G. Dietterich, R. H. Lathrop, and T. Lozano-Pérez, “Solving the multiple instance problem with axis-parallel rectangles,” Artificial Intelligence, vol. 89, no. 1, pp. 31 – 71, 1997. [Online]. Available:
  • [15] Y. Xu, T. Mo, Q. Feng, P. Zhong, M. Lai, and E. I. C. Chang, “Deep learning of feature representation with multiple instance learning for medical image analysis,” in 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), May 2014, pp. 1626–1630.
  • [16] H. O. Song, Y. J. Lee, S. Jegelka, and T. Darrell, “Weakly-supervised Discovery of Visual Pattern Configurations,” in Proceedings of the 27th International Conference on Neural Information Processing Systems, ser. NIPS’14.   Cambridge, MA, USA: MIT Press, 2014, pp. 1637–1645. [Online]. Available:
  • [17] J. Wu, Y. Yu, C. Huang, and K. Yu, “Deep multiple instance learning for image classification and auto-annotation,” in

    2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR)

    , June 2015, pp. 3460–3469.
  • [18] Z. Yan, Y. Zhan, Z. Peng, S. Liao, Y. Shinagawa, D. N. Metaxas, and X. S. Zhou, Bodypart Recognition Using Multi-stage Deep Learning.   Cham: Springer International Publishing, 2015, pp. 449–461. [Online]. Available:
  • [19] W. Shen, M. Zhou, F. Yang, D. Dong, C. Yang, Y. Zang, and J. Tian,

    Learning from Experts: Developing Transferable Deep Features for Patient-Level Lung Cancer Prediction

    .   Cham: Springer International Publishing, 2016, pp. 124–131. [Online]. Available:
  • [20] F. Wu, Z. Wang, Z. Zhang, Y. Yang, J. Luo, W. Zhu, and Y. Zhuang, “Weakly Semi-Supervised Deep Learning for Multi-Label Image Annotation,” IEEE Transactions on Big Data, vol. 1, no. 3, pp. 109–122, Sept 2015.
  • [21] G. Papandreou, L. C. Chen, K. P. Murphy, and A. L. Yuille, “Weakly-and Semi-Supervised Learning of a Deep Convolutional Network for Semantic Image Segmentation,” in 2015 IEEE International Conference on Computer Vision (ICCV), Dec 2015, pp. 1742–1750.
  • [22] S. Ren, K. He, R. Girshick, and J. Sun, “Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 39, no. 6, pp. 1137–1149, June 2017.
  • [23] R. Girshick, “Fast R-CNN,” in 2015 IEEE International Conference on Computer Vision (ICCV), Dec 2015, pp. 1440–1448.
  • [24] K. Simonyan and A. Zisserman, “Very Deep Convolutional Networks for Large-Scale Image Recognition,” CoRR, vol. abs/1409.1556, 2014.
  • [25] D. P. Kingma and J. Ba, “Adam: A Method for Stochastic Optimization,” CoRR, vol. abs/1412.6980, 2014. [Online]. Available:
  • [26] K. He, X. Zhang, S. Ren, and J. Sun, “Deep Residual Learning for Image Recognition,” in 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2016, pp. 770–778.
  • [27] B. Zhou, A. Khosla, A. Lapedriza, A. Oliva, and A. Torralba, “Learning Deep Features for Discriminative Localization,” in 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2016, pp. 2921–2929.
  • [28] J. R. R. Uijlings, K. E. A. van de Sande, T. Gevers, and A. W. M. Smeulders, “Selective Search for Object Recognition,” International Journal of Computer Vision, vol. 104, no. 2, pp. 154–171, Sep 2013. [Online]. Available:
  • [29] T. Deselaers, B. Alexe, and V. Ferrari, “Weakly Supervised Localization and Learning with Generic Knowledge,” International Journal of Computer Vision, vol. 100, no. 3, pp. 275–293, Dec 2012. [Online]. Available: