Log In Sign Up

A Two-Stage Multiple Instance Learning Framework for the Detection of Breast Cancer in Mammograms

by   Sarath Chandra K, et al.

Mammograms are commonly employed in the large scale screening of breast cancer which is primarily characterized by the presence of malignant masses. However, automated image-level detection of malignancy is a challenging task given the small size of the mass regions and difficulty in discriminating between malignant, benign mass and healthy dense fibro-glandular tissue. To address these issues, we explore a two-stage Multiple Instance Learning (MIL) framework. A Convolutional Neural Network (CNN) is trained in the first stage to extract local candidate patches in the mammograms that may contain either a benign or malignant mass. The second stage employs a MIL strategy for an image level benign vs. malignant classification. A global image-level feature is computed as a weighted average of patch-level features learned using a CNN. Our method performed well on the task of localization of masses with an average Precision/Recall of 0.76/0.80 and acheived an average AUC of 0.91 on the imagelevel classification task using a five-fold cross-validation on the INbreast dataset. Restricting the MIL only to the candidate patches extracted in Stage 1 led to a significant improvement in classification performance in comparison to a dense extraction of patches from the entire mammogram.


page 1

page 2

page 4

page 5


Classification of Breast Cancer Histology using Deep Learning

Breast Cancer is a major cause of death worldwide among women. Hematoxyl...

A Robust and Effective Approach Towards Accurate Metastasis Detection and pN-stage Classification in Breast Cancer

Predicting TNM stage is the major determinant of breast cancer prognosis...

Breast Cancer Detection Using Convolutional Neural Networks

Breast cancer is prevalent in Ethiopia that accounts 34 patients. The di...

Learning from Suspected Target: Bootstrapping Performance for Breast Cancer Detection in Mammography

Deep learning object detection algorithm has been widely used in medical...

Improving Breast Cancer Detection using Symmetry Information with Deep Learning

Convolutional Neural Networks (CNN) have had a huge success in many area...

Capturing Localized Image Artifacts through a CNN-based Hyper-image Representation

Training deep CNNs to capture localized image artifacts on a relatively ...

I Introduction

Breast cancer is a leading cause of mortality in women [1]. Its early detection through large-scale screening is critical for timely intervention but requires the manual evaluation of a large number of mammograms [5]. However, only a small proportion of the entire screening population actually exhibits salient markers for malignancy. Thus, automated diagnostic tools can reduce the Radiologist’s workload by referring only the suspicious cases to them and may also aid in minimizing the inter and intra-expert variations in diagnosis.

Breast cancer is primarily characterized by the presence of malignant lumps or tumors called mass in addition to other secondary indications such as the distribution of microcalcifications, asymetries, and architectural distortions [8]. A non-cancerous benign mass typically has a smooth and regular shape while a malignant mass is usually characterized by irregular and indistinct margins. Many recent methods for screening breast cancer [2], [4] first segment the lesions and then extract features from the detected lesions for image-level classification into a benign or malignant class. In [2], the lesion segmentation maps for multi-view images of the same breast were provided as input to an ensemble of ResNet models for breast level classification. In [4]

, handcrafted features were augmented with Convolutional Neural Network (CNN) features to improve classification. A transfer learning approach was employed in


where a patch level classifer was first trained to detect the lesions and then extended using additional convolutional layers at the end to obtain the image-level classifier. Alternatively, few single stage Multiple Instance Learning (MIL) approaches

[11] have been explored which extract features from dense patches obtained from the entire image without lesion segmentation. The sparse patch level prediction scores from a classifier are aggregated into image-level predictions using a max operation [11].

Fig. 1: Pipeline of the proposed method.
Fig. 2: The block diagram and CNN architecture used in Stage 1 to detect the bounding boxes of the mass regions in the mammogram.

The image-level screening of mammograms is a challenging task. Since a mass occupies a very small region (around [11]) in the entire mammogram, the classifier must learn to attend to localized suspicious regions while ignoring the large healthy background tissue. Moreover, discriminating between a benign, malignant mass and a dense healthy fibro-glandular tissue is a challenging task as they often share a similar intensity and texture profile. Finally, the explainability of an automated system in terms of an approximate localization of the mass which led to the prediction of cancer is critical for deployment in a real-world scenario.

To address these issues, we explore a two-stage framework depicted in Fig. 1. A localization network is trained in the first stage to extract local candidate patches in the mammograms that may contain either a benign or malignant mass. In the second stage, a Multiple Instance Learning (MIL) formulation is employed to obtain a global image-level feature representation from the extracted image patches to classify the mammograms. In contrast to the existing single stage MIL methods, we hypothesize that restricting the MIL framework only to the regions detected in the first stage may make its task simpler resulting in improved classification performance. The proposed method has been found to provide a robust mass localization and competitive image-level classification performance on the INbreast dataset [7]. The efficacy of utilizing the localization network in improving the image-level classification performance has also been demonstrated experimentally. The details of the proposed method is presented in Section II followed by a discussion of the experimental results in Section III and concluding remarks in Section IV.

Ii Method

The proposed method employs a two-stage framework (see Fig. 1) to ensure that it only attends to the clinically relevant localized regions in the mammogram for its predictions. The first stage is a localization network that detects multiple bounding boxes in each image which are the candidate regions in the mammogram that may contain either a benign or a malignant mass. A set of image patches are extracted from the detected regions and employed in the second stage for the image-level prediction. The second stage employs a MIL strategy to obtain a global image-level feature representation by computing a weighted average of patch-level features. Both the aggregation weights and the patch-level features are obtained by applying a CNN model to each image patch. Further details are discussed below.

Image Preprocessing: A tight crop around the breast region is obtained by roughly segmenting it using Otsu Thresholding to remove the dark background and resized to

pixels. The images are whitened to zero mean and unit variance. During training, the data augmentation is performed with random vertical and horizontal flips, translation and scaling within a range of 0.2 times the image dimensions, and rotations within

to improve generalization.

Stage 1, Localization of Mass: The localization network employs a Fully Convolutional Network similar to U-Net [9] with the number of convolutional filters in each layer modified for our task as depicted in Fig. 2

. The output of the localization network is a softmap where the value at each pixel is the probability (

) that it belongs to a mass. It is post-processed to obtain the detection bounding boxes as follows. First, the softmap is binarized by thresholding at

and a morphological closing operating is applied to remove small spurious regions and smooth the boundaries of the detected regions. Thereafter, a tight bounding box is extracted around each of the detected connected components.

A linear combination of the weighted binary cross-entropy (WCE) loss (with a weight of 0.8) and the soft Dice loss [6] (with a weight of 0.2) is employed at a pixel-level to train the localization network. A weight of 28 is given to the positive class in the WCE loss based on the ratio of the number of foreground to background pixels in the Ground Truth (GT) bounding boxes.

Patch Extraction: Once the localization network has been trained, 5 candidate patches (one at the center and four from each corner) of size pixels are extracted from each of the detected bounding boxes. We note that not all the extracted patches are malignant as some will contain benign masses while a few of them may be False Positives containing healthy fibro-glandular tissues that have a similar intensity and texture characteristics as that of a mass. In rare cases, where no masses were detected, a set of 5 patches were randomly selected from the bright regions in the entire image.

Fig. 3: The Multiple Instance Learning Framework in Fig. (a) extracts a set of local features and corresponding scalar weights by applying a single CNN model which shares the same model weights across the different patches. The image level feature is obtained as the weighted average of the local features and used for benign vs. malignant classification using a FC layer. The architecture of the patch level CNN is detailed in Fig. (b).

Stage 2, Image-level Classification: In Stage 2, our objective is to perform a benign vs. malignant classification at the image-level by only attending to the set of candidate patches denoted by which were extracted in the previous stage. may vary across different images. An image belongs to the positive class if at least one contains a malignant mass while none of the in a benign image can have a malignant mass. However, the GT class labels are only available at the image-level and not known for each individual patch. No specific ordering is assumed among the image patches. This image-level classification task can be posed as a MIL where each image is treated as a Bag of Instances which represent the image patches .

In contrast to the MIL approach in [11] which attempted to combine the individual predictions for each

using sparsity constraints and max-pooling operations, we directly obtain the image-level prediction without explicitly classifying each

individually. We employ an embedding-level approach similar to [3] which is depicted in Fig. 3 (a). Each image patch

is encoded into a 512-dimensional feature vector

using a CNN model which additionally computes a scalar attention weight . Each is normalized using the operation. Finally, the global image-level feature is computed as

and used for binary classification using an FC layer with 1 neuron and

activation. The binary cross-entropy loss was used to train the entire MIL architecture in Fig. 3 (a) and the class imbalance was handled during training by oversampling the instances of the malignant class with data augmentation.

The operation normalizes the attention weights to sum to 1 thereby ensuring that the scale of is invariant to and also induces a regional competition among the image patches. The same CNN model with identical network weights is applied to each to obtain and whose architecture is detailed in Fig. 3 (b).

Iii Experiments

Dataset: Similar to [11], the proposed framework has been evaluated on the INbreast dataset [7]. It consists of 410 Full-field Digital Mammographic images in both MLO and CC views from 115 subjects out of which 310 are benign and 100 contain malignant masses. The Ground Truth (GT) contains a rough localization of the masses and the class labels for benign (0) or malignant (1) for each image. If a mammogram with cancer has multiple masses, at least one (or more) of the masses will be malignant but the GT for each mass is not available as class labels are only available at the image level.

A stratified five-fold cross-validation was performed for evaluation. In contrast to [11] which partitioned the dataset at an image-level, we partitioned the five folds at the subject level thereby ensuring that the MLO/CC views of the same breast do not occur simultaneously in both the train and test splits in any fold. This ensures that the performance evaluation is not biased due to overfitting. The implementation of [11] made available by its re-trained on our fold partitions for a fair benchmark comparison.

Training Details:

The first and second stages are trained separately. Once the localization network has been trained, it is used to extract the image patches which are then used to train the second stage. The localization network in Stage 1 is trained for 300 epochs with a batch size of 8 and 36 batch updates per epoch. The Stochastic Gradient Descent (SGD) optimizer is employed with a learning rate of 0.001, weight decay of 0.0005 and the learning rate is decayed by 0.1 after 50, 200 and 250 epochs.

The MIL network in Stage 2 is trained for 100 epochs using the SGD optimizer with a learning rate of 0.001, weight decay of 0.0005 and the learning rate is decayed by 0.1 after 50 epochs. During implementation, each training batch is of a variable size and is composed of all image patches extracted from a single image.

Implementation Details:

The proposed method is implemented in Python 3 using the PyTorch 1.0 library and trained on a server with

Intel Xeon E5-2620 CPU with GB RAM, TB HDD and 3 Nvidia GTX TITAN X GPU with GB RAM.

Performance of mass localization: The objective of the localization network in Stage 1 is not an accurate semantic segmentation but to obtain an approximate localization of the masses in terms of bounding boxes to enable the extraction of the candidate image patches for the second stage. We also note that the MIL framework can learn to ignore the False Positives (FP) from the first stage by assigning for these patches. A predicted bound box is treated as a True Positive (TP) if it has an Intersection over Union (IoU) greater than 0.5 with respect to the GT bounding box. The localization network performed reasonably well with an average Precision of and an average Recall of across the five-fold cross-validation.

The localization performance has also been evaluated at a pixel level using the FROC plots presented in Fig. 4 (a), where the TP (FP) is defined as the number of pixels in the predicted regions that lie within any (outside all) of the GT bounding boxes in each image. 222Additional qualitative results for mass localization are available online at:

Fig. 4: (a) FROC curves for the localization of mass in Stage 1 for the five folds. (b) A qualitative result of the mass localization in Stage 1 for the mammogram in left is depicted in the right image, with the GT marked in Red and our result in Green. (c) The ROC curves for the proposed image level benign vs. malignant classification for the five folds. The ROC curves for [11] is also plotted for each fold using dashed line of same color.

Performance of Image-level Classification: The average image-level classification performance of the proposed method across the five folds and the corresponding ROC plot are presented in Table I and Fig. 4

(c) respectively. The performance of employing the proposed MIL-framework alone without using the localization network in Stage 1 was also evaluated by extracting a dense set of overlapping image patches with a stride of

pixels from the entire breast region and retraining our MIL framework. The proposed two-stage framework was found to significantly outperform Stage 2 alone (see Table I) demonstrating the efficacy of the localization network in simplifying the classification task in the second stage by restricting it to the suspicious regions in the image.

Our method with an average AUC of outperforms the existing MIL based method in [11] by in terms of AUC. Moreover, our method has a significantly better Sensitivity-Specificity trade-off in comparison to [11] (see Table I) which is desirable in screening systems.

Sens. Spec. B. Acc. AUC
Stage 2 alone
TABLE I: Average five-fold cross-validation performance for image-level Benign vs. Malignant Clasification. The Sensitivity (Sens.), Specificity (Spec.), Balanced Accuracy (B. Acc.) and Area under ROC curve (AUC) are reported as mean standard deviation.

Iv Conclusion

Mammograms are commonly employed in the large scale screening of breast cancer which is primarily characterized by the presence of malignant masses. However, image-level detection of cancer poses many challenges due to the small size of the mass regions in the entire image and the difficulty in discriminating between malignant, benign mass and healthy dense fibro-glandular tissues.

In this work, we propose a two-stage framework to address these issues. In the first stage, bounding boxes around the mass regions are detected which are used to extract a set of candidate image patches. The second stage employs a MIL strategy to obtain a global image-level feature representation by computing a weighted average of patch-level features learned using a CNN. Our method performed reasonably well on the task of localization of masses with an average Precision/Recall of and achieved an average AUC of 0.91 on the image-level classification task outperforming a state-of-the-art MIL based method. Finally, performing a more fine-grained 6-level classification in the second stage to predict the BIRADS severity scale [8] presents a challenging direction for future work.


  • [1] C. E. DeSantis, J. Ma, A. Goding Sauer, et al. (2017) Breast cancer statistics, 2017, racial disparity in mortality by state. CA: a cancer journal for clinicians 67 (6), pp. 439–448. Cited by: §I.
  • [2] N. Dhungel, G. Carneiro, and A. P. Bradley (2017) Fully automated classification of mammograms using deep residual neural networks. In IEEE ISBI, pp. 310–314. Cited by: §I.
  • [3] M. Ilse, J. M. Tomczak, and M. Welling (2018) Attention-based deep multiple instance learning. In ICML, pp. 2132–2141. Cited by: §II.
  • [4] T. Kooi, G. Litjens, B. Van Ginneken, et al. (2017)

    Large scale deep learning for computer aided detection of mammographic lesions

    Medical image analysis 35, pp. 303–312. Cited by: §I.
  • [5] A. B. Miller, C. Wall, C. J. Baines, et al. (2014) Twenty five year follow-up for breast cancer incidence and mortality of the canadian national breast screening study: randomised screening trial. British Medical Journal 348, pp. 366. Cited by: §I.
  • [6] F. Milletari, N. Navab, and S. Ahmadi (2016) V-net: fully convolutional neural networks for volumetric medical image segmentation. In Int. Conf. on 3D Vision (3DV), pp. 565–571. Cited by: §II.
  • [7] I. C. Moreira, I. Amaral, I. Domingues, et al. (2012) Inbreast: toward a full-field digital mammographic database. Academic radiology 19 (2), pp. 236–248. Cited by: §I, §III.
  • [8] S. G. Orel, N. Kay, C. Reynolds, and D. C. Sullivan (1999) BI-rads categorization as a predictor of malignancy. Radiology 211 (3), pp. 845–850. Cited by: §I, §IV.
  • [9] O. Ronneberger, P. Fischer, and T. Brox (2015) U-net: convolutional networks for biomedical image segmentation. In MICCAI, pp. 234–241. Cited by: §II.
  • [10] L. Shen, L. R. Margolies, J. H. Rothstein, et al. (2019) Deep learning to improve breast cancer detection on screening mammography. Nature Scientific reports 9 (1), pp. 1–12. Cited by: §I.
  • [11] W. Zhu, Q. Lou, Y. S. Vang, and X. Xie (2017) Deep multi-instance networks with sparse label assignment for whole mammogram classification. In MICCAI, pp. 603–611. Cited by: §I, §I, §II, Fig. 4, TABLE I, §III, §III, §III.

Supplementary Material: A Two-Stage Multiple Instance Learning Framework for the Detection of Breast Cancer in Mammograms

Fig. 5: The qualitative results of the mass detection using the localization network in Stage 1. First column: the input image; Second Column: The Ground Truth(GT) bounding box in Red; Third Column: The predicted Bounding box in Green; Fourth Column: The GT and predicted detections are overlayed.