A Hypersensitive Breast Cancer Detector

by   Stefano Pedemonte, et al.

Early detection of breast cancer through screening mammography yields a 20-35 serve the growing population of women seeking screening mammography. Although commercial computer aided detection (CADe) software has been available to radiologists for decades, it has failed to improve the interpretation of full-field digital mammography (FFDM) images due to its low sensitivity over the spectrum of findings. In this work, we leverage a large set of FFDM images with loose bounding boxes of mammographically significant findings to train a deep learning detector with extreme sensitivity. Building upon work from the Hourglass architecture, we train a model that produces segmentation-like images with high spatial resolution, with the aim of producing 2D Gaussian blobs centered on ground-truth boxes. We replace the pixel-wise L_2 norm with a weak-supervision loss designed to achieve high sensitivity, asymmetrically penalizing false positives and false negatives while softening the noise of the loose bounding boxes by permitting a tolerance in misaligned predictions. The resulting system achieves a sensitivity for malignant findings of 0.99 with only 4.8 false positive markers per image. When utilized in a CADe system, this model could enable a novel workflow where radiologists can focus their attention with trust on only the locations proposed by the model, expediting the interpretation process and bringing attention to potential findings that could otherwise have been missed. Due to its nearly perfect sensitivity, the proposed detector can also be used as a high-performance proposal generator in two-stage detection systems.



page 1

page 3


A deep learning algorithm for reducing false positives in screening mammography

Screening mammography improves breast cancer outcomes by enabling early ...

Robust breast cancer detection in mammography and digital breast tomosynthesis using annotation-efficient deep learning approach

Breast cancer remains a global challenge, causing over 1 million deaths ...

Learning from Suspected Target: Bootstrapping Performance for Breast Cancer Detection in Mammography

Deep learning object detection algorithm has been widely used in medical...

Local image registration a comparison for bilateral registration mammography

Early tumor detection is key in reducing the number of breast cancer dea...

Revealing Hidden Potentials of the q-Space Signal in Breast Cancer

Mammography screening for early detection of breast lesions currently su...

Adaptation of a deep learning malignancy model from full-field digital mammography to digital breast tomosynthesis

Mammography-based screening has helped reduce the breast cancer mortalit...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Purpose

With nearly 269,000 new cases and 42,000 deaths each year, breast cancer is the most common and the second most deadly cancer for women [cancerStatistics2019]. Breast cancer screening using full-field digital mammography (FFDM) has led to a reduction in deaths from this disease [elmore2005screening]. Developed to aid radiologists in screening, traditional computer aided detection (CADe) highlighted suspicious regions in the image. However, it failed to achieve clinical utility due to its low sensitivity, high false positive rate, and reliance on hand-designed features [lehman2015diagnostic, Baker2003].

Recently, convolutional neural networks (CNN) have achieved superhuman performance on many imaging tasks and have been adapted to FFDM to improve detection rates in cancer


. In particular, CNN detection models localize and classify different findings in images. However, most detection models are trained on well-labeled, balanced datasets with tight bounding boxes. These models can struggle to adapt to datasets without these qualities, such as mammography, where the cancer incidence rate is 0.51% and most bounding boxes annotated in the routine clinical workflow only loosely encapsulate the finding


. Thus, we propose a new weakly-supervised loss function for a detector that marks malignant findings on mammograms that enables us to achieve extreme sensitivity and a false positive rate of less than a handful of marks per image.

2 Methods

2.1 Hourglass Model

Two principal approaches for object detection can be identified in the literature: 1 – detectors that localize and predict the extent of objects by drawing bounding boxes around them [He_2017_ICCV, liu2016ssd]

, 2 – detectors that localize objects by estimating their center coordinates


. This second class of detectors has been developed mostly to localize body joints for human pose estimation. In this context, a refreshingly simple and high performance solution was proposed by Newell et al.

[newell2016stacked]. Their solution consists of a simple CNN trained to produced a Gaussian blob at the location of each joint. This CNN, named Hourglass, is composed of several U-Nets, which are stacks of residual modules that progressively downsample and then upsample the features [ronneberger2015u]. Before each downsampling, skip connections are added across modules of identical resolution to facilitate gradient propagation. The key characteristic of the U-Net and Hourglass architectures is to enable the model to produce outputs with high spatial resolution while considering a wide receptive field for each output pixel. These end-to-end convolutional networks are not much different from VGG-style networks, but have the advantage of removing the trade-off between the size of the field-of-view and the resolution of the model output.

The authors of Hourglass achieve state-of-the-art performance on the body joint localization task by minimizing the norm between the network output and a stack of reference images, each representing a Gaussian blob located at a different joint location. In our approach for the localization of malignant lesions on mammograms, we maintain the Hourglass architecture but introduce a new loss that promotes high sensitivity.

2.2 Blobs to coordinates

In pose estimation, where there are a fixed set of points, the Hourglass is trained to produce a single blob output channel for each joint location in the form . We modify the last layer of the network such that the output is , capturing all potential findings in one channel. We then define a new loss that promotes high sensitivity and that allows for the prediction of any number of blobs at multiple locations. Conversion of to an array of 2-dimensional coordinates is achieved by applying a simple peak-finding algorithm (See Figure 2-A-D).

Figure 2: Hourglass network with loss that promotes high sensitivity

2.3 An innovative loss that promotes sensitivity

In Hourglass, [newell2016stacked] the loss is the norm of the difference between the predicted images and a stack of images containing 2-D Gaussian blobs centered on each joint annotation. We replace the norm of the Hourglass model with one designed around four principles aimed at promoting sensitivity:

  1. [label=(), topsep=1pt, itemsep=1pt,partopsep=1pt, parsep=1pt]

  2. Loss should remain unaffected by small mis-alignments between the predicted and ground truth locations.

  3. Loss should remain unaffected by size variations of the predicted blob.

  4. A small number of false positives per image is acceptable.

  5. False positive marks should be penalized less than false negative marks.

The loss that we propose is composed of two terms: a detection loss and a background loss . The detection loss operates on pixels surrounding the area of an annotation, measuring the similarity between the model output and a 2-D Gaussian blob. The background loss operates on pixels far away from annotations, measuring the similarity of the model output to zero.

To calculate the detection loss, first a patch (marked in green in Figure 2-D,F) of a chosen fixed size centered on the ground truth annotation is extracted from the model output (in a practical implementation, this operation happens in-place to enable propagation of the loss gradient). The detection loss is then calculated as the -norm of the difference between and a reference patch . Following principles (a) and (b), we incorporate in the loss tolerances to errors in the location of annotations and to varying extent of the findings by constructing the reference patch adaptively, as a function of the model output . Invariance of the loss to small mis-alignments between the predicted and ground truth locations (a) is obtained by generating a reference 2-D Gaussian blob centered on the model’s predicted blob (see Figure 2-F). Invariance of the loss to the predicted blob size (b) is obtained by comparing the model output with a bank of reference blobs of different size , with (see Figure 2-F). Only the most similar blob (i.e. the blob that yields the smallest -norm) contributes to the detection loss. Denoting with a 2D Gaussian over patch coordinates :


The background loss enforces the model to produce values close to zero in areas far away from annotations, evaluating the -norm of the model output in the region outside of the detector patch (see Figure 2-E). In order to promote sensitivity, we relax the effect of the background loss by allowing for a small number of false positives in each image (c). This is achieved by masking out patches of the model output centered on the top- highest confidence blobs generated by the model. Figure 2-E presents an example of background mask for .


where denotes element-wise multiplication. Finally, sensitivity is promoted by penalizing false positive marks less than false negative marks (d). This is achieved by down-weighting the background loss by a factor , producing the final loss:


2.4 Dataset

Train Val Test
Patients 49748 6214 6194
Exams 158159 19872 19556
Images 665805 84135 82450
    Benign 11930 1556 1380
    High Risk 891 124 76
    Malignant 2699 331 293
Normal Images 653232 82520 81009
Figure 3: Image-level statistics for training (Train), validation (Val), and testing (Test) datasets and annotation counts.

A total of 197,587 screening mammography exams (832,390 FFDM images) spanning 62,156 patients were collected from an academic medical center in the United States. The exams were interpreted by one of 11 radiologists with breast imaging experience ranging from 2 to 30 years. Annotations were collected as part of the routine clinical work flow. The intended clinical use case of these annotations did not require precise segmentations nor rigorous definitions of the physical extent of a finding. Therefore, the tightness of an annotation to the finding’s boundary varies from case to case. The data encompasses all types of mammographically significant findings that could be encountered in a screening setting except for breast implants which were excluded from both training and evaluation. Screening exams were associated with subsequent biopsy events by clinical staff for regulatory compliance purposes through dedicated mammography reporting software (Magview 7.1, Burtonsville, Maryland). This provided a structured way to directly link annotations on screening exams to pathology results from biopsies. Pathology cell type information was mapped to the labels of benign, high risk, or malignant by a fellowship trained breast imaging radiologist.

The images were classified into four classes: (1) normal, no suspicious tissue was found by a radiologist, (2) benign, benign tissue was found during screening or biopsy, (3) high risk, tissue likely to develop into cancer was found during biopsy, (4) malignant, malignant tissue was found during biopsy. During training, the model was trained to identify high risk (3) and malignant (4) as the positive cases with normal (1) and benign (2) as the negative cases. Our analysis focused on the detection of biopsy proven malignant findings. Therefore, though we considered high risk findings as positive examples in training to promote sensitivity, during evaluation sensitivity was evaluated with malignant (4) as the positive class and all others as the negative class. Patients were randomly selected for model training, validation, or testing according to a 80:10:10 split. These splits were on the patient level, so there are no overlapping images, exams, or patients in the different datasets. Training, hyperparameter tuning, and model selection were completed using only the training and validation sets. The final performance was evaluated once on the test set after all models had been frozen. Statistics of the three data sets are reported in Table


2.5 Training

The model was trained on FFDM images resized to

then downsampled by a factor of 2.5 in each dimension. Training occurred for 80 epochs using the Adam optimizer with a learning rate of 1e-4 and background weight

. One epoch consisted of 8,000 images evenly sampled among the three annotation types and the set of images without findings. Owing to the fully-convolutional architecture, the size of the input image can be different for each batch. Thus, in order to speed-up experimentation, we trained the model in two phases. First, we pre-trained the model on patches of size centered on image annotations, downsampled by 2.5, and then fine-tuned it on the whole-image input with size , downsampled by 2.5.

3 Results

Method Sensitivity FPI Data Type Malignant/Total Cases Dataset
Malich et al. (2001) [Malich2001] 0.900 1.3 M, C 150/150 Private
Petrick et al. (2002) [Petrick2002] 0.870 1.5 M 156/156 Private
Baker et al. (2003) [Baker2003] 0.380 0.7 D 45/45 Private
Dhungel et al. (2015) [Dhungel2015] 0.75 4.8 M 40/40 DDSM-BCRP
Morra et al. (2015) [Morra2015] 0.890 2.7 M, C 123/175 Private
Ribli et al. (2018) [Ribli2018] 0.9 0.3 M, C, D, A 115/115 INbreast
Teuwen et al. (2018) [Teuwen2018] 0.969 3.6 M, D, A 1153/2878* Private
Moor et al. (2018) [Moor2018] 0.94 7.9 M, D, A 1153/2878* Private
Agarwal et al. (2019) [Agarwal2019] 0.980 1.7 M 211/223 DDSM + INbreast
Ours 0.990 4.8 M, C, D, A 168/19556 Private
Figure 4: Sensitivity and false positives per image (FPI) for different CADe systems. M = Masses, C = Microcalcifications, D = Architectural Distortions, A = Asymmetries. * Approximation (exact numbers not given)

The hypersensitive loss detector achieves a sensitivity for malignancies of 0.99, generating only 4.8 false negative marks per image. The Hourglass model with the original loss only achieves 0.42 sensitivity, so standard detection loss functions struggle on this challenging detection problem and highly imbalanced dataset. Additionally, we compare our approach to other CADe software in Table 4. Our approach is the only method that both evaluates on a highly imbalanced dataset that reflects the natural distribution (0.86% malignancy case occurrence) and detects masses, microcalcifications, architectural distortions, and asymmetries. These two features are crucial for a clinically-functional model, as they reflect what a model would encounter in an actual screening population, where only 0.51% of exams have cancer and exams can have any type of lesion [Lehman2017]. Additionally, we experiment in Figure 5 to demonstrate the positive effect on sensitivity of each of the four innovative components of the loss described in Section 2.3.

Figure 5: Effect of removing the four aspects of the loss on sensitivity and false positives per image (FPI). The plots represent sensitivity as a function of the average number of false positive marks per image for increasing values of a threshold applied to the predicted Gaussian blob intensity. All aspects increase the sensitivity: (a) invariance to blob alignment, (b) invariance to blob size, (c) accepting top- false positives per image (), (d) down-weighting the background loss (). For several of the curves (baseline, (a), (c), (d)), the sensitivity does not reach a value greater than 0.96 even when all blobs are accepted, confirming the importance of each component of the loss. Blob size invariance (b) has the effect of reducing the false positive rate while maintaining maximum sensitivity.

4 Conclusions

This work presents a novel loss function for training mammography CADe models. By considering the specific properties of both the problem domain and the data, design and optimization of this loss produce a hypersensitive detector with an acceptable false positive rate. Moreover, this loss function, though inspired by mammography CADe, has components directly applicable to many image-based detection tasks. Data with large variances in both segmentation quality and physical extent of findings to be detected are exceedingly common especially in the medical imaging domain. The proposed loss function makes no assumptions about the underlying model and can be used with any segmentation-based network architecture.

The low sensitivity of existing CADe systems limits their applicability in the radiologists’ interpretation workflow. Since malignant findings can be missed by a low-sensitivity CADe system, the human reader cannot rely on software generated annotations. Therefore existing CADe systems are utilized as second readers to bring back the radiologist’s attention to image areas that the algorithm considers suspicious. However, unlike other CADe systems, our model performs with a nearly perfect sensitivity and an acceptable rate of false positives on a dataset that reflects the natural distribution, with a 0.86% cancer occurrence rate of all the different types of lesions: masses, microcalcifications, architectural distortions, and asymmetries. Thus, this nearly perfect sensitivity to malignant findings enabled by the new loss function described in this work could potentially allow radiologists to safely ignore regions of the images not highlighted by the algorithm. We believe that these are important steps forward for improving, simplifying, and substantially expediting radiologists’ interpretations of mammograms.

5 Future Work

The high sensitivity of this detection model allows the detection of nearly all malignant findings in all images, making it an ideal proposal-generator for a two-stage detection system [NIPS2015_5638]. In the first stage, this detection model identifies the suspicious regions of each full image, ideally capturing all the malignant lesions with high sensitivity. In the second stage, we can train a classifier on image patches from the proposals generated by the first stage. This second model solely inputs smaller patches and is trained specifically on these proposals, which are the most suspicious and challenging image areas to classify. Such a two stage model could increase the specificity while maintaining the high level of sensitivity.

6 Disclosure

This work has not been submitted to any journal or conference for publication or presentation consideration.