With nearly 269,000 new cases and 42,000 deaths each year, breast cancer is the most common and the second most deadly cancer for women [cancerStatistics2019]. Breast cancer screening using full-field digital mammography (FFDM) has led to a reduction in deaths from this disease [elmore2005screening]. Developed to aid radiologists in screening, traditional computer aided detection (CADe) highlighted suspicious regions in the image. However, it failed to achieve clinical utility due to its low sensitivity, high false positive rate, and reliance on hand-designed features [lehman2015diagnostic, Baker2003].
Recently, convolutional neural networks (CNN) have achieved superhuman performance on many imaging tasks and have been adapted to FFDM to improve detection rates in cancer[nyu:2019]
. In particular, CNN detection models localize and classify different findings in images. However, most detection models are trained on well-labeled, balanced datasets with tight bounding boxes. These models can struggle to adapt to datasets without these qualities, such as mammography, where the cancer incidence rate is 0.51% and most bounding boxes annotated in the routine clinical workflow only loosely encapsulate the finding[Lehman2017]
. Thus, we propose a new weakly-supervised loss function for a detector that marks malignant findings on mammograms that enables us to achieve extreme sensitivity and a false positive rate of less than a handful of marks per image.
2.1 Hourglass Model
Two principal approaches for object detection can be identified in the literature: 1 – detectors that localize and predict the extent of objects by drawing bounding boxes around them [He_2017_ICCV, liu2016ssd]
, 2 – detectors that localize objects by estimating their center coordinates[newell2016stacked]
. This second class of detectors has been developed mostly to localize body joints for human pose estimation. In this context, a refreshingly simple and high performance solution was proposed by Newell et al.[newell2016stacked]. Their solution consists of a simple CNN trained to produced a Gaussian blob at the location of each joint. This CNN, named Hourglass, is composed of several U-Nets, which are stacks of residual modules that progressively downsample and then upsample the features [ronneberger2015u]. Before each downsampling, skip connections are added across modules of identical resolution to facilitate gradient propagation. The key characteristic of the U-Net and Hourglass architectures is to enable the model to produce outputs with high spatial resolution while considering a wide receptive field for each output pixel. These end-to-end convolutional networks are not much different from VGG-style networks, but have the advantage of removing the trade-off between the size of the field-of-view and the resolution of the model output.
The authors of Hourglass achieve state-of-the-art performance on the body joint localization task by minimizing the norm between the network output and a stack of reference images, each representing a Gaussian blob located at a different joint location. In our approach for the localization of malignant lesions on mammograms, we maintain the Hourglass architecture but introduce a new loss that promotes high sensitivity.
2.2 Blobs to coordinates
In pose estimation, where there are a fixed set of points, the Hourglass is trained to produce a single blob output channel for each joint location in the form . We modify the last layer of the network such that the output is , capturing all potential findings in one channel. We then define a new loss that promotes high sensitivity and that allows for the prediction of any number of blobs at multiple locations. Conversion of to an array of 2-dimensional coordinates is achieved by applying a simple peak-finding algorithm (See Figure 2-A-D).
2.3 An innovative loss that promotes sensitivity
In Hourglass, [newell2016stacked] the loss is the norm of the difference between the predicted images and a stack of images containing 2-D Gaussian blobs centered on each joint annotation. We replace the norm of the Hourglass model with one designed around four principles aimed at promoting sensitivity:
[label=(), topsep=1pt, itemsep=1pt,partopsep=1pt, parsep=1pt]
Loss should remain unaffected by small mis-alignments between the predicted and ground truth locations.
Loss should remain unaffected by size variations of the predicted blob.
A small number of false positives per image is acceptable.
False positive marks should be penalized less than false negative marks.
The loss that we propose is composed of two terms: a detection loss and a background loss . The detection loss operates on pixels surrounding the area of an annotation, measuring the similarity between the model output and a 2-D Gaussian blob. The background loss operates on pixels far away from annotations, measuring the similarity of the model output to zero.
To calculate the detection loss, first a patch (marked in green in Figure 2-D,F) of a chosen fixed size centered on the ground truth annotation is extracted from the model output (in a practical implementation, this operation happens in-place to enable propagation of the loss gradient). The detection loss is then calculated as the -norm of the difference between and a reference patch . Following principles (a) and (b), we incorporate in the loss tolerances to errors in the location of annotations and to varying extent of the findings by constructing the reference patch adaptively, as a function of the model output . Invariance of the loss to small mis-alignments between the predicted and ground truth locations (a) is obtained by generating a reference 2-D Gaussian blob centered on the model’s predicted blob (see Figure 2-F). Invariance of the loss to the predicted blob size (b) is obtained by comparing the model output with a bank of reference blobs of different size , with (see Figure 2-F). Only the most similar blob (i.e. the blob that yields the smallest -norm) contributes to the detection loss. Denoting with a 2D Gaussian over patch coordinates :
The background loss enforces the model to produce values close to zero in areas far away from annotations, evaluating the -norm of the model output in the region outside of the detector patch (see Figure 2-E). In order to promote sensitivity, we relax the effect of the background loss by allowing for a small number of false positives in each image (c). This is achieved by masking out patches of the model output centered on the top- highest confidence blobs generated by the model. Figure 2-E presents an example of background mask for .
where denotes element-wise multiplication. Finally, sensitivity is promoted by penalizing false positive marks less than false negative marks (d). This is achieved by down-weighting the background loss by a factor , producing the final loss:
A total of 197,587 screening mammography exams (832,390 FFDM images) spanning 62,156 patients were collected from an academic medical center in the United States. The exams were interpreted by one of 11 radiologists with breast imaging experience ranging from 2 to 30 years. Annotations were collected as part of the routine clinical work flow. The intended clinical use case of these annotations did not require precise segmentations nor rigorous definitions of the physical extent of a finding. Therefore, the tightness of an annotation to the finding’s boundary varies from case to case. The data encompasses all types of mammographically significant findings that could be encountered in a screening setting except for breast implants which were excluded from both training and evaluation. Screening exams were associated with subsequent biopsy events by clinical staff for regulatory compliance purposes through dedicated mammography reporting software (Magview 7.1, Burtonsville, Maryland). This provided a structured way to directly link annotations on screening exams to pathology results from biopsies. Pathology cell type information was mapped to the labels of benign, high risk, or malignant by a fellowship trained breast imaging radiologist.
The images were classified into four classes: (1) normal, no suspicious tissue was found by a radiologist, (2) benign, benign tissue was found during screening or biopsy, (3) high risk, tissue likely to develop into cancer was found during biopsy, (4) malignant, malignant tissue was found during biopsy. During training, the model was trained to identify high risk (3) and malignant (4) as the positive cases with normal (1) and benign (2) as the negative cases. Our analysis focused on the detection of biopsy proven malignant findings. Therefore, though we considered high risk findings as positive examples in training to promote sensitivity, during evaluation sensitivity was evaluated with malignant (4) as the positive class and all others as the negative class. Patients were randomly selected for model training, validation, or testing according to a 80:10:10 split. These splits were on the patient level, so there are no overlapping images, exams, or patients in the different datasets. Training, hyperparameter tuning, and model selection were completed using only the training and validation sets. The final performance was evaluated once on the test set after all models had been frozen. Statistics of the three data sets are reported in Table3.
The model was trained on FFDM images resized to
then downsampled by a factor of 2.5 in each dimension. Training occurred for 80 epochs using the Adam optimizer with a learning rate of 1e-4 and background weight. One epoch consisted of 8,000 images evenly sampled among the three annotation types and the set of images without findings. Owing to the fully-convolutional architecture, the size of the input image can be different for each batch. Thus, in order to speed-up experimentation, we trained the model in two phases. First, we pre-trained the model on patches of size centered on image annotations, downsampled by 2.5, and then fine-tuned it on the whole-image input with size , downsampled by 2.5.
|Method||Sensitivity||FPI||Data Type||Malignant/Total Cases||Dataset|
|Malich et al. (2001) [Malich2001]||0.900||1.3||M, C||150/150||Private|
|Petrick et al. (2002) [Petrick2002]||0.870||1.5||M||156/156||Private|
|Baker et al. (2003) [Baker2003]||0.380||0.7||D||45/45||Private|
|Dhungel et al. (2015) [Dhungel2015]||0.75||4.8||M||40/40||DDSM-BCRP|
|Morra et al. (2015) [Morra2015]||0.890||2.7||M, C||123/175||Private|
|Ribli et al. (2018) [Ribli2018]||0.9||0.3||M, C, D, A||115/115||INbreast|
|Teuwen et al. (2018) [Teuwen2018]||0.969||3.6||M, D, A||1153/2878*||Private|
|Moor et al. (2018) [Moor2018]||0.94||7.9||M, D, A||1153/2878*||Private|
|Agarwal et al. (2019) [Agarwal2019]||0.980||1.7||M||211/223||DDSM + INbreast|
|Ours||0.990||4.8||M, C, D, A||168/19556||Private|
The hypersensitive loss detector achieves a sensitivity for malignancies of 0.99, generating only 4.8 false negative marks per image. The Hourglass model with the original loss only achieves 0.42 sensitivity, so standard detection loss functions struggle on this challenging detection problem and highly imbalanced dataset. Additionally, we compare our approach to other CADe software in Table 4. Our approach is the only method that both evaluates on a highly imbalanced dataset that reflects the natural distribution (0.86% malignancy case occurrence) and detects masses, microcalcifications, architectural distortions, and asymmetries. These two features are crucial for a clinically-functional model, as they reflect what a model would encounter in an actual screening population, where only 0.51% of exams have cancer and exams can have any type of lesion [Lehman2017]. Additionally, we experiment in Figure 5 to demonstrate the positive effect on sensitivity of each of the four innovative components of the loss described in Section 2.3.
This work presents a novel loss function for training mammography CADe models. By considering the specific properties of both the problem domain and the data, design and optimization of this loss produce a hypersensitive detector with an acceptable false positive rate. Moreover, this loss function, though inspired by mammography CADe, has components directly applicable to many image-based detection tasks. Data with large variances in both segmentation quality and physical extent of findings to be detected are exceedingly common especially in the medical imaging domain. The proposed loss function makes no assumptions about the underlying model and can be used with any segmentation-based network architecture.
The low sensitivity of existing CADe systems limits their applicability in the radiologists’ interpretation workflow. Since malignant findings can be missed by a low-sensitivity CADe system, the human reader cannot rely on software generated annotations. Therefore existing CADe systems are utilized as second readers to bring back the radiologist’s attention to image areas that the algorithm considers suspicious. However, unlike other CADe systems, our model performs with a nearly perfect sensitivity and an acceptable rate of false positives on a dataset that reflects the natural distribution, with a 0.86% cancer occurrence rate of all the different types of lesions: masses, microcalcifications, architectural distortions, and asymmetries. Thus, this nearly perfect sensitivity to malignant findings enabled by the new loss function described in this work could potentially allow radiologists to safely ignore regions of the images not highlighted by the algorithm. We believe that these are important steps forward for improving, simplifying, and substantially expediting radiologists’ interpretations of mammograms.
5 Future Work
The high sensitivity of this detection model allows the detection of nearly all malignant findings in all images, making it an ideal proposal-generator for a two-stage detection system [NIPS2015_5638]. In the first stage, this detection model identifies the suspicious regions of each full image, ideally capturing all the malignant lesions with high sensitivity. In the second stage, we can train a classifier on image patches from the proposals generated by the first stage. This second model solely inputs smaller patches and is trained specifically on these proposals, which are the most suspicious and challenging image areas to classify. Such a two stage model could increase the specificity while maintaining the high level of sensitivity.
This work has not been submitted to any journal or conference for publication or presentation consideration.