A multi-site study of a breast density deep learning model for full-field digital mammography and digital breast tomosynthesis exams

01/23/2020 ∙ by Thomas P. Matthews, et al. ∙ 12

Purpose: To develop a Breast Imaging Reporting and Data System (BI-RADS) breast density DL model in a multi-site setting for synthetic 2D mammography (SM) images derived from 3D DBT exams using FFDM images and limited SM data. Materials and Methods: A DL model was trained to predict BI-RADS breast density using FFDM images acquired from 2008 to 2017 (Site 1: 57492 patients, 187627 exams, 750752 images) for this retrospective study. The FFDM model was evaluated using SM datasets from two institutions (Site 1: 3842 patients, 3866 exams, 14472 images, acquired from 2016 to 2017; Site 2: 7557 patients, 16283 exams, 63973 images, 2015 to 2019). Adaptation methods were investigated to improve performance on the SM datasets and the effect of dataset size on each adaptation method is considered. Statistical significance was assessed using confidence intervals (CI), estimated by bootstrapping. Results: Without adaptation, the model demonstrated close agreement with the original reporting radiologists for all three datasets (Site 1 FFDM: linearly-weighted κ_w = 0.75, 95% CI: [0.74, 0.76]; Site 1 SM: κ_w = 0.71, CI: [0.64, 0.78]; Site 2 SM: κ_w = 0.72, CI: [0.70, 0.75]). With adaptation, performance improved for Site 2 (Site 1: κ_w = 0.72, CI: [0.66, 0.79], Site 2: κ_w = 0.79, CI: [0.76, 0.81]) using only 500 SM images from each site. Conclusion: A BI-RADS breast density DL model demonstrated strong performance on FFDM and SM images from two institutions without training on SM images and improved using few SM images.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 2

page 5

page 7

page 8

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Breast density is an important risk factor for breast cancer [12, 19, 2] and areas of higher density can mask findings within mammograms leading to lower sensitivity [18]. Many states have passed breast density notification laws requiring clinics to inform women of their density [14]. Radiologists typically assess breast density using the Breast Imaging Reporting and Data System (BI-RADS®

) lexicon, which divides breast density into four categories: A, almost entirely fatty; B, scattered areas of fibroglandular density; C, heterogeneously dense; and D, extremely dense (examples are presented in Figure 

1) [25]. Unfortunately, radiologists exhibit intra- and inter-reader variability in the assessment of BI-RADS breast density, which can result in differences in clinical care and estimated risk [28, 27, 1].

(a)
(b)
(c)
(d)
Figure 1: Example synthetic 2D mammography (SM) images derived from digital breast tomosynthesis (DBT) exams for each of the four Breast Imaging Reporting and Data System (BI-RADS) breast density categories: (a) A, almost entirely fatty, (b) B, scattered areas of fibroglandular density, (c) C, heterogeneously dense, and (d) D, extremely dense. Images are normalized so that the grayscale intensity windows found in their Digital Imaging and Communications in Medicine (DICOM) headers range from 0.0 to 1.0.

Deep learning (DL) has previously been employed to assess BI-RADS breast density for film [32] and full-field digital mammography (FFDM) images [8, 21, 17, 16, 30] with some models demonstrating closer agreement with consensus estimates than individual radiologists [16]. To realize the promise of using these DL models in clinical practice, two key challenges must be met. First, as breast cancer screening is increasingly moving to digital breast tomosynthesis (DBT) [24] due to improved reader performance [7, 26, 23], DL models should be compatible with DBT exams. Figure 2 shows the differences in image characteristics between 2D images for FFDM and DBT exams. However, the relatively recent adoption of DBT at many institutions means that the datasets available for training DL models are often fairly limited for DBT exams compared with FFDM exams. Second, DL models must offer consistent performance across sites, where differences in imaging technology, patient demographics, or assessment practices could impact model performance. To be practical, this should be achieved while requiring little additional data from each site.

Figure 2: Comparison between (a) a full-field digital mammography (FFDM) image and (b) a synthetic 2D mammography (SM) image of the same breast under the same compression. A zoomed-in region, whose original location is denoted by the white box, is shown for both (c) the FFDM image and (d) the SM image to highlight the differences in texture and contrast that can occur between the two image types. Images are normalized so that the grayscale intensity windows found in their Digital Imaging and Communications in Medicine (DICOM) headers range from 0.0 to 1.0.

In this work, we present a BI-RADS breast density DL model that offers close agreement with the original reporting radiologists for both FFDM and DBT exams at two institutions. A DL model is first trained to predict BI-RADS breast density using a large-scale FFDM dataset from one institution. Then, the model is evaluated on a test set of FFDM exams as well as synthetic 2D mammography (SM) images generated as part of DBT exams (C-View, Hologic, Inc., Marlborough, MA), acquired from the same institution and from a separate institution. Adaptation techniques, requiring few SM images, are explored to improve performance on the two SM datasets.

2 Materials and Methods

This retrospective study was approved by an institutional review board for each of the two sites where data were collected (Site 1: internal institutional review board, Site 2: Western Institutional Review Board, Puyallup, WA). Informed consent was waived and all data were handled according to the Health Insurance Portability and Accountability Act.

2.1 Datasets

Mammography exams were collected from two sites: Site 1, an academic medical center located in the mid-western region of the United States, and Site 2, an out-patient radiology clinic located in northern California (Site 1 FFDM: 187627, acquired from 2008 to 2017; Site 1 SM: 3866, 2016 to 2017; Site 2 SM: 16283, 2015 to 2019). The exams were interpreted by one of 11 radiologists with breast imaging experience ranging from 2 to 30 years for Site 1 and by one of 9 radiologists with experience ranging from 10 to 41 years for Site 2. The BI-RADS breast density assessments of the radiologists were obtained from each site’s mammography reporting software (Site 1: Magview 7.1, Magview, Burtonsville, Maryland; Site 2: MRS 7.2.0; MRS Systems Inc. Seattle, Washington). Patients were randomly selected for training (FFDM: 50700, 88%; Site 1 SM: 3169, 82%; Site 2 SM: 6056, 80%), validation (FFDM: 1832, 3%; Site 1 SM: 403, 10%; Site 2 SM: 757, 10%), or testing (FFDM: 4960, 9%; Site 1 SM: 270, 7%; Site 2 SM: 744, 10%). Since the split was performed at the patient-level, the images for a given patient appear in only one of these sets. All exams with a BI-RADS breast density assessment were included. For the test sets, exams were required to have all four standard screening mammography images (the mediolateral oblique and craniocaudal views of both breasts). The distribution of the BI-RADS breast density assessments for each set are presented in Table 1 (Site 1) and Table 2 (Site 2).

The two sites serve different patient populations. The patient cohort from Site 1 is 59% Caucasian (34192/58397), 23% African American (13201/58397), 3% Asian (1630/58397), and 1% Hispanic (757/58397) while Site 2 is 58% Caucasian (4350/7557), 1% African American (110/7557), 21% Asian (1594/7557), and 7% Hispanic (522/7557). The distribution of ages is similar for the two sites (Site 1: 54.8 15.7 yr, Site 2: 55.7 11.2 yr).

FFDM Train FFDM Val FFDM Test SM Train SM Val SM Test
Patients 50700 1832 4960 3169 403 270
Exams 168208 6157 13262 3189 407 270
Images 672704 25000 53048 11873 1519 1080
BI-RADS A 80459  (12.0%) 3465 (13.9%) 4948 (9.3%) 1160 (9.8%) 154 (10.1%) 96 (8.9%)
BI-RADS B 348878 (51.9%) 12925 (51.7%) 27608 (52.0%) 6121 (51.6%) 771 (50.8%) 536 (49.6%)
BI-RADS C 214465 (31.9%) 7587 (30.3%) 18360 (34.6%) 3901 (32.9%) 510 (33.6%) 388 (35.9%)
BI-RADS D 28902  (4.3%) 1023 (4.1%) 2132 (4.0%) 691 (5.8%) 84 (5.5%) 60 (5.6%)
Table 1: Description of the Site 1 full-field digital mammography (FFDM) and synthetic 2D mammography (SM) training (train), validation (val), and test (test) datasets. The total number of patients, exams, and images are given for each dataset. The number of images for the four Breast Imaging Reporting and Data System (BI-RADS) breast density categories are also provided.
Train Val Test
Patients 6056 757 744
Exams 13061 1674 1548
Images 51241 6540 6192
BI-RADS A 7866 (15.4%) 865 (13.2%) 948 (15.3%)
BI-RADS B 20731 (40.5%) 2719 (41.6%) 2612 (42.2%)
BI-RADS C 15706 (30.7%) 2139 (32.7%) 1868 (30.2%)
BI-RADS D 6938 (13.5%) 817 (12.5%) 764 (12.3%)
Table 2: Description of the Site 2 synthetic 2D mammography (SM) training (train), validation (val), and test (test) datasets. The total number of patients, exams, and images are given for each dataset. The number of images for the four Breast Imaging Reporting and Data System (BI-RADS) breast density categories are also provided.

2.2 Deep Learning Model

The DL model and training procedure were implemented using the pytorch

DL framework (pytorch.org, version 1.0). The base model architecture is a pre-activation Resnet-34

[11, 10, 31]

, which accepts as input a single image corresponding to one of the views from a mammography exam, and produces estimated probabilities that the image belongs to each of the BI-RADS breast density categories. The model was trained using the FFDM dataset following the procedure described in Appendix 

A.

2.3 Domain Adaptation Methods

The goal of domain adaptation is to take a model trained on a dataset from one domain (source domain) and transfer its knowledge to a dataset in another domain (target domain), which is typically much smaller in size. Features learned by DL models in the early layers can be general, i.e. domain and task agnostic [33]. Depending on the similarity of domains and tasks, even deeper features learned from one domain can be reused for another domain or task.

In this work, we explore approaches for adapting the DL model trained on FFDM images (source domain) to SM images (target domain) that reuse all the features learned from the FFDM domain. First, inspired by the work of Guo et al. [9], we consider the addition of a small linear layer following the final fully-connected layer where either the 4

4 matrix is diagonal (vector calibration) or the 4

4 matrix is allowed to freely vary (matrix calibration). Second, we retrain the final fully-connected layer of the Resnet-34 model on samples from the target domain (fine-tuning).

To investigate the impact of the target domain dataset size, the adaptation techniques were repeated for different SM training sets across a range of sizes. The adaptation process was repeated 10 times for each dataset size with different training data in order to investigate the uncertainty arising from the selection of the training data. For each realization, the training images were randomly selected, without replacement, from the full training set. As a reference, a Resnet-34 model was trained from scratch, i.e. random initialization, for the largest number of training samples for each SM dataset.

Further details on these methods are provided in Appendix A.

2.4 Statistical Analysis

To obtain an exam-level assessment, each image within an exam was processed by the DL model and the resulting probabilities were averaged. Several metrics were computed from these average probabilities for the 4-class BI-RADS breast density task and the binary dense (BI-RADS C+D) vs. non-dense (BI-RADS A+B) task: (1) accuracy, estimated based on concordance with the original reporting radiologists, (2) the area under the receiver operating characteristic curve (AUC), and (3) Cohen’s kappa (

https://scikit-learn.org, version 0.20.0). Confidence intervals (CI) were computed by use of non-Studentized pivotal bootstrapping of the test sets for 8000 random samples [4]. For the 4-class problem, macroAUC (the average of the four AUC values from the one vs. others tasks) and Cohen’s kappa with linear weighting are reported. For the binary density tasks, the predicted dense and non-dense probabilities were computed by summing the probabilities for the corresponding BI-RADS density categories.

3 Results

3.1 Performance on FFDM Exams

The trained model was first evaluated on a large held-out test set of FFDM exams from Site 1 (4960 patients, 13262 exams, 53048 images, mean age: 56.9, age range: 23-97). In this case, the images were from the same institution and of the same image type (FFDM) as employed to train the model. The BI-RADS breast density distribution predicted by the DL model (A: 8.5%, B: 52.2%, C: 36.1%, D: 3.2%) was similar to that of the original reporting radiologists (A: 9.3%, B: 52.0%, C: 34.6%, D: 4.0%). The DL model exhibited close agreement with the radiologists for the BI-RADS breast density task across a variety of performance measures (see Table 3), including accuracy (82.2%, 95% CI: [81.6%, 82.9%]) and linearly-weighted Cohen’s kappa ( = 0.75, CI: [0.74, 0.76]). A high-level of agreement was also observed for the binary breast density task (accuracy = 91.1%, CI: [90.6%, 91.6%], AUC = 0.971, CI: [0.968, 0.973],  = 0.81, CI: [0.80, 0.82]). As demonstrated by the confusion matrices shown in Figure 3, the DL model is rarely off by more than one breast density category (e.g. calls an extremely dense breast scattered; 0.03%, 4/13262).

To place the results in the context of prior work, the performance on the FFDM test set is compared with results evaluated on other large FFDM datasets acquired from academic centers [16, 30] and with commercial breast density software [3] (see Table 3). While there are limitations to comparing results evaluated on different test sets (see Section 4), our FFDM DL model appears to offer competitive performance.

4-class Accuracy 4-class macroAUC 4-class Linear Binary Accuracy Binary AUC Binary
Ours 82.2 [81.6, 82.9] 0.952 [0.949, 0.954] 0.75 [0.74, 0.76] 91.1 [90.6, 91.6] 0.971 [0.968, 0.973] 0.81 [0.80, 0.82]
Lehman et al. [16] 77 [76, 78] 0.67 [0.66, 0.68] 87 [86, 88]
Wu et al. [30] 76.7 0.916 86.5 0.65
Volpara v1.5.0 [3] 57 0.57 [0.55, 0.59] 78 0.64 [0.61, 0.66]
Quantra v2.0 [3] 56 0.46 [0.44, 0.47] 83 0.59 [0.57, 0.62]
Table 3: Performance of the baseline model on the test set for full-field digital mammography (FFDM) exams for both the 4-class Breast Imaging Reporting and Data System (BI-RADS) breast density task and binary density task (dense, BI-RADS C+D vs. non-dense, BI-RADS A+B). 95% confidence intervals are given in brackets. Results from prior works are shown evaluated on their respective test sets as points of comparison.
(a)
(b)
Figure 3: Confusion matrices for the (a) Breast Imaging Reporting and Data System (BI-RADS) breast density task and the (b) binary density task (dense, BI-RADS C+D vs. non-dense, BI-RADS A+B) evaluated on the full-field digital mammography (FFDM) test set. The number of test samples (exams) within each bin is shown in parentheses.

3.2 Performance on DBT Exams

3.2.1 Site 1 Results

Results are first reported for the Site 1 SM test set (270 patients, 270 exams, 1080 images, mean age: 54.6, age range: 28-72) as this avoids any differences that may occur between the two sites. Without adaptation, the model still demonstrates close agreement with the original reporting radiologists for the BI-RADS breast density task (accuracy = 79%, CI: [74%, 84%];  = 0.71, CI: [0.64, 0.78]; see Table 4). The DL model slightly underestimates breast density for SM images (see Figure 4), producing a BI-RADS breast density distribution (A: 10.4%, B: 57.8%, C: 28.9%, D: 3.0%) with more non-dense cases and fewer dense cases relative to the radiologists (A: 8.9%, B: 49.6%, C: 35.9%, D: 5.6%). Agreement for the binary density task is also quite high without adaptation (accuracy = 88%, CI: [84%, 92%];  = 0.75, CI: [0.67, 0.83]; AUC = 0.97, CI: [0.96, 0.99]).

Datasets Methods 4-class Accuracy 4-class macroAUC 4-class Linear Binary Accuracy Binary AUC Binary
MM 82.2 0.952 0.75 91.1 0.971 0.81
MM S1 None 79 [74, 84] 0.94 [0.93, 0.96] 0.71 [0.64, 0.78] 88 [84, 92] 0.97 [0.96, 0.99] 0.75 [0.67, 0.83]
Vector 81 [77, 86] 0.95 [0.94, 0.97] 0.73 [0.67, 0.80] 90 [87, 94] 0.97 [0.96, 0.99] 0.80 [0.73, 0.88]
Matrix 80 [76, 85] 0.95 [0.94, 0.97] 0.72 [0.66, 0.79] 91 [88, 95] 0.97 [0.96, 0.99] 0.82 [0.76, 0.90]
Fine-tune 81 [76, 86] 0.95 [0.94, 0.97] 0.73 [0.67, 0.80] 90 [87, 94] 0.97 [0.95, 0.99] 0.80 [0.73, 0.88]
MM S2 None 76 [74, 78] 0.944 [0.938, 0.951] 0.72 [0.70, 0.75] 92 [91, 93] 0.980 [0.976, 0.986] 0.84 [0.81, 0.87]
Vector 79 [77, 81] 0.954 [0.949, 0.961] 0.78 [0.76, 0.80] 92 [91, 93] 0.979 [0.974, 0.985] 0.83 [0.80, 0.86]
Matrix 80 [78, 82] 0.956 [0.950, 0.963] 0.79 [0.76, 0.81] 92 [91, 94] 0.983 [0.978, 0.988] 0.84 [0.82, 0.87]
Fine-tune 80 [78, 82] 0.957 [0.952, 0.964] 0.79 [0.77, 0.81] 93 [92, 94] 0.984 [0.979, 0.988] 0.85 [0.83, 0.88]
Table 4: Performance of the proposed approaches for adapting a deep learning (DL) model trained on one dataset to another with only 500 synthetic 2D mammography (SM) images. The datasets are denoted as MM for the full-field digital mammography (FFDM) dataset, S1 for the Site 1 SM dataset, and S2 for the Site 2 SM dataset. The performance of the model trained from scratch on the FFDM dataset (672k training samples) and evaluated on its test set is also shown as a reference. 95% confidence intervals, computed by bootstrapping over the test sets, are given in brackets.

After adaptation by matrix calibration with 500 Site 1 SM images, the density distribution is more similar to that of the radiologists (A: 5.9%, B: 53.7%, C: 35.9%, D: 4.4%), while overall agreement is about the same (accuracy = 80%, CI: [76%, 85%];  = 0.72, CI: [0.66, 0.79]). Accuracy for the two dense classes is improved at the expense of the two non-dense classes (see Figure 4). A larger improvement is seen for the binary density task, where Cohen’s kappa rose from 0.75 [0.67, 0.83] to 0.82 [0.76, 0.90] (accuracy = 91%, CI: [88%, 95%]; AUC = 0.97, CI: [0.96, 0.99]).

(a)
(b)
(c)
(d)
Figure 4: Confusion matrices, evaluated on the Site 1 SM test set, for the (a) Breast Imaging Reporting and Data System (BI-RADS) breast density task and (b) the binary density task (dense, BI-RADS C+D vs. non-dense, BI-RADS A+B) without adaptation and for the (c) BI-RADS breast density task and (d) the binary density task (dense vs. non-dense) with adaptation by matrix calibration for 500 training samples. The number of test samples (exams) within each bin is shown in parentheses.

3.2.2 Site 2 Results

Close agreement between the DL model and the original reporting radiologists was also observed for the Site 2 SM test set (744 patients, 1548 exams, 6192 images, mean age: 55.2, age range: 30-92) without adaptation (accuracy = 76%, CI: [74%, 78%];  = 0.72 CI: [0.70, 0.75]; see Table 4

). The BI-RADS breast density distribution predicted by the DL model (A: 5.7%, B: 48.8%, C: 36.4%, D: 9.1%) was similar to the distribution found in the Site 1 datasets. The model could have learned a prior density distribution from the Site 1 FFDM dataset that may not be optimal for Site 2 where patient demographics are different. The predicted density distribution does not appear to be skewed towards low density estimates as seen for Site 1 (see Figure 

5). Agreement for the binary density task was especially strong (accuracy = 92%, CI: [91%, 93%];  = 0.84, CI: [0.81, 0.87]; AUC = 0.980, CI: [0.976, 0.986]).

(a)
(b)
(c)
(d)
Figure 5: Confusion matrices, evaluated on the Site 2 SM test set, for the (a) Breast Imaging Reporting and Data System (BI-RADS) breast density task and (b) the binary density task (dense, BI-RADS C+D vs. non-dense, BI-RADS A+B) without adaptation and for the (c) BI-RADS breast density task and (d) the binary density task (dense vs. non-dense) with adaptation by matrix calibration for 500 training samples. Number of test samples (exams) within each bin are shown in parentheses.

With adaptation by matrix calibration with 500 Site 2 training samples, performance for the BI-RADS breast density task on the Site 2 SM dataset substantially improved (accuracy = 80%, CI: [78%, 82%];  = 0.79, CI: [0.76, 0.81]). After adaptation, the predicted BI-RADS breast density distribution (A: 16.9%, B: 43.3%, C: 29.4%, D: 10.4%) was more similar to that of the radiologists (A: 15.3%, B: 42.2%, C: 30.2%, D: 12.3%). Less improvement was seen for the binary breast density task (accuracy = 92%, CI: [91%, 94%];  = 0.84, CI: [0.82, 0.87]; AUC = 0.983, CI: [0.978, 0.988]).

3.2.3 Impact of Dataset Size on Adaptation

The preferred adaptation method will depend on the number of training samples available for the adaptation, with more training samples benefiting methods with more parameters. Figure 6 shows the impact of the amount of training data on the performance of the adaptation methods, as measured by linearly weighted Cohen’s kappa and macroAUC, for both the Site 1 and Site 2 SM datasets. Each adaptation method has a range of number of samples where it offers the best performance, with the regions ordered by the corresponding number of parameters for the adaptation methods (vector calibration: 8 parameters, matrix calibration: 20, fine-tuning: 2052). This demonstrates the trade-off between the performance of the adaptation method and the amount of new training data that must be acquired. When the number of training samples is very small (e.g. 100 images), some adaptation methods negatively impact performance. Even at the largest dataset sizes, the amount of training data was too limited for the Resnet-34 model trained from scratch on SM images to exceed the performance of the models adapted from FFDM.

(a)
(b)
(c)
(d)
Figure 6:

Impact of the number of training samples in the target domain on the performance of the adapted model for the Site 1 synthetic 2D mammography (SM) test set, as measured by (a) macroAUC and (b) linearly weighted Cohen’s kappa, and for the Site 2 SM test set, as measured by (c) macroAUC and (d) linearly weighted Cohen’s kappa. Results are shown for vector and matrix calibration, and retraining the last fully-connected layer (fine-tuning). Error bars indicate the standard error of the mean computed over 10 random realizations of the training data. Performance prior to adaptation (none) and training from scratch are shown as references. For the Site 1 SM studies, the full-field digital mammography (FFDM) performance serves as an additional reference. Note that each graph is shown with its own full dynamic range in order to facilitate comparison of the different adaptation methods for a given metric and dataset.

4 Discussion

Breast Imaging Reporting and Data System (BI-RADS) breast density can be an important indicator of breast cancer risk and radiologist sensitivity, but intra- and inter-reader variability may limit the effectiveness of this measure. Deep learning (DL) models for estimating breast density can reduce this variability while still providing accurate assessments. However, in order to serve as a useful clinical tool, DL models need to demonstrate that they can be applied to digital breast tomosynthesis (DBT) exams and generalize across institutions. To overcome the limited training data for DBT exams, a DL model was trained on a large set of full-field digital mammography (FFDM) images. The model showed close agreement with the radiologists reported BI-RADS breast density for a test set of FFDM images (Site 1:  = 0.75, 95% confidence interval (CI): [0.74, 0.76]) and for two datasets of synthetic 2D mammography (SM) images, which are generated as part of DBT exams (Site 1:  = 0.71, CI: [0.64, 0.78]; Site 2:  = 0.72, CI: [0.70, 0.75]). The strong performance on the SM datasets from different institutions suggests that the DL model may generalize to DBT exams and multiple sites. Further adaptation of the model for the SM datasets led to a limited improvement for Site 1 ( = 0.72, CI: [0.66, 0.79]) and a more substantial improvement for Site 2 ( = 0.79, CI: [0.76, 0.81]). The investigation of the impact of dataset size suggests that these adaptation methods could serve as practical approaches for adapting deep learning models if a model must be updated to account for site-specific differences.

When radiologists’ assessments are accepted as the ground truth, inter-reader variability may limit the performance that can be achieved for a given dataset. For example, the performance obtained on the Site 2 SM dataset following adaptation was higher than that obtained on the FFDM dataset used to train the model. This is likely a result of limited inter-reader variability for the Site 2 SM dataset due to over 80% of the exams having been read by only two readers.

Unlike previous works, our BI-RADS breast density DL model was evaluated on SM images from DBT exams and on data from multiple institutions. Further, as shown in Section 3.1, when evaluated on the FFDM images, the model appeared to offer competitive performance to previous DL models and commercial breast density software ( = 0.75, CI: [0.74, 0.76] vs. Lehman et al. 0.67, CI: [0.66, 0.68]; Volpara 0.57, CI: [0.55, 0.59], Quantra 0.46, CI: [0.44, 0.47]) [16, 3]. For each work, results are reported on their respective test sets, which may be more or less challenging due to varying levels of inter-reader variability or other factors.

Other measures of breast density, such as volumetric breast density, have been previously estimated by automated software for DBT exams [29, 22, 6]. Thresholds can be chosen to translate these measures to BI-RADS breast density, but this may result in lower levels of agreement than direct estimation of BI-RADS breast density (e.g. [6]

). Here, BI-RADS breast density is estimated from 2D SM images instead of the 3D tomosynthesis volumes as this simplifies transfer learning from the FFDM images and mirrors the manner in which breast radiologists assess density.

This study has several limitations. First, the proposed domain adaptation approaches may be less effective when the differences between domains are larger. In this work, adaptation is from two types of mammography images produced by the same manufacturer. Second, the data from Site 1 was collected over a time period covering the transition from BI-RADS version 4 to BI-RADS version 5, during which the criteria for assessing BI-RADS breast density changed. Third, when a DL model is adapted to a new institution, adjustments may be made for differences in image content, patient demographics, or the interpreting radiologists. This last adjustment may result in a degree of inter-reader variability between the original and adapted DL models, though likely lower than the individual inter-reader variability if the model learns the consensus of each group of radiologists. As a result, the improved performance following adaptation for the Site 2 SM dataset could be due to differences in patient demographics or radiologist assessment practices compared with the FFDM dataset. The weaker improvement for the Site 1 SM dataset could be due to similarities in these same factors.

Still, the broad use of Breast Imaging Reporting and Data System (BI-RADS) breast density deep learning (DL) models holds great promise for improving clinical care. The success of the DL model without adaptation suggests that the features learned by the model are largely applicable to both full-field digital mammography (FFDM) images and synthetic 2D mammography (SM) images from digital breast tomosynthesis (DBT) exams as well as to different readers and institutions. A BI-RADS breast density DL model that can generalize across sites and image types could lead to fast, low-cost, and more consistent estimates of breast density for women.

Appendix A Training Procedure

The deep learning (DL) model, described in Section 2.2

, was a pre-activation Resnet-34 network, where the batch normalization layers were replaced with group normalization layers

[11, 10, 31]. It was trained using the full-field digital mammography (FFDM) dataset (see Table 1) by use of the Adam optimizer [13] with a learning rate of and a weight decay of . Weight decay not was applied to the parameters belonging to the normalization layers. The input was resized to 416320 pixels and the pixel intensity values were normalized so that the grayscale window denoted in the Digital Imaging and Communications in Medicine (DICOM) header ranged from 0.0 to 1.0. Training was performed using mixed precision [20] and gradient checkpointing [5] with batch sizes of 256 distributed across two NVIDIA GTX 1080 Ti graphics processing units (Santa Clara, CA). Each batch was sampled such that the probability of selecting a BI-RADS B or BI-RADS C sample was four times that of selecting a BI-RADS A or BI-RADS D sample, which roughly corresponds to the distribution of densities observed nationally in the United States [15]

. Horizontal and vertical flipping were employed for data augmentation. In order to obtain more frequent information on the training progress, epochs were capped at 100k samples compared with a total training set size of over 672k samples. The model was trained for 100 such epochs. Results are reported for the epoch that had the lowest cross entropy loss on the validation set, which occurred after 93 epochs.

The parameters for the vector and matrix calibration methods were chosen by minimizing a cross-entropy loss function by use of the BFGS optimization method (

https://scipy.org, version 1.1.0). The parameters were initialized such that the linear layer corresponded to the identity transformation. Training was stopped when the norm of the gradient was less than or when the number of iterations exceeded 500. Retraining the last fully-connected layer for the fine-tuning method was performed by use of the Adam optimizer with a learning rate of and weight decay of . The batch size was set to 64. The fully-connected layer was trained from random initialization for 100 epochs and results were reported for the epoch with the lowest validation cross entropy loss. Training from scratch on the synthetic 2D mammography (SM) datasets was performed following the same procedure as for the base model. For fine-tuning and training from scratch, the size of an epoch was set to the number of training samples.

Acknowledgements

This work was supported in part by funding from Whiterabbit AI, Inc. WU has equity interests in Whiterabbit AI, Inc. and may receive royalty income and milestone payments from a “Collaboration and License Agreement” with Whiterabbit AI, Inc. to develop a technology evaluated in this research. The following authors are employed by and/or have equity interests in Whiterabbit AI, Inc.: T.P.M., S.S., B.M., J.S., M.P.S., S.P., A.L., R.M.H., N.G., D.S., and S.C.M.

The authors would like to thank Drs. Mark A. Anastasio, Catherine M. Appleton, and Curtis P. Langlotz for their insightful feedback and review of this manuscript. The authors would also like to thank Chip Schweiss for managing the research cluster with which this work was performed.

References

  • [1] W. A. Berg, C. Campassi, P. Langenberg, and M. J. Sexton. Breast Imaging Reporting and Data System: Inter- and intraobserver variability in feature analysis and final assessment. AJR, 174:1769–1777, 2000.
  • [2] N. F. Boyd, H. Guo, L. J. Martin, L. Sun, J. Stone, E. Fishell, R. A. Jong, G. Hislop, A. Chiarelli, S. Minkin, and M. J. Yaffe. Mammographic Density and the Risk and Detection of Breast Cancer. New England Journal of Medicine, 356(3):227–236, 2007.
  • [3] K. R. Brandt, C. G. Scott, L. Ma, A. P. Mahmoudzadeh, M. R. Jensen, D. H. Whaley, F. F. Wu, S. Malkov, C. B. Hruska, A. D. Norman, J. Heine, J. Shepherd, V. S. Pankratz, K. Kerlikowske, and C. M. Vachon. Comparison of Clinical and Automated Breast Density Measurements: Implications for Risk Prediction and Supplemental Screening. Radiology, 279(3):710–719, 2016.
  • [4] J. Carpenter and J. Bithell. Bootstrap confidence intervals: when, which, what? A practical guide for medical statisticians. Statistics in Medicine, 19:1141–1164, 2000.
  • [5] T. Chen, B. Xu, C. Zhang, and C. Guestrin. Training Deep Nets with Sublinear Memory Cost. arXiv:1604.06174 [cs], Apr. 2016.
  • [6] D. Förnvik, H. Förnvik, A. Fieselmann, K. Lång, and H. Sartor. Comparison between software volumetric breast density estimates in breast tomosynthesis and digital mammography images in a large public screening cohort. Eur Radiol, 29(1):330–336, 2019.
  • [7] S. M. Friedewald, E. A. Rafferty, S. L. Rose, M. A. Durand, D. M. Plecha, J. S. Greenberg, M. K. Hayes, D. S. Copit, K. L. Carlson, T. M. Cink, L. D. Barke, L. N. Greer, D. P. Miller, and E. F. Conant. Breast Cancer Screening Using Tomosynthesis in Combination With Digital Mammography. JAMA, 311(24):2499, 2014.
  • [8] Z. Gandomkar, M. E. Suleiman, D. Demchig, P. C. Brennan, and M. F. McEntee.

    BI-RADS density categorization using deep neural networks.

    In Proc. SPIE 10952, Medical Imaging 2019: Image Perception, Observer Performance, and Technology Assessment, page 109520N, 2019.
  • [9] C. Guo, G. Pleiss, Y. Sun, and K. Q. Weinberger. On Calibration of Modern Neural Networks. In International Conference on Learning Representations (ICML), 2017.
  • [10] K. He, X. Zhang, S. Ren, and J. Sun. Deep Residual Learning for Image Recognition. In

    The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)

    , 2016.
  • [11] K. He, X. Zhang, S. Ren, and J. Sun. Identity Mappings in Deep Residual Networks. In The European Conference on Computer Vision (ECCV), 2016.
  • [12] K. Kerlikowske, A. J. Cook, D. S. M. Buist, S. R. Cummings, C. Vachon, P. Vacek, and D. L. Miglioretti. Breast Cancer Risk by Breast Density, Menopause, and Postmenopausal Hormone Therapy Use. Journal of Clinical Oncology, 28:3830–3837, 2010.
  • [13] D. P. Kingma and J. Ba. Adam: A Method for Stochastic Optimization. In International Conference on Learning Representations (ICML), 2015.
  • [14] N. R. Kressin, C. M. Gunn, and T. A. Battaglia. Content, Readability, and Understandability of Dense Breast Notifications by State. JAMA, 315(16):1786–1788, 2016.
  • [15] C. D. Lehman, R. F. Arao, B. L. Sprague, J. M. Lee, D. S. M. Buist, K. Kerlikowske, L. M. Henderson, T. Onega, A. N. A. Tosteson, G. H. Rauscher, and D. L. Miglioretti. National Performance Benchmarks for Modern Screening Digital Mammography: Update from the Breast Cancer Surveillance Consortium. Radiology, 283(1):49–58, 2017.
  • [16] C. D. Lehman, A. Yala, T. Schuster, B. Dontchos, M. Bahl, K. Swanson, and R. Barzilay. Mammographic Breast Density Assessment Using Deep Learning: Clinical Implementation. Radiology, 290(1):52–58, 2019.
  • [17] X. Ma, C. E. Fisher, J. Wei, M. A. Helvie, H.-P. Chan, C. Zhou, L. M. Hadjiiski, and Y. Lu. Multi-path deep learning model for automated mammographic density categorization. In Proc. SPIE 10950, Medical Imaging 2019: Computer-Aided Diagnosis, page 109502E, 2019.
  • [18] M. T. Mandelson, N. Oestreicher, P. L. Porter, D. White, C. A. Finder, S. H. Taplin, and E. White. Breast Density as a Predictor of Mammographic Detection: Comparison of Interval- and Screen-Detected Cancers. J Natl Cancer Inst, 92(13):1081–1087, 2000.
  • [19] V. A. McCormack and I. d. S. Silva. Breast Density and Parenchymal Patterns as Markers of Breast Cancer Risk: A Meta-analysis. Cancer Epidemiol Biomarkers Prev, 15(6):1159–1169, 2006.
  • [20] P. Micikevicius, S. Narang, J. Alben, G. Diamos, E. Elsen, D. Garcia, B. Ginsburg, M. Houston, O. Kuchaiev, G. Venkatesh, and H. Wu. Mixed Precision Training. arXiv:1710.03740 [cs, stat], Oct. 2017.
  • [21] A. A. Mohamed, W. A. Berg, H. Peng, Y. Luo, R. C. Jankowitz, and S. Wu.

    A deep learning method for classifying mammographic breast density categories.

    Med Phys, 45(1):314–321, 2018.
  • [22] S. Pertuz, E. S. McDonald, S. P. Weinstein, E. F. Conant, and D. Kontos. Fully Automated Quantitative Estimation of Volumetric Breast Density from Digital Breast Tomosynthesis Images: Preliminary Results and Comparison with Digital Mammography and MR Imaging. Radiology, 279(1):65–74, 2015.
  • [23] E. A. Rafferty, J. M. Park, L. E. Philpotts, S. P. Poplack, J. H. Sumkin, E. F. Halpern, and L. T. Niklason. Diagnostic Accuracy and Recall Rates for Digital Mammography and Digital Mammography Combined With One-View and Two-View Tomosynthesis: Results of an Enriched Reader Study. American Journal of Roentgenology, 202(2):273–281, 2014.
  • [24] I. B. Richman, J. R. Hoag, X. Xu, H. P. Forman, R. Hooley, S. H. Busch, and C. P. Gross. Adoption of Digital Breast Tomosynthesis in Clinical Practice. JAMA Intern Med, 179(9):1292–1295, 2019.
  • [25] E. A. Sickles, C. J. D’Orsi, and L. W. Bassett. ACR BI-RADS Mammography. In ACR BI-RADS Atlas, Breast Imaging Reporting and Data System. American College of Radiology, Reston, VA, 5th edition, 2013.
  • [26] P. Skaane, A. I. Bandos, R. Gullien, E. B. Eben, U. Ekseth, U. Haakenaasen, M. Izadi, I. N. Jebsen, G. Jahr, M. Krager, L. T. Niklason, S. Hofvind, and D. Gur. Comparison of Digital Mammography Alone and Digital Mammography Plus Tomosynthesis in a Population-based Screening Program. Radiology, 267(1):47–56, 2013.
  • [27] M. C. Spayne, C. C. Gard, J. Skelly, D. L. Miglioretti, P. M. Vacek, and B. M. Geller. Reproducibility of BI‐RADS Breast Density Measures Among Community Radiologists: A Prospective Cohort Study. The Breast Journal, 18(4):326–333, 2012.
  • [28] B. L. Sprague, E. F. Conant, T. Onega, M. P. Garcia, E. F. Beaber, S. D. Herschorn, C. D. Lehman, A. N. Tosteson, R. Lacson, M. D. Schnall, D. Kontos, J. S. Haas, D. L. Weaver, W. E. Barlow, and on behalf of the PROSPR Consortium. Variation in Mammographic Breast Density Assessments Among Radiologists in Clinical Practice: A Multicenter Observational Study. Ann Intern Med, 165(7):457, 2016.
  • [29] A. S. Tagliafico, G. Tagliafico, F. Cavagnetto, M. Calabrese, and N. Houssami. Estimation of percentage breast tissue density: comparison between digital mammography (2d full field digital mammography) and digital breast tomosynthesis according to different BI-RADS categories. Br J Radiol, 86(1031):20130255, 2013.
  • [30] N. Wu, K. J. Geras, Y. Shen, J. Su, S. G. Kim, E. Kim, S. Wolfson, L. Moy, and K. Cho. Breast density classification with deep convolutional neural networks. arXiv:1711.03674 [cs, stat], Nov. 2017.
  • [31] Y. Wu and K. He. Group Normalization. In The European Conference on Computer Vision (ECCV), 2018.
  • [32] P. H. Yi, A. Lin, J. Wei, H. I. Sair, F. K. Hui, and S. C. Harvey.

    Deep-Learning Based Semantic Labeling for 2d Mammography & Comparison of Computational Complexity for Machine Learning Tasks.

    In SIIM Conference on Machine Intelligence in Medical Imaging, page 2, 2018.
  • [33] J. Yosinski, J. Clune, Y. Bengio, and H. Lipson. How transferable are features in deep neural networks? In Z. Ghahramani, M. Welling, C. Cortes, N. D. Lawrence, and K. Q. Weinberger, editors, Advances in Neural Information Processing Systems 27, pages 3320–3328. Curran Associates, Inc., 2014.