Breast cancer is by far the most common cancer diagnosed in women worldwide. In clinical practice, contextual information and multi-view information (i.e., medio-lateral oblique (MLO) view which is a side view of the breast taken at a certain angle, and cranio-caudal (CC) view which is a top-bottom view of the breast) are helpful for the radiologists to detect mass on the mammogram. However, while there has been a significant progress in mass detection  and classification  based on mammogram by using deep convolutional neural networks (CNNs), most of the deep convolutional architectures identify the mammogram mass without taking different views into account during model training, such that the relation between two views of the mammogram cannot be learned.
To address this limitation, in this work, we mainly focus on the aggregation of two views. Breast lesions could be at arbitrary image locations, of different scales, and from different categories. This makes it difficult to directly
model the related mass between two views of mammogram images. To solve this issue, we turn attention to the regions of interest (ROIs, i.e., candidate mass regions) detected by the region-based convolutional neural network architecture (RCNN), and explore the hidden relationship between the ROIs from two views. In particular, we propose a novel mammogram mass detection framework, termed cross-view relation region-based convolutional neural network (CVR-RCNN). Unlike the previous deep learning work[11, 13] which do not distinguish different views of mammogram, we extend a two-branch Faster RCNNs by including a novel cross-view relation network. In particular, we demonstrate the benefit of incorporating the relation information between the different views in our framework.
Our contributions are twofold. First, to the best of our knowledge, this is the first work to exploit the modeling relation information between two views for mammogram mass detection. Our cross-view detection framework is much more effective and efficient than existing approaches. Second, we introduce a new cross-view relation network for mass ROIs interaction modeling which imitates the flow of radiologists screening mammogram targets. Since the current datasets are not large enough, we collected a large-scale dataset that contains 1,425 specimen mammograms with annotations of breast masses and evaluated the proposed model on this challenging dataset. Public Digital Database for Screening Mammography (DDSM) dataset was also used to additionally justify the effectiveness of the proposed model. Our experimental results show that the proposed model outperforms several state-of-the-art methods.
In principle, our approach is fundamentally differs from the previous mammogram mass detection methods. It is the first end-to-end cross-view modeling framework using two-branch RCNNs, which opens a new avenue for mammogram mass detection and may further reduce radiologists’ screening time.
2 The CVR-RCNN Framework
The proposed CVR-RCNN detection framework consists of a two-branch extended Faster RCNNs and a novel relation network.
Two-branch Faster RCNNs. Motivated by the siamese network 
which uses two weight-shared feature extraction branches, in our framework, we propose a two-branch weight-shared RCNNs (Figure1, Left half) connected by relation blocks to learn the latent cross-view information. For each branch, we adopt the current popular object detection framework Faster-RCNN  and extend it with several residual blocks. Further, inspired by the relevant work , each original residual block 
was modified by performing batch normalization (BN) and rectified linear unit (ReLU) before convolution operations. Each branch of Faster RCNN aims to detect the regions of interest (ROIs, i.e., the candidate mass regions) for further processing by the following cross-view relation network.
Cross-view Relation Networks. In order to discover the latent relation between two views, inspired by relationship module proposed in , we designed a new relation network consisting of certain number of relation blocks linking the paired views of mammogram (see Figure 1, ‘CVR Networks’). The objective of the relation network is to transfer both visual and geometric information of ROIs from the second (or first) view to the first (or second) one in order to help detect masses more effectively in the first (or second) view.
ROI in the first view needs to be classified and its position needs to be fine-tuned from the first branch of the proposed framework. Denote the visual feature representation of theROI in the first view by , and the geometric feature by . In general,
is a feature vector from the output of a fully connected layer following the the ROI-pooling layer in the Faster RCNN for theROI region, and includes the coordinate (), width (), and height () of the ROI in the first view. Similarly, denote the visual feature representation of the ROI in the second view by , and its geometric feature by .
In order to use the visual information from the second view to help detect the mass candidate (represented by and ) in the first view, we need to establish both the visual and geometric relations between each mass candidate in the second view and the candidate in the first view. The strength of visual relation or the similarity between the candidate in the first view and the candidate in the second view can be represented by
where and are two matrices transforming the original visual features into the same feature space before measuring the similarity based their dot product. and are part of the model parameters to be learned. is the dimension of the new feature space and is used as a normalization factor.
To establish geometric relationship between the candidates in the two views, inspired by the work , we first normalize the geometric information of the candidate in the first view by the candidate in the second view,
where is used to boost potential geometric relation between candidates by reducing the effect of differences in position and size between the two candidates. Then similar to the work [6, 10], the cross-view normalized geometric feature is embedded to a high-dimensional feature space by the and functions of the elements at different frequencies (please refer to the reference  for details). The embedding process is denoted by . The geometric relation between two candidates is then defined as
where is a vector transforming the high-dimensional feature vector into a scalar weight. is part of the model parameters to be learned. The function is used to trim any negative weight to , thus restricting interaction between candidates satisfying certain geometric relationships.
Combining the visual and geometric information, the relation between the candidate in the first view and all the candidates in the second view can be summarized as
Here is used to make sure the visual relation is strengthened and non-negative between similar candidates (as in the reference ), and is a normalization factor. Each visual feature from the second view is transformed by the matrix to the feature space in which the is. Since and are often extracted similarly from corresponding ROIs, is in general a square matrix. is also part of the model parameters to be learned.
Finally, the relational feature is added to the original feature to form the output of the relation block for the mass candidate in the first view,
Motivated by the skip connection in the ResNet , the two visual features and are summed rather than concatenated. The output of the relation block (and the original geometric feature ) can be used as input to the next relation block when multiple relation blocks are employed in the detection framework.
For each ROI from the first (or second) view, the output from the last cross-view relation block was then fed into a number of fully connected layers for ROI classification and another number of fully connected layers for bounding box offset prediction, as in the Faster RCNN. Given a set of paired (two-view) images, the two-branch Faster RCNNs with the cross-view relation network can be trained by minimizing the loss function,
where and respectively represent the classification and the regression losses from the view. are coefficients balancing the loss terms.
3.1 Experimental Protocols
Private Dataset: To evaluate the proposed framework on the large breast mass dataset, we collected a large-scale dataset which contains 1,425 scanned mammography images with breast mass lesions. To the best of our knowledge, this dataset is the largest cohort collected specifically for breast masses detection. The annotations (bounding boxes of each breast lesion) were labeled by 4 experienced radiologists. Each senior radiologist evaluated the annotations made by the relatively junior experienced radiologist and made further modifications, if necessary.
Data Preprocessing: To avoid over-fitting during training, the training set was augmented by affine transformations (e.g., rotations). Each image was resized to pixels and the pixel values were rescaled to the range .
: MXNET library was applied to construct the proposed deep convolutional architecture. The model parameters were initialized by a pretrained ResNet-101 and then fine-tuned around 20 epochs using the early-stopping criterion with a mini-batch of two images for each device with 4 GPUs. The SGD optimizer was used with learning rate. And we evaluated the proposed framework with default setting (, ).
Statistical Evaluation: In both datasets, about paired images were used for training, and the remaining for testing.
score, precision and recall were used as evaluation metrics. For the public dataset, true positive rate (TPR) versus false positive per image (FPI) were used as the metrics following previous work (e.g.,).
3.2 Effect of the Relation Block
In this section, the effect of the number of relation blocks in the cross-view relation network was investigated based on our private dataset. As shown in Table 1, adding the relation blocks to the network (second to last rows) clearly improved the detection performance than that without relation block (first row, ). Also, more relation blocks steadily lead to the higher precision rate (e.g., achieving 76.56 when using relation blocks). However, compared with the higher precision rate, the recall is largely reduced when relation blocks were used (clinically, recall is relatively more important than precision). One reason could be that by sharing features and relation more times between the two views, the network is forced to pay more attention to the relationship between the two views, rather than to the visual features from each single view. In addition, using more relation blocks would increase the computational complexity and memory usage. As a trade-off, relation blocks were used as default in the tests below.
3.3 Comparison with Non Cross-view Methods
The proposed CVR-RCNN model was further evaluated on our private dataset by comparing with two representative detection frameworks Faster RCNN  and SSD . In Table 2, ‘Faster RCNN’ and ‘SSD’ indicate that the data from two views are mixed (therefore not using view information) for training and testing. In comparison, ‘two-branch Faster RCNNs’ and ‘two-branch SSDs’ indicate that each view of data was used to train a individual detection model, and each test data was predicted by either the first or the second model based on which view it is from. As shown in Table 2, the two-branch Faster RCNNs and SSDs models perform clearly better than their corresponding versions without considering view information, e.g., for the two-branch Faster RCNNs model, the precision rate 65.27% vs. 64.01% and the recall rate 71.93% vs. 70.53%. This suggests that the separated models conditioned on views information outperforms the single model trained on the mix-view images.
|Faster RCNN ||64.01||70.53||0.67||0.45|
|two-branch Faster RCNNs||65.27||71.93||0.69||0.42|
However, for the two-branch Faster RCNNs model, it did not consider that the lesions in both of the two views may be largely related to each other. By learning the relationship between the two views, using the Cross-View Relation Network, our CVR-RCNN achieved the best performance, producing the notable improvements of the precision rate from 65.27 to 71.12, and the recall rate from 71.93% to 75.33%. One reason for the notable improvement should be that, by sharing the visual and geometric information of the ROIs between the two views, the network is driven to learn the different manifestations of the same lesions in two views. Then for each ROI, the network would examine whether the ROI exists in both views. After such double check strategy, the false positive rate would be likely reduced, improving the precision rate and reducing the false positives per image (FPI).
|Campanini et al. ||-||firstname.lastname@example.org|
|Eltonsy et al. ||-||email@example.com, firstname.lastname@example.org, email@example.com|
|Sampat et al. ||-||firstname.lastname@example.org, email@example.com, firstname.lastname@example.org|
|Faster RCNN ||email@example.com, 0.75@,1.8, firstname.lastname@example.org|
|two-branch Faster RCNNsemail@example.com, firstname.lastname@example.org|
|CVR-RCNNemail@example.com, firstname.lastname@example.org, email@example.com|
3.4 Comparison with State-of-the-Art
To further justify the effectiveness of the proposed model, we performed experiments on the public DDSM dataset and compared with the results from the public DDSM leaderboard. Table 3 shows that our proposed method clearly outperforms the state-of-the-art methods for mass detection in mammograms, where the results from the competing methods are reported by their original authors. The results demonstrate that the proposed CVR-RCNN is noticeably better than the previous ones, suggesting that our CVR-RCNN is more suitable for mammogram mass detection.
In addition, the results from the Faster RCNN and the two-branch Faster RCNNs (Table 3, fourth and fifth rows) again confirm that view information is helpful to improve the detection performance, with better performance from the two-branch Faster RCNNs. More importantly, by comparing the results of the two-branch Faster RCNNs and the proposed CVR-RCNN, we can see that the model trained with cross-view relation network obtain the best performance, demonstrating that the interaction between two views is effective for detecting breast masses in the proposed framework.
In this work, we proposed a cross-view mammogram mass detection framework by combining the conventional CNN detection with a novel cross-view relation network. Extensive evaluations on a private dataset and a public dataset clearly demonstrated the superior performance of the proposed framework compared to state-of-the-art methods. This opens a new avenue to improve detection of mammogram mass where paired or multiple views of information are available.
Campanini, R., Dongiovanni, D., Iampieri, E., Lanconelli, N., Masotti, M., Palermo, G., Riccardi, A., Roffilli, M.: A novel featureless approach to mass detection in digital mammograms based on support vector machines. Physics in Medicine & Biology 49(6), 961 (2004)
-  Eltonsy, N.H., Tourassi, G.D., Elmaghraby, A.S.: A concentric morphology model for the detection of masses in mammography. IEEE transactions on medical imaging 26(6), 880–889 (2007)
Girshick, R., Donahue, J., Darrell, T., Malik, J.: Rich feature hierarchies for accurate object detection and semantic segmentation. In: Proceedings of the IEEE conference on computer vision and pattern recognition. pp. 580–587 (2014)
-  He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition. pp. 770–778 (2016)
-  He, K., Zhang, X., Ren, S., Sun, J.: Identity mappings in deep residual networks. In: European conference on computer vision. pp. 630–645. Springer (2016)
-  Hu, H., Gu, J., Zhang, Z., Dai, J., Wei, Y.: Relation networks for object detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. pp. 3588–3597 (2018)
-  Liu, W., Anguelov, D., Erhan, D., Szegedy, C., Reed, S., Fu, C.Y., Berg, A.C.: Ssd: Single shot multibox detector. In: European Conference on Computer Vision (2016)
-  Ren, S., He, K., Girshick, R., Sun, J.: Faster r-cnn: Towards real-time object detection with region proposal networks. In: Advances in neural information processing systems. pp. 91–99 (2015)
-  Sampat, M.P., Bovik, A.C., Whitman, G.J., Markey, M.K.: A model-based framework for the detection of spiculated masses on mammographya. Medical physics 35(5), 2110–2123 (2008)
-  Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, L., Polosukhin, I.: Attention is all you need (2017)
-  Xi, P., Shu, C., Goubran, R.: Abnormality detection in mammography using deep convolutional neural networks (2018)
-  Zagoruyko, S., Komodakis, N.: Learning to compare image patches via convolutional neural networks. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. pp. 4353–4361 (2015)
-  Zhu, W., Lou, Q., Vang, Y., Xie, X.: Deep multi-instance networks with sparse label assignment for whole mammogram classification (2017)