Region Proposal Networks with Contextual Selective Attention for Real-Time Organ Detection

12/26/2018 ∙ by Awais Mansoor, et al. ∙ 0

State-of-the-art methods for object detection use region proposal networks (RPN) to hypothesize object location. These networks simultaneously predicts object bounding boxes and objectness scores at each location in the image. Unlike natural images for which RPN algorithms were originally designed, most medical images are acquired following standard protocols, thus organs in the image are typically at a similar location and possess similar geometrical characteristics (e.g. scale, aspect-ratio, etc.). Therefore, medical image acquisition protocols hold critical localization and geometric information that can be incorporated for faster and more accurate detection. This paper presents a novel attention mechanism for the detection of organs by incorporating imaging protocol information. Our novel selective attention approach (i) effectively shrinks the search space inside the feature map, (ii) appends useful localization information to the hypothesized proposal for the detection architecture to learn where to look for each organ, and (iii) modifies the pyramid of regression references in the RPN by incorporating organ- and modality-specific information, which results in additional time reduction. We evaluated the proposed framework on a dataset of 768 chest X-ray images obtained from a diverse set of sources. Our results demonstrate superior performance for the detection of the lung field compared to the state-of-the-art, both in terms of detection accuracy, demonstrating an improvement of >7% in Dice score, and reduced processing time by 27.53% due to fewer hypotheses.



There are no comments yet.


page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Organ detection is critical step in various medical image analysis applications including segmentation, semantic navigation, query processing, etc. The performance of these applications is contingent upon fast and accurate localization of the organ of interest. Moreover, with the application of big data technologies on the rise in the field of medical imaging, these technologies are even more essential to provide better patient care, create population-specific atlases, and curate data accurately for artificial intelligence algorithms. However, challenges in fast and accurate organ detection continue to be a bottleneck in the development of real-time and accurate medical imaging applications.

Traditional methods for organ detection are based primarily on a sliding window approach to generate hypotheses for the organ location. Then a classifier assigns labels to the hypotheses. The seminal real-time object detection algorithm proposed by Viola and Jones


follows this approach using a cascade of Adaboost classifiers. In addition to impacting various computer vision applications, this method has also been adopted in a number of medical imaging applications. However, in medical imaging regression-based solutions are a more feasible route for organ detection than exhaustive search. This is due to the facts that: (i) exhaustive search is unnecessary when detecting anatomy because medical images offer strong contextual information, and (ii) the size and resolution of normative medical and biomedical images make exhaustive search prohibitively expensive. In addition, variations in orientation and scale increase the computational complexity exponentially.

Several approaches to incorporate contextual information for efficient organ detection have been proposed in the literature. For instance, Zhou et al.

presented a method based on boosting ridge regression to detect and localize the left ventricle in cardiac ultrasound

[2]. Pauly et al. [3] used supervised regression from 3D local binary pattern descriptors for organ detection in multichannel magnetic resonance (MR) Dixon sequences. Zhang et al. [4]

proposed marginal space learning (MSL), which breaks down the complexity of learning similarity transformation from the image space to the projection space. Although there has been an increasing interest in applying deep-learning methods for organ detection from medical images, the state-of-the-art techniques use either exhaustive search mechanism

[5, 6]

or data pre-processing prior to neural network to incorporate contextual information

[7]. To our best knowledge, there are no works incorporating contextual information to deep neural networks for fast object and organ detection.

In this paper, we demonstrate how contextual information from image acquisition protocols about the organ location can be incorporated into state-of-the-art neural network-based detection mechanisms, such as RPN, resulting in faster and more accurate organ detection. The main contributions of the paper are: (i) we propose a reduced-size search space inside the convolutional feature map for proposal generation, (ii) we include useful prior information about the organ localization to the detection architecture so it can learn where to look for each organ, and (iii) we modify the pyramid of regression references in the RPN by incorporating organ- and modality-specific information, which results in additional time reduction and improved detection accuracy. We evaluate our proposed framework on the detection of the lung field from a dataset of 668 chest X-ray images obtained from diverse sources and compared it with state-of-the-art.

2 Methods

Fig. 0(a) provides an overview of our proposed network architecture for real-time detection of organs from medical images. The network architecture is designed to include prior protocol and contextual organ information for improved performance both in terms of accuracy and speed. The framework builds on state-of-the-art regional proposal networks (RPN) [8]

and Faster R-CNN (Region-based Convolutional Neural Networks)

[6] (Fig. 0(b)), which we briefly overviewed next.

Figure 1: Comparative illustration of the proposed architecture of the network based on context aware selective attention (a), with the state-of-the-art: Faster R-CNN [6] (b).

2.1 Faster R-CNN

Faster R-CNN is the current state-of-the-art method in real-time object detection and has been used in various computer vision applications such as image recognition, visual understanding, etc. Fig. 0(b) provides an overview diagram of Faster R-CNN, which consists of two primary modules: a deep convolutional neural network, the RPN, which provides region proposals for the object location hypotheses, and a detection module that uses the proposed region hypotheses and assigns labels to region proposals.

2.1.1 Hypothesis Generation–RPN:

Recent advances in object detection in computer vision applications are largely driven by the RPN [6, 8]. RPN takes as input a convolutional feature map of any size and provides a set of rectangular object proposals, each having an objectness score that measures the membership of the region to a set of object classes. Any deep-learning architecture can be used to generate the convolutional feature map. Several architectures such as VGG-16 have been tried to generate a convolutional feature maps [6]. To generate region proposals, the RPN slides over every location in the convolutional feature map. Subsequently, each sliding window is mapped to a lower-dimensional feature space (512-D for VGG-16) which is fed into two separate fully connected layers: a box-regression layer (reg) and a box-classification layer (cls). At each location of the sliding window, RPN predicts multiple region proposals (bounding boxes or anchors) at different scales and aspect ratios. Let denotes the maximum number of proposals at each location. Then the reg layer has outputs representing the coordinates of the bounding boxes (proposals) and the cls layer provides memberships of the bounding box containing the object(s) of interest or not. Hence, for a convolutional feature map of size , there are proposals in total which can be computationally expensive for medical images.

2.1.2 Hypothesis Classification:

For object detection, Faster R-CNN adopts the detection classifier presented in [5]. As shown in Fig. 0(b), Faster R-CNN learns a unified network composed of RPN and detection classifier with shared convolutional layers. After the RPN steps, fixed-sized feature maps are extracted for each proposal using the region of interest (ROI) pooling. Finally, fully-connected layers are used to provide a membership score for each possible object class.

2.2 Novel Contextual Selective Attention Region Proposal Network

Similar to Faster R-CNN, the proposed framework consists of two modules: a proposal hypothesis generation module and a detection module, as illustrated in Fig. 0(a).

2.2.1 Hypothesis Generation–Selective Attention RPN:

Medical images are acquired under specific protocols with a predetermined pose. This information can therefore be used to constrain the area of the convolutional feature map in which to look for the organ of interest (instead of scanning the entire map using the sliding window approach). In our proposed selective attention RPN, we exploit that prior information to boost the detection performance of Faster R-CNN in terms of speed and accuracy. Specifically, we use a reduced search space, which we denote as , to avoid generating region proposals in the areas in which the organ of interest is unlikely to be located. The attention region

can be determined based on the prior statistical information from the training dataset and the acquisition protocol. Furthermore, we estimated the expected size and aspect-ratio of the organ of interest from the training data and population statistics; this results in fewer and more accurate proposals (

) at each location. For the application of left and right lung field detection (two separate classes), we denote the feature map of the last convolutional layer in the VGG-16 architecture as , so , where and define the boundaries of the restricted space along the horizontal and vertical dimensions. Furthermore, based on population statistics of lung shape (location, scale, aspect-ratio), we are able to reduce the number of proposals to , as opposed to used in Faster R-CNN. We trained our contextual selective attention RPN end-to-end using back-propagation with stochastic gradient decent optimization [9]. Unlike Faster R-CNN, the mini-batch in our approach was allowed to arise from multiple images due to the relative standardization of medical imaging acquisition protocols. In order to minimize the bias effects of having more negative proposals (i.e., proposals with no organ of interest), we randomly sampled positive (i.e., proposals with organ of interest) and negative samples at

ratios. Network weights were randomly initialized using a Gaussian distribution with zero mean and 0.01 standard deviation. We used the learning rate of

with a weight decay of , and momentum of .

2.2.2 Hypothesis Classification: Contextual R-CNN

Once the region proposals are obtained using the selective attention RPN, we incorporate their co-ordinates , normalized by the image size, to train the detection classifier. This information improves the detection accuracy by incorporating organ location information to its appearance (i.e., the convolutional feature map). We named our detection with appended position information, as contextual R-CNN. As demonstrated later, we found that adding this location information to the appearance information of the proposal reduced the training time by . We use approximate joint training approach to train our network [6]. Specifically, in this approach, the RPN and the contextual R-CNN networks are merged into a single framework during training, as shown in Fig. 0(a). At each iteration, the forward pass generates region proposal whose co-ordinates are fed to the contextual R-CNN detector. During back-propagation, the loss from both selective attention RPN and contextual R-CNN are combined as explained below.

2.2.3 Loss Function

To train the RPN module, we assigned a binary class label (object/ no object) to each proposal. A positive class label was assigned to proposals with Intersection over Union (IoU) overlap greater than with any ground-truth bounding-box. A negative label was assigned to the proposals with IoU ratios lower than for all the ground-truth bounding-boxes. The proposals with IoU ratios between and were considered neutrals and therefore not used for training. Using these definitions and adopting the approximate joint training approach, we minimized the following objective function:


where was the proposal index in the mini-batch,

was the predicted probability of the

proposal to not be labeled as background. was the ground-truth label for the proposal ( if the proposal was positive, if negative).

was the vector representing the 4 co-ordinates (

), was the co-ordinate vector associated with the ground-truth bounding-box. The classification loss was the logarithmic loss over two classes (object/ background) while the regression loss () was the smooth loss defined in [5]. was only activated for positive proposals () and disabled otherwise (). was the indicator function defining the attention region (Fig. 0(a)

) of the convolutional feature map, which was our search space. The indicator function ensures that the loss function defined in eq. (

1) was calculated using the proposals within only. was the weighting parameter between the classification and the regression loss. Although, we set empirically, we did not observe larges effects of using different values of , as also reported in [6].

3 Experiments

3.1 Datasets and Reference Standards

Our experiments were conducted on both publicly available data and datasets acquired in-house using a wide range of devices, age groups, and multiple pulmonary pathologies. We used 247 (age: year) publicly available chest radiographs (CXRs) from the Japanese Society of Radiological Technology (JSRT; dataset and 108 (age: year) from the Belarus Tuberculosis Portal (BTP; In addition, after approval from the Internal Review Board, we used 313 (age: year) posterior-anterior CXRs were collected at our institution. The JSRT radiographs had dimensions of pixels, spatial resolution of mm/pixel, and digital resolution of 12 bits. BTP images had dimensions of pixels, spatial resolution of mm/pixel, and the digital resolution of 12 bits. The ground truth labels of the lung field were prepared using the ITK-SNAP interactive software under the supervision of two expert pulmonologists.

3.2 Implementation Details

Training and validation were performed on a single scale, similar to Faster R-CNN. The images were rescaled to a maximum of 600 pixels on the shorter side. The stride in our framework was 16 pixels. We used 2 different scales to calculate the proposals, with bounding boxes having areas of

and pixels, and 2 aspect ratios: 1:2 and 3:4. Based on the statistics of CXR imaging protocol (location, scale, and aspect-ratio of the organ-of-interest), the feature map was shrunk by (, , , ) from each side resulting in reduction in the sliding window search space. Similar to Faster R-CNN, non-maximum suppression was applied to the proposals based on their cls

layer scores. Our framework was implemented using Tensorflow with Keras, and trained using a Nvidia Titan X GPU, CUDA 8.0, and CuDNN 6.0.

3.3 Results

Table 1 compares the performance of the proposed method to the current state-of-the art– Faster R-CNN [6] using the VGG16 architecture– in terms of the number of proposals, time of execution and detection accuracy. Our framework obtained the Dice score of , which was a significant improvement over the Faster R-CNN (p-value; Wilcoxon Rank Sum Test). Interestingly, the detection accuracy of Faster R-CNN was improved by just using the optimal scales and aspect-ratios for the lung field from CXR. These scales and aspect-ratios were empirically calculated in the previous section (Implementation Details). In terms of timing, the proposed framework is (6.67 fps) faster than Faster R-CNN. Fig. 2 shows our qualitative results.

Method # proposals Dice Score Timing (sec)
Faster R-CNN 300 0.21
Faster R-CNN with optimal aspect ratios (2) and scales (2) 300 0.18
Proposed Method 154 0.15
Table 1:

Performance results comparison of the proposed method with state-of-the-art in terms of the number of proposals, detection accuracy and execution time. The processing time includes non-maximum suppression, pooling, fully-connected, and softmax layers.

Figure 2: Qualitative results obtained using the proposed method. The bounding-box with detected organ-of-interest (left/ right lung field) and the confidence score by the algorithm is shown.

4 Conclusion

In this paper, we presented a new and general RPN with contextual attention mechanism to generate region proposals efficiently and accurately for organ detection in medical images. We illustrated the improved performance of our framework on the detection of the lung filed from a cohort of diverse chest radiographs with a variety of pathologies. We increased the classification accuracy by providing organ location information to the classifier. Our results also show that by using critical organ localization and geometric information, the region proposal evaluation speed increases on average by to provide real-time results. Finally, our novel architecture improves the overall detection accuracy of the lung field by . In our future work we plan to extend our approach to 3D volumetric images where this improvement in time will be extremely important in applications such as query processing.


  • [1] P. Viola and M. Jones, “Rapid object detection using a boosted cascade of simple features,” in

    Computer Vision and Pattern Recognition, 2001. CVPR 2001. Proceedings of the 2001 IEEE Computer Society Conference on

    , vol. 1.    IEEE, 2001, pp. I–I.
  • [2] S. K. Zhou, J. Zhou, and D. Comaniciu, “A boosting regression approach to medical anatomy detection,” in Computer Vision and Pattern Recognition, 2007. CVPR’07. IEEE Conference on.    IEEE, 2007, pp. 1–8.
  • [3] O. Pauly, B. Glocker, A. Criminisi, D. Mateus, A. M. Möller, S. Nekolla, and N. Navab, “Fast multiple organ detection and localization in whole-body mr dixon sequences,” in International Conference on Medical Image Computing and Computer-Assisted Intervention.    Springer, 2011, pp. 239–247.
  • [4] Y. Zheng and D. Comaniciu, “Marginal space learning,” in Marginal Space Learning for Medical Image Analysis.    Springer, 2014, pp. 25–65.
  • [5] R. Girshick, “Fast r-cnn,” arXiv preprint arXiv:1504.08083, 2015.
  • [6] S. Ren, K. He, R. Girshick, and J. Sun, “Faster R-CNN: Towards real-time object detection with region proposal networks,” in Advances in neural information processing systems, 2015, pp. 91–99.
  • [7] F. C. Ghesu, E. Krubasik, B. Georgescu, V. Singh, Y. Zheng, J. Hornegger, and D. Comaniciu, “Marginal space deep learning: efficient architecture for volumetric image parsing,” IEEE transactions on medical imaging, vol. 35, no. 5, pp. 1217–1228, 2016.
  • [8] A. Akselrod-Ballin, L. Karlinsky, S. Alpert, S. Hasoul, R. Ben-Ari, and E. Barkan, “A region based convolutional network for tumor detection and classification in breast mammography,” in Deep Learning and Data Labeling for Medical Applications.    Springer, 2016, pp. 197–205.
  • [9] T. S. Ferguson, “An inconsistent maximum likelihood estimate,” Journal of the American Statistical Association, vol. 77, no. 380, pp. 831–834, 1982.