Universal Lesion Detection (ULD) in computed tomography (CT) images [26, 27, 25, 32, 18, 17, 22, 7], which aims to localize different types of lesions instead of identifying lesion types [8, 10, 21, 23, 1, 16, 15, 11, 20, 31, 6, 12]
, plays an essential role in computer-aided diagnosis (CAD) systems. Recently, deep learning-based detection approaches achieve excellent results for ULD using possible bounding boxes (BBoxs) (or anchors) as proposals. However, empirical evidence shows that using anchor-based proposals leads to severe data imbalance (e.g., class and spatial imbalance) , which leads to a high false-positive (FP) rate in ULD. Therefore, there is an urgent need to reduce the FP proposals and improve the lesion detection performance.
Most existing ULD methods are mainly inspired by the successful deep models in object detection from natural images. Tang et al.  constructed a pseudo mask for each lesion region as the extra supervision information to adapt a Mask-RCNN  for ULD. Yan et al.  proposed a 3D Context Enhanced (3DCE) R-CNN model based on the model 
pre-trained from ImageNet for 3D context modeling. Li et al. proposed the so-called MVP-Net, which is a multi-view feature pyramid network (FPN)  with position-aware attention to incorporate multi-view information for ULD. Han et al.  leveraged cascaded multi-task learning to jointly optimize object detection and representation learning.
All the above approaches proposed for ULD are designed based on a two-stage anchor-based framework, i.e., proposal generation followed by classification and regression like Faster R-CNN . They achieve good performance because: i) The anchoring mechanism is a good reception field initialization for limited-data and limited-lesion-category datasets. ii) The two-stage mechanism is a coarse-to-fine mechanism for the CT lesion dataset that only contains two categories (‘lesion’ or not ), i.e., first finds lesion proposals and then removes the FP proposals. However, such a framework has two main limitations for effective ULD: (i) The imbalanced anchors in stage-1. (e.g, class, spatial imbalance ). In the first stage, anchor-based methods first find out the positive (lesion) anchors and use them as the region of interest (ROI) proposals according to the intersection over union (IoU) between anchors and ground-truth (GT) BBoxs. Hence, the number of positive anchors is decided by the IoU threshold and the amount of GT BBoxs per image. Specifically, an anchor is considered positive if its IoU with a GT BBox is greater than the IoU threshold and negative otherwise. This idea helps natural images to get enough positive anchors because they may have a lot of GT BBoxs per image, but it isn’t suitable for ULD. Most CT slices only have one or two GT lesion BBox(s), so the amount of positive anchors is rather limited. This limitation can cause severe data imbalance and influence the training convergence of the whole network. Using a lower IoU threshold is a simple way to get more positive anchors, but a lot of low-IoU anchors are labeled as positive can also lead to a high FP rate in ULD. (ii) The insufficient supervision in stage-2. In the second stage, each ROI proposal (selected anchor) from the first stage has one corresponding classification score to represent the possibility of containing lesions. The ROI proposals with high classification scores are chosen to obtain the final BBox prediction. ULD is a challenging task due to the similar appearances (e.g., intensity and texture) between lesions and other tissues; the non-lesion regions can also get very high scores. Hence, a single classification score can easily lead to FPs in ULD.
To address the anchor-imbalance problem, anchor-free methods [19, 30] solve detection in a per-pixel prediction manner and achieve success in natural images with sufficient data and object categories. But for lesion detection (lesion or not) with limited data, they lack needed precision. To overcome the supervision-insufficient problem, Mask R-CNN-like  methods add a mask branch to introduce extra segmentation supervision and hence improve the detection performance. But it needs training segmentation masks that are costly to obtain.
In this paper, we present a continuous bounding map (BM) representation to enable the per-pixel prediction in the 1st stage and introduce extra-supervision in the 2nd stage of any anchor-based detection method. Our first contribution is a new box-to-map representation, which represents a BBox by three 2D bounding maps (BMs) in (along) three different directions (axes): -direction (), -direction (), and -direction (), as shown in Fig. 1. The pixel values in and decrease from the centerline to the BBox borders in x and y directions respectively with a linear fashion, while the pixel values in decrease from both two directions. Compared with a sharp binary representation (e.g., binary anchors label in RPN, binary segmentation mask in Mask R-CNN ), such a soft continuous map can provide a more detailed representation of location. This (i.e., per-pixel & continuous) promotes the network to learn more contextual information , thereby reducing the FPs. Our second contribution is to expand the capability of a two-stage anchor-based detection method using our BM representation in a light way. First, we use as the GT of a positive anchor in the first stage as in Fig. 2 and choose a proper IoU threshold to deal with the anchor imbalance problem. Second, we add one additional branch called BM branch paralleled with the BBox branch  in the second stage as in Fig. 3. The BM branch introduces extra supervision of detailed location to the whole network in a pixel-wise manner and thus decreases the FP rate in ULD. We conduct extensive experiments on the DeepLesion Dataset  with four state-of-the-art ULD methods to validate the effectiveness of our method.
As shown in Fig. 2, we utilize BMs to reduce the ULD FP rate by replacing the original positive anchor class labels in stage-1 and adding a BM branch to introduce extra pixel-wise location supervision in stage-2. Section 2.1 details the BM representation and Section 2.2 defines the anchor labels for RPN training based on our BMs. Section 2.3 explains the newly introduced BM branch.
2.1 Bounding maps
Motivated by , the BMs are formed in all-zero maps by only changing the value of pixels located within the BBox(s) as in Fig. 1. Let be the lesion GT BBox of one CT image , the set of coordinates within BBox can be denoted as:
and the center point of this BBox lies at .
Within each BBox , the pixel values in and decrease from 1 (center line) to 0.5 (border) in a linear fashion:
where is the slope of linear function in -direction or -direction, which is calculated according to the GT BBox’s width () or height ():
We take the sum of all the s and s to obtain the total and of one input image, respectively.
where is the number of GT BBox(s) of one CT image. Then the -direction BM can be generated by calculating the square root of the product between and :
where denotes the element-wise multiplication.
By introducing the above BMs, we expect they can promote network training and reduce FPs. Because the proposed BMs offer a soft continuous map about the lesion other than a sharp binary mask, which can convey more contextual information about the lesion, not only its location but also guidance of confidence. These properties are favourable for object detection task with irregular shapes and limited-amount GT BBox(s) like ULD.
2.2 Anchor label in RPN
In the original two-stage anchor-based detection frameworks, RPN is trained to produce object bounds and objectness classification scores at each position (or anchors’ centerpoint), where
is the output stride. During training, all the anchors are first divided into three categories of positive (lesion), negative and drop-out anchors based on their IoUs with the GT BBoxs. Then the GT labels of positive, negative and drop-out anchors are set as 1, 0, -1 respectively and only the positive and negative anchors are used for loss calculation in RPN training.
In our proposed method, we still use 0 and -1 as the GT class labels of negative and drop-out anchors, but we set the class label of positive anchors as their corresponding value in . For size consistency, we first resize to to match the size of . Therefore, the GT label of anchor is given as:
where is the centerpoint coordinates of an anchor in and denotes the IoU between the anchor and GT BBox.
Anchor classification loss function:
Anchor classification loss function:For each anchor, the original RPN loss is the sum of anchor classification loss and BBox regression loss. However, the amount of GT BBox in one CT slice is usually more limited than one natural image. Hence a proper RPN IoU threshold is hard to find in ULD task: a higher IoU threshold can cause imbalanced anchors problem while a lower IoU threshold which causes too many low-IoU anchors’ GT label are set as can lead to a high FP rate. Therefore, we replace the original anchor classification loss with our proposed anchor classification loss:
where is the norm-2 loss, and denote the negative and positive anchor thresholds, respectively.
2.3 Bounding map branch
As shown in Fig. 3, the BM branch is similar to the mask branch in Mask R-CNN . It is paralleled with the BBox branch and applied separately to each ROI. The branch consists of four convolution, two deconvolution and one convolution layers. It takes ROI proposal feature map as input and aims to obtain the and proposals, denoted by and , respectively:
where is the function of BM branch.
BM branch loss function: For each ROI, and are first cropped based on the ROI BBox and resized to the size of the BM branch output to obtain and . Then we concatenate the two BMs into a multi-channel map and use it as the ground-truth for our BM branch. Therefore, the loss function of BM branch for each ROI can be defined as a norm-2 loss:
Full loss function: The full loss function of our method is given as:
where and are the original box regression loss (in RPN) of the training (negative positive) anchor and BBox branch loss of positive ROI in Faster R-CNN . and denote our anchor classification loss and BM branch loss for training anchors and positive ROI. and are the number of training anchors and positive ROIs, respectively.
3.1 Dataset and setting
We conduct experiments using the DeepLesion dataset . The dataset is a large-scale CT dataset with 32,735 lesions on 32,120 axial slices from 10,594 CT studies of 4,427 unique patients. Different from existing datasets that typically focus on one type of lesion, DeepLesion contains a variety of lesions with a variety of diameters ranges (from 0.21 to 342.5mm). We rescale the 12-bit CT intensity range to [0,255] with different window ranges proposed in different frameworks. Every CT slice is resized to
, and the slice intervals are interpolated to 2mm.111We use a CUDA toolkit in  to speed up this process. We conducted experiments on the official training (), validation (), testing (
) sets. The number of FPs per image (FPPI) is used as the evaluation metric, and we mainly compare the sensitivity at 4 FPPI for WRITING briefness, just as in.
We only use the horizontal flip as the training data augmentation and train them with stochastic gradient descent (SGD) forepochs. The base learning rate is set as , and decreased by a factor of after the and epoch. The models with our method utilize a lower positive anchor IoU threshold of 0.5, and the other network settings are the same as the corresponding original models.
3.2 Detection performance
We perform experiment with three state-of-the-art two-stage anchor-based detection methods to evaluate the effectiveness of our approach. We also use two state-of-the-art anchor-free natural image detection methods for comparison.
3DCE. The 3D context enhanced region-based CNN (3DCE)  is trained with 9 or 27 CT slices to form the 9-slice or 27-slice 3DCE.
MVP-Net. The multi-view FPN with position-aware attention network (MVP-Net)  is trained with 3 CT slices to form the 3-slice MVP-Net.
Faster R-CNN. Similar to MVP-Net , we rescale an original 12-bit CT image with window ranges of [50,449], [-505,1980] and [446,1960] to generate three rescaled CT images. Then we concatenate the three rescaled CT images into three channels to train a Faster R-CNN. The other network settings are the same as the baseline MVP-Net.
As shown in Table 1, our method brings promising detection performance improvements for all baselines uniformly at different FPPIs. The improvement of Faster R-CNN , 9-slice 3DCE, 27-slice 3DCE and 9-slice FPN-3DCE are more pronounced than that of MVP-Net. This is because the MVP-Net is designed for reducing the FP rate in UDL and has achieved relatively high performance. Also, the anchor-free methods yields unsatisfactory results, and we think the main reason is that they completely discard the anchor and two-stage mechanism. Fig. 4 presents a case to illustrate the effectiveness of our method in improving the performance of Faster-R-CNN.
3.3 Ablation study
We provide an ablation study about the two key components of the proposed approach, e.g., with vs. without using in stage-1 and with vs. without using BM branch () in stage-2. We also perform a study to compare the efficiency between linear BMs and Gaussian BMs. As shown in Table 2, using as the class label for positive anchors, we obtain a 2.27% improvement over the Faster R-CNN  baseline. Further adding a BM branch for introducing extra pixel-wise supervision accounts for another 1.14% improvement. Using both and BM branch gives the best performance. Taking Gaussian BM instead linear BM does not bring improvement. The use of our method causes a minor influence to the inference time measured on a Titan XP GPU.
In this paper, we study how to overcome the two limitations of two-stage anchor-based ULD methods: the imbalanced anchors in the first stage and the insufficient supervision information in the second stage. We first propose BMs to represent a BBox in three different directions and then use them to replace the original binary GT labels of positive anchors in stage-1 introduce additional supervision through a new BM branch in stage-2. We conduct experiments based on several state-of-the-art baselines on the DeepLesion dataset, and the results show that the performances of all the baselines are boosted with our method.
Normal appearance autoencoder for lung cancer detection and segmentation. In MICCAI, pp. 249–256. Cited by: §1.
-  (2009) Imagenet: a large-scale hierarchical image database. In IEEE CVPR, pp. 248–255. Cited by: §1.
-  (2019) Tattoo image search at scale: joint detection and compact representation learning. IEEE Trans. Pattern Anal. Mach. Intell 41 (10), pp. 2333–2348. Cited by: §1.
-  (2017) MASK R-CNN. In IEEE ICCV, pp. 2961–2969. Cited by: §1, §1, §1, §2.3.
-  (2019) 3D -net: a 3d universal u-net for multi-domain medical image segmentation. In MICCAI, pp. 291–299. Cited by: footnote 1.
-  (2020) High-resolution chest x-ray bone suppression using unpaired ct structural priors. IEEE Trans. Med. Imag.. Cited by: §1.
-  (2019) MVP-net: multi-view fpn with position-aware attention for deep universal lesion detection. In MICCAI, pp. 13–21. Cited by: §1, §1, 3rd item, 4th item, §3.1.
-  (2019) Evaluate the malignancy of pulmonary nodules using the 3-d deep leaky noisy-or network. IEEE Trans. Neural Netw. Learn. Syst 30 (11), pp. 3484–3495. Cited by: §1.
-  (2017) Feature pyramid networks for object detection. In IEEE CVPR, pp. 2117–2125. Cited by: §1, 2nd item.
Automated pulmonary embolism detection from ctpa images using an end-to-end convolutional neural network. In MICCAI, pp. 280–288. Cited by: §1.
-  (2019) 3DFPN-hs: 3d feature pyramid network based high sensitivity and specificity pulmonary nodule detection. In MICCAI, pp. 513–521. Cited by: §1.
-  (2018) 3d anisotropic hybrid network: transferring convolutional features from 2d images to 3d anisotropic volumes. In MICCAI, pp. 851–858. Cited by: §1.
-  (2020) Imbalance problems in object detection: a review. Trans. Pattern Anal. Mach. Intell.. Cited by: §1, §1.
-  (2015) Faster R-CNN: towards real-time object detection with region proposal networks. In NIPS, pp. 91–99. Cited by: §1, §1, §2.3, §3.2, §3.3.
-  (2019) Attentive ct lesion detection using deep pyramid inference with multi-scale booster. In MICCAI, pp. 301–309. Cited by: §1.
-  (2019) NoduleNet: decoupled false positive reduction for pulmonary nodule detection and segmentation. In MICCAI, pp. 266–274. Cited by: §1.
-  (2019) ULDor: a universal lesion detector for ct scans with pseudo masks and hard negative example mining. In IEEE ISBI, pp. 833–836. Cited by: §1, §1.
-  (2019) Improving deep lesion detection using 3d contextual and spatial attention. In MICCAI, pp. 185–193. Cited by: §1.
-  (2019) Fcos: fully convolutional one-stage object detection. In IEEE ICCV, pp. 9627–9636. Cited by: §1, 5th item.
-  (2018) Automated pulmonary nodule detection: high sensitivity with few candidates. In MICCAI, pp. 759–767. Cited by: §1.
-  (2019) Volumetric attention for 3d medical image segmentation and detection. In MICCAI, pp. 175–184. Cited by: §1.
-  (2018) 3D context enhanced region-based convolutional neural network for end-to-end lesion detection. In MICCAI, pp. 511–519. Cited by: §1, §1, 1st item, 2nd item.
-  (2019) Mulan: multitask universal lesion analysis network for joint lesion detection, tagging, and segmentation. In MICCAI, pp. 194–202. Cited by: §1.
-  (2018) Deep lesion graphs in the wild: relationship learning and organization of significant radiology image findings in a diverse large-scale lesion database. In IEEE CVPR, pp. 9261–9270. Cited by: §1, Table 1, §3.1.
-  (2020) 3D aggregated faster R-CNN for general lesion detection. arXiv:2001.11071. Cited by: §1.
-  (2019) 3D anchor-free lesion detector on computed tomography scans. arXiv:1908.11324. Cited by: §1.
-  (2019) Lesion detection by efficiently bridging 3d context. In MLMI Workshop, pp. 470–478. Cited by: §1.
-  (2017) Deep learning for medical image analysis. Academic Press. Cited by: §1.
Medical image recognition, segmentation and parsing: machine learning and multiple object approaches. Academic Press. Cited by: §1.
-  (2019) Objects as points. arXiv:1904.07850. Cited by: §1, §1, §2.1, 5th item.
-  (2018) Deepem: deep 3d convnets with em for weakly supervised pulmonary nodule detection. In MICCAI, pp. 812–820. Cited by: §1.
-  (2019) Improving retinanet for ct lesion detection with dense masks from weak recist labels. In MICCAI, pp. 402–410. Cited by: §1.