Lesion detection plays an important role in computer-aided detection/diagnosis (CAD) systems. Early algorithms generally focused on one or few particular lesion types. To promote the development of universal lesion detection algorithms, Yan et al.  built a large-scale dataset comprising lesions of multiple categories named DeepLesion. The DeepLesion dataset was annotated with the Response Evaluation Criteria in Solid Tumors (RECIST) diameters, which is one of the most frequently used ways of recording clinically meaningful findings in clinical routine by radiologists, due to its prevailing adoption for cancer patient monitoring. As part of the RECIST guidelines, the lesion diameters include two lines, with the first measuring the longest diameter of the lesion and the second indicating the longest perpendicular diameter to the first in the plane of measurement (see Fig. 1
for examples). A bounding box is also provided for each lesion in DeepLesion, which is computed to enclose the diameter measurement with a 5-pixel padding in each direction (i.e., left, top, right, and bottom). Using the DeepLesion dataset, various methods [2, 3, 4, 5, 6] have been proposed and advanced the state of the art for universal lesion detection. Although these methods yielded promising results, lesion detection using bounding boxes suffers from three prominent drawbacks. First, the bounding box may not be a good representation for lesions. Second, the process of tuning anchor boxes can be laborious. Third, the bounding box is not the clinical standard for measuring lesion sizes.
Few works attempted to overcome one or several of the three drawbacks described above. Tang et al.  constructed a pseudo ellipse mask from the RECIST annotation for each lesion, and adopted Mask R-CNN  for predicting the pseudo mask. Zlocha et al.  improved quality of the pseudo mask with GrabCut , and employed multi-scale dense supervision by a pseudo segmentation task to aid the detection task. Two main drawbacks of these two works were that the pseudo mask was often inaccurate in segmenting the lesion, and extra overhead was incurred to learn the pseudo mask. Different from these works, a noteworthy work 
proposed to directly predict the endpoints of the RECIST diameters, which is (as far as we know) the first work that modeled these characteristics of the RECIST annotations. However, the method was semi-automatic where the region of interest must be provided as a prerequisite, and the process was quite intricate where an extra spatial transformer network was employed to learn the transformation for unifying the lesion orientation.
In this paper, we propose a conceptually straightforward network named RECIST-Net to detect the four extreme points (i.e., top-most, left-most, bottom-most, right-most) and the center point of the RECIST diameters, overcoming all the limitations mentioned above. The extreme and center points can characterize the lesions. To learn these keypoints, we borrow ideas from the ExtremeNet  and employ an HourglassNet  to regress the heatmaps of the extreme and center points by treating the task as a keypoint detection problem. For testing, we propose a purely geometry-based grouping strategy to produce a bounding box for each prediction. We evaluate our method on the DeepLesion dataset , and achieve a sensitivity of 92.49% at four false positives per image, outperforming all competing methods including methods using multi-task learning [6, 5, 9].
In this section, we present the details of the proposed RECIST-Net for detection of the extreme and center points of the RECIST diameters. An overview of the RECIST-Net architecture is presented in Fig. 2. We firstly briefly introduce the Hourglass backbone that we use. Then, we present the design of our detection head for learning the extreme and center points of RECIST diameters. Lastly, a geometry-based grouping strategy is explored to produce a bounding box for each detection.
2.1 Hourglass Backbone
Our RECIST-Net adopts the HourglassNet  as backbone to detect the extreme and center points of the RECIST diameters. For the input, considering that neighboring slices are important to providing contextual information for differentiating lesions from non-lesions, we group three consecutive axial slices of a CT volume into a 3-channel image. In addition, Li et al.  demonstrated that CT images with different window levels and widths could improve the performance in detecting subtle lesions and reducing false positives (FPs). Being inspired, we stack images with three different configurations of window level and width to the original image as input.
2.2 Learning Extreme and Center Points in RECIST Diameters
In this work, we grasp the central concepts of CornerNet  and ExtremeNet . Given four extreme and one center points, we regress a heatmap of width and height for each keypoint. The training is guided by a multi-peak Gaussian heatmap
, where each keypoint defines the center of a Gaussian kernel and the standard deviation is set proportional to the object size[14, 12]. In order to balance the positive and negative locations, a modified focal loss is adopted for training, as in [14, 12]:
where and are hyper-parameters and fixed to and during training, and is the number of objects in the image.
Similar to [14, 12], we additionally regress the keypoint offset for each extreme point to recover part of the information lost in the down-sampling process of the HourglassNet . We regress the offset maps with the smooth loss on locations of the ground truth extreme point as:
where is the down-sampling factor in HourglassNet ( in our case), and
is the coordinate of the estimated extreme point. Note that we omit the indexing ofin the of Eq. (2) for convenience. There is no offset prediction for the center point.
2.3 Grouping Strategy for Inference
Next, we present the strategy of how to group the prediction heatmaps into detections in a purely geometric manner for inference. We simplify the grouping strategy used in ExtremeNet  into three big steps as described below.
First, for each heatmap of the extreme points, we extract local peaks with a max-pooling operation of kernel sizeto suppress similar scores in neighborhood windows (named ExtractPeak in ExtremeNet ). After the suppression, we preserve only the top positions with scores greater than a threshold . Second, given four extreme points denoted by in the corresponding heatmaps, their geometric center is calculated by . If this center is predicted with a high response in the center-point heatmap, i.e., greater than a threshold , we consider the extreme points as a valid candidate detection. In this paper . We then iterate over all possible combinations of the remaining peak positions in a brute force manner (though with a runtime of , where is the number of preserved extreme points in corresponding heatmaps, it can be accelerated on a GPU). A combination score is computed by adding up the scores of each quadruple of extreme points and twice the score of the corresponding center point, and the top candidate combinations are preserved as the initial prediction results for detection. The settings of and in  are adopted in this paper. Third, we refine the coordinates of the initial prediction results by adding an offset predicted at the corresponding location of the offset map to each predicted extreme point. For a fair comparison with other methods which have been evaluated against bounding boxes, we generate a tight bounding box enclosing each grouped quadruple of detected extreme points. Different from ExtremeNet , which additionally employed a multi-scale augmentation for inference, we only use the flip augmentation for computational efficiency. Lastly, a Soft-NMS is employed to filter all augmented detection results.
The DeepLesion dataset  is used for experiments, with the official split (70% for training, 15% for validation, and 15% for test). As in , 35 noisy lesion annotations are removed. We evaluate performance of the methods on the official test set. Following the general practice in previous works [6, 5, 9, 3, 17, 4, 16, 15, 2]
, we report the sensitivity at various FPs (0.5, 1, 2, 3, 4) per scan as the evaluation metric.
During training, we set the input resolution of the axial slices to in pixels, and output resolution to in pixels. To alleviate the overfitting problem, we use three data augmentation methods: random flipping horizontally and vertically, random scaling between 0.6 and 1.4, and random cropping. In the test phase, the input axial image is resized to a fixed size of in pixels. The predicted bounding box coordinates are enlarged with a 5-pixel padding as done with the ground truth for computing sensitivity. The multi-view input is consistent with the window settings in . We initialize our network using weights of ExtremeNet 
trained on COCO. The network is optimized with Adam with a learning rate of. We train our RECIST-Net on two NVIDIA GeForce RTX 2080 Ti GPUs with a batch size of 11 for 55,550 iterations.
3.3 Comparison with State-of-The-Art Methods
We show the results in Table 1. From the table, we can observe that our RECIST-Net outperforms all competing methods, including those using multi-task learning [6, 5, 7, 9]. The superior performance is attributed to the conceptually straightforward RECIST-based formulation for detection, which has not been exploited in previous works. Visual examples of the detected lesions on official test images are shown in Fig. 3
. The probability threshold is set to 0.32 yielding 0.5 FP per image. We can observe that lesions of varying size, appearance, and type are localized accurately.
We further analyze the detection performance on different lesion types and image properties on the official test set according to three criteria: 1) lesion type, 2) lesion diameter, and 3) slice interval of the CT scans. We show the results per lesion type criterion in Table 2 and those per other two criteria in Table 3. As shown, our method achieves the best performances on all metrics, except for the lesions with diameters greater than 30 mm, for which it achieves the second best. This is because, for larger objects, the center response map  may not be accurate enough to perform well, as a shift of a few pixels might miss a detection and result in a false-negative .
3.4 Ablation Study
We conduct an ablation study with respect to the ExtractPeak grouping strategy, Soft-NMS for post-processing, flip augmentation (FlipAug) for test time augmentation, and multi-view input, to identify how much these components contribute to the performance. The results are shown in Table 4. We can observe that ExtractPeak and Soft-NMS bring the most improvements to RECIST-Net, both of which are used to suppress similar scores in neighborhood windows. The result confirms the core status of the non-maximum-suppression-like methods in detection algorithms. It is worth noting that with flip augmentation, a 7.66% improvement is achieved in sensitivity at four FPs. This may imply that detection with different orientations can provide complementary information. With multi-view input, a further 1.81% improvement is achieved. This is reasonable, since, different window levels and widths are used for the reading of CT scans of different body parts in clinical practice, and hence a multi-view setting can boost the performance of lesion detection across the body.
In this paper, we presented a formulation for universal lesion detection which was implemented with RECIST-Net. The RECIST-Net detected four extreme points (i.e., top-most, left-most, bottom-most, right-most) and one center point of a lesion in a keypoint detection way. We hope this work would inspire researchers to develop more methods that are friendly to the way of annotation.
5 Compliance with Ethical Standards
This research study was conducted retrospectively using human subject data made available in open access by . Ethical approval was not required as confirmed by the license attached with the open access data.
This work was supported by National Natural Science Foundation of China (Grant No. 61671399) and the Fundamental Research Funds for the Central Universities (Grant No. 20720190012). Shilei Cao, Dong Wei, Kai Ma, and Yefeng Zheng are employees of Tencent. The authors have no relevant financial or non-financial interest to disclose.
Ke Yan, Xiaosong Wang, Le Lu, and Ronald M. Summers,
“DeepLesion: Automated mining of large-scale lesion annotations and universal lesion detection with deep learning,”J. of Med. Imaging, vol. 5, no. 3, pp. 036501, 2018.
Ke Yan, Mohammadhadi Bagheri, and Ronald M. Summers,
“3D context enhanced region-based convolutional neural network for end-to-end lesion detection,”in MICCAI. Springer, 2018, pp. 511–519.
-  Qingbin Shao, Lijun Gong, Kai Ma, Hualuo Liu, and Yefeng Zheng, “Attentive CT lesion detection using deep pyramid inference with multi-scale booster,” in MICCAI. Springer, 2019, pp. 301–309.
-  Qingyi Tao, Zongyuan Ge, Jianfei Cai, Jianxiong Yin, and Simon See, “Improving deep lesion detection using 3D contextual and spatial attention,” in MICCAI. Springer, 2019, pp. 185–193.
-  Zihao Li, Shu Zhang, Junge Zhang, Kaiqi Huang, Yizhou Wang, and Yizhou Yu, “MVP-Net: Multi-view FPN with position-aware attention for deep universal lesion detection,” in MICCAI. Springer, 2019, pp. 13–21.
-  Ke Yan, Youbao Tang, Yifan Peng, Veit Sandfort, Mohammadhadi Bagheri, Zhiyong Lu, and Ronald M. Summers, “MULAN: Multitask universal lesion analysis network for joint lesion detection, tagging, and segmentation,” in MICCAI. Springer, 2019, pp. 194–202.
-  Youbao Tang, Ke Yan, Yuxing Tang, Jiamin Liu, Jing Xiao, and Ronald M. Summers, “ULDor: A universal lesion detector for CT scans with pseudo masks and hard negative example mining,” arXiv preprint arXiv:1901.06359, 2019.
-  Kaiming He, Georgia Gkioxari, Piotr Dollár, and Ross Girshick, “Mask R-CNN,” in ICCV, 2017, pp. 2961–2969.
-  Martin Zlocha, Qi Dou, and Ben Glocker, “Improving RetinaNet for CT lesion detection with dense masks from weak RECIST labels,” arXiv preprint arXiv:1906.02283, 2019.
-  Carsten Rother, Vladimir Kolmogorov, and Andrew Blake, “GrabCut: Interactive foreground extraction using iterated graph cuts,” in ACM Trans. Graphics. ACM, 2004, vol. 23, pp. 309–314.
Youbao Tang, Adam P. Harrison, Mohammadhadi Bagheri, Jing Xiao, and Ronald M.
“Semi-automatic RECIST labeling on CT scans with cascaded convolutional neural networks,”in MICCAI. Springer, 2018, pp. 405–413.
-  Xingyi Zhou, Jiacheng Zhuo, and Philipp Krahenbuhl, “Bottom-up object detection by grouping extreme and center points,” in CVPR, 2019, pp. 850–859.
Alejandro Newell, Kaiyu Yang, and Jia Deng,
“Stacked hourglass networks for human pose estimation,”in ECCV. Springer, 2016, pp. 483–499.
-  Hei Law and Jia Deng, “CornerNet: Detecting objects as paired keypoints,” in ECCV, 2018, pp. 734–750.
-  Xudong Wang, Zhaowei Cai, Dashan Gao, and Nuno Vasconcelos, “Towards universal object detection by domain attention,” in CVPR, 2019, pp. 7289–7298.
-  Xudong Wang, Shizhong Han, Yunqiang Chen, Dashan Gao, and Nuno Vasconcelos, “Volumetric attention for 3D medical image segmentation and detection,” in MICCAI. Springer, 2019, pp. 175–184.
-  Ning Zhang, Dechun Wang, Xinzi Sun, Pengfei Zhang, Chenxi Zhang, Yu Cao, and Benyuan Liu, “3D anchor-free lesion detector on computed tomography scans,” arXiv preprint arXiv:1908.11324, 2019.