RECIST-Net: Lesion detection via grouping keypoints on RECIST-based annotation

by   Cong Xie, et al.
Xiamen University

Universal lesion detection in computed tomography (CT) images is an important yet challenging task due to the large variations in lesion type, size, shape, and appearance. Considering that data in clinical routine (such as the DeepLesion dataset) are usually annotated with a long and a short diameter according to the standard of Response Evaluation Criteria in Solid Tumors (RECIST) diameters, we propose RECIST-Net, a new approach to lesion detection in which the four extreme points and center point of the RECIST diameters are detected. By detecting a lesion as keypoints, we provide a more conceptually straightforward formulation for detection, and overcome several drawbacks (e.g., requiring extensive effort in designing data-appropriate anchors and losing shape information) of existing bounding-box-based methods while exploring a single-task, one-stage approach compared to other RECIST-based approaches. Experiments show that RECIST-Net achieves a sensitivity of 92.49 at four false positives per image, outperforming other recent methods including those using multi-task learning.


page 1

page 2

page 4


Improving RetinaNet for CT Lesion Detection with Dense Masks from Weak RECIST Labels

Accurate, automated lesion detection in Computed Tomography (CT) is an i...

MVP-Net: Multi-view FPN with Position-aware Attention for Deep Universal Lesion Detection

Universal lesion detection (ULD) on computed tomography (CT) images is a...

Universal Lesion Detection by Learning from Multiple Heterogeneously Labeled Datasets

Lesion detection is an important problem within medical imaging analysis...

Conditional Training with Bounding Map for Universal Lesion Detection

Universal Lesion Detection (ULD) in computed tomography plays an essenti...

Liver Lesion Detection from Weakly-labeled Multi-phase CT Volumes with a Grouped Single Shot MultiBox Detector

We present a focal liver lesion detection model leveraged by custom-desi...

Check and Link: Pairwise Lesion Correspondence Guides Mammogram Mass Detection

Detecting mass in mammogram is significant due to the high occurrence an...

1 Introduction

Lesion detection plays an important role in computer-aided detection/diagnosis (CAD) systems. Early algorithms generally focused on one or few particular lesion types. To promote the development of universal lesion detection algorithms, Yan et al. [1] built a large-scale dataset comprising lesions of multiple categories named DeepLesion. The DeepLesion dataset was annotated with the Response Evaluation Criteria in Solid Tumors (RECIST) diameters, which is one of the most frequently used ways of recording clinically meaningful findings in clinical routine by radiologists, due to its prevailing adoption for cancer patient monitoring. As part of the RECIST guidelines, the lesion diameters include two lines, with the first measuring the longest diameter of the lesion and the second indicating the longest perpendicular diameter to the first in the plane of measurement (see Fig. 1

for examples). A bounding box is also provided for each lesion in DeepLesion, which is computed to enclose the diameter measurement with a 5-pixel padding in each direction (

i.e., left, top, right, and bottom). Using the DeepLesion dataset, various methods [2, 3, 4, 5, 6] have been proposed and advanced the state of the art for universal lesion detection. Although these methods yielded promising results, lesion detection using bounding boxes suffers from three prominent drawbacks. First, the bounding box may not be a good representation for lesions. Second, the process of tuning anchor boxes can be laborious. Third, the bounding box is not the clinical standard for measuring lesion sizes.

Figure 1: Three example lesions of the DeepLesion [1] dataset annotated with the RECIST diameters (red for the long diameters and blue for short). The bounding boxes are computed to enclose the lesion measurements with a 5-pixel padding in each direction. Left to right: large to small lesions.

Few works attempted to overcome one or several of the three drawbacks described above. Tang et al. [7] constructed a pseudo ellipse mask from the RECIST annotation for each lesion, and adopted Mask R-CNN [8] for predicting the pseudo mask. Zlocha et al. [9] improved quality of the pseudo mask with GrabCut [10], and employed multi-scale dense supervision by a pseudo segmentation task to aid the detection task. Two main drawbacks of these two works were that the pseudo mask was often inaccurate in segmenting the lesion, and extra overhead was incurred to learn the pseudo mask. Different from these works, a noteworthy work [11]

proposed to directly predict the endpoints of the RECIST diameters, which is (as far as we know) the first work that modeled these characteristics of the RECIST annotations. However, the method was semi-automatic where the region of interest must be provided as a prerequisite, and the process was quite intricate where an extra spatial transformer network was employed to learn the transformation for unifying the lesion orientation.

Figure 2: An overview of the proposed RECIST-Net for detecting the extreme and center points of the RECIST diameters.

In this paper, we propose a conceptually straightforward network named RECIST-Net to detect the four extreme points (i.e., top-most, left-most, bottom-most, right-most) and the center point of the RECIST diameters, overcoming all the limitations mentioned above. The extreme and center points can characterize the lesions. To learn these keypoints, we borrow ideas from the ExtremeNet [12] and employ an HourglassNet [13] to regress the heatmaps of the extreme and center points by treating the task as a keypoint detection problem. For testing, we propose a purely geometry-based grouping strategy to produce a bounding box for each prediction. We evaluate our method on the DeepLesion dataset [1], and achieve a sensitivity of 92.49% at four false positives per image, outperforming all competing methods including methods using multi-task learning [6, 5, 9].

2 Methodology

In this section, we present the details of the proposed RECIST-Net for detection of the extreme and center points of the RECIST diameters. An overview of the RECIST-Net architecture is presented in Fig. 2. We firstly briefly introduce the Hourglass backbone that we use. Then, we present the design of our detection head for learning the extreme and center points of RECIST diameters. Lastly, a geometry-based grouping strategy is explored to produce a bounding box for each detection.

2.1 Hourglass Backbone

Our RECIST-Net adopts the HourglassNet [13] as backbone to detect the extreme and center points of the RECIST diameters. For the input, considering that neighboring slices are important to providing contextual information for differentiating lesions from non-lesions, we group three consecutive axial slices of a CT volume into a 3-channel image. In addition, Li et al. [5] demonstrated that CT images with different window levels and widths could improve the performance in detecting subtle lesions and reducing false positives (FPs). Being inspired, we stack images with three different configurations of window level and width to the original image as input.

2.2 Learning Extreme and Center Points in RECIST Diameters

In this work, we grasp the central concepts of CornerNet [14] and ExtremeNet [12]. Given four extreme and one center points, we regress a heatmap of width and height for each keypoint. The training is guided by a multi-peak Gaussian heatmap

, where each keypoint defines the center of a Gaussian kernel and the standard deviation is set proportional to the object size

[14, 12]. In order to balance the positive and negative locations, a modified focal loss is adopted for training, as in [14, 12]:


where and are hyper-parameters and fixed to and during training, and is the number of objects in the image.

Similar to [14, 12], we additionally regress the keypoint offset for each extreme point to recover part of the information lost in the down-sampling process of the HourglassNet [13]. We regress the offset maps with the smooth loss on locations of the ground truth extreme point as:


where is the down-sampling factor in HourglassNet ( in our case), and

is the coordinate of the estimated extreme point. Note that we omit the indexing of

in the of Eq. (2) for convenience. There is no offset prediction for the center point.

2.3 Grouping Strategy for Inference

Next, we present the strategy of how to group the prediction heatmaps into detections in a purely geometric manner for inference. We simplify the grouping strategy used in ExtremeNet [12] into three big steps as described below.

First, for each heatmap of the extreme points, we extract local peaks with a max-pooling operation of kernel size

to suppress similar scores in neighborhood windows (named ExtractPeak in ExtremeNet [12]). After the suppression, we preserve only the top positions with scores greater than a threshold . Second, given four extreme points denoted by in the corresponding heatmaps, their geometric center is calculated by . If this center is predicted with a high response in the center-point heatmap, i.e., greater than a threshold , we consider the extreme points as a valid candidate detection. In this paper . We then iterate over all possible combinations of the remaining peak positions in a brute force manner (though with a runtime of , where is the number of preserved extreme points in corresponding heatmaps, it can be accelerated on a GPU). A combination score is computed by adding up the scores of each quadruple of extreme points and twice the score of the corresponding center point, and the top candidate combinations are preserved as the initial prediction results for detection. The settings of and in [12] are adopted in this paper. Third, we refine the coordinates of the initial prediction results by adding an offset predicted at the corresponding location of the offset map to each predicted extreme point. For a fair comparison with other methods which have been evaluated against bounding boxes, we generate a tight bounding box enclosing each grouped quadruple of detected extreme points. Different from ExtremeNet [12], which additionally employed a multi-scale augmentation for inference, we only use the flip augmentation for computational efficiency. Lastly, a Soft-NMS is employed to filter all augmented detection results.

3 Experiments

3.1 Dataset

The DeepLesion dataset [1] is used for experiments, with the official split (70% for training, 15% for validation, and 15% for test). As in [2], 35 noisy lesion annotations are removed. We evaluate performance of the methods on the official test set. Following the general practice in previous works [6, 5, 9, 3, 17, 4, 16, 15, 2]

, we report the sensitivity at various FPs (0.5, 1, 2, 3, 4) per scan as the evaluation metric.

3.2 Implementation

During training, we set the input resolution of the axial slices to in pixels, and output resolution to in pixels. To alleviate the overfitting problem, we use three data augmentation methods: random flipping horizontally and vertically, random scaling between 0.6 and 1.4, and random cropping. In the test phase, the input axial image is resized to a fixed size of in pixels. The predicted bounding box coordinates are enlarged with a 5-pixel padding as done with the ground truth for computing sensitivity. The multi-view input is consistent with the window settings in [5]. We initialize our network using weights of ExtremeNet [12]

trained on COCO. The network is optimized with Adam with a learning rate of

. We train our RECIST-Net on two NVIDIA GeForce RTX 2080 Ti GPUs with a batch size of 11 for 55,550 iterations.

3.3 Comparison with State-of-The-Art Methods

We show the results in Table 1. From the table, we can observe that our RECIST-Net outperforms all competing methods, including those using multi-task learning [6, 5, 7, 9]. The superior performance is attributed to the conceptually straightforward RECIST-based formulation for detection, which has not been exploited in previous works. Visual examples of the detected lesions on official test images are shown in Fig. 3

. The probability threshold is set to 0.32 yielding 0.5 FP per image. We can observe that lesions of varying size, appearance, and type are localized accurately.

width=1 FPs per scan 0.5 1 2 3 4 3DCE, 27 slices [2] 62.48 73.37 80.70 - 85.65 Faster-RCNN + DA [15] - - - - 87.29 Deformable Faster-RCNN + VA [16] 69.1 77.9 83.8 - - 3DCE + CS_Att, 21 slices [4] 71.4 78.5 84.0 - 87.6 Anchor-Free RPN [17] 68.73 77.10 83.54 - 88.12 FPN + MSB (weights sharing) [3] 67.0 76.8 83.7 - 89.0 Improved RetinaNet [9] 72.15 80.07 86.40 - 90.77 MVP-Net, 9 slices [5] 73.83 81.82 87.60 89.57 91.30 MULAN [6] 76.12 83.69 88.76 - 92.30 RECIST-Net (original image, 3 slices) 74.33 81.80 87.66 90.02 90.68 RECIST-Net (multi-view input) 76.14 83.71 89.62 91.69 92.49

Table 1: Sensitivity (%) at different false positives (FPs) per scan on the test set of the DeepLesion dataset [1]. Values for methods in comparison were reported in cited references based on the same train/validation/test split of the dataset. Note that MULAN [6] used extra tag supervision, and MVP-Net [5] used the same multi-view input as we do.
Figure 3: Visual results for lesion detection at 0.5 FP rate using RECIST-Net. Yellow boxes/points are ground truth, green are true positives, red are false positives, and pink boxes on the bottom corners show enlarged views.

We further analyze the detection performance on different lesion types and image properties on the official test set according to three criteria: 1) lesion type, 2) lesion diameter, and 3) slice interval of the CT scans. We show the results per lesion type criterion in Table 2 and those per other two criteria in Table 3. As shown, our method achieves the best performances on all metrics, except for the lesions with diameters greater than 30 mm, for which it achieves the second best. This is because, for larger objects, the center response map [12] may not be accurate enough to perform well, as a shift of a few pixels might miss a detection and result in a false-negative [12].

width=1. Lesion Types LU ME LV ST PV AB KD BN 3DCE, 27 slices [2] 89.00 88.00 90.00 74.00 84.00 84.00 82.00 75.00 3DCE_CS_Att, 15 slices [4] 92.00 88.50 91.40 80.30 85.00 84.40 84.30 75.00 Anchor-Free RPN [17] 93.00 88.00 91.00 85.00 86.00 83.00 80.00 65.00 RECIST-Net 94.36 94.33 94.29 86.85 89.73 90.01 93.56 87.04

Table 2: Sensitivity (%) at four FPs per scan on the test set of the DeepLesion dataset [1]. We report the results on eight types of lesions as 3DCE [2], including lung (LU), mediastinum (ME), liver (LV), soft tissue (ST), pelvis (PV), abdomen (AB), kidney (KD), and bone (BN). Corresponding results reported in the literature are included for comparison.

width=1 Lesion diameter (mm) Slice interval (mm) <10 1030 >30 <2.5 >2.5 3DCE, 27 slices [2] 80.00 87.00 84.00 86.00 86.00 Anchor-Free RPN [17] 83.00 87.00 88.00 - - 3DCE_CS_Att, 15 slices [4] 82.30 90.00 85.00 87.60 87.60 FPN+MSB (weights sharing) [3] 86.00 91.00 - - - Improved RetinaNet [9] 88.35 91.73 93.02 - - RECIST-Net 90.69 93.67 90.99 93.28 91.75

Table 3: Sensitivity (%) at four FPs per scan on the test set of DeepLesion [1]. We report results with different sizes and slice intervals of CT scans as 3DCE [2]. Corresponding results reported in the literature are included for comparison.

3.4 Ablation Study

We conduct an ablation study with respect to the ExtractPeak grouping strategy, Soft-NMS for post-processing, flip augmentation (FlipAug) for test time augmentation, and multi-view input, to identify how much these components contribute to the performance. The results are shown in Table 4. We can observe that ExtractPeak and Soft-NMS bring the most improvements to RECIST-Net, both of which are used to suppress similar scores in neighborhood windows. The result confirms the core status of the non-maximum-suppression-like methods in detection algorithms. It is worth noting that with flip augmentation, a 7.66% improvement is achieved in sensitivity at four FPs. This may imply that detection with different orientations can provide complementary information. With multi-view input, a further 1.81% improvement is achieved. This is reasonable, since, different window levels and widths are used for the reading of CT scans of different body parts in clinical practice, and hence a multi-view setting can boost the performance of lesion detection across the body.

width=1. ExtractPeak Soft-NMS FlipAug Multi-View 0.5 1 2 3 4 (a) 13.80 19.97 28.79 34.32 38.42 (b) 56.76 64.13 70.52 74.04 76.34 (c) 70.87 78.68 82.90 83.00 83.02 (d) 74.33 81.80 87.66 90.02 90.68 (e) 76.14 83.71 89.62 91.69 92.49

Table 4: Ablation study on building components of RECIST-Net, where it is incrementally built up by adding to the baseline model (row (a)) one component at a time (rows (b)–(e)). Sensitivity (%) at different FPs per scan are reported.

4 Conclusion

In this paper, we presented a formulation for universal lesion detection which was implemented with RECIST-Net. The RECIST-Net detected four extreme points (i.e., top-most, left-most, bottom-most, right-most) and one center point of a lesion in a keypoint detection way. We hope this work would inspire researchers to develop more methods that are friendly to the way of annotation.

5 Compliance with Ethical Standards

This research study was conducted retrospectively using human subject data made available in open access by [1]. Ethical approval was not required as confirmed by the license attached with the open access data.

6 Acknowledgments

This work was supported by National Natural Science Foundation of China (Grant No. 61671399) and the Fundamental Research Funds for the Central Universities (Grant No. 20720190012). Shilei Cao, Dong Wei, Kai Ma, and Yefeng Zheng are employees of Tencent. The authors have no relevant financial or non-financial interest to disclose.