3D Anchor-Free Lesion Detector on Computed Tomography Scans

by   Ning Zhang, et al.
UMass Lowell

Lesions are injuries and abnormal tissues in the human body. Detecting lesions in 3D Computed Tomography (CT) scans can be time-consuming even for very experienced physicians and radiologists. In recent years, CNN based lesion detectors have demonstrated huge potentials. Most of current state-of-the-art lesion detectors employ anchors to enumerate all possible bounding boxes with respect to the dataset in process. This anchor mechanism greatly improves the detection performance while also constraining the generalization ability of detectors. In this paper, we propose an anchor-free lesion detector. The anchor mechanism is removed and lesions are formalized as single keypoints. By doing so, we witness a considerable performance gain in terms of both accuracy and inference speed compared with the anchor-based baseline



There are no comments yet.


page 3


An Efficient Anchor-free Universal Lesion Detection in CT-scans

Existing universal lesion detection (ULD) methods utilize compute-intens...

Detecting Lesion Bounding Ellipses With Gaussian Proposal Networks

Lesions characterized by computed tomography (CT) scans, are arguably of...

DKMA-ULD: Domain Knowledge augmented Multi-head Attention based Robust Universal Lesion Detection

Incorporating data-specific domain knowledge in deep networks explicitly...

ULDor: A Universal Lesion Detector for CT Scans with Pseudo Masks and Hard Negative Example Mining

Automatic lesion detection from computed tomography (CT) scans is an imp...

3D Aggregated Faster R-CNN for General Lesion Detection

Lesions are damages and abnormalities in tissues of the human body. Many...

Improving RetinaNet for CT Lesion Detection with Dense Masks from Weak RECIST Labels

Accurate, automated lesion detection in Computed Tomography (CT) is an i...

Conditional Training with Bounding Map for Universal Lesion Detection

Universal Lesion Detection (ULD) in computed tomography plays an essenti...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

Computed Tomography (CT) scans capture inner details of the human body by emitting a series of narrow beam of X-ray. The X-ray absorption differs much across different tissues of the human body. This provides a way for physicians and radiologists to examine across different healthy organs as well as abnormal lesions.

Lesions are injuries and abnormal tissues. They can locate in different organs such as lungs, livers, abdomens, bones, etc. and are often the early stage manifestations of fatal diseases such as cancers and tuberculosis. Detecting lesions at their early stages are believed to improve the cure rate and survival rate. Compared with healthy tissues, lesions often present distinctive visual properties in CT scans. For instance, pulmonary (lung) nodules (also referred to as coin lesions) are often small round or oval-shaped with an isolated absorption of X-ray (measured by Hounsfield Unit). With these properties, it is possible for machines to detect lesions automatically from CT scans.

Before the advent of CNN, people mainly resorted to different types of morphology features such as Shape Index (SI), Curvedness (CV) [6, 14, 8] and other well designed features [16] for this this task. These features are devised fully based on human knowledge and have long been playing an important role. However, the limit is obvious as it is not easy for a human to enumerate all lesion appearance in the real world. One direct result is the relatively low recall rate.

In recent years, this feature engineering process is replaced by deep convolutional neural networks

[5, 7, 21, 2, 4, 1, 10, 17] where rich features are learned automatically. Most of CNN based lesion detectors [19, 18, 11] adopt the anchor mechanism to enumerate all possible bounding box templates (anchors) with respect to the dataset in terms of aspect ratio and size. These anchors can greatly improve recall but also cause massive false positives. Moreover, these massive false positives can exert huge pressure on Non-Maximum Suppression, making inference slow. Another issue with the anchor mechanism is that the anchor configuration must fit well to the characteristics of the dataset. Otherwise, a big degradation in performance can happen. This issue becomes more severe when objects under concern are very small [3].

The anchor-free idea seems to fit well with our task. One reason for this is to save the effort in finding the best anchor settings when it is adopted to different datasets. The other reason lies in the observation that lesions in 3D CT scans do not overlap with each other. Thus, we think overlapped anchors may not be necessary for our task.

Our major contribution in this paper lies in that we the first to propose a 3D anchor-free architecture for the lesion detection task.

Fig. 1: The architecture of the whole network. The network is of a “U” structure and consists of an upstream and a downstream pathway. Upstream and downstream features would be concatenated before being forwarded to the next layer. Detection heads are attached to the combined features. Note that for the anchor-free setting while for the anchor-based setting.

Ii Related Work

Ii-a Lesion Detection

Pulmonary Nodule detection has been well studied for years. Liao et. al. [11] proposed a 3D U-net [15] for the nodule detection task. This U-net contains upstream and downstream pathways which are similar to Feature Pyramid Network [12]

. However, the major difference lies in the upsampling operations in the top-down pathway and how upstream features and downstream features are combined. In particular, U-net adopts the transposed convolutions while FPN emploies parameter-free interpolations. To combines upstream and downstream features, the former leverages concatenation while the latter uses element-wise addition. Note that CT scans are usually too large. Thus, only sub-cubes of the original scans (typically 128

128128, 1/6 of the whole image) are fed in as the input. Zhu et. al [23] replaced the ResNet Building block with the DualNet block and a better performance is reported.

It is reported in [19] that 3D CNN may perform poorly when only part of the CT scan is provided such as the DeepLesion dataset [20]. Therefore, instead of employing 3D CNNs, Yan et. al [19]

proposed a 3D context enhanced 2D CNN for the general lesion detection task. The 3D context is achieved by stacking CNN features extracted from neighboring slices. One drawback is that this approach requires the “key slice” known in advance while in real application scenarios the “key slice” is agnostic, which limits its practicability.

Ii-B Anchor-Free Detectors

Recently, anchor-free Detectors have demonstrated great potentials and gained much attention [22, 9]. The main motivation is to remove the hassle of devising the best anchors (bounding box templates) for the dataset in process. In these anchor-free detectors, objects are represented as either a pair of keypoints at the corners (top-left corners and bottom-right corners) [9] or single keypoints in the center [22]. To better detect these keypoints, enriching context information is proved to be critical. To this end, a special corner pooling and a center pooling are adopted. However, one issue with these pooling operations is that they are very slow.

In our paper, we do not employ these powerful yet slow pooling operations. One reason comes from the object size. In our case, lesions are often very small and the receptive field of the detection head is relatively large compared to the object size. Moreover, lesions do not overlap with each other, making this pooling less necessary.

Iii Our Approach

Iii-a Network Architecture

We employ a U-net structure built upon DenseNet [7] building blocks. Our detector is singe-stage and each feature map is attached with a detection head. These heads are of an identical structure while they are independent with each other (no sharing). The whole network architecture is illustrated in Fig. 1

. In this paper, proposals in anchor-based and anchor-free settings are both encoded as 5-element vectors

. This encoding method is also adopted in field practice by physicians and radiologists when locating lesions. Therefore, the output channels of the detection heads (as shown in Fig. 1) are where and for the anchor-based and anchor-free setting respectively.

Iii-B Ground Truth Assignment

In our anchor-free design, we formulate a object as a center keypoint. We define a positive cube and non-negative cube for each grouth truth box. Center points locating inside of the positive cube, outside of the non-negative cube and in between will be assigned as positive, negative and ignored respectively. In particular, consider an object , where is the centroid; is the diameter. Suppose this object is assigned to feature map with size

and stride

, we have (assuming the input is a cube for simplicity) center points where , , . The positive cube and non-negative cube , where . Center points will be marked as positive if , negative if and ignored if . We use and in this paper. Note that the anchor-based baseline adopts the standard IoU based algorithm to label individual anchors.

Iii-C Training Loss

The training loss can be divided into the classification part and the localization part: . For the classification part, we use Focal Loss [13] for negative samples and Cross Entropy for positive samples. In addition, we follow [9] and penalize positive center points with a unnormalized Gaussian determined by its Euclidean distance to the ground truth centroid and the size of the object. Formally, given an object and a positive center point , the weight is defined as:


We use in this paper. After this, we have the following loss defined for the classification end:


where , is the ground truth.

For the localization part, the offset targets are encoded with the stride of feature maps instead of the anchors. More formally, the offset target on feature map with stride is defined as follows:


For the localization part we adopt the Smooth L1 loss.

FPs per image 0.5 1 2 4 8 16 Avg. FROC Inference Time
3DCE, 27 slices [19] 62.48 73.37 80.70 85.65 89.09 91.06 80.39 - -
Anchor-Based RPN 65.74 73.89 80.99 86.56 91.40 94.40 82.17 0.708 1.95s
Anchor-Free RPN 68.73 77.10 83.54 88.12 91.94 94.62 84.01 0.735 1.74s
TABLE I: Sensitivity (%), FROC score and Inference Time (s/scan) on the DeepLesion dataset. Note that one may not directly compare performance with [19] because of the different task settings (2D vs 3D).
Model LU ME LV ST PV AB KD BN 10 10-30 30
3DCE, 27 slices [19] 89 88 90 74 84 84 82 75 80 87 84
Anchor-Based RPN 91 88 87 80 85 80 80 69 82 88 80
Anchor-Free RPN 93 88 91 85 86 83 80 65 83 87 88
TABLE II: Sensitivity@4 (%) w.r.t lesion type and size. Types include lung (LU), mediastinum (ME), liver (LV), soft tissue (ST), pelvis (PV), abdomen (AB), kidney (KD), and bone (BN), respectively. “10”, “10-30” and “30” represent lesion diameter ranges (mm).
Fig. 2: Visualization of lesion types: bone, kidney and soft tissue. Red and green circles are predicted and ground truth boxes respectively. It may not be clear that the box 0.985 in kidney and 0.955 in soft tissue nearly fit perfectly with the ground truth.

Iv Experiments

We conduct experiments on the DeepLesion [20] dataset. This dataset is designed for general lesion detection with various types of lesion, including lung, mediastinum, liver, soft tissue, pelvis, abdomen, kidney, and bone. This dataset contains 10,594 CT studies from 4,427 unique patients with 32,735 annotated lesions. The official training, validation and testing set containing 22,901, 4,887, 4,912 lesions respectively (noisy annotations are removed). Note that, for each lesion, DeepLesion only provides a 60mm Z-context chunk centered with the annotated slice (key slice).

Primary attempts in [19] indicated that 3D CNN may not work well with the DeepLesion dataset. We think the reason is 3-fold: (1) out-bounding large lesions ( 48mm, 11% of all lesions) make it hard for localization (both the center position and size). (2) z-coordinate may not be accurate. As lesions are only annotated on a single center slice, when the slice interval is large the annotations would be inaccurate. (3) small lesions tend to be assigned with over-sized bounding boxes. This again introduces noises for the 3D CNN to regress the size.

Iv-a Data Pre-processing

We rescale (by interpolation) the 3D CT scans to an isotropic resolution (1mm in all directions). The large mass of black borders of the image is removed by simple value clipping. During training, random crops of a size 64128

128 (padding 0 when necessary) are fed into the network during training while in testing, a sliding window style cropping strategy is adopted. Detection results on these sub-crop pieces will be assembled to form the integral result. The 2D annotations are approximately converted to 3D ones with the form of {X, Y, Z, Diameter}.

Iv-B Training and Testing

Unlike [19], our tasks remain in the 3D object detection regime. We train the 3D CNN both w/ and w/o anchors. In anchor-based model, we configure 3 anchors for each feature scale (stride 4, 8, 16) which are {3.0, 5.0, 7.0}, {10.0, 13.0, 17.0},{22.0, 30.0, 40.0} respectively. During training, very large lesions ( 48mm, 11% of the training data) are removed because of the regression issues. Primary attempts show that if these large lesions are included, the training process would suffer from a slow convergence and oscillating losses. During testing, very large lesions are included.

Iv-C Evaluation

We detect lesion on each 60-mm z-axis CT image chunk. We use the free receiver operating characteristic (FROC) score to evaluate the performance following the same protocol of LUNA16 challenge [16]. This FROC score is approximated by the average recall at 7 false positive rates: 1/8, 1/4, 1/2, 1, 2, 4, and 8 False Positive per scan. In our case, one predicted box would be counted as a True Positive if its centroid is located in the mass of ground truth. In other words, the distance between the proposed and the real centroid is less than the radius of ground truth.

Iv-D Overall Performance

As we can see from Table I, 3D CNNs work well with this task. Note that in [19] evaluate their model at the key slice while we are detecting lesions without knowing the key slice in advance. Therefore, we argue our task settings are more challenging. In addition, our anchor-free RPN outperforms the anchor-based RPN in terms of both accuracy and inference speed. Our inference time is evaluated with one Nvidia Telsa K80.

Iv-E Performance w.r.t. Lesion Type and Size

Following [19], we also report the performance with respect to lesion type and diameter. All results are summarized in Table II. We can find that our 3D models (w/ and w/o anchors) do not perform well for bone and kidney lesions. On the other hand, our approach experiences no significant performance drop as [19] when detecting “Soft Tissue” lesions. We visualize these types in Fig. 2. Another observation is that the anchor-free design seems to be more tolerable to very large lesions than the anchor-based counterpart (“” in Table II ). Again, we stress the point that one may not directly compare our results with [19].

V Conclusions

Our anchor-free design works well with the general lesion detection task in terms of both accuracy and inference speed. Compared with the anchor-based design, the anchor-free design is more robust to large lesions (potentially reaching boundaries). Even though we cannot directly compared with [19]. We argue that our model can work with the key slice agnostic scenarios, which is more practical for real applications.


  • [1] M. F. Alcantara, Y. Cao, C. Liu, B. Liu, M. Brunette, N. Zhang, T. Sun, P. Zhang, Q. Chen, Y. Li, et al. (2017)

    Improving tuberculosis diagnostics using deep learning and mobile health technologies among resource-poor communities in perú

    Smart Health 1, pp. 66–76. Cited by: §I.
  • [2] Y. Cao, C. Liu, B. Liu, M. J. Brunette, N. Zhang, T. Sun, P. Zhang, J. Peinado, E. S. Garavito, L. L. Garcia, and W. H. Curioso (2016-06) Improving tuberculosis diagnostics using deep learning and mobile health technologies among resource-poor and marginalized communities. In 2016 IEEE First International Conference on Connected Health: Applications, Systems and Engineering Technologies (CHASE), Vol. , pp. 274–281. External Links: Document, ISSN Cited by: §I.
  • [3] C. Eggert, S. Brehm, A. Winschel, D. Zecha, and R. Lienhart (2017) A closer look: small object detection in faster r-cnn. In Multimedia and Expo (ICME), 2017 IEEE International Conference on, pp. 421–426. Cited by: §I.
  • [4] Y. Gao, N. Zhang, H. Wang, X. Ding, X. Ye, G. Chen, and Y. Cao (2016-06) IHear food: eating detection using commodity bluetooth headsets. In 2016 IEEE First International Conference on Connected Health: Applications, Systems and Engineering Technologies (CHASE), Vol. . External Links: Document, ISSN Cited by: §I.
  • [5] K. He, X. Zhang, S. Ren, and J. Sun (2016) Deep residual learning for image recognition. In

    Proceedings of the IEEE conference on computer vision and pattern recognition

    pp. 770–778. Cited by: §I.
  • [6] C. I. Henschke, D. F. Yankelevitz, R. Mirtcheva, G. McGuinness, D. McCauley, and O. S. Miettinen (2002) CT screening for lung cancer: frequency and significance of part-solid and nonsolid nodules. American Journal of Roentgenology 178 (5), pp. 1053–1057. Cited by: §I.
  • [7] G. Huang, Z. Liu, K. Q. Weinberger, and L. van der Maaten (2017) Densely connected convolutional networks. In Proceedings of the IEEE conference on computer vision and pattern recognition, Vol. 1, pp. 3. Cited by: §I, §III-A.
  • [8] C. Jacobs, E. M. van Rikxoort, T. Twellmann, E. T. Scholten, P. A. de Jong, J. Kuhnigk, M. Oudkerk, H. J. de Koning, M. Prokop, C. Schaefer-Prokop, et al. (2014) Automatic detection of subsolid pulmonary nodules in thoracic computed tomography images. Medical image analysis 18 (2), pp. 374–384. Cited by: §I.
  • [9] H. Law and J. Deng (2018) CornerNet: detecting objects as paired keypoints. CoRR abs/1808.01244. External Links: Link, 1808.01244 Cited by: §II-B, §III-C.
  • [10] P. Li, Y. Luo, N. Zhang, and Y. Cao (2015-08)

    HeteroSpark: a heterogeneous cpu/gpu spark platform for machine learning algorithms

    In 2015 IEEE International Conference on Networking, Architecture and Storage (NAS), Vol. , pp. 347–348. External Links: Document, ISSN Cited by: §I.
  • [11] F. Liao, M. Liang, Z. Li, X. Hu, and S. Song (2017) Evaluate the malignancy of pulmonary nodules using the 3d deep leaky noisy-or network. arXiv preprint arXiv:1711.08324. Cited by: §I, §II-A.
  • [12] T. Lin, P. Dollár, R. Girshick, K. He, B. Hariharan, and S. Belongie (2017) Feature pyramid networks for object detection. In CVPR, Vol. 1, pp. 4. Cited by: §II-A.
  • [13] T. Lin, P. Goyal, R. Girshick, K. He, and P. Dollár (2017) Focal loss for dense object detection. arXiv preprint arXiv:1708.02002. Cited by: §III-C.
  • [14] K. Murphy, B. van Ginneken, A. M. Schilham, B. De Hoop, H. Gietema, and M. Prokop (2009) A large-scale evaluation of automatic pulmonary nodule detection in chest ct using local image features and k-nearest-neighbour classification. Medical image analysis 13 (5), pp. 757–770. Cited by: §I.
  • [15] O. Ronneberger, P. Fischer, and T. Brox (2015) U-net: convolutional networks for biomedical image segmentation. In International Conference on Medical image computing and computer-assisted intervention, pp. 234–241. Cited by: §II-A.
  • [16] A. A. A. Setio, A. Traverso, T. De Bel, M. S. Berens, C. van den Bogaard, P. Cerello, H. Chen, Q. Dou, M. E. Fantacci, B. Geurts, et al. (2017) Validation, comparison, and combination of algorithms for automatic detection of pulmonary nodules in computed tomography images: the luna16 challenge. Medical image analysis 42, pp. 1–13. Cited by: §I, §IV-C.
  • [17] X. Sun, N. Zhang, Q. Chen, Y. Cao, and B. Liu (2019) People re-identification by multi-branch cnn with multi-scale features. In 2019 26th IEEE International Conference on Image Processing (ICIP), Cited by: §I.
  • [18] Z. Xie (2018) Towards single-phase single-stage detection of pulmonary nodules in chest ct imaging. arXiv preprint arXiv:1807.05972. Cited by: §I.
  • [19] K. Yan, M. Bagheri, and R. M. Summers (2018) 3d context enhanced region-based convolutional neural network for end-to-end lesion detection. In International Conference on Medical Image Computing and Computer-Assisted Intervention, pp. 511–519. Cited by: §I, §II-A, TABLE I, TABLE II, §IV-B, §IV-D, §IV-E, §IV, §V.
  • [20] K. Yan, X. Wang, L. Lu, L. Zhang, A. P. Harrison, M. Bagheri, and R. M. Summers (2018-06) Deep lesion graphs in the wild: relationship learning and organization of significant radiology image findings in a diverse large-scale lesion database. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §II-A, §IV.
  • [21] N. Zhang, Y. Cao, B. Liu, and Y. Luo (2017) Improved multimodal representation learning with skip connections. In Proceedings of the 2017 ACM on Multimedia Conference, MM ’17, New York, NY, USA, pp. 654–662. External Links: ISBN 978-1-4503-4906-2, Document Cited by: §I.
  • [22] X. Zhou, D. Wang, and P. Krähenbühl (2019) Objects as points. abs/1904.07850. External Links: Link, 1904.07850 Cited by: §II-B.
  • [23] W. Zhu, C. Liu, W. Fan, and X. Xie (2018) Deeplung: deep 3d dual path nets for automated pulmonary nodule detection and classification. arXiv preprint arXiv:1801.09555. Cited by: §II-A.