3D Aggregated Faster R-CNN for General Lesion Detection

by   Ning Zhang, et al.

Lesions are damages and abnormalities in tissues of the human body. Many of them can later turn into fatal diseases such as cancers. Detecting lesions are of great importance for early diagnosis and timely treatment. To this end, Computed Tomography (CT) scans often serve as the screening tool, allowing us to leverage the modern object detection techniques to detect the lesions. However, lesions in CT scans are often small and sparse. The local area of lesions can be very confusing, leading the region based classifier branch of Faster R-CNN easily fail. Therefore, most of the existing state-of-the-art solutions train two types of heterogeneous networks (multi-phase) separately for the candidate generation and the False Positive Reduction (FPR) purposes. In this paper, we enforce an end-to-end 3D Aggregated Faster R-CNN solution by stacking an "aggregated classifier branch" on the backbone of RPN. This classifier branch is equipped with Feature Aggregation and Local Magnification Layers to enhance the classifier branch. We demonstrate our model can achieve the state of the art performance on both LUNA16 and DeepLesion dataset. Especially, we achieve the best single-model FROC performance on LUNA16 with the inference time being 4.2s per processed scan.


page 1

page 8

page 10


3D Context Enhanced Region-based Convolutional Neural Network for End-to-End Lesion Detection

Detecting lesions from computed tomography (CT) scans is an important bu...

ULDor: A Universal Lesion Detector for CT Scans with Pseudo Masks and Hard Negative Example Mining

Automatic lesion detection from computed tomography (CT) scans is an imp...

Cross-Modality Synthesis from CT to PET using FCN and GAN Networks for Improved Automated Lesion Detection

In this work we present a novel system for generation of virtual PET ima...

3D Anchor-Free Lesion Detector on Computed Tomography Scans

Lesions are injuries and abnormal tissues in the human body. Detecting l...

Single-Shot Lightweight Model For The Detection of Lesions And The Prediction of COVID-19 From Chest CT Scans

We introduce a lightweight model based on Mask R-CNN with ResNet18 and R...

An Efficient Anchor-free Universal Lesion Detection in CT-scans

Existing universal lesion detection (ULD) methods utilize compute-intens...

Deep Learning on Point Clouds for False Positive Reduction at Nodule Detection in Chest CT Scans

The paper focuses on a novel approach for false-positive reduction (FPR)...

I Introduction

Lesions are damages and abnormalities in tissues of the human body. They can reside in different organs such as livers, bones, lung lobes and even soft tissues. Many of these lesions may develop into cancers. For instance, pulmonary (lung) nodules can grow into lung cancers. Therefore, effective detection of lesions plays an important role in early diagnosis and timely treatment of various cancers. In CT scans, lesions often present distinctive shapes, isolated levels of absorption of X-Ray (Hounsfield Unit value), and other internal structures. All these special visual properties allow both humans and machines to identify and locate lesions.

However, one challenging aspect is that lesions can be very small and sparse. In contrast, the whole image is massive, eg. a pulmonary nodule can have a diameter smaller than 5 while a chest CT scan is normally larger than 20,000 (volume ratio is smaller than 1/100,000). Moreover, positive regions are highly sparse and a typical positive/negative region (anchor) ratio can be 1/10,000. Note that these issues commonly exist in other medical image modalities such as Diabetic Retinopathy images [3] and Hematoxylin and Eosin (H&E) stained whole-slide images [2]. To detect these very small lesions, the false positive rate becomes inevitably high. Therefore, people always have to trade-off between high sensitivity and low false positive rate [26].

As a result, lesion detectors often contain two major components: (1) candidate generation (region proposal) and (2) false positive reduction (FPR). In the early days, feature-engineering based methods have long been playing an important role. Popular features include Shape Index (SI), Curvedness (CV) [11, 21, 14] and other morphology features [26]. These features help to indicate suspicious regions. After that, some thresholding procedures are adopted to reduce the false positive rate to strike the balance between recall and precision. Recently, CNN based approaches dominate this task. These CNNs can either be 2D [30, 25] or 3D [15, 35, 29]. In general, 3D CNNs often deliver better performance than 2D CNNs.

Nowadays, the prevalent pipeline of the state of the art solutions [26] decouples the two components and trains independent networks for the region proposal component and false positive reduction component respectively. The RPNs are often shallow and fed with a very large 3D input (typically 128128128). In contrast, FPRNs are trained with much smaller input (typically 323232, cropped from the suspicious regions detected by RPNs). Thus, FPRNs often employ a deep architecture. Though this two-step pipeline works very well, it is not an end-to-end approach and the whole system becomes very complicated, significantly slowing down the inference speed.

Fig. 1: Small positive/negative regions are difficult to distinguish if we only look at the local area (the tiny white round spot). Though both regions appear to be similar, the left one is Negative while the right one is Positive (nodule).

Motivated by these issues, many end-to-end solutions have been proposed. Xie et. al [29] removed the FPR step by fully leveraging a powerful multi-scale 3D RPN. Yan et. al [30] maintained the balance between high sensitivity and low false positive rate by employing a 2D CNN augmented with 3D context. The 3D context information is gathered by stacking 2D CNN features from neighboring slices. Though they both achieved very good accuracy performance, many drawbacks exist. For instance, in [29], the authors traded off much inference speed for accuracy by adopting very complicated anchor box settings and soft Non-Maximum Suppression (NMS). In [30], the 3D object detection task is reduced to a 2D task with the “center slice” known in advance. However, in real application scenarios, the center slice is always agnostic. As a result, this approach can only detect lesions slice by slice without an efficient retrieval strategy along Z-axis.

The question occurred to us is: why cannot we share the backbone of RPN and FPRN following the idea of Faster R-CNN[22]? From our point of view, three major challenges exist. (1) local areas fail to provide sufficient discriminative information to differentiate positive and negative regions (Fig. 1). This is especially true for small lesions. (2) objects are too small to apply RoI operations. For instance, in [15], the finest resolution of feature layers is 4 (64 ) 111In this paper, we always re-scale images to an isotropic resolution. Therefore, for simplicity, we slightly abuse to use 4 () instead of 64 to represent the voxel resolution., while the diameters of a great portion of nodules are smaller than 5mm. In this case, RoI only contains one spatial point at the feature level. Besides, directly increasing the resolution of the feature layer does not help necessarily. (3) prohibitive memory consumption constrains the design of 3D CNNs. We cannot easily adopt a very complicated design for RPNs because of the very large input. On the other hand, FPRNs are often much deeper than RPNs given the much smaller input size. Therefore, directly attaching the FPR branch to the backbone of an RPN may result in poor performance. This depth-memory dilemma also constrains our exploration in the Faster R-CNN direction.

Our major contribution lies in that we present the first successful approach to enforce an end-to-end full 3D Faster R-CNN solution for the small and sparse lesion detection task. To address the issues and challenges aforementioned: (1) we propose an adaptive Focal Loss to stabilize the training of RPN. This adaptive Focal Loss can well address the large variance introduced from random sub-crops as well as the extremely imbalanced positive/negative anchor ratio. (2) we aggregate multi-scale features to enhance the region-based classifier branch. This feature aggregation enriches the context information in solving the insufficient local area issue. (3) to tackle the failing RoI operation issue and depth-memory dilemma, we magnify the aggregated features locally before feeding them into RoI operations (RoI Align). This local magnification avoids training finer feature layers directly and costs much less memory usage, which allows deeper subset designs for the classifier branch. In the meantime, it also enlarges the RoI area. Even though all our experiments are conducted on 3D CT scans, we argue that our proposed technical components can also be applied to other small and sparse object detection tasks in other image modalities.

Fig. 2: The whole network structure with local magnification layers. Sub-cubes are cropped out as the input of a U-Net. We attach an RPN head at each feature scale. Note that we are not sharing parameter across these heads. Proposals from each head would result in aggregated feature crops from all scales instead of a single layer. After RoI operations, these full-scale feature crops would be sent to the ultimate fully connected layers to acquire the FPR score. The final confidence score for each proposal would be the average of two confidence score from both the RPN branch and FPR branch.

Ii Related Work

As our work is framed in the object detection task, we will focus on the previous work on object detection and lesion detection. For a more general review about deep learning in the biomedical image domain, one can refer to


2D Object Detection and Segmentation. Deconvolutional SSD [7] and Mask R-CNN [8]

are probably the most related solutions in 2D cases. DSSD applies deconvolutions (transposed convolutions) globally on multiple feature scales before the actual detection layers. However, our deconvolutions in Local Magnification operate locally at the RoI crops before RoI operations. This is different from Mask R-CNN, where deconvolutions are located after RoI Align. By doing so, RoI operations can operate on larger areas. In our model, we also adopt RoI Align to reduce the “Quantization Error” introduced by RoI ops.

In the biomedical image modality, U-Net [23] was proposed for a segmentation task. Unlike a regular Fully Convolutional Network (FCN) [19], U-Net possesses a unique symmetric structure. More importantly, U-Net uses concatenation for lateral connections [9, 34, 10] to enforce the error signal propagation jumping to far early stages. This is different from Feature Pyramid Network (FPN) [16] where the element-wise addition is leveraged for the same purpose.

Nodule Detection. Before the advent of CNNs, people have devised a series of features for this task. Murphy et. al [21] proposed to compute Shape Index (SI), and Curvedness (CV) at every position in lungs. Then, a thresholding step finds out seeds lying on the surface of the nodule. After that, a necessary merging process develops individual clusters. These clusters are later formulated as the candidates (Regions of Interest). Some other tailored approaches specialized for certain types of nodules also emerged: for sub-solid nodules [14, 11]; for large nodules [24]. The main idea is to impose constraints on the Hounsfield Unit (HU) value and the diameter to filter out targeted nodules. Likewise, Tan et. al [28] proposed to use three different sets of filters specially designed for 3 different types of nodules: isolated, juxtavascular, and juxtapleural nodules. Note that, nearly all these approaches undergo a re-scaling pre-process step to produce 3D CT images of an isotropic resolution. This re-scaling pre-process is also employed in CNN based approaches.

Recently, many CNN based approaches are gaining increasing attention. Berens et. al proposed a 2D U-Net [23]. It operates on CT scans slice by slice to determine noisy candidates. These candidates are later merged by morphological analysis. Ding et. al [4] proposed to fine-tune the models pre-trained with natural images. The idea is to pack 3 consecutive slices of the scan as the RGB channels of a natural image. Ypsulantis et. al [32] exploited a recurrent 2D CNN to fully leverage the context information along Z-axis. Compared with plain 2D CNNs, a considerable performance boost is yielded. Concurrently, 3D U-Net variants [35, 29] became popular and achieved huge success. These variants differ in building blocks and the data strategy. Even though all these methods are proposed for pulmonary nodule detection, they can be easily generalized to other types of lesion detection tasks.

Divide-and-Conquer. Though 3D CNNs perform well, one fatal issue is that the input image is too large. As a result, it is not possible to feed the whole image into the whole network (even with the help of semantic segmentation). To address this issue, people resort [15, 35, 29] to the divide-and-conquer strategy: using sub-cubes instead of the whole image as the input. During training, random crops (1/6 of the whole image) are fed into the network while at the testing phase, final results are assembled from the detection results from sliding window pieces.

However, this divide-and-conquer mechanism introduces too much randomness during training, slowing down the convergence speed. One straight forward explanation comes from the batch normalization

[13] layers. Batch normalization focuses on channel-wise whole feature map statistics while with sub-cubes being fed in, these statistics become unstable. Another reason lies in the commonly used online hard negative mining mechanism [27]. In each iteration, only the hard negatives contribute to the loss. However, these hard negatives are highly varying across iterations. This large variance further affect the convergence speed. One quick remedy can be the focal loss [17] as it takes into account all samples when calculating the loss. However, the vanilla form of focal suffers from the extremely small positive/negative sample ratio. Our adaptive focal loss is motivated by these observations.

False Positive Reduction. At this stage, RoIs are assumed to be ready to extract 3D cubes to train independent CNNs. Setio et. al [25] proposed a 2D Multi-View CNN for this task. The 3D context is encoded by 9 different plains of symmetry extracted from each candidate cube. These 2D plains are then fed into 2D CNNs. CNN features are fused to make the final decisions. Dou et. al [5] devised 3 shallow but powerful 3D CNNs to ensemble for this task. Each of these 3 CNNs tackles nodules of different sizes. Some other 3D CNNs also reportedly work well such as 3D U-Net CNN (PAtech Team) and 3D Wide Residual Network [33]. In general, compared with 2D CNNs, 3D CNNs seem to work better with this task, since it is the most straight forward way to leverage the power of CNNs.

Full Solutions. Currently, multi-phase (ensemble) solutions outperform single-phase (single-model) ones. The detection Network and the FPR network are often decoupled and specialized independently [26]. In other words, the two networks do not share the backbone, which makes the whole solution complicated and systematically slow. Most of these solutions adopt 3D CNNs for both types of networks. However, it is reported in [30] that the two-step pipeline with 3D CNNs may fail in much noisy settings in terms of both image quality and less precise annotations (ex. DeepLesion dataset [31]). Yan adopts a 2D CNN enhanced with multiple neighboring slices context to attack the issue and better performance is acquired.

Fig. 3:

Diameter Alignment and Local Magnification operations in the FPRN branch. In our case, ignoring the confidence score, each proposal is represented as a 4-dimension vector: {Z, Y, X, Diameter}. Any proposal at each scale will require an all-scale feature crop. RoI spatial locations are calculated following the standard routines

[22], while the crop size at the referenced level (where the proposal is acquired) will be broadcasted to other levels of the pyramid. For instance, in (a), the proposal {60.56, 52.46, 57.56; 17.23} is derived from “Resolution 16” making the reference scale be 16 and the diameter be 1.08. Then the diameter 1.08 would be top-downed to other levels. Following the same rule, in (b), the diameter would be bottom-upped from “Resolution 4” to the other two scales. All these crops would be up-sampled before they are aggregated. Note that we crop the feature maps with some margins with a fixed crop size.

Iii Our Approach

As illustrated in Fig. 2, our model contains two heads at each pyramid level: Region Proposal Network (RPN) and False Positive Reduction Network (FPRN). They share the same backbone of a U-Net structure with DenseNet building blocks [12]. This design allows the end-to-end training.

Iii-a Backbone Network

Our backbone network employs a U-Net structure. In the upstream pathway, feature map sizes are gradually reduced to extract increasingly abstract features, while in the down-stream pathway, upsampling operations (transposed convolutions) take effect to acquire information complement to the upstream pathway. Note that feature maps of the same level from two pathways are concatenated (the pink Combination Module in Fig. 2) before they propagate to the next layer.

This feature pyramid idea is also explored in Feature Pyramid Network (FPN) [16] where “element-wise addition” is used instead of “concatenation” for the lateral shortcut. This mental image of “addition” vs. “concatenation” reminds us of the difference between ResNet and DenseNet: skip connections [9, 34, 10]

vs. concatenations. For this reason, we use DenseNet Building Block in our model. Besides, in FPN, up-sampling operations are parameter-free (using interpolations) while we adopt transposed convolutions which introduce some additional free parameters.

All detailed layer configurations are shown in Table I. We use the same notations in [12]. Note that, we configure all feature maps in Feature Pyramid to contain 32 channels, making it convenient for head sharing across the RPN branches (in practice, we actually do not share the RPN heads) as well as for the feature aggregation in the FPRN branches.

Encoding PreBlock Encode (1) Transition (1) Encode (2) Transition (2) Encode (3) Transition (3)
Output 64 64 64 64 64 64 32 32 32 32 32 32 16 16 16 16 16 16 8 8 8


Decoding Detector Transition (1) Decode (1) Upsample (2) Transition (2) Decode (2) Upsample (3)
Output * 32 32 32 32 32 32 32 32 32 16 16 16 16 16 16 16 16 16


Magnify Magnify Output 5 5 5 RoI Align Output 2 2 2 Classifier

Layer Configurations. “max” and “avg” denote Max Pooling and Avg Pooling respectively. The growth rate is set to be 16. Following the same notations from DenseNet most of “conv” layers shown in the table correspond the sequence BN-ReLU-Conv.

Iii-B Region Proposal Network Branch

We explore multi-scale techniques in our model to better handle the large variance of the object size. [18] suggests that sharing heads across different pyramid levels can result in some performance boost. However, we observed that branches keeping independent to each other is a better choice in our case. Unlike the 2D object detection where bounding boxes are represented as 4-element vectors: {x, y, w, h}, we use {z, y, x, diameter} to encode a bounding box here. This is because that in field practice, radiologists and physicians adopt this way to annotate lesions.

Distributing anchors to different scales is important here to allow “fair hits” for nodules of different diameters. “Fair hits” means nodules of different diameters have similar hit counts on box templates. It helps to balance sparse positive samples and avoiding small nodules being flushed out by large nodules. Another good attribute brought by this technique is that Regions of Interest would have a similar size on their respective reference pyramid levels (Fig 3

). For instance, a lesion of 6mm diameter on stride 4mm feature maps and a lesion of 12mm on stride 8mm feature maps would have the same size.

The training loss of RPN branches contains two parts: Bounding Box Regression and Binary Classification. For the Bounding Box Regression part, the standard Smooth L1 Loss is adopted. To the classification end, two popular options are Online Hard Example Mining (OHEM) and Focal Loss [17], which is defined as:


where , , denote the output probability and the ground truth respectively. and

are the hyperparameters to control the “loss decay” rate and ratio between losses from positive and negative samples.

In our case, the divide-and-conquer strategy is introduced, resulting in a large batch-wise variance across training iterations. This large variance leads OHEM to be unstable because only a few anchors contribute to the loss. In contrast, Focal Loss [17] covering all anchors should stabilize the training. However, the vanilla form (defined as Eq. 1 and 2) does not work well in our extremely small and sparse object detection task. The reason lies in the denominator in Eq. 2. In [17], it is determined by the number of positive samples. In our case, the positive/negative ratio is extremely small (1/10000). Therefore, using

as the denominator does not work here. Instead of tuning the denominator directly, we adjust the loss function as follows:


where , , denote respectively the number of True Negative samples, positive samples and negative samples. is a linear factor increasing with iterations.

By doing so, we introduce a “focus shift” throughout the whole training. At the initial training phase, we place the focus on the massive negative samples because this dense updating signal should move more “safely” towards convergence. As the training proceeds, the focus shifts to positive samples to avoid “overkill”. In practice, the term would drop very fast (much faster than quadratically). Therefore, we add the linear and log multiplier to smooth out this “focus shift” process. Note that we use Focal Loss only for negative samples while we calculate Cross Entropy for positive samples, given the small number of positive samples.

Iii-C False Positive Reduction Network Branch

We adopt an aggregated classifier in the FPRN branch in our model. Unlike in common practice [22, 16] that feature crops only come the single reference scale features, each proposal in our model will result in a feature aggregation across the feature pyramid. This feature aggregation is realized by Diameter Alignment (Fig. 3). A good attribute of this aggregated classifier is that we can explicitly enforce the scale-invariant property for nodules of different sizes by sharing the weights of the classifier heads.

Diameter Alignment (DA) (Fig. 3) helps to incorporate more context information for small nodules and to probe into in-nodule details for large nodules. For middle size nodules, both context and in-nodule detail information is enhanced. The core idea is that the feature crop size at a certain scale of the feature pyramid will be broadcasted to other scales with the centroid remaining the same. The resulting aggregated feature would automatically incorporate more context information. Note that in our implementation, we crop with some margins to ensure the transposed convolution in later Local Magnification works correctly.

This context information enrichment is motivated by the observation that if we only look at local areas (RoI), blood vessels and small nodules can appear quite similar to each other as small white spots (Fig. 1). To handle this issue, in real clinical practice, radiologists often scroll up and down along Z-axis to examine the context when detecting confusing small nodules. We show in later ablation study that this Diameter Alignment plays an important role in improving the FPRN branch.

We adopt RoI Align [8] in our models as it can theoretically work with small regions. We follow the official implementation of RoI Align and extend it to the 3D case (from 4 neighboring points to 8 neighbor points). All RoI Align outputs will be aggregated (by concatenation) before they reach the fully connected layers. The final confidence score of each proposal will be the average of two branches.

Iii-D Local Magnification

Finer resolution of feature layers is another option to attack the small RoI issue. However, naive approaches can easily fail because of the prohibitive memory consumption. To circumvent this problem, we propose to up-sample RoI crops locally to avoid the tremendous memory consumption. Conceptually, this operation provides a “closer” look at the regions of interest (Fig. 3), which resembles much putting a magnifying lens above RoIs. Therefore, we name this operation as Local Magnification.

Iii-E Joint Training vs. Alternating Training Between Branches

Essentially the whole model contains two branches: RPN and FPRN. Since we have no ready-to-use 3D model as for the 2D scenarios on which we can directly fine-tune, we have to train the model from scratch. We first train the RPN first to acquire a good initialization for the backbone given the “dense” error signal emitted by the RPN branch. After that, we add the FPRN branch to the model.

After we stack the FPRN branch on the model, we have multiple options for further training: joint training and alternating training between branches. In practice, we found that the former choice seems to be more stable in terms of performance. We attribute this fact to the combination of both “Global and Local” complementary losses. In a way, RPN losses are more general and global which take into account all spatial positions while FPRN losses only focus on the highly suspicious regions, making them more local.

Backbone Multi-Scale OHEM Focal Loss FROC/Sensitivity
Res18 [35] 0.834 / 0.946
DualPath [35] 0.842 / 0.958
ResBlock [29] 0.920 / -
ResBlock [29] 0.935 / -


a. DenseBlock + anchor1 0.839 / 0.896
b. DenseBlock + anchor1 0.868 / 0.941
c. DenseBlock + anchor1 0.898 / 0.956
d. DenseBlock + anchor1 0.910 / 0.982
e. DenseBlock + anchor2 0.899 / 0.970


f. DenseBlock + anchor1 0.917 / 0.977
TABLE II: Ablation study for baseline RPNs, we use DenseNet Block as the building block for both the upstream and downstream pathway. The detailed architecture of the model can be found in Table I. Note that we do not share the RPN heads. Each pyramid level has its own RPN head of an identical structure. “

” means using the “Anchor Based Sampling” technique, which can be interpreted as a boosting procedure. With ABS, an additional 30 epochs of training is conducted.

Iv Experiment on LUNA16

We conduct a series of experiments on the task of Pulmonary Nodule detection with the LUNA16 dataset [26]. This dataset is a subset of the publicly available dataset LIDC-IDRI [1]. It summarizes the annotations from LIDC-IDRI: tiny nodules (diameter 3mm) and annotations of low confidence (fewer than 2 physicians agree on) are excluded. As a result, LUNA16 contains 888 CT scans and 1186 nodules. We follow the rule of the LUNA16 challenge [26] by conducting 10-fold cross validation and evaluate the performance with the official FROC score: the average recall rate with the number of false positives being 0.125, 0.25, 0.5, 1, 2, 4, 8 per scan. In LUNA16 Challenge, a 3D region proposal is counted as a True Positive as long as its center is located inside a true bounding sphere: the distance between two centers is less than the radius of true bounding sphere. We adopt this criterion for all our experiments, including the experiments on the DeepLesion dataset.

Iv-a Data Preparation

As in [15], we use officially provided segmentation masks to remove unnecessary volume from the original 3D scans. We rescale all images to . During training, we randomly crop out 3D cubes from the pre-processed images as the input. Note that the segmented scans are typically much larger (6 8 times). During test, we adopt the sliding window style cropping: divide the whole images into overlapped cubes and merge all detection results later for the final decision.

Iv-B Baseline RPNs

We adopt a DenseNet backbone. We conduct extensive ablation experiments to evaluate the effectiveness of all the technical components. To make fair comparisons, all ablation experiments undergo a 10-fold cross validation with 50 training epochs. During training, IoU thresholds are and . Regions possessing IoU value with Ground Truth larger than , smaller than and in between will be assigned as positive, negative and ignored respectively. The and in the Focal Loss are set to be 0.8 and 5 respectively.

In the ablation study, when multi-scale technique is removed, anchors (denoted as “anchor1”) are set to be {4, 6, 8, 12, 16, 24}mm at the feature layer of 4mm resolution, while with the multi-scale technique the {4, 6}mm, {8, 12}mm {16, 24}mm anchors are distributed to the feature level of 4mm, 8mm, 16mm resolution respectively. As Eggert et al. [6] suggest that the anchor set matters much for small object detection tasks, we also explore other anchor sets such as {5, 10},{15, 20}, {25, 35} (denoted as “anchor2”) to validate this phenomenon. All results are summarized in Table II.

OHEM vs. Focal Loss. Our proposed Focal Loss always results in a large performance improvement compared with OHEM (and usually use fewer iterations). This can be clearly illustrated by comparing Row a and Row b (0.839/0.896 0.868/0.941), Row c and Row d (0.898/0.956 0.910/0.982). Note that we do not report the experiment result with the vanilla Focal Loss because it simply does not work.

Multiscale vs. Single Scale. Leveraging multi-scale technique drastically improves the performance (over 0.05 FROC score improvement). This is well demonstrated by comparing Row a and Row c (0.839/0.896 0.898/0.956), Row b and Row c (0.868/0.941 0.910/0.982).

DenseBlock vs ResBlock. So far, we have set up a relatively strong baseline network. However, we cannot directly compare our Dense Block with Res Block or with DualPath Block. The reason is two-fold. First, our model architecture may not be the best. We do not follow the common practice to increase the kernel number gradually as the feature map gets smaller. This is due to the idea that we want to force the RPN heads to be of identical shape, which makes it convenient to share heads and to build an aggregation of features afterward. Second, there are many other factors that also affect the performance such as re-shuffle and cropping strategies, hyper-parameter settings and the implementation details. This is illustrated perfectly by the large performance gap between the two Res18-like networks [35, 29]. Effective modifications include substituting the ReLU activation and NMS to Randomized ReLU and soft-NMS respectively as well as leveraging the multi-scale technique.

Iv-C Anchor Based Sampling

Xie et al. [29] have shown that “Anchor Based Sampling” (ABS) can greatly improve the performance. This ABS works by “boosting”, i.e. focusing more on “hard cases” which easily confuse the model. This mechanism functions by repeating the following steps: (1) training as usual for certain iterations; (2) testing each whole CT scan in the training set to locate the hard regions. These hard regions will be more focused in the next training round. We apply one round ABS in our models and find it work considerably well. Results are summarized in Table II and Table III.

Model DA Magnify Joint Alternating RPN FPRN Combined
Crop 96 0.911/0.970 0.889/0.946 0.917/0.970
Crop 96 0.916/0.977 0.902/0.963 0.919/0.976
Crop 96 0.908/0.972 0.894/0.962 0.912/0.972


Crop 128 0.923/0.976 0.889/0.944 0.926/0.976
Crop 128 0.925/0.983 0.905/0.957 0.930/0.981


Crop 128 0.920/0.972 0.895/0.955 0.926/0.964
Crop 128 0.930/0.983 0.908/0.963 0.939/0.985
Crop 128 0.928/0.984 0.914/0.968 0.935/0.983
Merged (Intersection) - - 0.943/0.979
Merged (Union) - - 0.942/0.991
TABLE III: Ablation study on the FPRN branch (FROC/Sensitivity). “Crop 96” and “Crop 128” represent the input cube size be 96 and 128 respectively. “DA” is short for Diameter Alignment. If it is removed, we still have full-scale feature aggregations. However, we will use the standard procedure to calculate nodule diameters across all pyramid levels. “Magnify”, “Joint” and “Alternating” respectively denote Local Magnification ops, training the RPN and FPRN branches simultaneously and alternatively. We also report the ensemble performance of the 3D Aggregated Faster R-CNN with and without Local Magnification. We merge the results by simply averaging the overlapped proposals. Non-overlapped proposals between the two proposal sets are retained (Union) or discarded (Intersection).
Model FROC Inference Time
Res18 [35] 0.834 -
DualPath [35] 0.842 -
ResBlock [29] 0.935 15s /Scan
3D-AG (Ours) 0.939 4.2s /Scan
TABLE IV: The state of the art single-model solutions. Note that our inference time is evaluated with much inferior GPU settings compared with [29]

Iv-D False Positive Reduction Network Branch

We stack the FPRN branch over the backbone of RPN and use joint (multi-task) training for the whole model. Unlike the regular Faster R-CNN classifier branch, three main modifications are introduced in our FPRN branch: (1) multi-scale Feature Aggregation, (2) Diameter Alignment when cropping RoIs, (3) Local Magnification. Note that all Local Magnification layers’ weights are shared across all pyramid levels. We conduct a full ablation study to isolate the effect brought by each technique. Furthermore, we also compare the effects of the input crop size and the two different training strategies: training two branches jointly and alternating between the training of two branches. We refer to our model with and without Local Magnification as “3D-AG” and “3D-AG-LM”. Results are summarized in Table III.

As we can see from Table III: (1) introducing FPR branch brings in consistent and considerable performance boost (0.917/0.977 0.939/0.985). To our surprise, it also improves the RPN branch (0.917/0.977 0.930/0.983). We attribute this to the strong scale-invariant constraints imposed by the FPRN branch during training, forcing the backbone network responding to both branches simultaneously. This conjecture is further supported by the fact that the joint training performs better than the alternative training. (2) local magnification layers bring in a consistent positive effect for the FPRN branch in terms of both FROC score and Sensitivity (0.895/0.955 0.914/0.968). However, when we reach the final result, models with Local Magnification sometimes trail ones without it. We leave this inconsistency as our future work. (3) Diameter Alignment is critical (0.926/0.964 0.939/0.985). Once we remove it, RPN fails to improve compared with the baseline RPN (0.920/0.972 vs. 0.917/0.977) and the FPRN branch drastically loses efficacy (from 0.908/0.963 to 0.895/0.955).

Iv-E LUNA16 Leader Board

Our models achieve the best FROC score among single-model solutions IV. However, recently many better results from multi-phase (ensemble) solutions have been reported. Unfortunately, important details are still missing. To our best knowledge, all these state-of-the-art ensemble solutions consist of multiple heterogeneous networks (ensembles) such as PAtech (0.951, 1 RPN + 2 FPRNs); JianPeiCAD (0.950, 2 RPNs + 1 FPRN ); LUNA16FONOVACAD (0.947, 1 RPN + 3 FPRNs) 222Results can be found on https://luna16.grand-challenge.org/results/. Our Merged (union) and Single-Model results would rank at 4th and 6th place on the LUNA16 Leader Board. Nevertheless, our model still works considerably well given the fact that our models allow end-to-end training and are inherently faster than ensemble solutions.

Fig. 4: FROC Curve and True Negative Patient Scores of our models. (a) presents the original FROC evaluation curves. (b) shows detailed information when all curves hit the 0.95 sensitivity. (c) summarizes the TNP score of each model.
Fig. 5: Typical Hard Cases confusing our models. We use the “Faster R-CNN w/ Magnify” as the testing model. The probability of each region is marked on the images. Other models share a similar property. The score of 0 indicates the nodule is missed.

Iv-F Inference Time

We test our models with 4 Tesla K80 GPUs (48 GB) with an Intel(R) Xeon(R) E5-2640 v2 CPU. We set the probability threshold to be 0.269 (Sigmoid(-1)). The inference time for baseline RPN models and 3D Aggregated Faster R-CNN models without and with Local Magnification are 3.0s, 4.2s and 5.0s per pre-processed scan respectively. In [29], 4 Titan XPs (48 GB) are adopted which are much faster than Tesla K80 (12.1 TFLOPs vs. 5.6 TFLOPs). Despite this inferior GPU setting, our approaches are at least 2 times faster than [29] (15s per pre-processed scan).

Note that detailed information about the inference time of other state-of-the-art ensemble approaches is unavailable to the public. Nevertheless, we argue that our approach is systematically faster. Because we cut off the time for additional cropped cubes (raw images) to propagate through very deep (much deeper than the RPN backbone) 3D CNNs.

Iv-G True Negative Patients

Though the FROC score is well designed for this task, it only focuses on the individual lesion level. We argue that more attention needs to be paid to at the Patient level since it is more acceptable for Positive Patient to have False Positives (FP) than for Negative Patients. To this end, we also report the performance of our models on the True Negative Patient (TNP) score .

Based on LUNA16 annotations, there are 287 negative patients. We adopt the probability thresholds allowing the sensitivity to reach 0.95 and calculate the TNP score for each model. As shown in Fig. 4, a higher FROC score does not necessarily associate with a better TNP score. For instance, “Merged (Union)” processes much higher FROC score than all single-model solutions while yielding a worse TNP score. This is also another reason that we argue TNP evaluation is an important complement to FROC evaluation.

Iv-H Visualization of Hard Cases

We visualize hard cases in Fig. 5, including positives hard to detect and negatives easy to mistake. Typically, False Positives (Hard Negatives) are caused by small nodules (a-c), the failure of segmentation (d), the noise and bad quality of raw images (e). As for False Negatives (Hard Positives), the model may additionally suffer from the low contrast of RoIs with the background (d, e). Therefore, we argue that better segmentation and higher quality of raw CT scans should further help the detection.

V Experiments on DeepLesion

We also evaluate our model on a more general lesion detection task with the DeepLesion [31] dataset. This dataset contains 10,594 CT studies from 4,427 unique patients. There are 32,735 lesions annotated at their key slices. The whole dataset is officially divided into training, validation, and testing set with each of them containing 22,901, 4,887, 4,912 lesions respectively (noisy annotations are removed). Note that, DeepLesion only provides 60mm Z-context along with the key slice for each lesion. On the other hand, various types of lesions are included in this dataset, including lung, mediastinum, liver, soft tissue, pelvis, abdomen, kidney, and bone. This wide variety of lesions allows us to evaluate our approach on a more general scale.

However, it is reported in [30] that 3D CNN does not work well with the DeepLesion dataset. We attribute this observation to three aspects: (1) key slice indices may not be accurate especially when the slice interval is large (eg., 5mm); (2) large lesions ( 48mm, 11% of the data) can be easily out-of-bound; (3) annotated bounding boxes for small lesions are usually too large compared with the actual size of lesions. All these issues pose significant challenges to the bounding box regression. To attack these issues we merge multi-annotated lesions (lesions with multiple annotations), remove very large lesions during training and adjust the diameter of small lesions to the minimum of the long side of the bounding box and long diameter.

V-a Data Pre-processing

Note that no semantic segmentation is applied here because whole CT scans are not available. However, we still can reduce unnecessary parts by clipping black borders. Similar to the experiments on LUNA16, each scan chunk is rescaled to an isotropic resolution (1 mm). In all experiments with DeepLesion, training sample size is 64128

128 (padding 0 when necessary).

We convert the 2D annotations into 3D ones as {X, Y, Z, Diameter} vectors. Z position is calculated by key slice indices and slice intervals. In this way, the task settings become the same as LUNA16. Moreover, despite the issues of bounding box regression, our model can still generalize well with DeepLesion.

V-B Training and Testing Settings

We adopt the same model architecture as with LUNA16 except for the anchor setting. The anchors on stride 4, 8 and 16 are configured as {3, 5, 7}, {10, 13, 17} and {22.0, 30.0, 40.0} respectively. In both training and testing, we adopt the same cropping strategy in LUNA16 experiments. We remove very large lesions ( 48mm, 11% of the data) during training. This operation is nontrivial. Our primary attempts show that when these large lesions are included, the regression losses from the RPN branch are hard to converge. When testing, we report both the results with and without very large nodules (also 11% of the testing lesions). Note that, we adopt LUNA16’s criterion for the evaluation. We train the model from scratch and results are summarized in Table V.

FPs per image 0.5 1 2 4 8 16 Avg. FROC
3DCE, 27 slices [30] 62.48 73.37 80.70 85.65 89.09 91.06 80.39 -


RPN Baseline 65.74 73.89 80.99 86.56 91.40 94.40 82.17 0.708
3D-AG 74.08 81.42 86.08 89.38 92.08 94.82 86.31 0.771


RPN Baseline * 69.09 76.75 82.81 87.29 91.20 93.84 83.50 0.735
3D-AG * 74.68 81.52 85.45 89.18 92.01 94.72 86.26 0.774
TABLE V: Sensitivity (%) and FROC score on the DeepLesion dataset. With “*” means large lesions (48 mm) are removed. Note that we may not directly compare performance with [30] because of the different evaluation criteria (2D vs 3D).
Model LU ME LV ST PV AB KD BN 10 10-30 30
3DCE, 27 slices [30] 89 88 90 74 84 84 82 75 80 87 84


RPN Baseline 91 88 87 80 85 80 80 69 82 88 80
3D-AG 93 89 92 84 90 86 83 70 83 90 91


RPN Baseline * 91 88 89 80 86 82 81 68 82 88 84
3D-AG * 93 89 91 84 90 86 83 70 83 90 90
TABLE VI: Sensitivity@4 (%) on DeepLession w.r.t Lesion Type and Diameter. With “*” means large lesions (48 mm) are removed. The abbreviations of lesion types stand for lung (LU), mediastinum (ME), liver (LV), soft tissue (ST), pelvis (PV), abdomen (AB), kidney (KD), and bone (BN), respectively. “10”, “10-30” and “30” indicate lesion diameter ranges (mm).
Fig. 6: Visualization with DeepLesion. Green boxes are ground truth while red boxes are ones predicted. It may be not clear sometimes because of the nearly perfect match. Such as Abdomen: box 0.936, Mediastinum: box 0.897, Liver: 0.985, Lung: box 0.896, Kidney: box 0.985, Soft Tissue: 0.865.

V-C Overall Performance

We evaluate the performance with the FROC score here. As we can see from Table V, our model generalizes well to the general lesion detection task and the Aggregated FPRN branch consistently improves the baseline RPN. Again, the large performance gap between RPN w/ and w/o very large lesions supports our conjecture that our model could be sensitive to out-of-bound lesions. Nevertheless, it shows that our Aggregated FPRN is robust to the detection of a variety of lesions. Moreover, our Aggregated FPRN is less sensitive to the out-of-bound lesions. This is another advantage brought by our model.

V-D Performance w.r.t. Lesion Type and Size

As in [30], we also report the performance with respect to Lesion Type and Diameter. All results are summarized in Table VI. From Table VI we can easily find that our model does not perform well for bone and kidney lesions. On the other hand, our approach does not experience significant performance drop as in [30] when detecting “Soft Tissue” lesions. Again, we stress the point that one may not directly compare our results with [30]. Nevertheless, the considerable improvement brought by the Aggregated classifier branch compared with the RPN baseline shows that our Aggregated Faster R-CNN generalizes well across different tasks. Some of the results are visualized in Fig. 6.

Vi Conclusions and Future work

For the training of the RPN branch, the adaptive Focal Loss demonstrates superiority to the OHEM mechanism in terms of stability, training speed and performance. The boosting strategy Anchor Based Sampling proves to be effective. Moreover, we have observed that segmentation also plays an important role by filtering out noise and reducing the unnecessary area. Inspired by these observations, we can further improve the training mechanism in three directions: (1) the sampling strategy and robust segmentation at the starting point; (2) the backbone network structure in the middle; (3) the cost at the ending point.

By stacking the FPRN branch RPNs’ backbone, we can consistently improve the FROC performance on this lesion detection task. One surprising finding is that with the help of the FPRN branch, the model becomes more robust to out-of-bound lesions. When isolating each technical component, we find that (1) Diameter Alignment plays a critical role by enriching the context information; (2) Local Magnification Operations are effective for the FPRN branch. Sometimes, however, it may not be the best choice for the full solution. This inconsistency calls for further research. Possible directions can be better designed FPRN structure, losses balance, and local regional constraints. Nevertheless, this Local Magnification and FPRN branch open a door towards a full diagnosis of nodules by offering more interpretable features: texture, calcification, lobulation, and even malignancy. All these features can potentially be incorporated into the FPRN branch. We will focus on this interpretability in the future.

Vii Acknowledgment


  • [1] S. G. Armato III, G. McLennan, L. Bidaut, M. F. McNitt-Gray, C. R. Meyer, A. P. Reeves, B. Zhao, D. R. Aberle, C. I. Henschke, E. A. Hoffman, et al. (2011) The lung image database consortium (lidc) and image database resource initiative (idri): a completed reference database of lung nodules on ct scans. Medical physics 38 (2), pp. 915–931. Cited by: §IV.
  • [2] B. E. Bejnordi, M. Veta, P. J. Van Diest, B. Van Ginneken, N. Karssemeijer, G. Litjens, J. A. Van Der Laak, M. Hermsen, Q. F. Manson, M. Balkenhol, et al. (2017) Diagnostic assessment of deep learning algorithms for detection of lymph node metastases in women with breast cancer. 318 (22), pp. 2199–2210. Cited by: §I.
  • [3] Q. Chen, X. Sun, N. Zhang, Y. Cao, and B. Liu (2019) Mini lesions detection on diabetic retinopathy images via large scale cnn features. In

    International Conference on Tools with Artificial Intelligence (ICTAI)

    Cited by: §I.
  • [4] J. Ding, A. Li, Z. Hu, and L. Wang (2017) Accurate pulmonary nodule detection in computed tomography images using deep convolutional neural networks. CoRR abs/1706.04303. External Links: Link, 1706.04303 Cited by: §II.
  • [5] Q. Dou, H. Chen, L. Yu, J. Qin, and P. Heng (2017) Multilevel contextual 3-d cnns for false positive reduction in pulmonary nodule detection. IEEE Transactions on Biomedical Engineering 64 (7), pp. 1558–1567. Cited by: §II.
  • [6] C. Eggert, S. Brehm, A. Winschel, D. Zecha, and R. Lienhart (2017) A closer look: small object detection in faster r-cnn. In Multimedia and Expo (ICME), 2017 IEEE International Conference on, pp. 421–426. Cited by: §IV-B.
  • [7] C. Fu, W. Liu, A. Ranga, A. Tyagi, and A. C. Berg (2017) DSSD: deconvolutional single shot detector. arXiv preprint arXiv:1701.06659. Cited by: §II.
  • [8] K. He, G. Gkioxari, P. Dollár, and R. Girshick (2017) Mask r-cnn. In Computer Vision (ICCV), 2017 IEEE International Conference on, pp. 2980–2988. Cited by: §II, §III-C.
  • [9] K. He, X. Zhang, S. Ren, and J. Sun (2016) Deep residual learning for image recognition. In

    Proceedings of the IEEE conference on computer vision and pattern recognition

    pp. 770–778. Cited by: §II, §III-A.
  • [10] K. He, X. Zhang, S. Ren, and J. Sun (2016) Identity mappings in deep residual networks. In European conference on computer vision, pp. 630–645. Cited by: §II, §III-A.
  • [11] C. I. Henschke, D. F. Yankelevitz, R. Mirtcheva, G. McGuinness, D. McCauley, and O. S. Miettinen (2002) CT screening for lung cancer: frequency and significance of part-solid and nonsolid nodules. American Journal of Roentgenology 178 (5), pp. 1053–1057. Cited by: §I, §II.
  • [12] G. Huang, Z. Liu, K. Q. Weinberger, and L. van der Maaten (2017) Densely connected convolutional networks. In Proceedings of the IEEE conference on computer vision and pattern recognition, Vol. 1, pp. 3. Cited by: §III-A, §III.
  • [13] S. Ioffe and C. Szegedy (2015) Batch normalization: accelerating deep network training by reducing internal covariate shift. arXiv preprint arXiv:1502.03167. Cited by: §II.
  • [14] C. Jacobs, E. M. van Rikxoort, T. Twellmann, E. T. Scholten, P. A. de Jong, J. Kuhnigk, M. Oudkerk, H. J. de Koning, M. Prokop, C. Schaefer-Prokop, et al. (2014) Automatic detection of subsolid pulmonary nodules in thoracic computed tomography images. Medical image analysis 18 (2), pp. 374–384. Cited by: §I, §II.
  • [15] F. Liao, M. Liang, Z. Li, X. Hu, and S. Song (2017) Evaluate the malignancy of pulmonary nodules using the 3d deep leaky noisy-or network. arXiv preprint arXiv:1711.08324. Cited by: §I, §I, §II, §IV-A.
  • [16] T. Lin, P. Dollár, R. Girshick, K. He, B. Hariharan, and S. Belongie (2017) Feature pyramid networks for object detection. In CVPR, Vol. 1, pp. 4. Cited by: §II, §III-A, §III-C.
  • [17] T. Lin, P. Goyal, R. Girshick, K. He, and P. Dollár (2017) Focal loss for dense object detection. arXiv preprint arXiv:1708.02002. Cited by: §II, §III-B, §III-B.
  • [18] T. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan, P. Dollár, and C. L. Zitnick (2014) Microsoft coco: common objects in context. In Computer Vision–ECCV 2014, pp. 740–755. Cited by: §III-B.
  • [19] J. Long, E. Shelhamer, and T. Darrell (2015) Fully convolutional networks for semantic segmentation. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 3431–3440. Cited by: §II.
  • [20] M. Mahmud, M. S. Kaiser, A. Hussain, and S. Vassanelli (2018)

    Applications of deep learning and reinforcement learning to biological data

    29 (6), pp. 2063–2079. Cited by: §II.
  • [21] K. Murphy, B. van Ginneken, A. M. Schilham, B. De Hoop, H. Gietema, and M. Prokop (2009) A large-scale evaluation of automatic pulmonary nodule detection in chest ct using local image features and k-nearest-neighbour classification. Medical image analysis 13 (5), pp. 757–770. Cited by: §I, §II.
  • [22] S. Ren, K. He, R. Girshick, and J. Sun (2015) Faster r-cnn: towards real-time object detection with region proposal networks. In Advances in neural information processing systems, pp. 91–99. Cited by: §I, Fig. 3, §III-C.
  • [23] O. Ronneberger, P. Fischer, and T. Brox (2015) U-net: convolutional networks for biomedical image segmentation. In International Conference on Medical image computing and computer-assisted intervention, pp. 234–241. Cited by: §II, §II.
  • [24] A. A. Setio, C. Jacobs, J. Gelderblom, and B. Ginneken (2015) Automatic detection of large pulmonary solid nodules in thoracic ct images. Medical physics 42 (10), pp. 5642–5653. Cited by: §II.
  • [25] A. A. A. Setio, F. Ciompi, G. Litjens, P. Gerke, C. Jacobs, S. J. van Riel, M. M. W. Wille, M. Naqibullah, C. I. Sánchez, and B. van Ginneken (2016) Pulmonary nodule detection in ct images: false positive reduction using multi-view convolutional networks. IEEE transactions on medical imaging 35 (5), pp. 1160–1169. Cited by: §I, §II.
  • [26] A. A. A. Setio, A. Traverso, T. De Bel, M. S. Berens, C. van den Bogaard, P. Cerello, H. Chen, Q. Dou, M. E. Fantacci, B. Geurts, et al. (2017) Validation, comparison, and combination of algorithms for automatic detection of pulmonary nodules in computed tomography images: the luna16 challenge. Medical image analysis 42, pp. 1–13. Cited by: §I, §I, §I, §II, §IV.
  • [27] A. Shrivastava, A. Gupta, and R. Girshick (2016) Training region-based object detectors with online hard example mining. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 761–769. Cited by: §II.
  • [28] M. Tan, R. Deklerck, B. Jansen, M. Bister, and J. Cornelis (2011) A novel computer-aided lung nodule detection system for ct images. Medical physics 38 (10), pp. 5630–5645. Cited by: §II.
  • [29] Z. Xie (2018) Towards single-phase single-stage detection of pulmonary nodules in chest ct imaging. arXiv preprint arXiv:1807.05972. Cited by: §I, §I, §II, §II, TABLE II, §IV-B, §IV-C, §IV-F, TABLE IV.
  • [30] K. Yan, M. Bagheri, and R. M. Summers (2018) 3d context enhanced region-based convolutional neural network for end-to-end lesion detection. In International Conference on Medical Image Computing and Computer-Assisted Intervention, pp. 511–519. Cited by: §I, §I, §II, §V-D, TABLE V, TABLE VI, §V.
  • [31] K. Yan, X. Wang, L. Lu, L. Zhang, A. P. Harrison, M. Bagheri, and R. M. Summers (2018-06) Deep lesion graphs in the wild: relationship learning and organization of significant radiology image findings in a diverse large-scale lesion database. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §II, §V.
  • [32] P. Ypsilantis and G. Montana (2016) Recurrent convolutional networks for pulmonary nodule detection in ct imaging. arXiv preprint arXiv:1609.09143. Cited by: §II.
  • [33] S. Zagoruyko and N. Komodakis (2016) Wide residual networks. arXiv preprint arXiv:1605.07146. Cited by: §II.
  • [34] N. Zhang, Y. Cao, B. Liu, and Y. Luo (2017) Improved multimodal representation learning with skip connections. In Proceedings of the 2017 ACM on Multimedia Conference, MM ’17, New York, NY, USA, pp. 654–662. External Links: ISBN 978-1-4503-4906-2, Link, Document Cited by: §II, §III-A.
  • [35] W. Zhu, C. Liu, W. Fan, and X. Xie (2018) Deeplung: deep 3d dual path nets for automated pulmonary nodule detection and classification. arXiv preprint arXiv:1801.09555. Cited by: §I, §II, §II, TABLE II, §IV-B, TABLE IV.