MULAN: Multitask Universal Lesion Analysis Network for Joint Lesion Detection, Tagging, and Segmentation

08/12/2019 ∙ by Ke Yan, et al. ∙ National Institutes of Health Google Inc 6

When reading medical images such as a computed tomography (CT) scan, radiologists generally search across the image to find lesions, characterize and measure them, and then describe them in the radiological report. To automate this process, we propose a multitask universal lesion analysis network (MULAN) for joint detection, tagging, and segmentation of lesions in a variety of body parts, which greatly extends existing work of single-task lesion analysis on specific body parts. MULAN is based on an improved Mask R-CNN framework with three head branches and a 3D feature fusion strategy. It achieves the state-of-the-art accuracy in the detection and tagging tasks on the DeepLesion dataset, which contains 32K lesions in the whole body. We also analyze the relationship between the three tasks and show that tag predictions can improve detection accuracy via a score refinement layer.



There are no comments yet.


page 7

page 10

page 14

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Detection, classification, and measurement of clinically important findings (lesions) in medical images are primary tasks for radiologists [10]. Generally, they search across the image to find lesions, and then characterize their locations, types, and related attributes to describe them in radiological reports. They may also need to measure the lesions, e.g., according to the RECIST guideline [2]

, for quantitative assessment and tracking. To reduce radiologists’ burden and improve accuracy, there have been many efforts in the computer-aided diagnosis area to automate this process. For example, detection, attribute estimation, and malignancy prediction of lung nodules have been extensively studied

[6, 14]. Other works include detection and malignancy prediction of breast lesions [9], classification of three types of liver lesions [1], and segmentation of lymph nodes [12]. Variants of Faster R-CNN [9, 6] have been used for detection, whereas patch-based dictionaries [1] or networks [6, 14] have been studied for classification and segmentation.

Most existing work on lesion analysis focused on certain body parts (lung, liver, etc.). In practice, a radiologist often needs to analyze various lesions in multiple organs. Our goal is to build such a universal lesion analysis algorithm to mimic radiologists, which to the best of our knowledge is the first work on this problem. To this end, we attempt to integrate the three tasks in one framework. Compared to solving each task separately, the joint framework will be not only more efficient to use, but also more accurate, since different tasks may be correlated and help each other [13, 14].

We present the multitask universal lesion analysis network (MULAN) which can detect lesions in CT images, predict multiple tags for each lesion, and segment it as well. This end-to-end framework is based on an improved Mask R-CNN [3]

with three branches: detection, tagging, and segmentation. The tagging (multilabel classification) branch learns from tags mined from radiological reports. We extracted 185 fine-grained and comprehensive tags describing the body part, type, and attributes of the lesions. The relation between the three tasks is analyzed by experiments in this paper. Intuitively, lesion detection can benefit from tagging, because the probability of a region being a lesion is associated with its attribute tags. We propose a score refinement layer in MULAN to explicitly fuse the detection and tagging results and improve the accuracy of both. A 3D feature fusion strategy is developed to leverage the 3D context information to improve detection accuracy.

MULAN is evaluated on the DeepLesion [17] dataset, a large-scale and diverse dataset containing measurements and 2D bounding-boxes of over 32K lesions from a variety of body parts on computed tomography (CT) images. It has been adopted to learn models for universal lesion detection [15, 13], measurement [11], and classification [16]. On DeepLesion, MULAN achieves the state-of-the-art accuracy in detection and tagging and performs comparable in segmentation. It outperforms the previous best detection result by 10%. We released the code of MULAN in 111

2 Method

Figure 1: Flowchart of MULAN and the 3D feature fusion strategy.

The flowchart of the multitask universal lesion analysis network (MULAN) is displayed in Fig. 1 (a). Similar to Mask R-CNN [3], MULAN has a backbone network to extract a feature map from the input image, which is then used in the region proposal network (RPN) to predict lesion proposals. Then, an ROIAlign layer [3] crops a small feature map for each proposal, which is used by three head branches to predict the lesion score, tags, and mask of the proposal.

2.1 Backbone with 3D Feature Fusion

A good backbone network is able to encode useful information of the input image into the feature map. In this study, we adopt the DenseNet-121 [4] in the backbone with the last dense block and transition layer removed, as we found removing them slightly improved accuracy and speed. Next, we employ the feature pyramid strategy [7] to add fine-level details into the feature map. This strategy also increases the size of the final feature map, which will benefit the detection and segmentation of small lesions. Different from the original feature pyramid network [7] which attaches head branches to each level of the pyramid, we attach the head branches only to the finest level [6, 14].

3D context information is very important when differentiating lesions from non-lesions [15]. 3D CNNs have been used for lung nodule detection [6]. However, they are memory-consuming, thus smaller networks need to be used. Universal lesion detection is much more difficult than lung nodule detection, so networks with more channels and layers are potentially desirable. Yan et al. [15] proposed 3D context enhanced region-based CNN (3DCE) and achieved better detection accuracy than a 3D CNN in the DeepLesion dataset. They first group consecutive axial slices in a CT volume into 3-channel images. The upper and lower images provide 3D context for the central image. A feature map is then extracted for each image with a shared 2D CNN. Lastly, they fuse the feature maps of all images with a convolutional (Conv) layer to produce the 3D-context-enhanced feature map for the central image and predict 2D boxes for the lesions on it.

The drawback of 3DCE is that the 3D context information is fused only in the last Conv layer, which limits the network’s ability to learn more complex 3D features. As shown in Fig. 1 (b), we improve 3DCE to relieve this issue. The basic idea is to fuse features of multiple slices in earlier Conv layers. Similar to 3DCE, feature maps (FMs) are fused with a Conv layer (i.e., the 3D fusion layer). Then, the fused central FM is used to replace the original central FM, while the upper and lower FMs are kept unchanged. All FMs are then fed to subsequent Conv layers. Because the new central FM contains 3D context information, sophisticated 3D features can be learned in subsequent layers with nonlinearity. This 3D fusion layer can be inserted between any two layers of the original 2D CNN. In MULAN, one 3D fusion layer is inserted after dense block 2 and another one after the last layer of the feature pyramid. We found fusing 3D context in the beginning of the CNN (before dense block 2) is not good possibly because the CNN has not yet learned good semantic 2D features by then. At the end of the network, only the central feature map is used as the FM of the central image.

2.2 Head Branches and Score Refinement Layer

Figure 2: Illustration of the head branches and the score refinement layer of MULAN.

The structure and function of the three head branches are shown in Fig. 2. The detection branch consists of two 2048D fully connected layers (FC) and predicts the lesion score of each proposal, i.e., the probability of the proposal being a lesion. It also conducts bounding-box regression to refine the box [3].

The tagging branch predicts the body part, type, and attributes (intensity, shape, etc.) of the lesion proposal. It applies the same label mining strategy as that in LesaNet [16]

. We first construct the lesion ontology based on the RadLex lexicon. To mine training labels, we tokenize the sentences in the radiological reports of DeepLesion, and then match and filter the tags in the sentences using a text mining module. 185 tags with more than 30 occurrences in DeepLesion are kept. A weighted binary cross-entropy loss is applied on each tag. The hierarchical and mutually exclusive relations between the tags were leveraged in a label expansion strategy and a relational hard example mining loss to improve accuracy

[16]. The score propagation layer and the triplet loss in [16] are not used. Due to space constraints, we refer readers to the supplementary material (sup. mat.) and [16] for more implementation details in this branch.

For the segmentation branch, we follow the method in [13] and generate pseudo-masks of lesions for training. The DeepLesion dataset does not contain lesions’ ground-truth masks. Instead, each lesion has a RECIST measurement [2], namely a long axis and a short axis annotated by radiologists. They are utilized to generate four quadrants as the estimation of the real mask [13], since most lesions have ellipse-like shapes. We use the Dice loss [14] as it works well in balancing foreground and background pixels. The predicted mask can be easily used to compute the contour and then the RECIST measurement of the lesion, see Fig. 2 for an example.

Intuitively, detection (lesion/non-lesion classification) is closely related to tagging. One way to exploit their synergy is to combine them in one branch to make them share FC features. However, this strategy led to inferior accuracy for both tasks in our experiments probably because detecting a variety of lesions is a hard problem and requires rich features with high nonlinearity, thus a dedicated branch is necessary. In this study, we propose to combine them at the decision level. Specifically, for each lesion proposal, we join its lesion score from the detection branch and the 185 tag scores from the tagging branch as a feature vector, then predict the lesion and tag scores again using a

score refinement layer (SRL). Tag predictions can thus support detection explicitly. We also add new features as the input of the layer including the statistics of the proposal ( width, height), the patient’s gender, and age. Other relevant features such as medical history and lab results may also be considered. In MULAN, SRL is a simple FC layer as we found more nonlinearity did not improve results possibly due to overfitting. The losses for detection and tagging after this layer are the same as those in the respective branches.

More implementation details of MULAN are depicted in the sup. mat.

3 Experiments and Discussion


MULAN was implemented in PyTorch based on the maskrcnn-benchmark


project. The DenseNet backbone was initialized with an ImageNet pretrained model. The score refinement layer was initialized with an identity matrix so that the scores before and after it were the same when training started. Other layers were randomly initialized. Each mini-batch had 8 samples, where each sample consisted of three 3-channel images for 3D fusion (Fig. 


). We used SGD to train MULAN for 8 epochs and set the base learning rate to 0.004, then reduced it by a factor of 10 after the 4th and 6th epochs. It takes MULAN 30ms to predict a sample during inference on a Tesla V100 GPU.

Data: The DeepLesion dataset [17] contains 32,735 lesions and was divided into training (70%), validation (15%), and test (15%) sets at the patient level. When training, we did data augmentation for each image in three ways: random resizing with a ratio of 0.81.2; random translation of -88 pixels in and axes; and 3D augmentation. A lesion in DeepLesion was annotated in one axial slice, but the actual lesion also exists in approximately the same position in several neighboring slices depending on its diameter and the slice interval. Therefore, we can do 3D augmentation by randomly shifting the slice index within half of the lesion’s short diameter. Each of these three augmentation methods improved detection accuracy by 0.20.4%. Some examples of DeepLesion are presented in Section 1 of the sup. mat.

Metrics: For detection, we compute the sensitivities at 0.5, 1, 2, and 4 false positives (FPs) per image [15]

and average them, which is similar to the evaluation metric of the LUNA dataset

[6]. For tagging, we use the 500 manually tagged lesions in [16] for evaluation. The area under the ROC curve (AUC) and F1 score are computed for each tag and then averaged. Since there are no ground-truth (GT) masks in DeepLesion except for RECIST measurements [11], we use the average distance from the endpoints of the GT measurement to the predicted contour as a surrogate criterion (see sup. mat. Section 2). The second criterion is the average error of length of the estimated RECIST diameters, which are very useful values for radiologists and clinicians [2].

Qualitative and quantitative results are presented in Fig. 3 and Table 1, respectively. Note that in Table 1, tagging and segmentation accuracy were calculated by predicting tags and masks based on GT bounding-boxes, so that they were under the same setting as previous studies [11, 16] and independent of the detection accuracy. We will discuss the results of each task below.

Figure 3: Examples of MULAN’s lesion detection, tagging, and segmentation results on the test set of DeepLesion. For detection, boxes in green and red are predicted TPs and FPs, respectively. The number above each box is the lesion score (confidence). For tagging, tags in black, red (underlined), and blue (italic) are predicted TPs, FPs, and FNs, respectively. They are ranked by their scores. For segmentation, the green lines are ground-truth RECIST measurements; the orange contours and lines show predicted masks and RECIST measurements, respectively. More visual examples are provided in sup. mat. Section 3. (TP: true positive; FP: false positive; FN: false negative)
Detection (%) Tagging (%) Segmentation (mm)
Avg. sensitivity AUC F1 Distance Diam. err.
ULDor [13] 69.22
3DCE [15] 75.55
LesaNet [16] (rerun) 95.12 43.17
Auto RECIST [11] 1.7088
MULAN 86.12 96.01 45.53 1.4138 1.9660
(a) w/o feature pyramid 79.73 95.51 43.44 1.6634 2.3780
(b) w/o 3D fusion 79.57 95.88 44.28 1.4120 1.9756
(c) w/o detection branch 95.16 40.03 1.2445 1.7837
(d) w/o tagging branch 84.79 1.4230 1.9589
(e) w/o mask branch 85.21 95.87 43.76
(f) w/o score refine. layer 84.24 95.65 44.59 1.4260 1.9687
Table 1: Accuracy comparison and ablation studies on the test set of DeepLesion. Bold results are the best ones. Underlined results in the ablation studies are the worst ones, indicating the ablated strategy is the most important for the criterion.

Detection: Table 1 shows that MULAN significantly surpasses existing work on universal lesion detection by over 10% in average sensitivity. According to the ablation study, 3D fusion and feature pyramid improve detection accuracy the most. If the tagging branch is not added (ablation study (d)), the detection accuracy is 84.79%; When it is added, the accuracy slightly drops to 84.24% (ablation study (f)). However, when the score refinement layer (SRL) is added, we achieve the best detection accuracy of 86.12%. We hypothesize that SRL effectively exploits the correlation between the two tasks and uses the tag predictions to refine the lesion detection score. To verify the impact of SRL, we randomly re-split the training and validation set of DeepLesion five times and found MULAN with SRL always outperformed it without SRL by .

Examples in Fig. 3 show that MULAN is able to detect true lesions with high confidence score, although there are still FPs when normal tissues have a similar appearance with lesions. We analyzed the detection accuracy by tags and found lung masses/nodules, mediastinal and pelvic lymph nodes, adrenal and liver lesions are among the lesions with the highest sensitivity, while lesions in pancreas, bone, thyroid, and extremity are relatively hard to detect. These conclusions can guide us to collect more training samples with the difficult tags in the future.

Tagging: MULAN outperforms LesaNet [16], a multilabel CNN designed for universal lesion tagging. According to ablation study (c), adding the detection branch improves tagging accuracy. This is probably because detection is hard and requires comprehensive features to be learned in the backbone of MULAN, which are also useful for tagging. Fig. 3 shows that MULAN is able to predict the body part, type, and attributes of lesions with high accuracy.

Segmentation: Our predicted RECIST diameters have an average error of 1.97mm compared with the GT diameters. From Fig. 3, we can find that MULAN performs relatively well on lesions with clear borders, but struggles on those with indistinct or irregular borders, e.g., the liver mass in Fig. 3 (c). Ablation studies show that feature pyramid is the most crucial strategy. Another interesting finding is that removing the detection branch (ablation study (c)) markedly improves segmentation accuracy. The detection task impairs segmentation, which could be a major reason why the multitask MULAN cannot beat Auto RECIST [11], a framework dedicated to lesion measurement. It implies that better segmentation results may be achieved using a single-task CNN.

More detailed results are shown in supplementary material.

4 Conclusion and Future Work

In this paper, we proposed MULAN, the first multitask universal lesion analysis network which can simultaneously detect, tag, and segment lesions in a variety of body parts. The training data of MULAN can be mined from radiologists’ routine annotations and reports with minimum manual effort [17]. An effective 3D feature fusion strategy was developed. We also analyzed the interaction between the three tasks and discovered that: 1) Tag predictions could improve detection accuracy via a score refinement layer; 2) The detection task improved tagging accuracy but impaired segmentation performance.

Universal lesion analysis is a challenging task partially because of the large variance of appearances of the normal and abnormal tissues. Therefore, the 22K training lesions in DeepLesion are still not sufficient for MULAN to learn, which is a main reason for its FPs and FNs. In the future, more training data need to be mined. We also plan to apply or finetune MULAN on other applications of specific lesions. We hope MULAN can be a useful tool for researchers focusing on different types of lesions.

Acknowledgments: This research was supported by the Intramural Research Programs of the National Institutes of Health (NIH) Clinical Center and National Library of Medicine (NLM). It was also supported by NLM of NIH under award number K99LM013001. We thank NVIDIA for GPU card donations.

5 Appendix

5.1 Introduction to the DeepLesion Dataset

Figure 4: Examples of the CT images, annotations, and reports in DeepLesion [17]. The red and blue lines in the images are the RECIST measurements. The green boxes are the bounding-boxes. The sentences are extracted from radiological reports according to the bookmarks [16]. The tags are mined from the sentences and normalized [16].

The DeepLesion dataset [17] was mined from a hospital’s picture archiving and communication system (PACS) based on bookmarks, which are markers annotated by radiologists during their routine work to measure significant image findings. It is a large-scale dataset with 32,735 lesions on 32,120 axial slices from 10,594 CT studies of 4,427 unique patients. There are 1 – 3 lesions in each axial slice. Different from existing datasets that typically focus on one type of lesion, DeepLesion contains a variety of lesions including those in lungs, livers, kidneys, etc., and enlarged lymph nodes.

Each lesion in DeepLesion has a RECIST measurement [2], which consists of two lines: one measuring the longest diameter of the lesion and the second measuring its longest perpendicular diameter in the axial plane, see Fig. 4. From these two diameters, we can compute a 2D bounding box to train a lesion detection algorithm [17], as well as generate a psuedo-mask to train a lesion segmentation algorithm [13].

Besides measuring the lesions, radiologists often describe them in radiological reports and use a hyperlink (shown as “BOOKMARK” in Fig. 4) to link the measurement with the sentence. We can extract tags that describe the lesion in the sentence to train a lesion tagging algorithm [16]. The predicted tags can provide comprehensive and fine-grained semantic information for the user to understand the lesion.

5.2 Additional Details in Methods

5.2.1 Backbone

The backbone structure of MULAN is a truncated DenseNet-121 [4] (87 Conv layers after truncation) with feature pyramid [7]

and 3D feature fusion. The finest level of the feature pyramid corresponds to dense block 1 and has stride 4

[7]. The channel number after feature pyramid is 512.

5.2.2 Detection and Segmentation

The structures of the region proposal network (RPN), detection branch, and mask branch are similar to those in Mask R-CNN [3]

. Five anchor scales (16, 24, 32, 48, 96) and three anchor ratios (1:2, 1:1, 2:1) are used in RPN. The loss function for detection and segmentation is


where and are the classification (lesion vs. non-lesion) and bounding-box regression [3] losses of RPN; and are those in the detection branch; is the Dice loss [8] in the segmentation branch.

5.2.3 Tagging

The label mining strategy and the loss function of the tagging branch are similar to [16], except that the score propagation layer and the triplet loss are not used. Based on the RadLex lexicon [5], we run whole-word matching in the sentences to extract the lesion tags and combine all synonyms. Some tags in the sentence are not related to the lesion in the image, so we use a text-mining module [16] to filter the irrelevant tags. The final 185 tags can be categorized into three classes [16]: 1. Body parts, which include coarse-level body parts (e.g., chest, abdomen), organs (lung, lymph node), fine-grained organ parts (right lower lobe, pretracheal lymph node), and other body regions (porta hepatis, paraspinal); 2. Types, which include general terms (nodule, mass) and more specific ones (adenoma, liver mass); and 3. Attributes, which describe the intensity, shape, size, etc., of the lesions (hypoattenuation, spiculated, large).

The tagging branch predicts a score for each tag of each proposal . Because positive labels are sparse for most tags, we adopt a weighted cross-entropy (WCE) loss [16] for each tag as in Eq. 2, where is the number of true lesions in a minibatch, is the number of tags; ; is the ground-truth of lesion having tag . The loss weights are , are the number of positive and negative labels of tag in the training set of DeepLesion, respectively. Similar to the segmentation branch, the tagging branch only considers proposals corresponding to true lesions in the loss function, since we do not know the ground-truth tags of non-lesions, although non-lesions can also have body parts and attributes.


There are hierarchical and mutually exclusive relations between the tags. For example, lung is the parent of left lung (if a lesion is in the left lung, it must be in the lung), while left lung and right lung are exclusive (they cannot both be true for one lesion). These relations can be leveraged to improve tagging accuracy. Tags extracted from reports are often not complete since radiologists typically do not write down all possible characteristics. If a tag is not mentioned in the report, it may still be true. To deal with this label noise problem, first, we use the label expansion strategy [16] to infer the missing parent tags. If a child tag is mined from the report, all its parents will be set as true. Second, we use the relational hard example mining (RHEM) strategy [16] to suppress reliable negative tags. If a tag is true, all its exclusive tags must be false, so we can define a new loss term to assign higher weights to these exclusive tags.

5.2.4 Overall

The overall loss function of MULAN is


where is defined in Eq. 1, and and are the losses of the tagging branch. and are the losses of the score refinement layer, which have the same forms as in Eq. 1 and in Eq. 2, respectively.

5.3 Additional Details in Experiments and Results

5.3.1 Image Preprocessing Method

We rescaled the 12-bit CT intensity range to floating-point numbers in [0,255] using a single windowing (-1024–3071 HU) that covers the intensity ranges of the lung, soft tissue, and bone. Every image slice was resized so that each pixel corresponds to 0.8mm. The slice intervals of most CT scans in the dataset are either 1mm or 5mm. We interpolated in the

-axis to make the intervals of all volumes 2mm. The black borders in images were clipped for computation efficiency. We used the official data split of DeepLesion. The input of our experiments are 9-slice sub-volumes in DeepLesion, including the key slice that contains the lesion, 4 slices superior to it, and 4 slices inferior [15].

5.3.2 Surrogate Evaluation Criteria for Lesion Segmentation

Figure 5: Illustration of the predicted mask (green contour), estimated RECIST measurement (green segments), and ground-truth RECIST measurement (orange segments with yellow endpoints) of a lesion.

There are no ground-truth (GT) masks in DeepLesion. Instead, each lesion has a GT RECIST measurement, so we use two surrogate metrics to evaluate segmentation results, see Fig. 5. First, if the predicted mask is accurate, the endpoints of the GT RECIST measurement should be on its contour. Therefore, the average distance from the endpoint of the GT measurement to the contour of the predicted mask is a useful metric (the smaller the better). Second, if the predicted mask is accurate, the lengths of the estimated RECIST measurement (diameters) should be the same with the GT diameters. The average error of lengths is thus another useful metric (the smaller the better).

RECIST measurements [11] can be easily estimated from the predicted mask. We first compute the contour of the predicted mask, then find two points on the contour with the largest distance to form the long axis. Next, we search on the contour to find the short axis that is perpendicular to the long axis and has the largest length.

5.3.3 Additional Results

Figure 6: Examples of MULAN’s lesion detection, tagging, and segmentation results on the test set of DeepLesion. For detection, boxes in green and red are predicted TPs and FPs, respectively. The number above each box is the lesion score (confidence). For tagging, tags in black, red (underlined), and blue (italic) are predicted TPs, FPs, and FNs, respectively. They are ranked according to their scores. For segmentation, the green lines are ground-truth RECIST measurements; the orange contours and lines show predicted masks and RECIST measurements, respectively. (TP: true positive; FP: false positive; FN: false negative)
Figure 7: Free-response receiver operating characteristic (FROC) curve of various methods and variations of MULAN on the test set of DeepLesion.

We show the free-response receiver operating characteristic (FROC) curves in Fig. 7. They correspond to the detection accuracies in Table 1 of the main paper. MULAN outperforms previous methods 3DCE and ULDor. In Fig. 6, the threshold for lesion scores is 0.5. Note that some FPs in the detection results are actually TPs, because there are missing lesion annotations in the test set of DeepLesion. Some examples include the smaller lung mass in Fig. 6 (a) and the two smaller pancreatic lesions in Fig. 6 (c).

To turn tag scores into decisions, we calibrated a threshold for each tag that yielded the best F1 on the validation set, and then applied it on the test set. In Fig. 6, MULAN is able to predict the body part, type, and attributes of lesions with high accuracy. Possible reasons of tagging errors include:

  • Some attributes with variable appearances and few training samples have some FPs, such as “benign” and “diffuse” in Fig. 6 (d);

  • Adjacent body parts may be confused by the model, such as “right lower lobe” in (a) and “pancreatic head”, “pancreatic tail”, “lesser sac”, and “duodenum” in (c).

The threshold for mask prediction is 0.5. In Fig. 6, MULAN performs relatively well on lesions with clear borders ((a) and (b)), but struggles on those with indistinct borders ((c) and (d)). For the latter case, GT measurements may sometimes be inaccurate or not consistent (different radiologists have different opinions).

5.4 Results using the Released Tags [16] of DeepLesion

In this paper, we used an NLP algorithm slightly different from [16] to mine tags from reports. There are 185 tags in this paper and 171 in [16]. Since the 171 tags of [16] have been released in 333, we also retrained MULAN on these 171 tags so that the results can be compared with others’ methods trained on the 171 tags. The results are shown in Table 2 below. The definition of the metrics are the same with those in Table 1 of the main paper. The results are also similar with those in Table 1 of the main paper.

Detection (%) Tagging (%) Segmentation (mm)
Avg. sensitivity AUC F1 Distance Diam. err.
ULDor [13] 69.22
3DCE [15] 75.55
LesaNet [16] 93.98 43.44
Auto RECIST [11] 1.7088
MULAN 85.22 95.12 46.12 1.4354 1.9619
Table 2: Accuracy comparison on the test set of DeepLesion using the 171 training tags of [16]. Note that only LesaNet and MULAN used the tags.

The detection performance of MULAN at various FPs per image is reported in Table 3. The 171 tags were used for training, so the results are slightly different from those in Table 1 of the main paper.

FPs per image 0.5 1 2 4 8 16 Avg. of [0.5,1,2,4]
3DCE [15] 62.48 73.37 80.70 85.65 89.09 91.06 75.55
ULDor [13] 52.86 64.80 74.84 84.38 87.17 91.80 69.22
MULAN 76.12 83.69 88.76 92.30 94.71 95.64 85.22
Table 3: Sensitivity (%) at various FPs per image on the test set of DeepLesion (the 171 tags were used for training MULAN).

Tables 46 show the details of the 171 tags and the tag-wise detection and tagging accuracies. The accuracies were computed on the mined tags [16] of the validation set of DeepLesion. “# Train” and “# Test” are the numbers of positive cases in the training and validation sets, respectively.

Tag Class # Train # Test Det. Tag AUC Tag F1
chest bodypart 6813 784 76 94 72
abdomen bodypart 5752 589 71 94 69
lymph node bodypart 4752 516 74 95 67
lung bodypart 3481 414 79 97 87
upper abdomen bodypart 2957 347 71 95 66
retroperitoneum bodypart 2340 241 70 97 71
liver bodypart 1838 200 73 98 81
pelvis bodypart 1634 176 68 98 79
mediastinum bodypart 1604 170 77 96 61
right lung bodypart 1579 195 75 97 71
left lung bodypart 1306 157 81 99 82
mediastinum lymph node bodypart 1152 130 77 97 66
kidney bodypart 896 116 66 98 83
soft tissue bodypart 851 82 63 82 28
hilum bodypart 681 65 69 96 79
chest wall bodypart 672 69 51 98 72
left lower lobe bodypart 654 63 70 99 71
right lower lobe bodypart 647 79 79 98 61
right upper lobe bodypart 518 72 73 98 65
left upper lung bodypart 512 79 85 99 72
abdomen lymph node bodypart 504 44 61 92 43
axilla bodypart 493 41 62 100 85
mesentery bodypart 468 47 69 97 52
bone bodypart 462 52 49 96 65
paraaortic bodypart 462 40 74 98 46
pelvis lymph node bodypart 439 67 72 98 72
retroperitoneum lymph node bodypart 432 31 75 98 39
pancreas bodypart 395 31 56 99 52
adrenal gland bodypart 377 50 79 100 80
axilla lymph node bodypart 346 23 64 99 68
blood vessel bodypart 341 47 68 86 35
left kidney bodypart 320 45 62 99 68
hilum lymph node bodypart 318 28 73 99 68
right kidney bodypart 278 41 73 99 67
pleura bodypart 275 43 55 95 29
right mid lung bodypart 257 32 68 96 57
groin bodypart 226 32 66 98 68
pelvic wall bodypart 222 24 65 97 45
left adrenal gland bodypart 220 29 74 100 75
spleen bodypart 208 23 70 95 67
spine bodypart 176 23 63 95 57
neck bodypart 171 16 72 99 45
mesentery lymph node bodypart 170 20 59 96 38
iliac lymph node bodypart 165 29 77 99 55
lung base bodypart 164 20 71 95 23
muscle bodypart 163 21 79 95 26
porta Hepatis bodypart 162 16 50 94 41
right hilum lymph node bodypart 159 13 77 100 63
groin lymph node bodypart 144 28 65 98 67
fat bodypart 140 21 70 87 11
subcarinal lymph node bodypart 138 26 83 99 56
fissure bodypart 128 13 75 96 18
body wall bodypart 127 18 86 99 28
right adrenal gland bodypart 124 20 84 100 67
subcutaneous bodypart 121 34 76 98 43
lingula bodypart 120 14 86 99 27
porta Hepatis lymph node bodypart 119 11 50 93 36
Table 4: Details of the 171 tags. The detection accuracy (average sensitivity in %), tagging AUC and F1 (%) are also shown.
Tag Class # Train # Test Det. Tag AUC Tag F1
superior mediastinum bodypart 116 8 72 96 11
peritoneum bodypart 115 3 92 88 0
superior mediast. lymph node bodypart 108 16 80 96 21
paraspinal bodypart 107 11 77 91 23
external iliac lymph node bodypart 103 15 73 99 38
supraclavicular lymph node bodypart 102 9 86 96 40
breast bodypart 101 24 70 95 46
thyroid gland bodypart 100 12 69 100 55
aorticopulmonary window bodypart 99 11 75 99 44
intestine bodypart 97 9 61 94 9
anterior mediastinum bodypart 92 8 91 99 37
pancreatic head bodypart 89 10 35 99 30
common iliac lymph node bodypart 84 8 53 98 17
abdominal wall bodypart 83 7 79 100 33
left hilum lymph node bodypart 83 13 71 100 63
extremity bodypart 79 17 85 97 20
adnexa bodypart 73 7 43 98 31
paracaval lymph node bodypart 72 3 83 100 16
airway bodypart 71 10 45 97 23
aorta bodypart 71 5 45 92 0
cardiophrenic bodypart 69 4 75 99 19
rib bodypart 66 11 14 99 53
diaphragm bodypart 64 11 59 89 0
pancreatic tail bodypart 63 3 50 99 9
paraspinal muscle bodypart 59 4 88 90 13
peripancreatic lymph node bodypart 57 4 25 98 13
omentum bodypart 55 7 46 96 5
thigh bodypart 54 12 90 98 24
psoas muscle bodypart 54 4 88 89 14
thoracic spine bodypart 51 9 50 100 44
subpleural bodypart 51 7 61 97 14
vertebral body bodypart 50 8 41 100 46
retrocrural lymph node bodypart 50 4 81 79 50
lumbar bodypart 48 4 69 99 30
perihilar bodypart 48 1 100 96 0
pretracheal lymph node bodypart 47 10 83 97 16
bronchus bodypart 47 9 56 98 15
small bowel bodypart 46 8 66 96 7
anterior abdominal wall bodypart 46 3 83 100 23
pancreatic body bodypart 45 3 42 100 46
cervix bodypart 43 0 - 0 0
stomach bodypart 40 4 50 94 8
urinary bladder bodypart 40 6 63 99 17
lung apex bodypart 36 5 90 98 9
sacrum bodypart 33 3 75 100 46
gallbladder bodypart 33 2 50 100 36
biliary system bodypart 33 3 42 92 33
pelvic bone bodypart 32 3 50 95 22
sternum bodypart 31 3 0 100 26
skin bodypart 31 8 41 89 21
pericardium bodypart 29 4 63 98 7
right thyroid lobe bodypart 28 3 75 100 29
femur bodypart 25 1 50 97 0
cortex bodypart 25 3 50 95 4
trachea bodypart 23 2 25 99 33
ovary bodypart 22 3 92 98 10
subcutaneous fat bodypart 21 10 70 98 20
lesser sac bodypart 15 2 50 99 0
Table 5: Table 4 continued.
Tag Class # Train # Test Det. Tag AUC Tag F1
mass type 4037 412 77 84 21
nodule type 3336 403 76 89 53
enlargement type 996 114 76 76 2
lung nodule type 752 77 86 94 38
lymphadenopathy type 739 79 75 87 17
cyst type 584 83 74 89 35
opacity type 323 37 64 94 24
lung mass type 280 19 93 93 12
metastasis type 267 21 80 82 13
fluid type 258 29 59 91 22
cancer type 250 14 82 72 4
ground-glass opacity type 192 23 54 96 22
thickening type 176 28 46 79 24
consolidation type 165 15 62 99 31
liver mass type 160 23 85 93 28
infiltrate type 119 9 64 92 10
necrosis type 111 11 75 91 26
hemangioma type 84 8 66 95 4
solid pulmonary nodule type 64 7 71 91 4
kidney cyst type 53 3 83 98 6
scar type 48 10 63 80 5
adenoma type 32 4 94 99 14
implant type 30 1 0 66 0
expansile type 29 0 0 0 0
lobular mass type 24 3 92 81 0
simple cyst type 24 3 83 98 7
lipoma type 21 3 0 99 8
hypoattenuation attribute 1681 188 70 90 40
enhancing attribute 902 120 66 81 19
large attribute 558 46 77 83 11
prominent attribute 396 37 64 91 18
calcified attribute 375 45 71 75 3
indistinct attribute 278 18 61 82 11
solid attribute 267 28 56 87 21
hyperattenuation attribute 239 30 72 83 21
heterogeneous attribute 191 19 75 81 20
spiculated attribute 170 26 70 79 23
sclerotic attribute 168 20 54 99 58
soft tissue attenuation attribute 147 12 60 69 5
tiny attribute 126 21 67 93 11
lobular attribute 102 18 85 68 3
conglomerate attribute 98 20 78 84 16
lytic attribute 90 11 20 99 31
cavitary attribute 73 26 87 93 26
subcentimeter attribute 72 11 86 89 7
circumscribed attribute 70 2 50 63 0
diffuse attribute 62 9 67 65 4
exophytic attribute 59 9 72 85 12
oval attribute 41 4 31 68 9
fat-containing attribute 37 10 53 74 6
noncalcified attribute 35 5 95 96 6
nonenhancing attribute 34 8 72 88 4
lucent attribute 33 3 75 8 0
thin attribute 20 7 57 91 13
reticular attribute 17 6 83 93 4
patchy attribute 14 3 67 84 0
Table 6: Table 5 continued.


  • [1] I. Diamant, A. Hoogi, C. F. Beaulieu, M. Safdari, E. Klang, M. Amitai, H. Greenspan, and D. L. Rubin (2016) Improved Patch-Based Automated Liver Lesion Classification by Separate Analysis of the Interior and Boundary Regions. IEEE Journal of Biomedical and Health Informatics 20 (6), pp. 1585–1594. Cited by: §1.
  • [2] E. A. Eisenhauer, P. Therasse, J. Bogaerts, L. H. Schwartz, D. Sargent, R. Ford, J. Dancey, S. Arbuck, S. Gwyther, M. Mooney, L. Rubinstein, L. Shankar, L. Dodd, R. Kaplan, D. Lacombe, and J. Verweij (2009) New response evaluation criteria in solid tumours: Revised RECIST guideline (version 1.1). European Journal of Cancer 45 (2), pp. 228–247. External Links: NIHMS150003, ISBN 0959-8049, ISSN 09598049 Cited by: §1, §2.2, §3, §5.1.
  • [3] K. He, G. Gkioxari, P. Dollar, and R. Girshick (2017) Mask R-CNN. In ICCV, pp. 2980–2988. Cited by: §1, §2.2, §2, §5.2.2.
  • [4] G. Huang, Z. Liu, K. Q. Weinberger, and L. van der Maaten (2017) Densely Connected Convolutional Networks. In CVPR, Cited by: §2.1, §5.2.1.
  • [5] C. P. Langlotz (2006-11) RadLex: a new method for indexing online educational materials. Radiographics 26 (6), pp. 1595–1597. Cited by: §5.2.3.
  • [6] F. Liao, M. Liang, Z. Li, X. Hu, and S. Song (2019) Evaluate the Malignancy of Pulmonary Nodules Using the 3D Deep Leaky Noisy-or Network.

    IEEE Transactions on Neural Networks and Learning Systems

    Cited by: §1, §2.1, §2.1, §3.
  • [7] T. Y. Lin, P. Dollár, R. Girshick, K. He, B. Hariharan, and S. Belongie (2017) Feature pyramid networks for object detection. In CVPR, Cited by: §2.1, §5.2.1.
  • [8] F. Milletari, N. Navab, and S. A. Ahmadi (2016)

    V-Net: Fully convolutional neural networks for volumetric medical image segmentation

    In International Conference on 3D Vision, pp. 565–571. Cited by: §5.2.2.
  • [9] D. Ribli, A. Horváth, Z. Unger, P. Pollner, and I. Csabai (2018)

    Detecting and classifying lesions in mammograms with Deep Learning

    Scientific Reports 8 (1). Cited by: §1.
  • [10] B. Sahiner, A. Pezeshk, L. M. Hadjiiski, X. Wang, K. Drukker, K. H. Cha, R. M. Summers, and M. L. Giger (2018-10) Deep learning in medical imaging and radiation therapy. Medical Physics. External Links: ISSN 00942405 Cited by: §1.
  • [11] Y. Tang, A. P. Harrison, M. Bagheri, J. Xiao, and R. M. Summers (2018-06) Semi-Automatic RECIST Labeling on CT Scans with Cascaded Convolutional Neural Networks. In MICCAI, External Links: Link Cited by: §1, Table 1, §3, §3, §3, §5.3.2, Table 2.
  • [12] Y. Tang, S. Oh, J. Xiao, R. M. Summers, and Y. Tang (2019) CT-realistic data augmentation using generative adversarial network for robust lymph node segmentation. In SPIE, pp. 109503V. External Links: Document Cited by: §1.
  • [13] Y. Tang, K. Yan, Y. Tang, J. Liu, J. Xiao, and R. M. Summers (2019) ULDor: A Universal Lesion Detector for CT Scans with Pseudo Masks and Hard Negative Example Mining. In ISBI, Cited by: §1, §1, §2.2, Table 1, §5.1, Table 2, Table 3.
  • [14] B. Wu, Z. Zhou, J. Wang, and Y. Wang (2018) Joint learning for pulmonary nodule segmentation, attributes and malignancy prediction. In ISBI, pp. 1109–1113. Cited by: §1, §1, §2.1, §2.2.
  • [15] K. Yan, M. Bagheri, and R. M. Summers (2018) 3D Context Enhanced Region-Based Convolutional Neural Network for End-to-End Lesion Detection. In MICCAI, pp. 511–519. Cited by: §1, §2.1, Table 1, §3, §5.3.1, Table 2, Table 3.
  • [16] K. Yan, Y. Peng, V. Sandfort, M. Bagheri, Z. Lu, and R. M. Summers (2019) Holistic and Comprehensive Annotation of Clinically Significant Findings on Diverse CT Images : Learning from Radiology Reports and Label Ontology. In CVPR, Cited by: §1, §2.2, Table 1, §3, §3, §3, Figure 4, §5.1, §5.2.3, §5.2.3, §5.2.3, §5.4, §5.4, §5.4, Table 2.
  • [17] K. Yan, X. Wang, L. Lu, and R. M. Summers (2018) DeepLesion: automated mining of large-scale lesion annotations and universal lesion detection with deep learning. Journal of Medical Imaging 5 (3). External Links: Document, ISSN 2329-4302 Cited by: §1, §3, §4, Figure 4, §5.1, §5.1.