A pulmonary embolism (PE) manifests as blocks in arteries triggered by blood clots, air bubbles, or accumulation of fat tissues that occur typically during surgery, pregnancy or cancer. PE is known to be one of the leading causes of cardiac-related mortality, where an early diagnosis and treatment is expected to have a significant impact in controlling the mortality rate. Unfortunately, over cases of PE are missed annually in the US alone, due to its innate complexity of occurrence combined with non-specific symptoms Liang and Bi (2007). Computed tomographic pulmonary angiography (CTPA) is the primary diagnostic exam to detect arterial diseases, given the high spatial resolution of CT scanners. Each CTPA study is a D image containing hundreds of slices, some of which show evidence of PE as irregularly shaped dark pixel regions partially surrounded by a border of lighter pixels.
In practice, each occurrence of PE can belong to one of the following broad categories: peripheral, segmental, subsegmental, lobar, or saddle type, which can be typically determined based on its arterial location. In particular, subsegmental PEs are considered to be the hardest to detect, since they often occur subtly in subsegmental branches of the pulmonary artery. Consequently, radiologists are required to painstakingly examine every slice in a CT image for detecting PEs, thus making this process highly cumbersome and time-consuming. Moreover, unlike other common diseases visualized in chest CTs such as lung nodules, which usually appear spherical, or emphysema, which can be observed across the entire lung, PEs are known to appear much more asymmetrically only in isolated regions of pulmonary vasculature.
Given the afore-mentioned challenges in detecting PEs, computer-aided diagnostic tools Liang and Bi (2007)
have become prominent. More specifically, data-driven approaches based on deep learning have produced promising results for automatic PE detectionTajbakhsh et al. (2015). The most successful solutions in medical image analysis often comprise multi-stage pipelines tailored for a specific organ/disease. Without loss of generality, such a pipeline in turn includes a segmentation stage for candidate generation, i.e. semantically meaningful regions that are likely to correspond to the disease occurrence, and a classification stage for the actual detection. A crucial bottleneck of this approach is the need for large annotated datasets. Acquiring expert annotations for D volumes, where every instance of disease occurrence is annotated (often referred to as dense annotations), is time-consuming and error-prone. Furthermore, there is a dearth of standard benchmark datasets for PE detection using CTPA, and most of the research in this space is conducted on custom datasets or small-scale challenge dataset de Radiodiagnóstico and the M+Visión consortium (2013) et al. (2016). Recently, Huang et al. showed that such a strategy, where K video clips were used to pre-train a -layer D convolutional network, could be effectively fine-tuned for learning a PE classifier using dense annotations.
In this paper, we adopt an alternate approach of building an accurate PE detection system using only weakly (or sparsely) labeled CT volumes. More specifically, we develop a two stage detection pipeline (shown in Figure 1) designed exclusively using D CNNs, wherein the candidate generation state utilizes a novel context-augmented U-Net and the classifier stage employs a simple D Conv-LSTM model coupled with multiple instance learning Zhu et al. (2017); Ilse et al. (2018); Braman et al. (2018), compared to the layer D CNN in Huang et al. (2019). We find that, even with significantly smaller number of parameters and with no pre-training, our approach produces state-of-the-art detection results on a challenging, large-scale real-world dataset. Further, we study its generalization across hospitals/datasets, given the large disparity across image acquisition systems and protocols, and demonstrate the proposed approach to be highly robust.
Our contributions can thus be summarized as follows:
Our approach does not require expensive dense annotations and operates exclusively on sparse annotations generated for every 10 mm of positive CT scans.
We introduce a context-augmentation strategy that enables the D U-Net in Stage 1 to produce high-quality masks.
By modeling each D CT volume as a bag of instances, i.e., features for each D slice obtained using a Conv-LSTM, we propose to employ multiple instance learning, based on feature aggregation, to detect PE.
For the first time, we evaluate our approach using a large-scale, multi-hospital chest CT dataset that well represents real-world scenarios through the inclusion of complex PE types and diverse imaging protocols.
We present insights from an elaborate empirical study, while discussing the impact of different architectural design choices on the generalization performance.
We show that the proposed approach achieves state-of-the-art detection performance, with AUC scores of and on validation and test sets respectively.
2 Dataset Description
We collected 1,874 PE positive and 718 negative anonymized, contrast-enhanced chest CT studies and their corresponding radiology reports from vRad®. Note that, due to the specific anonymization protocol used in our data curation process, we are unable to determine if two studies belong to the same patient. Hence, we cannot obtain an accurate estimate of the total number of unique patients. Our dataset is curated to represent variations across multiple imaging centers () and different contrast-enhanced imaging protocols, namely PE and CTA. In comparison, currently reported studies in the literature including the state-of-th-art PENet Huang et al. (2019) focus exclusively on the PE protocol, and study generalization to only two hospitals. Consequently, the problem setup we consider is significantly more challenging and further, we do not assume access to dense annotations. As part of the data preparation, we used the NLP pipeline described in Guo et al. (2017) to detect the presence or absence of PE from the patient’s radiology reports. Table 1 shows the sample sizes used for train, validation and test phases of our algorithm. Further, the number of slices in each of the volumes can vary significantly, as illustrated in Figure 2.
2.1 Weak Annotations
Clinical experts provided annotations for positive PE scans by drawing a contour around every embolism occurrence on slices approximately 10mm apart. This process naturally results in multiple unannotated slices between every pair of annotated slices, depending on the slice spacing. We refer such CT studies to be weakly or sparsely annotated. While each study was annotated by only one expert, a total of annotators were involved in the process. Out of the 1,874 positive studies that were processed, of those were discarded due to reasons including the lack of definitive evidence for presence of PE (discrepancy between annotator and the reporting radiologist), insufficient contrast, metal or motion artifacts, etc.
3 Proposed Methodology
|Subsegmental PE||Segmental PE||Lobar PE||Ground Truth||CA U-Net|
3.1 Approach Overview
We develop a two stage approach for weakly supervised PE detection from CT images. While the first stage processes the raw CT volumes to produce a mask that identifies candidate regions that are likely to correspond to emboli regions, the latter stage operates on the masked volume from Stage 1 to perform the actual detection. In contrast to existing solutions, our approach relies exclusively on D convolutions and does not require dense annotations. As illustrated in Figure 1, Stage 1 is implemented using a novel context-augmented D U-Net, and for Stage 2, we adopt a multiple instance learning (MIL) formulation, wherein each D volume is viewed as a bag of instances defined using the individual D slices . Here, denotes the total number of slices in . Broadly, existing MIL methods focus on inferring appropriate aggregation functions either on (i) the instance-level predictions () to produce bag-level prediction Zhu et al. (2017); Braman et al. (2018), or (ii) the instance-level latent features to construct the bag-level feature , which can be subsequently used to obtain the prediction Ilse et al. (2018). We adopt the latter approach, where the instance features are obtained using a D Conv-LSTM model and the feature aggregation is carried out using different functions including mean, max and learnable attention modules.
3.2 Stage 1: Candidate Mask Generation
The role of Stage 1 is to segment an image and identify PE candidates which are localized regions with semantics indicative of the disease. As an initial preprocessing step, each input CT scan is resampled to a volume with mm slice spacing. The architecture for the mask generator is a standard D U-Net Ronneberger et al. (2015), an encoder-decoder style network comprised of a contracting path to downsample the input image while doubling the number of channels, followed by an expansive path to upsample the image. Though using a D U-Net significantly simplifies the computation, processing each D slice independently fails to leverage crucial context information in the neighboring slices. In order to circumvent this, we propose to extract slabs of neighboring slices from either side of each D slice, to form a stack of slices. We treat the raw intensities from each the slices as the channel dimensions, thus producing slabs of size () representing number of channels, height and width. We refer to this architecture as the context-augmented U-Net (CA U-Net). We observed from our experiments that this simple augmentation strategy consistently produced high-quality masks (see example in Table 2).
Each downblock in our U-Net architecture contains D convolution layers with a xto downsample the image. While, each upblock upsamples and then concatenates features at the same level or depth of the network, followed by a convolutional layer coupled with batch normalization and ReLU activation. The depth of the network was fixed at . Upon training,
produces output probabilities for each pixel in the middle slice of the slab, indicating the likelihood of being PE candidate. The training objective was to achieve a highdice coefficient, defined as , a metric which describes the pixel-wise similarity between prediction masks () and ground truth annotation masks (), and has a range of . In practice, we actually adopt the dice loss defined as .
3.3 Stage 2: PE Detection
As described earlier, to perform the actual PE detection, we treat each CT volume as a bag of multiple D slices (instances). Hence, the goal of Stage 2 is to assign a prediction label to a bag indicating the presence or absence of PE. Multiple instance learning is a well-studied problem, where each instance is processed independently, and their features (or predictions) can be aggregated for obtaining bag-level predictions. However, we argue that processing each slice independently in a D volume can produce noisy predictions since the local context is not included. More specifically, we utilize a Conv-LSTM Xingjian et al. (2015) model to produce instance features that automatically incorporates context from its neighboring slices, and perform feature aggregation similar to any MIL system.
As illustrated in Figure 3, the PE detector contains an instance-level feature extractor followed by an MIL module. The feature extractor is a D Conv-LSTM architecture that effectively captures spatio-temporal correlations in a CT volume and produces meaningful instance-level features. All input-to-state and state-to-state transitions use a convolution operation containing filters, a x
kernel and a padding size of. The input to are the masked CTs, denoted as , that is obtained as follows: First, the prediction masks, , from Stage 1 are multiplied with raw CT volumes to create masked CT volumes. We then reduce the -dimension of the masked volumes for computational efficiency. To this end, we use a lung segmentation algorithm to detect the boundary axial slices () that span the lung region. We then extract middle slices from within this range, crop to reduce image height and width to () and finally resize to (), thus transforming to produce . Each instance () is transformed by the Conv-LSTM model as follows:
The features are then average-pooled using a kernel of size to produce dense -dimensional features for all slices in . In order to perform feature aggregation for MIL, we explored the use of max, mean and learnable self-attention functions. The self-attention function used is similar to the one described in Song et al. (2018), and was implemented with multiple attention heads.
Here, denotes the attention coefficients and , denote the learnable parameters for the attention module. The aggregated feature from the multi-head self-attention was further projected using a linear layer to obtain the final bag-level features. For training the detector model , we also explored using the standard binary cross-entropy (BCE) loss and the focal loss Lin et al. (2017) defined as:
At inference time, we apply the preprocessing steps of cropping and resizing to spatial resolution, but make predictions for moving windows of slice (with slice overlap) and use the maximum detection probability as the final prediction for the test CT scan.
4 Empirical Results
In this section, we present a detailed empirical analysis conducted to evaluate the performance of the proposed pipeline and study its behavior with respect to different architectural choices. In particular, we share insights from ablation studies focused on effect of the number of instances used in Stage 2Stage 2.
4.1 Experiment Setup
All our experiments are based on modifying the PE detector in Stage 2, while retaining the Stage 1 model to be the same. Details on sample sizes used in our empirical study are provided in Table 1. Typically, for successful adoption of detection algorithms in clinical practice, they are expected to have a high recall rate on the abnormal cases (also referred to as sensitivity). However, in order to obtain a well-rounded evaluation of the performance we report the following metrics: accuracy (Acc), sensitivity or recall (Rec) and precision (Prec).
where correspond to the number of true positives, false positives, false negatives and true negatives respectively. Further, to obtain an overview on the performance we use the -score and the area under receiver operator curve (AUROC). Note, the -score can be measured as
Training: We trained the model for Stage 2 using an adaptive learning rate of
, which is subsequently reduced based on plateauing behavior of the validation loss. Other hyperparameters include a batch size of, the number of instances () set to (unless specified otherwise), and the Adam optimizer with a weight decay of
. All implementations were carried out in Pytorch, and we performed multi-gpu training usingNVIDIA GTX GPUs.
|PE Detector: Feature Extractor + Aggregation Strategy + Loss Function|
Validation performance comparison using different combinations of feature extractor, aggregation strategy and loss function. Here, CL = Conv-LSTM, C = Conv. with no LSTM, SA = Self-attention, MSA = Multi-head Self Attenion, B = BCE Loss, F = Focal Loss, Max = Max pooling aggregation and Mean = Average pooling aggregation.
4.2 Ablation Studies
In this section, we provide details on the various ablation studies carried out to understand the effect of each architectural component towards the validation performance of the PE detector.
Study 1 - Effect of number of instances : Given the limited GPU memory sizes, and the large sizes of CT volumes, we varied the number of instances that were selected from the masked volume to invoke Stage 2 and studied its effect on the performance. We found that increasing the number of instances expectedly improved the classifier performance as shown in Figure 5.
Study 2 - Feature Extraction and Aggregation: We studied the effect of using LSTM for feature extraction by training a model with Conv-LSTM (CL) layer + Self-Attention (SA) and compared it to using only Conv. (C) + Self-Attention (SA). As expected, the Conv-LSTM model appears to extract more representative features from the slices, compared to treating each of the slices to be independent, as seen in Table 3. A similar empirical analysis on the choice of feature aggregation strategy was carried out. Surprisingly, using max pooling achieved the best performance when compared to even the self-attention module with learnable parameters. This is likely due to the fact that the LSTM already captures dependencies between instances in the bag, thus not requiring a dedicated attention module.
Study 3 - Loss Functions: We also observed that using Focal (F) loss significantly boosts the detection performance by countering the inherent imbalance in the dataset as opposed to using the conventional Binary Cross-Entropy (B) loss.
4.3 Test Performance - Variations across PE Types
Our dataset contains several kinds of PE with varying levels of severity, a distribution of which is shown in Figure 5(b). We report performance of our pipeline on low severity types such as subsegmental and segmental PE, as well as high severity types, namely saddle and pulmonary artery showed in Figure 7(a). As expected, our pipeline picks up evidence for high severity PE more easily by achieving an AUC score of , while obtaining an AUC of in detecting low severity PEs that are harder to find. When compared to the PENet model Huang et al. (2019), our approach achieves improved test accuracies on a dataset characterized by larger amounts of variability, while using a significantly reduced number of parameters.
4.4 Test Performance - Variations across CT Convolution Kernels
In addition, our dataset is comprised of CT images reconstructed using different convolutional kernels, whose choice typically controls the image resolution and noise-levels. Figure 6 shows the distribution of kernels for our dataset, and we find that though most cases use the ’Standard’ kernel, the dataset includes volumes reconstructed using a wide variety of other kernels. From Figure 7(b), we find that our pipeline is robust to variations in kernels by consistently achieving an AUC of on all cases.
5 Relation to Existing Work
In medical imaging applications, commonly deployed disease detection algorithms often involve multi-stage pipelines comprising both segmentation and classification models Ardila et al. (2019). An early work on PE detection used custom feature extraction based on hierarchical anatomical segmentation of organs such as vessels, pulmonary artery, and aorta Bouma et al. (2009). Though it appears natural to directly build a classifier model on the D volumes, in practice, algorithms that first identify semantically meaningful candidate regions, and subsequently extract discriminative features from those regions to perform detection are found to be more effective. These methods are inspired by the success of such two-stage methods in object detection, examples include region-based RCNN and its variants Girshick (2015). However, it is important to note that, adapting those techniques to problems in medical imaging have proven to be less trivial mainly for two reasons. One, these solutions require large datasets with ground truth in the form of bounding boxes that characterize regions of interest (ROIs) or dense annotations, which are usually harder to obtain in the clinical domain. Second, the models need to be capable of handling the heavy imbalance between number of positive cases against the more prevalent negative ROIs. Consequently, weakly supervised approaches have gained research interest. Methods that leverage information ranging from single-pixel labels Anirudh et al. (2016) to approximate segmentation labels from class activation maps have been proposed Li et al. (2018); Zhou et al. (2018)
. However, in the context of PE detection, most existing methods have relied exclusively on supervised learning with dense annotations, and the state-of-the-art solutions such as PENetHuang et al. (2019) utilize transfer learning from pre-trained models for effective detection.
Lung nodule detection using 3d convolutional neural networks trained on weakly labeled data. In Medical Imaging 2016: Computer-Aided Diagnosis, Vol. 9785, pp. 978532. Cited by: §5.
- End-to-end lung cancer screening with three-dimensional deep learning on low-dose chest computed tomography. Nature medicine 25 (6), pp. 954. Cited by: §5.
- Automatic detection of pulmonary embolism in cta images. IEEE transactions on medical imaging 28 (8), pp. 1223–1230. Cited by: §5.
- Disease detection in weakly annotated volumetric medical images using a convolutional lstm network. arXiv preprint arXiv:1812.01087. Cited by: §1, §3.1.
- External Links: Cited by: §1.
Proceedings of the IEEE international conference on computer vision, pp. 1440–1448. Cited by: §5.
Efficient clinical concept extraction in electronic medical records.
Thirty-First AAAI Conference on Artificial Intelligence, Cited by: §2.
- External Links: Cited by: §1, §2, §4.3, §5.
- Attention-based deep multiple instance learning. arXiv preprint arXiv:1802.04712. Cited by: §1, §3.1.
Tell me where to look: guided attention inference network.
The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §5.
- Computer aided detection of pulmonary embolism with tobogganing and mutiple instance classification in ct pulmonary angiography. In Biennial International Conference on Information Processing in Medical Imaging, pp. 630–641. Cited by: §1, §1.
- Focal loss for dense object detection. In Proceedings of the IEEE international conference on computer vision, pp. 2980–2988. Cited by: §3.3.
- U-net: convolutional networks for biomedical image segmentation. In International Conference on Medical image computing and computer-assisted intervention, pp. 234–241. Cited by: 1st item, §3.2.
Attend and diagnose: clinical time series analysis using attention models. In Thirty-Second AAAI Conference on Artificial Intelligence, Cited by: §3.3.
- Computer-aided pulmonary embolism detection using a novel vessel-aligned multi-planar image representation and convolutional neural networks. In International Conference on Medical Image Computing and Computer-Assisted Intervention, pp. 62–69. Cited by: §1.
- Convolutional neural networks for medical image analysis: full training or fine tuning?. IEEE transactions on medical imaging 35 (5), pp. 1299–1312. Cited by: §1.
Convolutional lstm network: a machine learning approach for precipitation nowcasting. In Advances in neural information processing systems, pp. 802–810. Cited by: 1st item, §3.3.
- Weakly supervised instance segmentation using class peak response. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 3791–3800. Cited by: §5.
- Deep multi-instance networks with sparse label assignment for whole mammogram classification. In International Conference on Medical Image Computing and Computer-Assisted Intervention, pp. 603–611. Cited by: §1, §3.1.