PatchFCN for Intracranial Hemorrhage Detection

06/08/2018 ∙ by Weicheng Kuo, et al. ∙ berkeley college 2

This paper studies the problem of detecting acute intracranial hemorrhage on head computed tomography (CT) scans. We formulate it as a pixel-wise labeling task of the frames that constitute a single head scan. The standard approach for this task is the fully convolutional network (FCN) which runs on the whole image at both training and test time. We propose a patch-based approach that controls the amount of context available to the FCN, based on the observation that when radiologists are interpreting CT scans, their judgment depends primarily on local cues and does not require the whole image context. To develop and validate the system, we collected a pixel-wise labeled dataset of 591 scans that covers a wide range of hemorrhage types and imaging conditions in the real world. We show that no pretraining from natural images is needed. By aggregating the pixel-wise labeling, our system is able to make region-level, frame-level, and stack-level decisions. Our final system approaches an expert radiologist performance with a high average precision (AP) of 96.5 +/- 1.3 23ms per frame



There are no comments yet.


page 2

page 6

page 7

page 13

page 14

page 15

page 16

page 17

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Traumatic brain injury (TBI) is a major contributor to injury-related deaths. In emergency departments, head computed tomography (CT) scans are routinely performed on patients under evaluation for suspected TBI. Existing works have shown that a computer vision system that rapidly and reliably detects emergency TBI findings, such as acute intracranial bleeding, on head CT scans can significantly reduce the time to diagnosis and potentially reduce death and long-term disability

[1, 7]

. Deep learning techniques have been successful recently in detecting intracranial hemorrhages, e.g. 3D classification

[1, 7] supervised by text reports, 2D classification [4], instance segmentation [2]. However, to our knowledge, no semantic segmentation approach has shown performance competitive with human experts.

Figure 1: Visualization of PatchFCN segmentation. Each pair contains the PatchFCN output (left) and groundtruth labels (right). Results are randomly selected from the positive frames of the test set.

We propose to solve the detection and segmentation problem jointly as a semantic segmentation task. Segmentation offers many advantages over classification, including better interpretability, and quantifiable metrics for disease prognosis [2, 4]. Unlike [2], we view hemorrhage as “stuff” (e.g. water) rather than “things” (e.g. car) due to its fluid nature. As the clinical need is to know whether a scan (i.e. whole head) is positive, and where the positive pixels are, semantic segmentation is the simplest way to achieve it.

Among existing pixel-wise labeling techniques, fully convolutional networks [5] (FCN) are successful and widely adopted for such tasks in computer vision [5] and the medical imaging community [11, 6]. Most computer vision practitioners use whole images as inputs for their FCNs following [5]. This is in contrast to how patch-based FCN training has been successful in applications such as retinopathy [11], MRI [6], and X-ray/CT imaging [8, 10]. Despite the wide adoption, there exists no systematic study on why patches improve FCN in many cases.

We propose PatchFCN and show that it outperforms standard FCN in localizing hemorrhages. Since no public dataset is available, one important challenge we face is to acquire pixelwise labeled data. Unlike the approaches that learn from text reports [1, 7], we collect a dataset of scans annotated pixelwise for the presence of hemorrhage by expert radiologists. Using x smaller data, PatchFCN significantly outperforms weakly supervised methods [1, 7] on classification tasks. Compared to the state-of-the-art segmentation method [2], our segmentation and classification results are competitive while using x less training data and a simpler system.

We analyze the following factors to better understand the performance gains of PatchFCN: 1) batch diversity, 2) amount of context, and 3) sliding window inference. We find that PatchFCN outperforms FCN by finding an optimal trade-off between batch diversity and the amount of context. In addition, sliding window inference helps to bridge the gap of train/test time and consistently improve performance. We hope these findings would benefit other segmentation tasks where patch-based training is effective.

2 Method

The goals for hemorrhage detection are to find out: 1) whether a stack contains hemorrhage, and 2) where the hemorrhage is within the stack. In practice this may be used by the radiologists/neurosurgeons to assess the risk level of the patient and triage the patient to immediate surgical evacuation, monitoring in the intensive care unit (ICU), or routine monitoring on the hospital ward. Inspired by existing works [6, 11, 8, 10, 3], we propose to solve both tasks with PatchFCN as follows (see Fig.2):

Figure 2: PatchFCN train on small patches and test in sliding window fashion. The colored boxes show different patch sizes in the context of a hemorrhage.

2.0.1 Patch-based Training:

We train an FCN on random small patches cropped from the whole images centered on foreground. The model learns to predict the binary pixel label within the patches. For head CT data, the intuition of patch-based training comes from how radiologists make decisions – the morphology of contrast region is often a crucial cue for deciding whether it represents pathologies. Similarly, PatchFCN causes the network to make its decision based on the local image information without relying on excessive context. In addition, small patches allow larger batch size and hence higher batch diversity to stabilize network training. As most convolutional networks have built-in batch normalization e.g.

[9], PatchFCN leverages it by finding a good trade-off between large minibatch and adequate context for the task.

2.0.2 Patch-based Inference:

At test time, we evaluate the images in a sliding window fashion, as opposed to the typical fully convolutional inference. Sliding window at test time avoids any domain shift which occurs when training on small patches and evaluating fully convolutionally on the whole image. This is because the paddings present in convolution layers make a patch in the context of a whole image not the same as the patch by itself. Let the input image be of size

and the patch size , then the total number of windows is given by , where is an adjustable parameter for the window overlap. As multiple predictions are made for each pixel, we simply average their scores. The frame-level score is obtained by averaging the pixel scores within the frame. To get stack-level scores from pixel scores, we first take -norm over the frame to obtain a stack-frame score. The stack score is defined as the maximum stack-frame score within a stack. is treated as a hyper-parameter and tuned on the trainval set.

2.0.3 Data Collection:

Our dataset consists of clinical head CT scans performed over 7 years from 2010 to 2017 on 4 different 64-detector-row CT scanners (GE Healthcare, Siemens) at our affiliated hospitals. We use the word “stack” for each patient’s head CT scan, and the word “frame” for each individual slice of the stack. The scans were anonymized by removing patient-related meta-data, skull, scalp, and face. Board-certified senior neuroradiologists who specialize in TBI identified all areas of hemorrhage in our dataset. Our data contains the typical spectrum of technical limitations seen in clinical practice (e.g. motion artifact, “streak” artifact near the skull base or in the presence of metal), and also contains all of the subtypes of acute intracranial hemorrhage, including epidural hematoma, subdural hematoma, subarachnoid hemorrhage, hemorrhagic contusion, and intracerebral hemorrhage (see Fig. 3 for examples). We randomly split the data into a trainval/test set of / stacks for development and internal validation. The hyper-parameters of PatchFCN are tuned within the trainval set.

2.0.4 Implementation Details:

We choose a DRN-38 backbone because it performs competitively among many network designs [9]. Regarding the inputs, we clip the dynamic range of raw data at and Hounsfield unit (HU), and then rescale the intensity to lie within . Image size is . In both training and test time, we use a patch size of unless stated otherwise. We utilize the z-axis context by fusing the adjacent frames with the center frame at the input (3 channels in total). The optimization is done by SGD with momentum following [9]

setup. We train the network from scratch without using ImageNet pretraining, as we do not observe any gains using ImageNet. We re-weight the positive class loss by

to balance the dominant negative class loss. The learning rate starts at and decreases by a factor of after and of the complete training iterations. At test time, we select to ensure good overlap between adjacent sliding windows. To compute stack-level score, we select in the norm. All parameters were found by cross validation on the trainval set.

3 Experiments

3.1 Stack-level Benchmark with Human Experts

The first order task of hemorrhage detection is to determine whether a stack contains hemorrhage. We conduct internal as well as external validation for PatchFCN on stack-level as shown in Figure 3. The human expert is a neuroradiologist certified by the American Board of Radiology with 15 years of attending experience. The expert is instructed to examine each scan with the same level of care as a clinical scan. We allow the expert to take as much as time as needed. The expert can modify their reads on scans before submitting final answers on the whole data set. The groundtruths are determined by at least one neuroradiologist with more than 10 years of neuroradiology attending experience.

3.1.1 Internal (Retrospective) Validation:

We report the ROC curve of PatchFCN on the test set and compare it with a human expert (15-year attending) in a retrospective setting where the test data was collected before the model development. Our single model AUC of is competitive against the state-of-the-art (single model) [2] and (ensemble models) [4], while using much less training data. Our human expert has very low false positive rate at recall, better than the of PatchFCN. Using both trainval and test data, our 4-fold cross validation AUC is .

3.1.2 External (Prospective) Validation:

We collected a prospective test set of scans after the model was developed. No further hyper-parameter adjustment was allowed in order to prevent overfitting to the test set. To minimize selection bias, we randomly select from all head CT scans performed from November to December 2018 using the Radiology Information System (RIS) SQL database in our hospital. The positive rate is , which approximates the observed positive rates in emergency departments of many U.S. hospitals. Our ensemble model () achieves an AUC of , which is competitive against the state-of-the-art [2] and [4]. PatchFCN approaches but does not exceed the human expert. Our best operating point is .

Figure 3: Internal and External Validation. We compare PatchFCN to an expert (neuroradiology attending with 15 years of experience) at stack level on retrospective and prospective test sets. PatchFCN achieves AUCs of and respectively, competitive with state-of-the-art systems that use much more labeled data. PatchFCN approaches but does not exceed the attending neuroradiologist.

3.2 Pixel-level Evaluation

Apart from stack-level evaluation, we evaluate PatchFCN at pixel level because clinicians also want to know the location and volume of the bleeds for disease prognosis. Figure 1 visualizes the outputs of PatchFCN in comparison with the groundtruths. Results are shown on randomly selected positive frames in the retrospective test set.

On the retrospective test set, our model achieves pixelwise Dice score, Jaccard index, and average precision of

, , and . In comparison, [2] reports Dice scores of to for a few types of hemorrhages they study. Our groundtruths are annotated pixelwise by senior neuroradiologists who specialize in TBI and include many subtle findings that could be easily missed by inexperienced radiologists. Using both trainval and test data, our 4-fold cross validation Dice score is .

Crop Size 80 120 160 240 480
Batch Size 144 64 36 16 4
Epoch 3600 1600 900 400 100
Dice 75.5 75.9 76.2 76.6 74.2
Jaccard 60.7 61.2 61.6 62.0 59.0
Pixel AP 78.5 78.1 78.5 78.5 75.9
Frame AP 87.8 89.3 89.8 89.9 87.8
Table 1: We benchmark PatchFCN on different patch sizes. Patch size 480 is the standard FCN that consumes whole images (baseline). As seen, PatchFCN consistently outperforms the baseline across a wide range of patch sizes on pixel and frame metric.

3.3 PatchFCN vs. FCN

Table  1 shows that PatchFCN consistently improves over standard FCN for pixel and frame by a healthy margin for a wide range of patch sizes. We report average precision (AP), Dice score and Jaccard index at pixel level with a threshold of 0.5. Note how PatchFCN is robust to patch size and maintains the performance even at a patch size of 80. We have tried even smaller sizes and observed a significant performance drop due to difficult optimization. To compare across different patch sizes, we choose the batch size to control the number of input pixels per batch to be the same, and we choose the number of epochs such that the number of gradient steps are the same. We also ensure that all performances are saturated and training longer does not improve further.

3.4 What Makes PatchFCN effective?

Given the effectiveness of PatchFCN, we want to delve deeper to understand what makes patches so effective. We identify a few differences from standard FCN and study them by control experiments. For the following experiments, we define the batch size , which is the product of , the number of images per batch, and , the number of patches per image. The batch size is defined this way because we sample patches from each of the image samples. PatchFCN has , , and , where is the crop size, whereas the standard FCN has . We perform these analyses on the test split because it is larger and yields more stable performance. In this section, we control the number of input pixels and number of iterations the same way as in Section 3.3, unless otherwise stated.

3.4.1 Batch Diversity:

One possible advantage of PatchFCN is that we can fit a larger batch size and thus include more diverse data within any given GPU memory. To study the contribution of batch diversity, we control the batch size and decrease the number of images we sample patches from. Since , this means we sample more patches per image. As decreases, we expect batch diversity to decrease as well. The default PatchFCN has and , which has the greatest diversity for any given

. By fixing the other hyperparameters, we can safely say the only difference here is the batch diversity. Note that we control the number of steps to be the same, so we decrease the number of epochs linearly with


Table 2 shows that decreased batch diversity results in lower pixel and frame-level performance. The breaking point is at , where the performance drops significantly from . We speculate that this is due to the use of batch normalization in residual networks[9]. This experiment demonstrates the importance of batch diversity for PatchFCN.

Epoch Dice Jaccard PixelAP FrameAP
16 1 16 240 400 76.6 62.0 78.5 89.8
8 2 16 240 200 76.4 61.8 78.5 89.7
4 4 16 240 100 74.7 59.6 77.3 87.7
2 8 16 240 50 57.5 40.3 67.6 81.4
Table 2: PatchFCN performance decreases with decreasing batch diveristy.

3.4.2 How Much Context Does PatchFCN Need?

A trade-off of using patches is that we restrict the amount of context available to the network during training. Intuitively, one would think that more context is better. However, with limited amount of data, it is possible that less context could serve as an effective regularizer by forcing the prediction to rely on local information. To understand how much we lose/gain by having less context, we compare PatchFCN using different patch sizes while fixing the batch size and the number of steps (number of input pixels not the same here).

Epoch Dice Jaccard PixelAP FrameAP
64 16 1 16 400 66.4 49.7 65.8 74.5
120 16 1 16 400 72.5 56.9 74.7 82.2
240 16 1 16 400 76.6 62.0 78.5 89.9
360 16 1 16 400 73.9 58.6 73.4 85.8
480 16 1 16 400 74.1 58.8 75.6 87.7
Table 3: Context helps PatchFCN from to , but not beyond.

Table 3 shows that the improvement of context plateaus at patch size . Compared to , is significantly better. However, increasing the patch size beyond 240 does not offer any more gain. We speculate that the improvement comes from the context regularization of patches, which helps in case of limited data. Overall, controlling context with patches is effective and allows the use of a larger and more diverse batch as in Table 2.

To qualitatively study what cues PatchFCN uses, we backpropagate the gradients from each hemorrhage region to the image space (see Fig.

4). The gradient responses primarily come from the pixels not confidently predicted and correspond to the cues used for hemorrhage prediction. Fig. 4 shows that FCN captures long range dependencies that can easily overfit to limited data, while PatchFCN focuses on the local morphology and may generalize better.

Figure 4: We visualize the gradients of PatchFCN with FCN in image space to see what cues the models rely on. Green speckles are the gradients and the burgundy regions are the selected ground truths for back-propagation.

3.4.3 Patch-based Sliding Window Inference:

At inference time, standard FCN applies on the whole image at once [5]. We hypothesize that this is sub-optimal for PatchFCN because the model is only trained on patches but has to take whole images at test time. That is why the default PatchFCN adopts sliding window inference to minimize the domain shift by letting PatchFCN evaluate patch by patch at test time. In Table 4, we show that sliding window inference consistently improves over fully convolutional inference for all patch sizes. Note that the gap is largest for the smallest crop size of , and decreases as patch size increases.

Epoch Dice Jaccard PixelAP FrameAP
80 144 3600 69.4 (-6.1) 53.1 (-7.6) 74.9 (-3.6) 85.5 (-2.3)
120 64 1600 75.0 (-0.9) 60.0 (-1.2) 75.6 (-2.5) 88.7 (-0.6)
240 16 400 75.9 (-0.7) 61.2 (-0.8) 76.4 (-2.1) 89.8 (-0.1)
Table 4: Sliding window inference consistently outperforms fully convolutional inference (black numbers) for all patch sizes. The red numbers show the gap with sliding window inference.

4 Conclusion

We propose PatchFCN – a simple yet effective framework for intracranial hemorrhage detection. PatchFCN approaches the performance of an expert neuroradiologist as well as performs competitively with the state-of-the-art at stack level. In addition, it localizes many subtypes of hemorrhages well and has strong pixel level performance. Analyses show that PatchFCN outperforms FCN by finding a good trade-off between batch diversity and the amount of context. Our work shows the capability of PatchFCN for intracranial hemorrhage detection and potentially for other medical segmentation tasks.


  • [1] M. R. Arbabshirani, B. K. Fornwalt, G. J. Mongelluzzo, J. D. Suever, B. D. Geise, A. A. Patel, and G. J. Moore (2018) Advanced machine learning in action: identification of intracranial hemorrhage on computed tomography scans of the head with clinical workflow integration. npj Digital Medicine 1 (1), pp. 9. Cited by: §1, §1.
  • [2] P. Chang, E. Kuoy, J. Grinband, B. Weinberg, M. Thompson, R. Homo, J. Chen, H. Abcede, M. Shafie, L. Sugrue, et al. (2018)

    Hybrid 3d/2d convolutional neural network for hemorrhage evaluation on head ct

    American Journal of Neuroradiology 39 (9), pp. 1609–1616. Cited by: §1, §1, §1, §3.1.1, §3.1.2, §3.2.
  • [3] S. C. Karayumak, M. Kubicki, and Y. Rathi (2018) Harmonizing diffusion mri data across magnetic field strengths. In International Conference on Medical Image Computing and Computer-Assisted Intervention, pp. 116–124. Cited by: §2.
  • [4] H. Lee, S. Yune, M. Mansouri, M. Kim, S. H. Tajmir, C. E. Guerrier, S. A. Ebert, S. R. Pomerantz, J. M. Romero, S. Kamalian, et al. (2019) An explainable deep-learning algorithm for the detection of acute intracranial haemorrhage from small datasets. Nature Biomedical Engineering 3 (3), pp. 173. Cited by: §1, §1, §3.1.1, §3.1.2.
  • [5] J. Long, E. Shelhamer, and T. Darrell (2015) Fully convolutional networks for semantic segmentation. In CVPR, Cited by: §1, §3.4.3.
  • [6] Y. Qin, K. Kamnitsas, S. Ancha, J. Nanavati, G. Cottrell, A. Criminisi, and A. Nori (2018) Autofocus layer for semantic segmentation. In International Conference on Medical Image Computing and Computer-Assisted Intervention, pp. 603–611. Cited by: §1, §2.
  • [7] J. J. Titano, M. Badgeley, J. Schefflein, M. Pain, A. Su, M. Cai, N. Swinburne, J. Zech, J. Kim, J. Bederson, et al. (2018) Automated deep-neural-network surveillance of cranial images for acute neurologic events. Nat Med 24 (9), pp. 1337–1341. Cited by: §1, §1.
  • [8] H. Wang, M. Moradi, Y. Gur, P. Prasanna, and T. Syeda-Mahmood (2017) A multi-atlas approach to region of interest detection for medical image classification. In International Conference on Medical Image Computing and Computer-Assisted Intervention, pp. 168–176. Cited by: §1, §2.
  • [9] F. Yu, V. Koltun, and T. Funkhouser (2017) Dilated residual networks. In CVPR, Cited by: §2.0.1, §2.0.4, §3.4.1.
  • [10] Y. Zhang and H. Yu (2018) Convolutional neural network based metal artifact reduction in x-ray computed tomography. IEEE transactions on medical imaging 37 (6), pp. 1370–1381. Cited by: §1, §2.
  • [11] Y. Zhang and A. C. Chung (2018) Deep supervision with additional labels for retinal vessel segmentation task. In International Conference on Medical Image Computing and Computer-Assisted Intervention, pp. 83–91. Cited by: §1, §2.