CT Data Curation for Liver Patients: Phase Recognition in Dynamic Contrast-Enhanced CT

09/05/2019 ∙ by Bo Zhou, et al. ∙ 0

As the demand for more descriptive machine learning models grows within medical imaging, bottlenecks due to data paucity will exacerbate. Thus, collecting enough large-scale data will require automated tools to harvest data/label pairs from messy and real-world datasets, such as hospital PACS. This is the focus of our work, where we present a principled data curation tool to extract multi-phase CT liver studies and identify each scan's phase from a real-world and heterogenous hospital PACS dataset. Emulating a typical deployment scenario, we first obtain a set of noisy labels from our institutional partners that are text mined using simple rules from DICOM tags. We train a deep learning system, using a customized and streamlined 3D SE architecture, to identify non-contrast, arterial, venous, and delay phase dynamic CT liver scans, filtering out anything else, including other types of liver contrast studies. To exploit as much training data as possible, we also introduce an aggregated cross entropy loss that can learn from scans only identified as "contrast". Extensive experiments on a dataset of 43K scans of 7680 patient imaging studies demonstrate that our 3DSE architecture, armed with our aggregated loss, can achieve a mean F1 of 0.977 and can correctly harvest up to 92.7 standard-loss approach, and also outperforms other, and more complex, model architectures.



There are no comments yet.


page 3

page 7

page 10

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Over the last decade, deep learning techniques have seen success in automatically interpreting biomedical and diagnostic imaging data [Litjens_2018, zhou2018generation]

. However, robust performance often requires training from large-scale data. Unlike computer vision datasets, which can rely on crowd-sourcing 

[deng2009imagenet], the collection of large-scale medical imaging datasets must typically involve physician labor. Thus, there exists a tension between modeling power and data requirements that only promises to increase [Kohli_2017]. An enticing prospect is mining physician expertise by collecting retrospective data from PACS, but the current generation of PACS do not properly address the curation of large-scale data for machine learning. In PACS, DICOM tags regarding scan descriptions are typically hand inputted, non-standardized, and often incomplete, which leads to the need for extensive data curation [harvey2019standardised]. These limitations frequently produce high mislabeling rates, e.g., the rate reported by Gueld et al., meaning that simply selecting the SOI from a large set of studies can be prohibitively laborious. This has spurred efforts to automatically text mine image/label pairs from PACS [yan2018deeplesion, zhou2019progressively, Irvin_2019], but these efforts rely on complicated and customized NLP technology to extract labels. Apart from the barriers put forth by this complexity, these solutions address contexts where it is possible to extract the information of interest from accompanying text. This is not always possible, as NLP parsers [Peng_2018, Irvin_2019] cannot always straightforwardly correct errors in the original reports or fill in missing information. As such, collecting large-scale data will also require developing automated, but robust, tools that go beyond mining from DICOM tags and/or reports.

This is the topic of our work, where we articulate a robust approach to large-scale data curation based on visual information. In our case, we focus on a hospital PACS dataset we collected that consists of CT scans of imaging studies from unique patients with liver lesions, along with pathological diagnoses. Its makeup is highly heterogeneous, comprising studies of multiple organs, protocols, and reconstruction types. Very simple and accessible text matching rules applied to the DICOM tags can accurately extract scan descriptions; however omissions and errors in the text mean these labels are noisy and unreliable. Without loss of generality, we focus on extracting a large-scale and well curated dataset of dynamic liver CT studies from our PACS data. Dynamic CT is the most common protocol to categorize and assess liver lesions [burrowes2017contrast], and we expect a large-scale dataset to prove highly valuable for the development of computer-aided diagnosis systems, provided it is well curated. Thus, the goal is to use the noisy labels to train a visual recognition system that can much more robustly identify dynamic liver CT studies, extract the corresponding axial-reconstructed scans, and identify the phase of each as being NC, A, V, or D. Fig. 1 shows examples of each phase and discriminating features of each.

Figure 1: NC, A, V, and D phases are the SOI in dynamic CT. Radiologists use contrast information in several organs to determine the phase, such as contrast in the heart/aorta (red arrows), portal veins (green arrows), and kidneys (yellow arrows).

Unlike prior work, we focus on extracting multi-phase volumetric SOI of a certain type, rather than on extracting disease tags or labels. This places a high expectation on performance, i.e., F1 scores of , or higher. To tackle this problem, we develop a principled phase recognition system whose contributions are threefold. First, we collect the aforementioned large-scale dataset from a hospital PACS, that includes more than scans. Second, we introduce a customized phase-recognition deep-learning model, comprised of a streamlined version of C3D [tran2015learning] with SE layers. We show that this simple, yet effective model, can outperform much more complicated models. Third, we address a common issue facing data curation systems, where many text mined labels are too general. In our case, these are labels that indicate only “contrast” rather than the more specific NC, A, V, or D SOI. So that we can still use these images for training, along with their weak supervisory signals, we design an ACE loss that incorporates the hierarchical relationship within annotations. Our experimental results demonstrate that our 3DSE model, in combination with our ACE loss, can achieve significantly better phase recognition performance than the text-mined method and other deep-learning based approaches. To the best of our knowledge, this is the first work investigating visual-information based data curation methods in PACS, and we expect that our data curation system would also prove a useful curation approach in domains other than liver dynamic CT.

2 Methods

2.1 Dataset

Our goal is to reliably curate as large as possible a dataset of liver dynamic CT scans, with minimal labor. To do this, we first extracted a dataset of CT studies from the PACS of Anomymized, corresponding to patients who had pathological diagnoses of liver lesions, with the hope that such a dataset would be of great interest for later downstream analysis. This resulted in studies of patients. For each study, the number of scans range from to and there are one to three studies per patient. The resulting dataset is highly heterogenous, containing several types of reconstructions, projections, anatomical regions, and contrast protocols that we not interested in, e.g., computed tomography arterial portography. Studies containing dynamic CT scans may have anywhere from one or all of NC, A, V, and D contrast phase SOI. Our aim is to identify and extract the axial-reconstructed versions of these scans from each study, should they exist. As such, this task exemplifies many of the general demands and challenges of data curation across medical domains.

With the dataset collected, we next applied a set of simple text matching rules to the DICOM tags to noisily label each scan as being either NC, A, V, D or O. The full set of rules are tabulated in our supplemental materials. The text-matching rules are more than sufficient to reliably extract labels based on text alone, due to the extremely simple structure and vocabulary of DICOM tags. However, because the source DICOM tags are themselves error-prone and unreliable [Gueld_2017], these labels suffer from inaccuracies, which we demonstrate later in our results. Finally, we filter out any scans that have less than slices, with a spatial resolution coarser than , or were taken after or during a biopsy or transplant procedure. As a result, we found 1728, 1703, 1504 and 1736 A, V, D and NC scans, respectively, with 326 scans labeled as ‘contrast’. We then manually annotated a validation set and a test set, comprising and scans; and studies; and and patients, respectively. This left a training set of scans from studies of patients with noisy text-mined annotations.

2.2 3DSE Network

As Fig. 1 illustrates, visual cues indicating the phase can be located in different anatomical areas. Given this, we opt for a 3D classification network. State of the art 3D classification networks, such as 3D-Resnet [hara2017learning] and C3D [tran2015learning], are often quite large, adding to the training time and increasing overfitting tendencies.

Figure 2: Our 3DSE network is designed to have a relatively small amount of parameters and consists of three parts, including two 3D convolution layers, one SE layer, and two fully connected layers.

Instead, we use a streamlined but effective architecture we call 3DSE, which is illustrated in Fig. 2. To begin, we first downsample all volumes to

. From these, image features are extracted using two convolutional layers, each followed by a rectified linear unit and max pooling layers. With such a streamlined feature extracter, activation maps are highly local 

[hu2018squeeze]. Thus, we add SE [hu2018squeeze] layers. These scale each feature channel with multiplicative factors computed using global pooling, providing an efficient means to increase descriptive capacity and inject global information. Subsequent pooling layers and a two fully connected layers provide the five output phase predictions. The total parameter size MB which is significantly smaller than 3D-Resnet [hara2017learning] and C3D [tran2015learning].

2.3 Aggregated Cross Entropy

Frequently, text-mined labels are only able to provide a more general label of “contrast” for a scan, indicating that it could be any of A, V, or D SOI. Since our goal is to determine the exact phase, the easiest way to handle such scans is to simply remove them from training, at the cost of using less data. Yet, such weakly supervised data still provides useful information, which should ideally be exploited to use as much training data as possible. To do this, we formulate a simple ACE loss that can execute a CE loss, but these weakly supervised instances. We formulate the probability of “contrast” as equalling the sum of the probabilities of all contrast phases:


where (2) assumes a pseudo-probability calculated using softmax,

denotes the logit outputs, and

indexes all five outputs.

The can be naively used in a CE loss, but that would preclude using a numerically stable “softmax with CE” formulation. Instead, for scans that can only be labelled as “contrast”, the CE loss can be written as:


where denotes the ground truth. The elimination of all terms but the contrast term in (4), follows from equalling one, with all other values equalling zero. The function enjoys numerically stable forward- and backward-pass implementations. Thus, when presented with a “contrast” scan, our model uses (5) for the loss, providing a simple and numerically stable means to exploit all available data to train our desired, but more fine-grained, outputs.

3 Results

We tested our 3DSE network, with and without the ACE loss, on our dataset, and compared it to both the noisy text-mined labels and also 3D-Resnet-101 [hara2017learning] and C3D [tran2015learning]. For all models we perform a sweep of learning rates and report results corresponding to the best setting and stopping point based on the validation set.

Focusing first on scan-level comparisons, Tbl. 1 presents F1 scores across the different phase types.

Text Mining 3DSE 3DSE + ACE
Precision Recall F1 Score Precision Recall F1 Score Precision Recall F1 Score
NC 0.977 0.895 0.934 0.965 0.965 0.964 0.993 0.986 0.988
A 0.966 0.983 0.974 0.974 0.966 0.970 0.991 0.991 0.992
V 0.975 0.782 0.868 0.965 0.946 0.956 0.930 0.993 0.963
D 0.964 0.956 0.960 0.964 0.956 0.960 0.972 0.930 0.951
O 0.926 0.986 0.955 0.981 0.989 0.985 0.997 0.990 0.993
mean 0.962 0.920 0.938 0.970 0.964 0.967 0.977 0.978 0.977
Table 1: Quantitative comparison of scan-level performance. Best results are marked in blue. For the 3DSE + ACE F1 phase-level scores, we use and to indicate if differences were statistically significant () compared to the text-mining and 3DSE model, respectively. Significance was calculated using randomized tests [Yeh_2000] and adjusted using the multiple comparison correction of Holm-Bonferroni [Holm_1979].

As can be observed from the text-mined results, many scans are misclassified as O and many D scans are missed, demonstrating the shortfalls of relying on labels based on DICOM tags. In contrast, the vision-based 3DSE significantly reduces classification errors, improving the mean F1 score from (via text mining) to . In particular, V’s F1 score is improved from to . Performance is increased even further when we use the ACE loss to include the “contrast” scans in training, boosting the mean F1 score to . While tests show a degradation of performance for the D phase, these differences do not meet statistical significance, unlike the statistically significant improvements seen in the NC, V, and O phases. Thus, these results validate the use of our ACE formulation to exploit as much training data as possible.

NC A V D O mean model size (MB)
3DResnet[hara2017learning] 0.560 0.866 0.259 0.052 0.929 0.533 325.22
C3D[tran2015learning] 0.972 0.965 0.920 0.895 0.989 0.948 33.56
3DSE-SE 0.954 0.953 0.924 0.914 0.985 0.946 11.44
3DSE 0.964 0.970 0.956 0.960 0.985 0.967 19.22
3DSE+ACE 0.988 0.992 0.963 0.951 0.993 0.977 19.22
Table 2: Across-model quantitative evaluation using the F1 score. Best and second-best results are marked in blue and red, respectively.

Shifting focus to across-model comparisons, Tbl. 2 compares our 3DSE model, with and without SE, against other state-of-the-art 3D deep models [hara2017learning, tran2015learning]. As can be seen, 3D-Resnet is nearly times larger than 3DSE and performs poorly, which we observed was due to overfitting. Moving down in model size, C3D [tran2015learning] performs better than 3D-Resnet, but is still unable to match 3DSE. If we remove the SE layer from our 3DSE model, performance considerably suffers, which demonstrates that the SE layer is important in achieving high performance. Despite this, performance still matches C3D even though a significantly smaller number of parameters are used. Finally, the last rows show 3DSE with and without the ACE loss, with latter achieving the highest performance at a model size much smaller than competitors. Finally, as Fig. 3 illustrates, the 3DSE model focuses on anatomical regions that are consistent with clinical practice. More visualizations can be found in our supplementary material.

Figure 3: Respond-CAM [zhao2018respond] visualizations of 3DSE from three different dynamic CT scans. (A) the 3DSE focuses on contrast accumulation in the cardiac region; (V): 3DSE focuses on contrast remnants in the cardiac blood pool, liver portal veins, and kidney veins; (D): 3DSE focuses on contrast accumulation in the ureters of the kidney.
Text Mining 3DSE 3DSE + ACE
0 Errs. 1 Err. Errs. 0 Errs. 1 Err. Errs. 0 Errs. 1 Err. Errs.
0 SOI 35 8 10 47 4 2 48 5 0
1 SOI 36 13 1 47 1 2 49 1 0
2 SOI 0 1 0 1 0 0 1 0 0
3 SOI 15 3 1 19 0 0 19 0 0
4 SOI 101 6 1 95 12 1 97 10 1
Total 186 32 13 209 16 6 214 16 1
Accuracy 80.9% 90.5% 92.7%
Table 3: Study-level performance of text mining and 3DSE. Each row groups studies based on the number of dynamic CT SOI they possess. Each column counts the number of studies based on how many scans were misclassified, if any. Best results for each SOI number are marked in blue.

These boosts in scan-level performance are important, but arguably the study-level performance is even more important, as the ultimate goal is to identify and extract as many dynamic liver CT studies as possible for downstream analysis. Thus, we also evaluate how many studies are correctly extracted, meaning all of their corresponding SOI are correctly classified. As Tbl. 

3 demonstrates, of studies have all of their scans correctly classified by our 3DSE model. Including the wealky supervised data using the ACE loss, we can further improve this to . If we extrapolate these results to entire dataset of studies, this means that the 3DSE model, armed with the ACE loss, can correctly identify and extract more studies than the text mining approach. This is a significant boost of study numbers for any subsequent analyses.

4 Conclusion

We presented a data curation tool to robustly extract multi-phase liver studies from a real-world and heterogenous hospital PACS. This includes a streamlined, but powerful, 3DSE model and a principled ACE loss designed to handle incompletely labelled data. Experiments demonstrated that our 3DSE model, along with the ACE loss, can outperform both text mining and also more complex deep models. These results indicate that our vision-based approach can be an effective means to better curate large-scale clinical datasets. Future work includes evaluating our approach in other clinical scenarios, as well as investigating how to harmonize text-mined features with our visual-based system.