Over the last decade, deep learning techniques have seen success in automatically interpreting biomedical and diagnostic imaging data [Litjens_2018, zhou2018generation]
. However, robust performance often requires training from large-scale data. Unlike computer vision datasets, which can rely on crowd-sourcing[deng2009imagenet], the collection of large-scale medical imaging datasets must typically involve physician labor. Thus, there exists a tension between modeling power and data requirements that only promises to increase [Kohli_2017]. An enticing prospect is mining physician expertise by collecting retrospective data from PACS, but the current generation of PACS do not properly address the curation of large-scale data for machine learning. In PACS, DICOM tags regarding scan descriptions are typically hand inputted, non-standardized, and often incomplete, which leads to the need for extensive data curation [harvey2019standardised]. These limitations frequently produce high mislabeling rates, e.g., the rate reported by Gueld et al., meaning that simply selecting the SOI from a large set of studies can be prohibitively laborious. This has spurred efforts to automatically text mine image/label pairs from PACS [yan2018deeplesion, zhou2019progressively, Irvin_2019], but these efforts rely on complicated and customized NLP technology to extract labels. Apart from the barriers put forth by this complexity, these solutions address contexts where it is possible to extract the information of interest from accompanying text. This is not always possible, as NLP parsers [Peng_2018, Irvin_2019] cannot always straightforwardly correct errors in the original reports or fill in missing information. As such, collecting large-scale data will also require developing automated, but robust, tools that go beyond mining from DICOM tags and/or reports.
This is the topic of our work, where we articulate a robust approach to large-scale data curation based on visual information. In our case, we focus on a hospital PACS dataset we collected that consists of CT scans of imaging studies from unique patients with liver lesions, along with pathological diagnoses. Its makeup is highly heterogeneous, comprising studies of multiple organs, protocols, and reconstruction types. Very simple and accessible text matching rules applied to the DICOM tags can accurately extract scan descriptions; however omissions and errors in the text mean these labels are noisy and unreliable. Without loss of generality, we focus on extracting a large-scale and well curated dataset of dynamic liver CT studies from our PACS data. Dynamic CT is the most common protocol to categorize and assess liver lesions [burrowes2017contrast], and we expect a large-scale dataset to prove highly valuable for the development of computer-aided diagnosis systems, provided it is well curated. Thus, the goal is to use the noisy labels to train a visual recognition system that can much more robustly identify dynamic liver CT studies, extract the corresponding axial-reconstructed scans, and identify the phase of each as being NC, A, V, or D. Fig. 1 shows examples of each phase and discriminating features of each.
Unlike prior work, we focus on extracting multi-phase volumetric SOI of a certain type, rather than on extracting disease tags or labels. This places a high expectation on performance, i.e., F1 scores of , or higher. To tackle this problem, we develop a principled phase recognition system whose contributions are threefold. First, we collect the aforementioned large-scale dataset from a hospital PACS, that includes more than scans. Second, we introduce a customized phase-recognition deep-learning model, comprised of a streamlined version of C3D [tran2015learning] with SE layers. We show that this simple, yet effective model, can outperform much more complicated models. Third, we address a common issue facing data curation systems, where many text mined labels are too general. In our case, these are labels that indicate only “contrast” rather than the more specific NC, A, V, or D SOI. So that we can still use these images for training, along with their weak supervisory signals, we design an ACE loss that incorporates the hierarchical relationship within annotations. Our experimental results demonstrate that our 3DSE model, in combination with our ACE loss, can achieve significantly better phase recognition performance than the text-mined method and other deep-learning based approaches. To the best of our knowledge, this is the first work investigating visual-information based data curation methods in PACS, and we expect that our data curation system would also prove a useful curation approach in domains other than liver dynamic CT.
Our goal is to reliably curate as large as possible a dataset of liver dynamic CT scans, with minimal labor. To do this, we first extracted a dataset of CT studies from the PACS of Anomymized, corresponding to patients who had pathological diagnoses of liver lesions, with the hope that such a dataset would be of great interest for later downstream analysis. This resulted in studies of patients. For each study, the number of scans range from to and there are one to three studies per patient. The resulting dataset is highly heterogenous, containing several types of reconstructions, projections, anatomical regions, and contrast protocols that we not interested in, e.g., computed tomography arterial portography. Studies containing dynamic CT scans may have anywhere from one or all of NC, A, V, and D contrast phase SOI. Our aim is to identify and extract the axial-reconstructed versions of these scans from each study, should they exist. As such, this task exemplifies many of the general demands and challenges of data curation across medical domains.
With the dataset collected, we next applied a set of simple text matching rules to the DICOM tags to noisily label each scan as being either NC, A, V, D or O. The full set of rules are tabulated in our supplemental materials. The text-matching rules are more than sufficient to reliably extract labels based on text alone, due to the extremely simple structure and vocabulary of DICOM tags. However, because the source DICOM tags are themselves error-prone and unreliable [Gueld_2017], these labels suffer from inaccuracies, which we demonstrate later in our results. Finally, we filter out any scans that have less than slices, with a spatial resolution coarser than , or were taken after or during a biopsy or transplant procedure. As a result, we found 1728, 1703, 1504 and 1736 A, V, D and NC scans, respectively, with 326 scans labeled as ‘contrast’. We then manually annotated a validation set and a test set, comprising and scans; and studies; and and patients, respectively. This left a training set of scans from studies of patients with noisy text-mined annotations.
2.2 3DSE Network
As Fig. 1 illustrates, visual cues indicating the phase can be located in different anatomical areas. Given this, we opt for a 3D classification network. State of the art 3D classification networks, such as 3D-Resnet [hara2017learning] and C3D [tran2015learning], are often quite large, adding to the training time and increasing overfitting tendencies.
Instead, we use a streamlined but effective architecture we call 3DSE, which is illustrated in Fig. 2. To begin, we first downsample all volumes to
. From these, image features are extracted using two convolutional layers, each followed by a rectified linear unit and max pooling layers. With such a streamlined feature extracter, activation maps are highly local[hu2018squeeze]. Thus, we add SE [hu2018squeeze] layers. These scale each feature channel with multiplicative factors computed using global pooling, providing an efficient means to increase descriptive capacity and inject global information. Subsequent pooling layers and a two fully connected layers provide the five output phase predictions. The total parameter size MB which is significantly smaller than 3D-Resnet [hara2017learning] and C3D [tran2015learning].
2.3 Aggregated Cross Entropy
Frequently, text-mined labels are only able to provide a more general label of “contrast” for a scan, indicating that it could be any of A, V, or D SOI. Since our goal is to determine the exact phase, the easiest way to handle such scans is to simply remove them from training, at the cost of using less data. Yet, such weakly supervised data still provides useful information, which should ideally be exploited to use as much training data as possible. To do this, we formulate a simple ACE loss that can execute a CE loss, but these weakly supervised instances. We formulate the probability of “contrast” as equalling the sum of the probabilities of all contrast phases:
where (2) assumes a pseudo-probability calculated using softmax,
denotes the logit outputs, andindexes all five outputs.
The can be naively used in a CE loss, but that would preclude using a numerically stable “softmax with CE” formulation. Instead, for scans that can only be labelled as “contrast”, the CE loss can be written as:
where denotes the ground truth. The elimination of all terms but the contrast term in (4), follows from equalling one, with all other values equalling zero. The function enjoys numerically stable forward- and backward-pass implementations. Thus, when presented with a “contrast” scan, our model uses (5) for the loss, providing a simple and numerically stable means to exploit all available data to train our desired, but more fine-grained, outputs.
We tested our 3DSE network, with and without the ACE loss, on our dataset, and compared it to both the noisy text-mined labels and also 3D-Resnet-101 [hara2017learning] and C3D [tran2015learning]. For all models we perform a sweep of learning rates and report results corresponding to the best setting and stopping point based on the validation set.
Focusing first on scan-level comparisons, Tbl. 1 presents F1 scores across the different phase types.
|Text Mining||3DSE||3DSE + ACE|
|Precision||Recall||F1 Score||Precision||Recall||F1 Score||Precision||Recall||F1 Score|
As can be observed from the text-mined results, many scans are misclassified as O and many D scans are missed, demonstrating the shortfalls of relying on labels based on DICOM tags. In contrast, the vision-based 3DSE significantly reduces classification errors, improving the mean F1 score from (via text mining) to . In particular, V’s F1 score is improved from to . Performance is increased even further when we use the ACE loss to include the “contrast” scans in training, boosting the mean F1 score to . While tests show a degradation of performance for the D phase, these differences do not meet statistical significance, unlike the statistically significant improvements seen in the NC, V, and O phases. Thus, these results validate the use of our ACE formulation to exploit as much training data as possible.
|NC||A||V||D||O||mean||model size (MB)|
Shifting focus to across-model comparisons, Tbl. 2 compares our 3DSE model, with and without SE, against other state-of-the-art 3D deep models [hara2017learning, tran2015learning]. As can be seen, 3D-Resnet is nearly times larger than 3DSE and performs poorly, which we observed was due to overfitting. Moving down in model size, C3D [tran2015learning] performs better than 3D-Resnet, but is still unable to match 3DSE. If we remove the SE layer from our 3DSE model, performance considerably suffers, which demonstrates that the SE layer is important in achieving high performance. Despite this, performance still matches C3D even though a significantly smaller number of parameters are used. Finally, the last rows show 3DSE with and without the ACE loss, with latter achieving the highest performance at a model size much smaller than competitors. Finally, as Fig. 3 illustrates, the 3DSE model focuses on anatomical regions that are consistent with clinical practice. More visualizations can be found in our supplementary material.
|Text Mining||3DSE||3DSE + ACE|
|0 Errs.||1 Err.||Errs.||0 Errs.||1 Err.||Errs.||0 Errs.||1 Err.||Errs.|
These boosts in scan-level performance are important, but arguably the study-level performance is even more important, as the ultimate goal is to identify and extract as many dynamic liver CT studies as possible for downstream analysis. Thus, we also evaluate how many studies are correctly extracted, meaning all of their corresponding SOI are correctly classified. As Tbl.3 demonstrates, of studies have all of their scans correctly classified by our 3DSE model. Including the wealky supervised data using the ACE loss, we can further improve this to . If we extrapolate these results to entire dataset of studies, this means that the 3DSE model, armed with the ACE loss, can correctly identify and extract more studies than the text mining approach. This is a significant boost of study numbers for any subsequent analyses.
We presented a data curation tool to robustly extract multi-phase liver studies from a real-world and heterogenous hospital PACS. This includes a streamlined, but powerful, 3DSE model and a principled ACE loss designed to handle incompletely labelled data. Experiments demonstrated that our 3DSE model, along with the ACE loss, can outperform both text mining and also more complex deep models. These results indicate that our vision-based approach can be an effective means to better curate large-scale clinical datasets. Future work includes evaluating our approach in other clinical scenarios, as well as investigating how to harmonize text-mined features with our visual-based system.