Liver fibrosis is a major health threat with high prevalence [poynard2010prevalence]. Without timely diagnosis and treatment, liver fibrosis can develop into liver cirrhosis [poynard2010prevalence] and even hepatocellular carcinoma [saverymuttu1986ultrasound]. While histopathology remains the gold standard, non-invasive approaches minimize patient discomfort and danger. Elastography is a useful non-invasive modality, but it is not always available or affordable and it can be confounded by inflammation, presence of steatosis, and the patient’s etiology [Tai_2015, Chen_2019, Lee_2017]. Assessment using conventional US may be potentially more versatile; however, it is a subjective measurement that can suffer from insufficient sensitivities, specificities, and high inter- and intra-rater variability [Manning_2008, Li_2019]. Thus, there is great impetus for an automated and less subjective assessment of liver fibrosis. This is the goal of our work.
Although a relatively understudied topic, prior work has advanced automated US fibrosis assessment [Mojsilovic_1997, Wu_1992, Ogawa_1998, meng2017liver, liu2019ultrasound]. In terms of deep NN, Meng et al. [meng2017liver] proposed a straightforward liver fibrosis parenchyma VGG-16-based [Simonyan15] classifier and tested it on a small dataset of images. Importantly, they only performed image-wise predictions and do not report a method for study-wise classification. On the other hand, Liu et al. [liu2019ultrasound] correctly identified the value in fusing features from all US images in a study when making a prediction. However, their algorithm requires exactly images. But, real patient studies may contain any arbitrary number of US scans. Their feature concatenation approach would also drastically increase computational and memory costs as more images are incorporated. Moreover, they rely on manually labeled indicators as ancillary supervision, which are typically not available without considerable labor costs. Finally, their system treats all US images identically, even though a study consists of different viewpoints of the liver, each of which may have its own set of clinical markers correlating with fibrosis. Ideally, a liver fibrosis assessment system could learn directly from supervisory signals already present in hospital archives, i.e., image-level fibrosis scores produced during daily clinical routines. In addition, a versatile system should also be able to effectively use all US images/views in a patient study with no ballooning of computational costs, regardless of their number.
We fill these gaps by proposing a robust and versatile pipeline for conventional ultrasound liver fibrosis assessment. Like others, we use a deep NN, but with key innovations. First, informed by clinical practice [Aube_2017], we ensure the network focuses only on a clinically-important ROI, i.e., the liver parenchyma and the upper liver border. This prevents the NN from erroneously overfitting on spurious or background features. Second, inspired by HeMIS [Havaei_2016]
, we adapt and expand on this approach and propose HVF as a way to learn from, and perform inference on, any arbitrary number of US scans within a study. While HVF share similarities with deep feature-based multi-instance learning[ilse2018attention]
, there are two important distinctions: (1) HVF includes variance as part of the fusion, as per HeMIS[Havaei_2016]
; (2) HVF is trained using arbitrary image combinations from a patient study, which is possible because, unlike multi-instance learning, each image (or instance) is strongly supervised by the same label. We are the first to propose and develop this mechanism to fuse global NN feature vectors. Finally, we implement a VSP that tailors the NN processing based oncommon liver US views. While the majority of processing is shared, each view possesses its own set of so-called “style”-based normalization parameters [Huang_2017] to customize the analysis. While others have used similar ideas segmenting different anatomical structures [Huang_2019], we are the first to apply this concept for clinical decision support and the first to use it in concert with a hetero-image fusion mechanism. The result is a highly robust and practical liver fibrosis assessment solution.
To validate our approach, we use a cross-validated dataset of US patient studies, comprising images. We measure the ability to identify patients with moderate to severe liver fibrosis. Compared to strong classification baselines, our enhancements are able to improve recall at precision by , with commensurate boosts in partial AUC. Importantly, ablation studies demonstrate that each component contributes to these performance improvements, demonstrating that our liver fibrosis assessment pipeline, and its constituent clinical ROI, HVF, and VSP parts, represents a significant advancement for this important task.
We assume we are given a dataset, , comprised of US patient studies and ground-truth labels indicating liver fibrosis status, dropping the when convenient. Each study , in turn, is comprised of an arbitrary number of 2D conventional US scans of the patient’s liver, . Fig. 1 depicts the workflow of our automated liver assessment tool, which combines clinical ROI pooling, HVF, and VSP.
2.1 Clinical ROI Pooling
We use a deep NN as backbone for our pipeline. Popular deep NN, e.g., ResNet [He_2016], can be formulated with the following convention:
where is a FCN feature extractor parameterized by , is the FCN output, is some global pooling function, e.g., average pooling, and
is a fully-connected layer (and sigmoid function) parameterized by. When multiple US scans are present, a standard approach is to aggregate individual image-wise predictions, e.g., taking the median:
This conventional approach may have drawbacks, as it is possible for the NN to overfit to spurious background variations. However, based on clinical practice [Aube_2017], we know a priori that certain features are crucial for assessing liver fibrosis, e.g., the parenchyma texture and surface nodularity. As Fig. 2 demonstrates, to make the NN focus on these features we use a masking technique. We first generate a liver mask for each US scan. This is done by training a simple segmentation network on a small subset of the images.
|(a) US Image||(b) Liver Mask||(c) Clinical ROI|
Then, for each scan, we create a rectangle that just covers the top half of and 10 pixels above the liver mask, to ensure the liver border is covered. The resulting binary mask is denoted . Because we only need to ensure we capture enough of the liver parenchyma and upper border to extract meaningful features, need not be perfect.
With a clinical ROI obtained, we formulate the pooling function in (1) as a masked version of global average pooling:
where and denote the element-wise product and global average pooling, respectively. Interestingly, we found that including the zeroed-out regions within the global average pooling benefits performance [chen2020anatomyaware, pelvic_yirui]. We posit their inclusion helps implicitly capture liver size characteristics, which is another important clinical US marker for liver fibrosis [Aube_2017].
2.2 Global Hetero-Image Fusion
A challenge with US patient studies is that they may consist of a variable number of images, each of a potentially different view. Ideally, all available US images would contribute to the final prediction. In (3) this is accomplished via a late fusion of independent and image-specific predictions. But, this does not allow the NN integrate the combined features across US images. A better approach would fuse these features. The challenge, here, is to allow for an arbitrary number of US images in order to ensure flexibility and practicality.
The HeMIS approach [Havaei_2016]
to segmentation offers a promising strategy that fuses features from arbitrary numbers of images using their first- and second-order moments. However, HeMIS fuses convolutional features early in its FCN pipeline, which is possible because it assumes pixel-to-pixel correspondence across images. This is completely violated for US images. Instead, onlyglobal US features can be sensibly fused together, which we accomplish through HVF. More formally, we use and to denote the set of FCN features and clinical ROI, respectively, for each image. Then HVF modifies (1) to accept any arbitrary set of FCN features to produce a study-wise prediction:
Besides the first- and second-order moments, HVF (6) also incorporates the max operator as a powerful hetero-fusion function [Zhou_2018_fusion]. All three operators can accept any arbitrary numbers of samples to produce one fused feature vector. To the best of our knowledge, we are the first to apply hetero-fusion for global feature vectors. The difference, compared to late fusion, is that features, rather than predictions are fused. Rather than always inputting all US scans when training, an important strategy is choosing random combinations of the
scans for every epoch. This provides a form of data augmentation and allows the NN to learn from image signals that may be suppressed otherwise. An important implementation note is that training with random combinations of images can make HVF’s batch statistics unstable. For this reason, a normalization not relying on batch statistics, such as instance-normalization[Ulyanov_2016], should be used.
2.3 View-Specific Parameterization
While HVF can effectively integrate arbitrary numbers of US images within a study, it uses the same FCN feature extractor across all images, treating them all identically. Yet, there are certain US features, such as vascular markers, that are specific to particular views. As a result, some manner of view-specific analysis could help push performance further. In fact, based on guidance from our clinical partner, US views of the liver can be roughly divided into categories, which focus on different regions of the liver. These are shown in Fig 3.
A naive solution would be to use a dedicated deep NN for each view category. However, this would drastically reduce the training set for each dedicated NN and would sextuple the number of parameters, computation, and memory consumption. Intuitively, there should be a great deal of analysis that is common across US views. The challenge is to retain this shared analysis, while also providing some tailored processing for each category.
To do this, we adapt the concept of “style” parameters to implement a VSP appropriate for US-based fibrosis assessment. Such parameters refer to the affine normalization parameters used in batch- [Ioffe_2015] or instance-normalization [Ulyanov_2016]. If these are switched out, while keeping all other parameters constant, one can alter the behavior of the NN in quite dramatic ways [Huang_2017, Huang_2019]. For our purposes, retaining view-specific normalization parameters allows for the majority of parameters and processing to be shared across views. VSP is then realized with a minimal number of additional parameters.
More formally, if we create sets of normalization parameters for an FCN, we can denote them as . The FCN from (2) is then modified to be parameterized also by :
where indexes each image by its view and now excludes the normalization parameters. VSP relies on identifying the view of each US scan in order to swap in the correct normalization parameters. This can be recorded as part of the acquisition process. Or, if this is not possible, we have found classifying the US views automatically to be quite reliable.
Dataset. We test our system on a dataset of US patient studies collected from the Chang Gung Memorial Hospital in Taiwan, acquired from Siemens, Philips, and Toshiba devices. The dataset comprises patients, among which () patients have moderate to severe fibrosis (27 with severe liver steatosis). All patients were diagnosed with hepatitis B. Patients were scanned up to times, using a different scanner type each time. Each patient study is composed of up to US images, corresponding to the views in Fig. 3. The total number of images is . We use -fold cross validation, splitting each fold at the patient level into , , and , for training, testing, and validation, respectively. We also manually labeled liver contours from randomly chosen US images.
Implementation Details and Comparisons. Experiments evaluated our workflow against several strong classification baselines, where throughout we use the same ResNet50 [He_2016]
backbone (pretrained on ImageNet[imagenet_cvpr09]). For methods using the clinical ROI pooling of (4), we use a truncated version of ResNet (only the first three layer blocks) for in (2). This keeps enough spatial resolution prior to the masking in (4). We call this truncated backbone “ResNet-3”. To create the clinical ROI, we train a simple 2D U-Net [Ronneberger_2015] on the
images with masks. For training, we perform standard data augmenation with random brightness, contrast, rotations, and scale adjustments. We use the stochastic gradient descent optimizer and a learning rate of 0.001 to train all networks.
|GHIF + VSP (I-Norm)|
For baselines that can output only image-wise predictions, we test against a conventional ResNet50 and also a ResNet-3 with clinical ROI pooling. For these two approaches, following clinical practices, we take the median value across the image-wise predictions to produce a study-wise prediction. All subsequent study-wise baselines are then built off the ResNet-3 with clinical ROI pooling. We first test the global feature fusion of (6), but only train the ResNet-3 with all available images in a US study. In this way, it follows the spirit of Liu et al.’s global fusion strategy [liu2019ultrasound]
. To reveal the impact of our hetero-fusion training strategy that uses different random combinations of US images per epoch, we also test two HVF variants, one using batch-normalization and one using instance-normalization. The latter helps measure the importance of using proper normalization strategies to manage the instability of HVF’s batch statistics. Finally, we test our proposed model which incorporates VSP on top of HVF and clinical ROI pooling.
Evaluation Protocols. The problem setup is binary classification, i.e., identifying patients with moderate to severe liver fibrosis, which are the patient cohorts requiring intervention. While we report full AUC, we primarily focus on operating points within a useful range of specificity or precision. Thus, we evaluate using partial AUC that only consider false positive rates within to because higher values lose their practical usefulness. Partial AUC are normalized to be within a range of to . We also report recalls at a range of precision points (R@P90, R@P85, R@P80) to reveal the achievable sensitivity at high precision points. We report mean values and mean graphs across all cross-validation folds.
Results. Tab. 1 presents our AUC, partial AUC and recall values, whereas Fig. 4 graphs the partial ROC. Several conclusions can be drawn. First, clinical ROI pooling produces significant boosts in performance, validating our strategy of forcing the network to focus on important regions of the image. Second, not surprisingly, global fusion without training with random combinations of images, performs very poorly, as only presenting all study images during training severely limits the data size and variability, handicapping the model.
For instance, compared to variants that train on individual images, global fusion effectively reduces the training size by about a factor of in our dataset. In contrast, the HVF variants, which train with the combinatorial number of random combinations of images, not only avoids drastically reducing the training set size, but can effectively increase it. Importantly, as the table demonstrates, using an appropriate choice of instance normalization is crucial in achieving good performance with HVF. Although not shown, switching to instance normalization did not improve performance for the image-wise or global fusion models. The boosts in HVF performance is apparent in the partial AUC and recalls at high precision points, underscoring the need to analyze results at appropriate operating points. Finally, adding the VSP provides even further performance improvements, particularly in R@P80-R@P90 values, which see a roughly increase over HVF alone. This indicates that VSP can significantly enhance the recall at the very demanding precision points necessary for clinical use. In total, compared to a conventional classifier, the enhancements we articulate contribute to roughly improvements in partial AUC and in R@P90 values. Table 1 of our supplementary material also presents AUC when only choosing to input one particular view in the model during inference. We note that performance is highest when all views are inputted into the model, indicating that our pipeline is able to usefully exploit the information across views. Our supplementary also includes liver segmentation results and success and failure cases for our system.
We presented a principled and effective pipeline for liver fibrosis characterization from US studies, proposing several innovations: (1) clinical ROI pooling to discourage the network from focusing on spurious image features; (2) HVF to manage any arbitrary number of images in the US study in both training and inference; and (3) VSP to tailor the analysis based on the liver view being presented using “style”-based parameters. In particular, we are the first to propose a deep global hetero-fusion approach and the first to combine it with VSP. Experiments demonstrate that our system can produce gains in partial AUC and R@P90 of roughly and , respectively on a dataset of patient studies. Future work should expand to other liver diseases and more explicitly incorporate other clinical markers, such as absolute or relative liver lobe sizing.