Cardiovascular disease (CVD) continues to be the cause of the largest portion of morbidity and mortality globally, accounting for over 18 million deaths globally [roth2020global]. Assessment of CVD with cine magnetic resonance imaging (MRI) has been shown to provide a non-invasive way to evaluate the detailed morphology and function of the heart. In particular, cine MRI is considered to be the most accurate imaging modality for assessing various quantitative parameters with important prognostic implications.
Segmentation of the left ventricle (LV), right ventricle (RV), and myocardium (MYO) from cardiac cine MR images plays an important role in characterizing clinically important parameters [gaggin2013biomarkers], such as ejection fraction (EF), end diastolic volume (EDV), end systolic volume (ESV), and myocardial mass. These parameters, in turn, can be used to identify disease phenotypes, stratify disease risks, and develop diagnostic and prognostic tools [ammar2021automatic]. In clinical practice, semi-automated segmentation is still predominantly used, partly due to the lack of fully-automated and accurate segmentation tools [bernard2018deep], which is time-consuming and suffers from inter-observer variability.
With the recent progress of deep learning[goodfellow2016deep], numerous convolutional neural networks (CNN) models, e.g., U-Net [ronneberger2015u], have been developed, demonstrating their accuracy in many medical image analysis tasks [shen2017deep]. While deep learning has achieved impressive results for segmentation and classification, a number of challenges arise in developing and deploying deep learning models for clinical applications [shen2017deep]. First, CNN models typically require a large number of labeled training datasets [goodfellow2016deep]. Sparse and inaccurate labels caused by privacy issues and the high cost of labeling, however, lead to difficulty in collecting sufficient and high-quality training sample datasets [goodfellow2016deep]; with the limited training datasets, an accurate model fitting at the training stage is challenging. Recently, to address this, efforts have been made to generate samples using data augmentation or adversarial training [liu2019hard], which, however, results in an unavoidable problem of appearance shift between real and generated data. Second, importantly, many CNN models are seen as a “black-box” model [kuo2016understanding, goodfellow2016deep]. Accordingly, CNN models remain largely elusive how a particular CNN model makes a decision and when it can be trusted. Therefore, it is crucial to develop an explainable model that works with a limited number of datasets for clinical applications.
To address the aforementioned challenges, in this work, we propose to develop a lightweight, interpretable, and fully-automated segmentation framework with successive subspace learning (SSL) [rouhsedaghat2021successive]. Specifically, our framework is comprised of the following steps: (1) sequential expansion of near-to-far neighborhood at different resolutions; (2) channel-wise subspace approximation using the subspace approximation with adjusted bias (Saab) transform for unsupervised dimension reduction; (3) a novel class-wise entropy guided feature selection for supervised dimension reduction; (4) concatenation of features and pixel-wise classification with gradient boost; and (5) conditional random field for post-processing.
To the best of our knowledge, this is the first attempt at exploring the SSL framework with the Saab transform for a segmentation task. Our framework is lightweight and interpretable, yet achieving a superior segmentation performance with 200 fewer parameters, compared with state-of-the-art U-Net models.
Ii-a Fundamentals of SSL and Saab Transform
Inspired by the recent stacked design of CNN models, the SSL principle [rouhsedaghat2021successive]
has been targeted for classifying 2D natural images (e.g., PixelHop[chen2020pixelhop, zhang2020pointhop]), 3D MR images [liu2021voxelhop], and point clouds (e.g., PointHop [zhang2020pointhop]). In each layer of SSL, the Saab transform [kuo2019interpretable]
, a variant of Principal Component Analysis (PCA), is used as an alternative to nonlinear activation, thereby alleviating the sign confusion problem[kuo2016understanding]
. Furthermore, the Saab transform is deemed more interpretable than nonlinear activation functions in CNNs[kuo2019interpretable, fan2020interpretability]
, as the model parameters are computed stage-by-stage in a feedforward manner, without backpropagation. Accordingly, the training of our SSL-based method is more efficient and interpretable than that of CNN models[chen2020pixelhop].
Ii-B Our Saab-based SSL segmentation Framework
In this work, we have a 2D MR image and its corresponding label , where and
denote the horizontal and vertical dimensions, respectively. The channel of the gray-value sample is 1, and the label of each pixel is encoded as a four-dimensional one-hot vector for four tissue classes. The architecture of our framework is illustrated in Fig.1, as detailed below.
Ii-B1 Module 1: Unsupervised Feature Selection
We first construct cascade SSL units and max-pooling operations to extract the attributes at different spatial scales in the unsupervised Module 1. Similar to PixelHop [chen2020pixelhop], in each SSL unit, we construct the neighboring region on the plane. For instance, in the first SSL unit, for the single-channel data, we construct the
region for each pixel position. Each of them is then flattened to a 9-dimensional vector. With a padding operation,x is transformed to a cubic with the size of . Then, the Saab transform is used for unsupervised dimension reduction in the channel direction. Each 9-dimensional vector is mapped to a -dimensional feature vector, where
is a hyperparameter to control the output dimension of the first PixelHop unit.
Specifically, the terms, direct current (DC) and alternating current (AC), are adopted from the circuit theory. In the first Saab transform, we configure one DC and AC anchor vectors with the size of . Then, the -th dimension of can be an affine transform of , i.e.,
and the Saab transform has a special design of the anchor vector and the bias term [kuo2019interpretable]. Similar to [kuo2019interpretable], we can set , and divide the anchor vector into two categories:
After computing , we half its spacial size with the max-pooling operation to and send to the next SSL unit. With the multi-channel input, the neighborhood construction involves region at each pixel position. Then, the neighborhood union is flattened to a vector, which is further processed by the Saab transform for dimension reduction. The detailed structure of our module 1 is provided in Table I.
With the cascaded SSL units, the neighborhood union is correlated with more pixels of to extract global information. This process is similar to CNN models in that a larger reception field is achieved in the deeper layers.
Ii-B2 Module 2: Supervised Feature Selection
In what follows, we resort to the supervised dimension reduction based on class-wise entropy-guided feature selection to tailor the discriminative feature for our segmentation task.
Because of the resolution deduction in each unit, we have different spatial size of the extracted features. The features in the later units correspond to a larger reception field (i.e., more pixels) in and . To match the features in these units with the original pixels, we resize to the size of and denote as . Therefore, we have , , , .
Because of the disparate importance, depending on the different channels for the segmentation decision, it is necessary to make supervised feature selection. In related developments, PixelHop++ [chen2020pixelhop++] proposes to classify each channel with the size of and select the channels with low cross-entropy score. However, it is not applicable to segmentation as a channel selection, since the label in the segmentation task is pixel-wise and the feature of a pixel in each channel is only a scalar, making it challenging to be used as a feature for a classifier.
Instead, we propose to select the channel with the small entropy of each class. Specifically, we would encourage the feature of a pixel in each channel to be similar, if the label of the corresponding pixels is the same class. We denote the feature of a pixel in each channel for the -th pixel of a class in the -th channel. The entropy of a sample can be:
where we use to index the four classes in our segmentation task. After calculating the entropy of four classes for each channel, we rank the entropy in descending order. Then, we select the top 80% channels for the subsequent pixel-wise classification task.
|Input Size||Type||Filter Shape|
|Saab Trans||kernels of|
|Saab Trans||kernels of for F1 channels|
|Saab Trans||kernels of for F2 channels|
|Saab||kernels of for F3 channels|
Ii-B3 Module 3: Information fusion for segmentation and post-processing
With the extracted features with both the Saab transform and class-wise entropy guided selection, we concatenate them along with the channel dimension to get the feature . The channel dimension is the sum of all channels in . Each feature vector on the plane of corresponding to an original pixel in or . Then, we carry out the pixel-wise classification for each of
dimensional features with a classifier. We empirically choose the extreme gradient boosting (XGBoost)[chen2015xgboost], which is an optimized distributed gradient boosting library designed to be highly efficient and flexible. XGBoost is trained to learn the correlation of pixel-wise feature and ground truth pixel class label in our training set.
We note that with a limited number of SSL units, it is challenging to support the reception field to cover all of the pixels for global perception. In contrast, too many SSL units will lead to very low resolution in the later units, which is not sufficient to support the pixel-wise segmentation. In addition, the channel size of the later units will be very large, leading to a long and indiscriminative feature vector, which can distract the pixel-wise classification.
To balance this conflict, we propose to validate the most effective and adopt the well-established post-processing tool of conditional random field (CRF) to further refine the segmentation results and get the final results of our framework .
To demonstrate the performance of our Saab transform-based SSL framework, we validated it on the Automated Cardiac Diagnosis Challenge (ACDC 2017) database, which contains 100 subjects. The cine MRI short-axis slices were acquired with 1.5T or 3.0T MRI scanners. The acquired cine MRI short-axis slices covered the LV, RV, and MYO from the base (upper slice) to the apex (lower slice), with 5–8 mm slice thickness, 5 or 10 mm inter-slice gap and the spatial resolution of 1.37–1.68 .
For each patient, the delineations of the LV, RV, and MYO, were obtained by two clinical experts. On average, each subject had about 27 labeled slices. We reported the average Dice similarity score with 30 subjects for testing, 10 subjects for validation, and 50 or 60 subjects for training.
Iii-a Implementation Details
All the experiments were implemented using Python on a server with a Xeon E5 v4 CPU/Nvidia Tesla V100 GPU with 128GB memory. We also used the widely adopted deep learning library, Pytorch, to implement U-Net[ronneberger2015u] and AttnUNet [schlemper2019attention]. For a fair comparison, we resized all of the slices to , which was consistent with the input of the U-Net models.
We empirically used four SSL units and set =5, =10, =30, and =100. We note that the number of the Saab AC filters in the unsupervised dimension reduction procedure controls the preserved energy ratio.
Iii-B Experimental Results
Fig. 2 shows the segmentation results of U-Net with ResNet50 backbone and our SSL framework. We can see that SSL is able to achieve comparable or even better performance than the widely used U-Net models.
For quantitative evaluation, we compared the Dice similarity score in Tables II and III, which used 50 or 60 subjects for training, respectively. Note that the larger Dice similarity score indicates the better segmentation performance. The best results are bolded. With 50 subjects for training, our SSL framework outperformed U-Net [ronneberger2015u] and attention-based U-Net [schlemper2019attention] in all of the three classes. We can observe that with relatively limited training datasets, the performance of the CNN models is inferior to our framework. In addition, the statistics of the network parameters are provided and compared in Table II. We can see that the number of parameters of our SSL framework was about 200 times fewer than the popular U-Net structures [ronneberger2015u, schlemper2019attention]. The much fewer parameters can largely alleviate the difficulty of a small number of training datasets. In the case of using 60 subjects for training, our SSL framework achieved better performance than the U-Net based methods in the average Dice similarity score.
Iii-C Sensitivity Analysis and Ablation Study
With four SSL units, we achieved a state-of-the-art Dice similarity score in both 50 and 60 training subjects settings. The number of SSL units is important for our segmentation framework to balance the efficiency and perception area. The low resolution can be challenging to provide accurate information for fine-grained pixel-wise classification. In Fig. 3
, we have shown the detailed sensitivity study using different SSL units. The standard deviation was computed with five random choices of training and validation splits. The class-wise entropy-guided feature selection was developed to simplify the subsequent classification modules. In addition, CRF was applied as a post-processing step. To demonstrate their effectiveness, we provide the ablation study in TableIV and the effect of CRF in Fig. 4.
|SSL without CRF||84.76%|
|SSL without entropy-guided feature selection||85.91%|
In this work, we presented a lightweight, interpretable, and fully-automated SSL framework with the Saab transform to segment the LV, RV, and MYO from cine MRI. A novel class-wise entropy-guided feature selection was proposed to achieve accurate segmentation. Our thorough experiments carried out using the ACDC 2017 database with different number of training subjects demonstrated that our framework achieved a superior performance, compared with the U-Net-based approaches, with about 200 fewer parameters.