Magnetic Resonance Imaging (MRI) is the standard-of-care imaging modality for non-invasive cardiac diagnosis, due to its high contrast sensitivity to soft tissue, good image quality, and lack of exposure to ionizing radiation. Cine cardiac MRI enables the acquisition of high resolution two-dimensional (2D) anatomical images of the heart throughout the cardiac cycle, capturing the full cardiac dynamics via multiple 2D+time short axis acquisitions spanning the whole heart.
Segmentation of the heart structures from these images enables measurement of important cardiac diagnostic indices such as myocardial mass and thickness, left/right ventricle (LV/RV) volumes and ejection fraction. Furthermore, high quality personalized heart models can be generated for cardiac morphology assessment, treatment planning, as well as, precise localization of pathologies during an image-guided intervention. Manual delineation is the standard cardiac image segmentation approach, which is not only time consuming, but also susceptible to high inter- and intra-observer variability. Hence there is a critical need for semi-/fully-automatic methods for cardiac cine MRI segmentation. However, the MR imaging artifacts such as bias fields, respiratory motion, and intensity inhomogeneity and fuzziness, render the segmentation of heart structures challenging.
. These techniques can be classified based on the amount of prior knowledge used during segmentation. First, theno-prior based methods rely solely on the image content to segment the heart structures based on intensity thresholds, and edge- and/or region-information. Hence these methods are often ineffective for the segmentation of ill-defined boundary regions. Second, the deformable models such as active contours and level-set methods incorporate weak-prior information regarding the smoothness of the segmented boundaries; similarly, graph theoretical models assume connectivity between the neighboring pixels providing piece-wise smooth segmentation results. Third, the Active shape and appearance models and Atlas-based methods impose very strong-prior information regarding the geometry of the heart structures and sometimes are too restricted by the training set. These weak-/strong-prior based methods may overcome segmentation challenges in ill-defined boundary regions but, nevertheless, at a high computational cost. Lastly, Machine Learning
based methods aim to predict the probability of each pixel in the image belonging to the foreground/background class based on either patch-wise or image-wise training. These methods are able to produce fast and accurate segmentation, provided the training set captures the population variability.
In the context of deep learning, Longet al.  proposed the first fully convolutional network (FCN) for semantic image segmentation, exploiting the capability of Convolutional Neural Networks (CNNs) [4, 5, 6]
to learn task-specific hierarchical features in an end-to-end manner. However, their initial adoption in the medical domain was challenging due to the limited availability of medical imaging data and associated costly manual annotation. These challenges were later circumvented by patch-based training, data augmentation, and transfer learning techniques[7, 8].
Specifically, in the context of cardiac image segmentation, Tran  adapted a FCN architecture for segmentation of various cardiac structures from short-axis MR images. Similarly, Poudel et al.  proposed a recurrent FCN architecture to leverage inter-slice spatial dependencies between the 2D cine MR slices. Avendi et al.  reported improved accuracy and robustness of the LV segmentation by using the output of a FCN to initialize a deformable model. Further, Oktay et al.  pre-trained an auto-encoder network on ground-truth segmentations and imposed anatomical constraints into a CNN network by adding -loss between the auto-encoder representation of the output and the corresponding ground-truth segmentation. Several modifications to the FCN architecture and various post-processing schemes have been proposed to improve the semantic segmentation results as summarized in .
To improve the generalization performance of neural networks, various regularization techniques have been proposed. These include parameter norm penalty (e.g. weight decay ), noise injection , dropout 17], adversarial training , and multi-task learning (MTL) . In this paper we focus on MTL-based network regularization. When a network is trained on multiple related tasks, the inductive bias provided by the auxiliary tasks causes the model to prefer a hypothesis that explains more than one task. This helps the network ignore task-specific noise and hence focus on learning features relevant to multiple tasks, improving the generalization performance . Furthermore, MTL reduces the Rademacher complexity  of the model (i.e. its ability to fit random noise), hence reducing the risk of overfitting. An overview of MTL applied to deep neural networks can be found in .
MTL has been widely employed in computer vision problems due to the similarity between various tasks being performed. A FCN architecture with a common encoder and task specific decoders was proposed in to perform joint classification, detection, and semantic segmentation, targeting real-time applications such as autonomous driving. A similar single-encoder-multiple-decoder architecture described in  performs semantic segmentation, depth regression, and instance segmentation, simultaneously. The architecture was further expanded by  to automatically learn the weights for each task based on its uncertainty, obtaining state-of-the-art results.
In the context of medical image analysis, Moeskops et al.  demonstrated the use of MTL for joint segmentation of six tissue types from brain MRI, the pectoral muscle from breast MRI, and the coronary arteries from cardiac Computed Tomography Angiography (CTA) images, with performance equivalent to networks trained on individual tasks. Similarly, Valindria et al.  employed a MTL framework to improve the performance for multi-organ segmentation from CT and MR images, exploring various encoder-decoder network architectures. Specific to the cardiac MR applications, Xue et al. 
proposed a network capable of learning multi-task relationship in a Bayesian framework to estimate various local/global LV indices for full quantification of the LV. Similarly, Dangiet al.  performed joint segmentation and quantification of the LV myocardium using the learned task uncertainties to weigh the losses, improving upon the state-of-the-art results. Most of these MTL methods in medical image analysis try to perform various clinically relevant tasks simultaneously. However, the focus of this work is on improving the segmentation performance of various FCN architectures using MTL as a network regularizer.
In this work, we propose to use the rich information available in the distance map of the segmentation mask as an auxiliary task for the image segmentation network. Since each pixel in the distance map represents its distance from the closest object boundary, this representation is redundant and robust compared to the per-pixel image label used for semantic segmentation. Furthermore, the distance map represents the shape and boundary information of the object to be segmented. Hence, training the segmentation network on the additional task of predicting the distance map is equivalent to enforcing shape and boundary constraints for the segmentation task; therefore the name distance map regularized convolutional neural network.
Related work to ours include , which take an image and its semantic segmentation as input and predict the distance transform of the object instances, such that, thresholding the distance map yields the instance segmentation. Similarly, Hayder et al.  represent the boundary of the object instances using a truncated distance map, which is used to refine the instance segmentation result. However, unlike these methods, our goal is not to perform instance segmentation, but to refine the semantic segmentation result using the distance map as an auxiliary task. The most closely related work to ours is presented in  for segmentation of building footprints form satellite images using a MTL framework. Their study is limited to a binary segmentation task, and their network architecture results in increased model complexity. Whereas, in this work, we perform both binary and multi-class segmentations and propose a generic framework to use MTL as a network regularizer, without increasing the model complexity.
The main contributions of this work are as follows:
We propose using distance map prediction as an auxiliary task in a MTL framework to impose soft-constraints on the shape and boundary of the objects to be segmented.
We demonstrate the application of the proposed regularization method for binary as well as multi-class segmentation on two different cardiac MRI datasets.
We demonstrate that the addition of a distance map regularization block improves the segmentation performance of three popular FCN segmentation architectures without increasing the model complexity and inference time.
We apply the task-uncertainty based weighing  scheme to automatically learn weights for the segmentation and distance map regression tasks during training.
We demonstrate superior generalization ability when using the proposed regularization framework with significantly improved cross-dataset segmentation performance.
Ii Methods and Materials
Ii-a CNN for Semantic Image Segmentation
Let be the input intensity image and be the corresponding image segmentation, with representing a set of class labels, and representing the image domain. The task of CNN based segmentation model, with weights , is to learn a discriminative function
that models the underlying conditional probability distribution. The output of a CNN model is passed through a softmax function to produce a probability distribution over the class labels, such that, the function can be learned by maximizing the likelihood:
where represents the
’th element of the vector. In practice, the negative log-likelihood is minimized to learn the optimal CNN model weights, . This is equivalent to minimizing the cross-entropy loss of the ground-truth segmentation, , with respect to the softmax of the network output, .
A typical FCN architecture (Fig. 1
) for image segmentation consists of an encoder and a decoder network. The encoder network includes multiple pooling (max/average pooling) layers applied after several convolution and non-linear activation layers (e.g. Rectified linear unit (ReLU)). It encodes hierarchical features important for the image segmentation task. To obtain per-pixel image segmentation, the global features obtained at the bottleneck layer need to be up-sampled to the original image resolution using the decoder network. The up-sampling filters can either be fixed (e.g. nearest-neighbor or bilinear upsampling), or can be learned during the training (deconvolutional layer). The final output of a decoder network is passed to a softmax classifier to obtain a per-pixel classification.
In a SegNet  (Fig. 1a) architecture, the decoder produces sparse feature maps by up-sampling its inputs using the pooling indices transferred from its encoder. These sparse feature maps are then convolved with a trainable filter bank to obtain dense feature maps, and are finally passed through a softmax classifier to produce per-pixel image segmentation. Since the decoder in the SegNet architecture uses only the global features obtained at the bottleneck layer of the encoder, the high frequency details in the segmentation are lost during the up-sampling process.
The U-Net architecture  (Fig. 1b) introduced skip connections, by concatenating output of encoder layers at different resolutions to the input of the decoder layers at corresponding resolutions, hence preserving the high frequency details important for accurate image segmentation. Furthermore, the skip connections are known to ease the network optimization 
by introducing multiple paths for backpropagation of the gradients, hence, mitigating the vanishing/exploding gradient problem. Similarly, skip connections also allow the network to learn lower level details in the outer layers and focus on learning the residual global features in the deeper encoder layers. Hence, the U-Net architecture is able to produce excellent segmentation results using limited training data with augmentation, and has been extensively used in medical image segmentation.
We observed that learned deconvolution filters in the original U-Net architecture can be replaced by the SegNet like decoder. The up-sampled sparse features are then densified using convolution filters and concatenated with the feature maps obtained from the skip connection. We hypothesize this architecture, using the encoder pooling indices during up-sampling, ensures most information flows through the bottleneck layer, such that, the proposed regularization on the bottleneck layer has higher effect. We refer to this modified architecture as U-SegNet (Fig 1e) throughout this paper, and use it as one of the baseline FCN architectures.
Ii-B Distance Map Regularization Network
The distance map of a binary segmentation mask can be obtained by computing the Euclidean distance of each pixel from the nearest boundary pixel . This representation provides rich, redundant, and robust information about the boundary, shape, and location of the object to be segmented. For a binary segmentation mask, where is the set of foreground pixels, represent the boundary pixels, and is the Euclidean distance between any two pixels, the truncated signed distance map, , is computed as:
is the minimum distance of pixel from the boundary pixels . We truncate the signed distance map at a predefined distance threshold, , hence assigning this maximum negative distance to the slices not containing any foreground pixels (i.e. ), indicating all pixels in the slice are far from the foreground (typically in the apical/basal regions of cardiac cine MR images).
The distance map regularization network is a SegNet like decoder network, up-sampling the feature maps obtained at the bottleneck layer of the encoder to the size of the input image, with the number of output channels equal to the number of foreground classes (i.e. ). For example, for a four class segmentation problem (): background, RV blood-pool, LV myocardium, and LV blood-pool, the regularization network has three output channels, predicting the truncated signed distance maps (Eq. 2) computed from the binary masks of the foreground classes: RV bood-pool, LV myocardium, and LV blood-pool.
Fig. 2 shows the regularization network added to the bottleneck layer of existing FCN architectures. Network training loss is the weighted sum of the cross-entropy loss for segmentation and the mean absolute difference (MAD) loss between the predicted and the ground-truth distance maps. Although the predicted distance maps are similar to the ground-truth maps, we observed they are not accurate and hence cannot be used to refine the obtained segmentation. Therefore, we remove the regularization network after training, such that, the original FCN architecture remains unchanged.
Ii-C MTL using Uncertainty to Weigh Losses
We model the likelihood for a segmentation task as the squashed and scaled version of the model output through a softmax function:
where, is a positive scalar, equivalent to the temperature, for the defined Gibbs/Boltzmann distribution. The magnitude of determines how uniform the discrete distribution is, and hence relates to the uncertainty of the prediction measured in entropy. The log-likelihood for the segmentation task can be written as:
where is the ’th element of the vector .
Similarly, for the regression task, we define our likelihood as a Lapacian distribution with its mean and scale parameter given by the neural network output:
The log-likelihood for regression task can be written as:
where is the neural networks observation noise parameter — capturing the noise in the output.
For a network with two outputs: continuous output modeled with a Laplacian likelihood, and a discrete output modeled with a softmax likelihood, the joint loss is:
where is the MAD loss of and is the cross-entropy loss of . To arrive at (Eq. 7), the two tasks are assumed independent and simplifying assumption , satisfied at , has been made for the softmax likelihood, resulting in a simple optimization objective with improved empirical results . During the training, the joint likelihood loss is optimized with respect to as well as , .
From equation (Eq. 7), we can observe that the losses for individual tasks are weighted by the inverse of their corresponding uncertainties (, ) learned during the training. Hence, the task with higher uncertainty will be weighted less and vice versa. Furthermore, the uncertainties cannot grow too large due to the penalization by the last two terms in (Eq. 7
). In practice, the network is trained to predict the log variance,, for numerical stability and avoiding any division by zero, such that, the positive scale parameter, , can be computed via exponential mapping .
Ii-D Clinical Datasets
Ii-D1 Left Ventricle Segmentation Challenge (LVSC)
This study employed 97 de-identified cardiac MRI image datasets from patients suffering from myocardial infraction and impaired LV contraction available as a part of the STACOM 2011 Cardiac Atlas Segmentation Challenge project [37, 38] database111http://www.cardiacatlas.org/challenges/lv-segmentation-challenge/. Cine-MRI images in short-axis and long-axis views are available for each case. The images were acquired using the Steady-State Free Precession (SSFP) MR imaging protocol with the following settings: typical thickness , gap , TR , TE , flip angle , FOV , spatial resolution to and image matrix using multiple scanners from various manufacturers. Corresponding reference myocardium segmentation generated from expert analyzed 3D surface finite element model are available for all 97 cases throughout the cardiac cycle.
Ii-D2 Automated Cardiac Diagnosis Challenge (ACDC)
This dataset222https://www.creatis.insa-lyon.fr/Challenge/acdc/databases.html is composed of short-axis cardiac cine-MR images acquired for 100 patients divided into 5 evenly distributed subgroups: normal, myocardial infarction, dilated cardiomyopathy, hypertropic cardiomyopathy, and abnormal right ventricle, available as a part of the STACOM 2017 ACDC challenge . The acquisitions were obtained over a 6 year period using two MRI scanners of different magnetic strengths (1.5T and 3.0T). The images were acquired using the SSFP sequence with the following settings: thickness (sometimes ), interslice gap , spatial resolution to , to frames per cardiac cycle. Corresponding manual segmentations for RV blood-pool, LV myocardium, and LV blood-pool, performed by a clinical expert for the end-systole (ES) and end-diastole (ED) phases are provided.
Ii-E Data Preprocessing and Augmentation
SimpleITK  was used to resample short-axis images to a common resolution of 1.5625
and crop/zero-pad to a common size ofand for LVSC and ACDC dataset, respectively. Image intensities were clipped at 99
percentile and normalized to zero mean and unit standard deviation. Each dataset was divided intotrain, validation, and test set with five non-overlaping folds for cross-validation. Train-validation-test fold was performed randomly over the whole LVSC dataset, whereas it was performed per subgroup (stratified sampling) for the ACDC dataset to maintain even distribution of subgroups over the training, validation, and testing sets. The training images were subjected to random similarity transform with: isotropic scaling of to , rotation of to , and translation of to of the image size along both x- and y-axes. The training set for LVSC and ACDC dataset included the original images along with augmentation of two and four randomly transformed versions of each image, respectively. We heavily augment the ACDC dataset, as the labels are available only for the ES and ED phases, whereas, lightly augment the LVSC dataset, as the labels are available throughout the cardiac cycle.
Ii-F Network Training and Testing Details
Networks implemented in PyTorch333https://github.com/pytorch/pytorch were initialized with the Kaiming uniform initializer 
and trained for 50 and 100 epochs for LVSC and ACDC dataset, respectively, with batch size of 15 images.RMS prop optimizer  with a learning rate of 0.0001 and 0.0005 for single- and multi-task networks (0.0002 for DMR-UNet), respectively, decayed by 0.99 every epoch was used. We saved the model with best average Dice coefficient on the validation set, and evaluated on the test set.
Networks were trained on NVIDIA Titan Xp GPU. The distance map threshold was selected empirically and set to a large value of , i.e. full distance map. To make sure the cross-entropy and the MSD loss are in similar scale, their weights were initialized to and , respectively. The auxillary task of distance map regression was removed after network training. Table I provides the model complexity and average timing requirements for training and testing the models.
Ii-G Evaluation Metrics
We use overlap and surface distance measures to evaluate the segmentation. Additionally we evaluate the clinical indices associated with the segmentation.
Ii-G1 Dice and Jaccard Coefficients
Given two binary segmentation masks, A and B, the Dice and Jaccard coefficient are defined as:
where, gives the cardinality (i.e. the number of non-zero elements) of each set. Maximum and minimum values (1.0 and 0.0, repectively) for Dice and Jaccard coefficient occur when there is 100% and 0% overlap between the two binary segmentation masks, respectively.
Ii-G2 Mean Surface Distance and Hausdorff Distance
Let, and , be surfaces (with and points, respectively) corresponding to two binary segmentation masks, A and B, respectively. The mean surface distance (MSD) is defined as:
Similarly, Hausdorff Distance (HD) is defined as:
is the minimum Euclidean distance of point from the points
. Hence, MSD computes the mean distance between the two surfaces, whereas, HD computes the largest distance between the two surfaces, and is sensitive to outliers.
Ii-G3 Ejection Fraction and Myocardial Mass
Ejection Fraction (EF) is an important cardiac parameter quantifying the cardiac output. EF is defined as:
where, EDV is the end-diastolic volume, and ESV is the end-systolic volume. Similarly, the myocardial mass can be computed from the myocardial volume as:
The correlation coefficients for the EF and myocardial mass computed from the ground-truth versus those computed from the automatic segmentation is reported. Correlation coefficient of () represents perfect positive (negative) linear relationship, whereas that of represents no linear relationship between two variables.
Ii-G4 Limits of Agreement
To compare the clinical indices computed from the ground-truth versus those obtained from the automatic segmentation, we take the difference between each pair of the two observations. The mean of these differences is termed as bias
, and the 95% confidence interval, mean1.96
standard deviation (assuming a Gaussian distribution), is termed aslimits of agreement (LoA).
||End Systole (ES)|
|SN||DMR SN||UNet||DMR UNet||USN||DMR USN||SN||DMR SN||UNet||DMR UNet||USN||DMR USN|
|Dice (%)||82.2 (3.8)||83.0 (3.7)||82.6 (3.8)||83.4 (3.2)||82.3 (3.9)||83.4 (3.3)||83.6 (3.8)||84.0 (4.0)||83.8 (4.4)||84.4 (4.0)||83.7 (4.1)||84.4 (3.7)|
|Jaccard (%)||69.9 (5.3)||71.0 (5.1)||70.5 (5.4)||71.6 (4.5)||70.1 (5.4)||71.6 (4.7)||72.0 (5.5)||72.6 (5.7)||72.3 (6.4)||73.2 (5.8)||72.1 (6.0)||73.2 (5.5)|
|MSD (mm)||0.30 (0.08)||0.29 (0.08)||0.29 (0.07)||0.27 (0.07)||0.29 (0.08)||0.28 (0.07)||0.30 (0.08)||0.30 (0.10)||0.30 (0.11)||0.28 (0.09)||0.30 (0.11)||0.29 (0.09)|
|HD (mm)||12.98 (2.64)||12.72 (3.72)||14.08 (5.20)||13.10 (3.42)||13.93 (4.68)||12.97 (4.86)||12.73 (3.36)||12.66 (3.69)||13.63 (5.20)||13.28 (3.51)||13.44 (4.10)||12.81 (3.23)|
|Mass(gram) (Bias+LOA)||3.13 (35.60)||2.67 (33.55)||2.67 (36.60)||4.30 (28.07)||5.57 (39.56)||2.12 (31.70)||5.81 (32.45)||4.49 (33.13)||4.61 (33.84)||5.23 (27.39)||5.66 (36.96)||4.33 (32.09)|
Iii-a Segmentation and Clinical Indices Evaluation
The proposed Distance Map Regularized (DMR) SegNet, UNet, and USegNet models along with the baseline models were trained for the joint segmentation of RV blood-pool, LV myocardium, and LV blood-pool from the ACDC challenge dataset. The provided reference segmentation and the corresponding automatic segmentation obtained from the DMR-USegNet model for a test patient is shown in Fig. 3. Automatic segmentation obtained from all networks, for ED and ES phases, are evaluated against the reference segmentation and summarized in Table II; also shown is the evaluation of subsequently computed clinical indices.
We can observe consistent improvement in the segmentation performance of the models across all heart-chambers and phases after the DM-Regularization. Specifically, there is statistically significant improvement444Wilcoxon signed-rank test is performed for statistical significance testing on several segmentation metrics for SegNet and USegNet models. Same results manifest onto the clinical indices with better correlation and LoA on both EF and myocardium mass. Furthermore, the DMR-USegNet model outperforms all other evaluated networks.
To further analyze the improvement in segmentation performance, we performed a regional analysis by sub-dividing the slices into apical (25% slices in the apical region and beyond), basal (25% slices in the basal region and beyond) and mid-region (remaining 50% mid slices), based on the reference segmentation. From Fig. 3(a), we can observe highest improvement in segmentation performance at the problematic apical slices ; however, due to small size of these regions, the improvement does not have a large effect on the overall performance, though it is of significance when constructing patient specific models of the heart for simulation purposes . We postulate that the additional constraint imposed by a very high negative distance assigned to empty apical/basal slices prevents the network from over-segmenting these regions, hence, improving the regional dice overlap and effectively reducing the overall Hausdorff distance.
To study the effect of the distance map regularization across the five patient sub-groups, we plot the average Dice coefficient for each sub-group computed for all six models in Fig. 5. As expected, we can observe the segmentation performance is better for the normal patients in comparison to the pathological cases. Furthermore, we can observe consistent improvement in segmentation performance after the distance map regularization for all patient sub-groups (with exception of MINF patients for SegNet and UNet models).
Table III shows the segmentation performance evaluated on the LVSC dataset, demonstrating superior performance of the DM regularized models over their baseline. Specifically, there is statistically significant improvement on the Dice and Jaccard metric for the ED phase. Furthermore, the correlation and LoA for the myocardial mass improves after network regularization. The improvement in performance is consistent across different heart regions as shown in Fig. 3(b), excluding slight performance reduction in basal slices for regularized USegNet model, which is compensated by large performance improvement in the apical slices. Interestingly, the simple encoder-decoder SegNet architecture performs equivalent to the UNet architecture with skip connections, likely due to the large training dataset; also reflected in marginal improvement of segmentation performance on the ES phase after DM regularization. Lastly, the segmentation performance on the LVSC dataset is significantly lower than ACDC dataset due to large variability and noise exhibited by the LVSC data as compared to the ACDC dataset.
Iii-B Cross Dataset Evaluation (Transfer Learning)
To analyze the generalization ability of our proposed distance map regularized networks, we performed a cross-dataset segmentation evaluation. The networks trained on ACDC dataset for five-fold cross-validation were tested on the LVSC dataset, and vice versa; such that, the majority voting scheme produced the final per-pixel segmentation. Quantitative evaluation of the automatic myocardium segmentation against the provided reference segmentation is summarized in Table IV. We can observe a significant boost in Dice coefficient of 8% to 13% for distance map regularized networks over their baseline models when trained on ACDC and tested on LVSC dataset (194 ED and ES volumes). Similarly, the distance map regularized models significantly outperform the baseline models by 17% to 41% improvement in Dice coefficient, when trained on LVSC and tested on ACDC dataset (200 ED and ES volumes).
We further analyzed the feature maps across different layers of the baseline and distance map regularized networks (supplementary material Fig. S4). We can observe the baseline models preserve the intensity information and propagate it throughout the network, hence, they are more sensitive to the dataset-specific intensity distribution. On the other hand, the multi-task regularized networks focus more on the edges and other discriminative features, producing sparse feature maps, while ignoring dataset-specific intensity distribution. Moreover, from the feature maps at the decoding layers, we can observe a clear delineation of several cardiac structures in the regularized network, while those for the baseline models are less discriminative, and contain information about all structures present in the image. Hence, we verify that multi-task learning based distance map regularization helps the network learn generalizable features important for the segmentation task, demonstrated by their excellent transfer learning capabilities (see Supplementary Materials for details on feature visualization (Fig. S4) and network learning curves showing the robustness of distance map regularized models against overfitting (Fig. S2)).
Iv Discussion and Conclusion
We performed an extensive study on the effects of hyper-parameters on the performance of the proposed regularization framework. Here we summarize the effects of the learned vs fixed task weighting, and various choices of the distance map threshold. Furthermore, we analyzed the distribution of network weights before and after regularization.
Task Weighting: During network training, we observed that the automatic weighting scheme learns to weigh the cross-entropy and MAD loss, such that, they are brought to the same scale. Hence, to accelerate network training, we initialize the weights to 100 and 0.1 for the cross-entropy and MAD loss, respectively. The learned weights for (cross-entropy, MAD) are around (1.5, 0.5) and (1.0, 1.0) for ACDC and LVSC dataset, respectively. Next, to determine the effect of learned task weighting scheme presented in section II-C, we analyzed the average Dice coefficient of the test set segmentation results for both ACDC (100 volumes) and LVSC (1050 volumes) datasets with fixed versus learned weighting. From Fig. 6, we can observe a slight improvement in average Dice coefficient with learned weights compared to fixed weighting.
Effect of Distance Map Threshold: We selected three extreme values for the distance map threshold: , , and , with weights initialized to , , and for (cross-entropy, MAD) loss, respectively. The networks were then trained with uncertainty based task weighting for a fixed number of epochs. Average Dice coefficient on the test-set obtained from the best performing models on the validation-set across five-fold cross-validation is summarized in Fig. 7. We can observe similar performance for different threshold values, demonstrating low sensitivity to the hyper-parameter. Hence, we decided to use a very high threshold of pixels, which is almost equivalent to regressing the full distance map and neglecting this hyper-parameter.
Network Weight Distribution: We also analyzed the weight distribution of the network before and after distance map regularization, as shown in the Supplementary Materials (Fig. S3). We can observe the number of non-zero weights increase after the distance map regularization, hence, better utilizing the network capacity. However, this effect is less prominent in the U-Net architecture, as the deconvolution filters are learned rather than using the pooling indices for up-sampling. This gives the U-Net architecture more flexibility to pass information through the skip connections, such that, the regularization imposed at the bottleneck layer has reduced effect.
Summary: In this work we proposed, implemented, and demonstrated the benefit of multi-task learning based regularization of fully convolutional networks for semantic image segmentation. We append a decoder network at the bottle-neck layer of existing FCN architectures to perform an auxiliary task of distance map prediction, which can be removed after training to reduce inference time. We automatically learn the weighting of the tasks based on their uncertainty. As the distance map contains robust information regarding the shape, location, and boundary of the object to be segmented, it facilitates the FCN encoder to learn robust global features important for the segmentation task. Our experiments verify that introducing the distance map regularization improves the segmentation performance of three popular FCN architectures for both binary and multi-class segmentation across two publicly available cardiac cine MRI datasets. Specifically, we observed significant improvement in segmentation performance in the problematic apical slices in response to the soft-constraints imposed by the distance map regularization. We also found consistent segmentation improvement on all five patient sub-groups in the ACDC dataset. These improvements were also reflected on the computed clinical indices important for the diagnosis of various heart conditions. Furthermore, we demonstrated the proposed regularization significantly improved the generalization ability of the networks on cross-dataset segmentation (transfer learning), with 8% to 41% improvement in Dice coefficient over the baseline FCN architectures.
Research reported in this publication was supported by the National Institute of General Medical Sciences of the National Institutes of Health under Award No. R35GM128877 and by the Office of Advanced Cyber infrastructure of the National Science Foundation under Award No. 1808530. Ziv Yaniv’s work was supported by the Intramural Research Program of the U.S. National Institutes of Health, National Library of Medicine.
-  P. Peng, K. Lekadir, A. Gooya, L. Shao, S. E. Petersen, and A. F. Frangi, “A review of heart chamber segmentation for structural and functional analysis using cardiac magnetic resonance imaging,” Magnetic Resonance Materials in Physics, Biology and Medicine, vol. 29, no. 2, pp. 155–195, Apr 2016.
-  C. Petitjean and J.-N. Dacher, “A review of segmentation methods in short axis cardiac MR images,” Medical Image Analysis, vol. 15, no. 2, pp. 169 – 184, 2011.
J. Long, E. Shelhamer, and T. Darrell, “Fully convolutional networks for
semantic segmentation,” in
The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2015.
-  Y. Lecun, L. Bottou, Y. Bengio, and P. Haffner, “Gradient-based learning applied to document recognition,” Proceedings of the IEEE, vol. 86, no. 11, pp. 2278–2324, Nov 1998.
-  I. Goodfellow, Y. Bengio, and A. Courville, Deep Learning. MIT Press, 2016.
-  Y. LeCun, Y. Bengio, and G. Hinton, “Deep learning,” Nature, vol. 521, no. 7553, pp. 436–444, 05 2015.
-  D. Shen, G. Wu, and H.-I. Suk, “Deep learning in medical image analysis,” Annual review of biomedical engineering, vol. 19, pp. 221–248, 06 2017.
-  G. Litjens et al., “A survey on deep learning in medical image analysis,” Medical Image Analysis, vol. 42, pp. 60 – 88, 2017.
-  P. V. Tran, “A fully convolutional neural network for cardiac segmentation in short-axis MRI,” CoRR, vol. abs/1604.00494, 2016.
-  R. P. K. Poudel, P. Lamata, and G. Montana, “Recurrent fully convolutional neural networks for multi-slice MRI cardiac segmentation,” in Reconstruction, Segmentation, and Analysis of Medical Images. Cham: Springer International Publishing, 2017, pp. 83–94.
-  M. Avendi, A. Kheradvar, and H. Jafarkhani, “A combined deep-learning and deformable-model approach to fully automatic segmentation of the left ventricle in cardiac MRI,” Medical Image Analysis, vol. 30, pp. 108 – 119, 2016.
-  O. Oktay et al., “Anatomically constrained neural networks (ACNNs): Application to cardiac image enhancement and segmentation,” IEEE Transactions on Medical Imaging, vol. 37, no. 2, pp. 384–395, Feb 2018.
-  A. Garcia-Garcia, S. Orts-Escolano, S. Oprea, V. Villena-Martinez, and J. G. Rodríguez, “A review on deep learning techniques applied to semantic segmentation,” CoRR, vol. abs/1704.06857, 2017.
-  A. Krogh and J. A. Hertz, “A simple weight decay can improve generalization,” ser. NIPS’91. San Francisco, CA, USA: Morgan Kaufmann Publishers Inc., 1991, pp. 950–957.
P. Vincent, H. Larochelle, Y. Bengio, and P.-A. Manzagol, “Extracting and composing robust features with denoising autoencoders,” ser. ICML ’08. New York, NY, USA: ACM, 2008, pp. 1096–1103.
-  N. Srivastava, G. Hinton, A. Krizhevsky, I. Sutskever, and R. Salakhutdinov, “Dropout: A simple way to prevent neural networks from overfitting,” Journal of Machine Learning Research, vol. 15, pp. 1929–1958, 2014.
-  S. Ioffe and C. Szegedy, “Batch normalization: Accelerating deep network training by reducing internal covariate shift,” in Proceedings of the 32Nd International Conference on International Conference on Machine Learning - Volume 37, ser. ICML’15. JMLR.org, 2015, pp. 448–456.
-  I. Goodfellow, J. Shlens, and C. Szegedy, “Explaining and harnessing adversarial examples,” in International Conference on Learning Representations, 2015.
-  R. Caruana, “Multitask learning,” Machine Learning, vol. 28, no. 1, pp. 41–75, Jul 1997.
-  P. L. Bartlett and S. Mendelson, “Rademacher and gaussian complexities: Risk bounds and structural results,” J. Mach. Learn. Res., vol. 3, pp. 463–482, Mar. 2003.
-  S. Ruder, “An overview of multi-task learning in deep neural networks,” arXiv preprint arXiv:1706.05098, 2017.
-  M. Teichmann, M. Weber, M. Zöllner, R. Cipolla, and R. Urtasun, “Multinet: Real-time joint semantic reasoning for autonomous driving,” in 2018 IEEE Intelligent Vehicles Symposium (IV), June 2018, pp. 1013–1020.
-  J. Uhrig, M. Cordts, U. Franke, and T. Brox, “Pixel-level encoding and depth layering for instance-level semantic labeling,” in GCPR, 2016.
-  A. Kendall, Y. Gal, and R. Cipolla, “Multi-task learning using uncertainty to weigh losses for scene geometry and semantics,” CoRR, vol. abs/1705.07115, 2017.
-  P. Moeskops, J. M. Wolterink, B. H. M. van der Velden, K. G. A. Gilhuijs, T. Leiner, M. A. Viergever, and I. Isgum, “Deep learning for multi-task medical image segmentation in multiple modalities,” in MICCAI, 2016.
-  V. V. Valindria et al., “Multi-modal learning from unpaired images: Application to multi-organ segmentation in CT and MRI,” in 2018 IEEE Winter Conference on Applications of Computer Vision (WACV), March 2018, pp. 547–556.
-  W. Xue, G. Brahm, S. Pandey, S. Leung, and S. Li, “Full left ventricle quantification via deep multitask relationships learning,” Medical Image Analysis, vol. 43, pp. 54 – 65, 2018.
-  S. Dangi, Z. Yaniv, and C. A. Linte, “Left Ventricle Segmentation and Quantification from Cardiac Cine MR Images via Multi-task Learning,” ArXiv e-prints, Sep. 2018.
-  M. Bai and R. Urtasun, “Deep watershed transform for instance segmentation,” in 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), July 2017, pp. 2858–2866.
-  Z. Hayder, X. He, and M. Salzmann, “Boundary-aware instance segmentation,” in 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), vol. 00, July 2017, pp. 587–595.
-  B. Bischke, P. Helber, J. Folz, D. Borth, and A. Dengel, “Multi-task learning for segmentation of building footprints with deep neural networks,” CoRR, vol. abs/1709.05932, 2017.
A. Krizhevsky, I. Sutskever, and G. E. Hinton, “Imagenet classification with deep convolutional neural networks,” inAdvances in Neural Information Processing Systems 25, F. Pereira, C. J. C. Burges, L. Bottou, and K. Q. Weinberger, Eds. Curran Associates, Inc., 2012, pp. 1097–1105.
-  V. Badrinarayanan, A. Kendall, and R. Cipolla, “Segnet: A deep convolutional encoder-decoder architecture for image segmentation,” CoRR, vol. abs/1511.00561, 2015.
-  O. Ronneberger, P. Fischer, and T. Brox, “U-net: Convolutional networks for biomedical image segmentation,” CoRR, vol. abs/1505.04597, 2015.
-  K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” in The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2016.
-  G. Borgefors, “Distance transformations in digital images,” Computer Vision, Graphics, and Image Processing, vol. 34, no. 3, pp. 344 – 371, 1986.
-  C. G. Fonseca et al., “The cardiac atlas project - an imaging database for computational modeling and statistical atlases of the heart,” Bioinformatics, vol. 27, no. 16, pp. 2288–2295, 2011.
-  A. Suinesiaputra et al., “A collaborative resource to build consensus for automated left ventricular segmentation of cardiac MR images,” Medical Image Analysis, vol. 18, no. 1, pp. 50 – 62, 2014.
-  O. Bernard et al., “Deep learning techniques for automatic MRI cardiac multi-structures segmentation and diagnosis: Is the problem solved?” IEEE Transactions on Medical Imaging, vol. 37, no. 11, pp. 2514–2525, Nov 2018.
-  Z. Yaniv, B. C. Lowekamp, H. J. Johnson, and R. Beare, “SimpleITK image-analysis notebooks: A collaborative environment for education and reproducible research,” Journal of Digital Imaging, vol. 31, no. 3, pp. 290–303, 2018.
-  K. He, X. Zhang, S. Ren, and J. Sun, “Delving deep into rectifiers: Surpassing human-level performance on imagenet classification,” in Proceedings of the IEEE international conference on computer vision, 2015, pp. 1026–1034.
-  G. Hinton, N. Srivastava, and K. Swersky, “Neural networks for machine learning lecture 6a overview of mini-batch gradient descent.”
-  T. E. Peters, C. E. Linte, Z. E. Yaniv, and J. E. Williams, “Mixed and augmented reality in medicine.” CRC Press, 2018, ch. Chapter 16. Augmented and Virtual Visualization for Image-Guided Cardiac Therapeutics, pp. 231–250.