I Introduction
Magnetic Resonance Imaging (MRI) is the standardofcare imaging modality for noninvasive cardiac diagnosis, due to its high contrast sensitivity to soft tissue, good image quality, and lack of exposure to ionizing radiation. Cine cardiac MRI enables the acquisition of high resolution twodimensional (2D) anatomical images of the heart throughout the cardiac cycle, capturing the full cardiac dynamics via multiple 2D+time short axis acquisitions spanning the whole heart.
Segmentation of the heart structures from these images enables measurement of important cardiac diagnostic indices such as myocardial mass and thickness, left/right ventricle (LV/RV) volumes and ejection fraction. Furthermore, high quality personalized heart models can be generated for cardiac morphology assessment, treatment planning, as well as, precise localization of pathologies during an imageguided intervention. Manual delineation is the standard cardiac image segmentation approach, which is not only time consuming, but also susceptible to high inter and intraobserver variability. Hence there is a critical need for semi/fullyautomatic methods for cardiac cine MRI segmentation. However, the MR imaging artifacts such as bias fields, respiratory motion, and intensity inhomogeneity and fuzziness, render the segmentation of heart structures challenging.
A comprehensive review of cardiac MR segmentation techniques can be found in [1, 2]
. These techniques can be classified based on the amount of prior knowledge used during segmentation. First, the
noprior based methods rely solely on the image content to segment the heart structures based on intensity thresholds, and edge and/or regioninformation. Hence these methods are often ineffective for the segmentation of illdefined boundary regions. Second, the deformable models such as active contours and levelset methods incorporate weakprior information regarding the smoothness of the segmented boundaries; similarly, graph theoretical models assume connectivity between the neighboring pixels providing piecewise smooth segmentation results. Third, the Active shape and appearance models and Atlasbased methods impose very strongprior information regarding the geometry of the heart structures and sometimes are too restricted by the training set. These weak/strongprior based methods may overcome segmentation challenges in illdefined boundary regions but, nevertheless, at a high computational cost. Lastly, Machine Learningbased methods aim to predict the probability of each pixel in the image belonging to the foreground/background class based on either patchwise or imagewise training. These methods are able to produce fast and accurate segmentation, provided the training set captures the population variability.
In the context of deep learning, Long
et al. [3] proposed the first fully convolutional network (FCN) for semantic image segmentation, exploiting the capability of Convolutional Neural Networks (CNNs) [4, 5, 6]to learn taskspecific hierarchical features in an endtoend manner. However, their initial adoption in the medical domain was challenging due to the limited availability of medical imaging data and associated costly manual annotation. These challenges were later circumvented by patchbased training, data augmentation, and transfer learning techniques
[7, 8].Specifically, in the context of cardiac image segmentation, Tran [9] adapted a FCN architecture for segmentation of various cardiac structures from shortaxis MR images. Similarly, Poudel et al. [10] proposed a recurrent FCN architecture to leverage interslice spatial dependencies between the 2D cine MR slices. Avendi et al. [11] reported improved accuracy and robustness of the LV segmentation by using the output of a FCN to initialize a deformable model. Further, Oktay et al. [12] pretrained an autoencoder network on groundtruth segmentations and imposed anatomical constraints into a CNN network by adding loss between the autoencoder representation of the output and the corresponding groundtruth segmentation. Several modifications to the FCN architecture and various postprocessing schemes have been proposed to improve the semantic segmentation results as summarized in [13].
To improve the generalization performance of neural networks, various regularization techniques have been proposed. These include parameter norm penalty (e.g. weight decay [14]), noise injection [15], dropout [16]
[17], adversarial training [18], and multitask learning (MTL) [19]. In this paper we focus on MTLbased network regularization. When a network is trained on multiple related tasks, the inductive bias provided by the auxiliary tasks causes the model to prefer a hypothesis that explains more than one task. This helps the network ignore taskspecific noise and hence focus on learning features relevant to multiple tasks, improving the generalization performance [19]. Furthermore, MTL reduces the Rademacher complexity [20] of the model (i.e. its ability to fit random noise), hence reducing the risk of overfitting. An overview of MTL applied to deep neural networks can be found in [21].MTL has been widely employed in computer vision problems due to the similarity between various tasks being performed. A FCN architecture with a common encoder and task specific decoders was proposed in
[22] to perform joint classification, detection, and semantic segmentation, targeting realtime applications such as autonomous driving. A similar singleencodermultipledecoder architecture described in [23] performs semantic segmentation, depth regression, and instance segmentation, simultaneously. The architecture was further expanded by [24] to automatically learn the weights for each task based on its uncertainty, obtaining stateoftheart results.In the context of medical image analysis, Moeskops et al. [25] demonstrated the use of MTL for joint segmentation of six tissue types from brain MRI, the pectoral muscle from breast MRI, and the coronary arteries from cardiac Computed Tomography Angiography (CTA) images, with performance equivalent to networks trained on individual tasks. Similarly, Valindria et al. [26] employed a MTL framework to improve the performance for multiorgan segmentation from CT and MR images, exploring various encoderdecoder network architectures. Specific to the cardiac MR applications, Xue et al. [27]
proposed a network capable of learning multitask relationship in a Bayesian framework to estimate various local/global LV indices for full quantification of the LV. Similarly, Dangi
et al. [28] performed joint segmentation and quantification of the LV myocardium using the learned task uncertainties to weigh the losses, improving upon the stateoftheart results. Most of these MTL methods in medical image analysis try to perform various clinically relevant tasks simultaneously. However, the focus of this work is on improving the segmentation performance of various FCN architectures using MTL as a network regularizer.In this work, we propose to use the rich information available in the distance map of the segmentation mask as an auxiliary task for the image segmentation network. Since each pixel in the distance map represents its distance from the closest object boundary, this representation is redundant and robust compared to the perpixel image label used for semantic segmentation. Furthermore, the distance map represents the shape and boundary information of the object to be segmented. Hence, training the segmentation network on the additional task of predicting the distance map is equivalent to enforcing shape and boundary constraints for the segmentation task; therefore the name distance map regularized convolutional neural network.
Related work to ours include [29], which take an image and its semantic segmentation as input and predict the distance transform of the object instances, such that, thresholding the distance map yields the instance segmentation. Similarly, Hayder et al. [30] represent the boundary of the object instances using a truncated distance map, which is used to refine the instance segmentation result. However, unlike these methods, our goal is not to perform instance segmentation, but to refine the semantic segmentation result using the distance map as an auxiliary task. The most closely related work to ours is presented in [31] for segmentation of building footprints form satellite images using a MTL framework. Their study is limited to a binary segmentation task, and their network architecture results in increased model complexity. Whereas, in this work, we perform both binary and multiclass segmentations and propose a generic framework to use MTL as a network regularizer, without increasing the model complexity.
The main contributions of this work are as follows:

We propose using distance map prediction as an auxiliary task in a MTL framework to impose softconstraints on the shape and boundary of the objects to be segmented.

We demonstrate the application of the proposed regularization method for binary as well as multiclass segmentation on two different cardiac MRI datasets.

We demonstrate that the addition of a distance map regularization block improves the segmentation performance of three popular FCN segmentation architectures without increasing the model complexity and inference time.

We apply the taskuncertainty based weighing [24] scheme to automatically learn weights for the segmentation and distance map regression tasks during training.

We demonstrate superior generalization ability when using the proposed regularization framework with significantly improved crossdataset segmentation performance.
Ii Methods and Materials
Iia CNN for Semantic Image Segmentation
Let be the input intensity image and be the corresponding image segmentation, with representing a set of class labels, and representing the image domain. The task of CNN based segmentation model, with weights , is to learn a discriminative function
that models the underlying conditional probability distribution
. The output of a CNN model is passed through a softmax function to produce a probability distribution over the class labels, such that, the function can be learned by maximizing the likelihood:(1) 
where represents the
’th element of the vector
. In practice, the negative loglikelihood is minimized to learn the optimal CNN model weights, . This is equivalent to minimizing the crossentropy loss of the groundtruth segmentation, , with respect to the softmax of the network output, .A typical FCN architecture (Fig. 1
) for image segmentation consists of an encoder and a decoder network. The encoder network includes multiple pooling (max/average pooling) layers applied after several convolution and nonlinear activation layers (e.g. Rectified linear unit (ReLU)
[32]). It encodes hierarchical features important for the image segmentation task. To obtain perpixel image segmentation, the global features obtained at the bottleneck layer need to be upsampled to the original image resolution using the decoder network. The upsampling filters can either be fixed (e.g. nearestneighbor or bilinear upsampling), or can be learned during the training (deconvolutional layer). The final output of a decoder network is passed to a softmax classifier to obtain a perpixel classification.In a SegNet [33] (Fig. 1a) architecture, the decoder produces sparse feature maps by upsampling its inputs using the pooling indices transferred from its encoder. These sparse feature maps are then convolved with a trainable filter bank to obtain dense feature maps, and are finally passed through a softmax classifier to produce perpixel image segmentation. Since the decoder in the SegNet architecture uses only the global features obtained at the bottleneck layer of the encoder, the high frequency details in the segmentation are lost during the upsampling process.
The UNet architecture [34] (Fig. 1b) introduced skip connections, by concatenating output of encoder layers at different resolutions to the input of the decoder layers at corresponding resolutions, hence preserving the high frequency details important for accurate image segmentation. Furthermore, the skip connections are known to ease the network optimization [35]
by introducing multiple paths for backpropagation of the gradients, hence, mitigating the vanishing/exploding gradient problem. Similarly, skip connections also allow the network to learn lower level details in the outer layers and focus on learning the residual global features in the deeper encoder layers. Hence, the UNet architecture is able to produce excellent segmentation results using limited training data with augmentation, and has been extensively used in medical image segmentation.
We observed that learned deconvolution filters in the original UNet architecture can be replaced by the SegNet like decoder. The upsampled sparse features are then densified using convolution filters and concatenated with the feature maps obtained from the skip connection. We hypothesize this architecture, using the encoder pooling indices during upsampling, ensures most information flows through the bottleneck layer, such that, the proposed regularization on the bottleneck layer has higher effect. We refer to this modified architecture as USegNet (Fig 1e) throughout this paper, and use it as one of the baseline FCN architectures.
IiB Distance Map Regularization Network
The distance map of a binary segmentation mask can be obtained by computing the Euclidean distance of each pixel from the nearest boundary pixel [36]. This representation provides rich, redundant, and robust information about the boundary, shape, and location of the object to be segmented. For a binary segmentation mask, where is the set of foreground pixels, represent the boundary pixels, and is the Euclidean distance between any two pixels, the truncated signed distance map, , is computed as:
(2) 
where,
is the minimum distance of pixel from the boundary pixels . We truncate the signed distance map at a predefined distance threshold, , hence assigning this maximum negative distance to the slices not containing any foreground pixels (i.e. ), indicating all pixels in the slice are far from the foreground (typically in the apical/basal regions of cardiac cine MR images).
The distance map regularization network is a SegNet like decoder network, upsampling the feature maps obtained at the bottleneck layer of the encoder to the size of the input image, with the number of output channels equal to the number of foreground classes (i.e. ). For example, for a four class segmentation problem (): background, RV bloodpool, LV myocardium, and LV bloodpool, the regularization network has three output channels, predicting the truncated signed distance maps (Eq. 2) computed from the binary masks of the foreground classes: RV boodpool, LV myocardium, and LV bloodpool.
Fig. 2 shows the regularization network added to the bottleneck layer of existing FCN architectures. Network training loss is the weighted sum of the crossentropy loss for segmentation and the mean absolute difference (MAD) loss between the predicted and the groundtruth distance maps. Although the predicted distance maps are similar to the groundtruth maps, we observed they are not accurate and hence cannot be used to refine the obtained segmentation. Therefore, we remove the regularization network after training, such that, the original FCN architecture remains unchanged.
IiC MTL using Uncertainty to Weigh Losses
We model the likelihood for a segmentation task as the squashed and scaled version of the model output through a softmax function:
(3) 
where, is a positive scalar, equivalent to the temperature, for the defined Gibbs/Boltzmann distribution. The magnitude of determines how uniform the discrete distribution is, and hence relates to the uncertainty of the prediction measured in entropy. The loglikelihood for the segmentation task can be written as:
(4)  
where is the ’th element of the vector .
Similarly, for the regression task, we define our likelihood as a Lapacian distribution with its mean and scale parameter given by the neural network output:
(5) 
The loglikelihood for regression task can be written as:
(6) 
where is the neural networks observation noise parameter — capturing the noise in the output.
For a network with two outputs: continuous output modeled with a Laplacian likelihood, and a discrete output modeled with a softmax likelihood, the joint loss is:
(7) 
where is the MAD loss of and is the crossentropy loss of . To arrive at (Eq. 7), the two tasks are assumed independent and simplifying assumption , satisfied at , has been made for the softmax likelihood, resulting in a simple optimization objective with improved empirical results [24]. During the training, the joint likelihood loss is optimized with respect to as well as , .
From equation (Eq. 7), we can observe that the losses for individual tasks are weighted by the inverse of their corresponding uncertainties (, ) learned during the training. Hence, the task with higher uncertainty will be weighted less and vice versa. Furthermore, the uncertainties cannot grow too large due to the penalization by the last two terms in (Eq. 7
). In practice, the network is trained to predict the log variance,
, for numerical stability and avoiding any division by zero, such that, the positive scale parameter, , can be computed via exponential mapping .IiD Clinical Datasets
IiD1 Left Ventricle Segmentation Challenge (LVSC)
This study employed 97 deidentified cardiac MRI image datasets from patients suffering from myocardial infraction and impaired LV contraction available as a part of the STACOM 2011 Cardiac Atlas Segmentation Challenge project [37, 38] database^{1}^{1}1http://www.cardiacatlas.org/challenges/lvsegmentationchallenge/. CineMRI images in shortaxis and longaxis views are available for each case. The images were acquired using the SteadyState Free Precession (SSFP) MR imaging protocol with the following settings: typical thickness , gap , TR , TE , flip angle , FOV , spatial resolution to and image matrix using multiple scanners from various manufacturers. Corresponding reference myocardium segmentation generated from expert analyzed 3D surface finite element model are available for all 97 cases throughout the cardiac cycle.
IiD2 Automated Cardiac Diagnosis Challenge (ACDC)
This dataset^{2}^{2}2https://www.creatis.insalyon.fr/Challenge/acdc/databases.html is composed of shortaxis cardiac cineMR images acquired for 100 patients divided into 5 evenly distributed subgroups: normal, myocardial infarction, dilated cardiomyopathy, hypertropic cardiomyopathy, and abnormal right ventricle, available as a part of the STACOM 2017 ACDC challenge [39]. The acquisitions were obtained over a 6 year period using two MRI scanners of different magnetic strengths (1.5T and 3.0T). The images were acquired using the SSFP sequence with the following settings: thickness (sometimes ), interslice gap , spatial resolution to , to frames per cardiac cycle. Corresponding manual segmentations for RV bloodpool, LV myocardium, and LV bloodpool, performed by a clinical expert for the endsystole (ES) and enddiastole (ED) phases are provided.
IiE Data Preprocessing and Augmentation
SimpleITK [40] was used to resample shortaxis images to a common resolution of 1.5625
and crop/zeropad to a common size of
and for LVSC and ACDC dataset, respectively. Image intensities were clipped at 99percentile and normalized to zero mean and unit standard deviation. Each dataset was divided into
train, validation, and test set with five nonoverlaping folds for crossvalidation. Trainvalidationtest fold was performed randomly over the whole LVSC dataset, whereas it was performed per subgroup (stratified sampling) for the ACDC dataset to maintain even distribution of subgroups over the training, validation, and testing sets. The training images were subjected to random similarity transform with: isotropic scaling of to , rotation of to , and translation of to of the image size along both x and yaxes. The training set for LVSC and ACDC dataset included the original images along with augmentation of two and four randomly transformed versions of each image, respectively. We heavily augment the ACDC dataset, as the labels are available only for the ES and ED phases, whereas, lightly augment the LVSC dataset, as the labels are available throughout the cardiac cycle.IiF Network Training and Testing Details
Networks implemented in PyTorch
^{3}^{3}3https://github.com/pytorch/pytorch were initialized with the Kaiming uniform initializer [41]and trained for 50 and 100 epochs for LVSC and ACDC dataset, respectively, with batch size of 15 images.
RMS prop optimizer [42] with a learning rate of 0.0001 and 0.0005 for single and multitask networks (0.0002 for DMRUNet), respectively, decayed by 0.99 every epoch was used. We saved the model with best average Dice coefficient on the validation set, and evaluated on the test set.Networks were trained on NVIDIA Titan Xp GPU. The distance map threshold was selected empirically and set to a large value of , i.e. full distance map. To make sure the crossentropy and the MSD loss are in similar scale, their weights were initialized to and , respectively. The auxillary task of distance map regression was removed after network training. Table I provides the model complexity and average timing requirements for training and testing the models.
#Parameters ()  

Test (ms/volume)  
ACDC  LVSC  ACDC  LVSC  Train  Test  
SegNet  2.11  14.80  71  69  2.96  2.96 
UNet  1.96  14.91  79  71  4.10  4.10 
USegNet  2.29  13.59  81  77  4.33  4.33 
DMRSegNet  4.62  20.32  69(149)  65(93)  3.56  2.96 
DMRUNet  4.49  18.60  73(154)  67(97)  4.70  4.10 
DMRUSegNet  4.65  20.11  69(149)  65(97)  4.93  4.33 
Test (ms/volume)
IiG Evaluation Metrics
We use overlap and surface distance measures to evaluate the segmentation. Additionally we evaluate the clinical indices associated with the segmentation.
IiG1 Dice and Jaccard Coefficients
Given two binary segmentation masks, A and B, the Dice and Jaccard coefficient are defined as:
(8) 
where, gives the cardinality (i.e. the number of nonzero elements) of each set. Maximum and minimum values (1.0 and 0.0, repectively) for Dice and Jaccard coefficient occur when there is 100% and 0% overlap between the two binary segmentation masks, respectively.
IiG2 Mean Surface Distance and Hausdorff Distance
Let, and , be surfaces (with and points, respectively) corresponding to two binary segmentation masks, A and B, respectively. The mean surface distance (MSD) is defined as:
(9) 
Similarly, Hausdorff Distance (HD) is defined as:
(10) 
where,
is the minimum Euclidean distance of point from the points
. Hence, MSD computes the mean distance between the two surfaces, whereas, HD computes the largest distance between the two surfaces, and is sensitive to outliers.
IiG3 Ejection Fraction and Myocardial Mass
Ejection Fraction (EF) is an important cardiac parameter quantifying the cardiac output. EF is defined as:
(11) 
where, EDV is the enddiastolic volume, and ESV is the endsystolic volume. Similarly, the myocardial mass can be computed from the myocardial volume as:
(12) 
The correlation coefficients for the EF and myocardial mass computed from the groundtruth versus those computed from the automatic segmentation is reported. Correlation coefficient of () represents perfect positive (negative) linear relationship, whereas that of represents no linear relationship between two variables.
IiG4 Limits of Agreement
To compare the clinical indices computed from the groundtruth versus those obtained from the automatic segmentation, we take the difference between each pair of the two observations. The mean of these differences is termed as bias
, and the 95% confidence interval, mean
1.96standard deviation (assuming a Gaussian distribution), is termed as
limits of agreement (LoA).Iii Results




End Systole (ES)  

SN  DMR SN  UNet  DMR UNet  USN  DMR USN  SN  DMR SN  UNet  DMR UNet  USN  DMR USN  
Dice (%)  82.2 (3.8)  83.0 (3.7)  82.6 (3.8)  83.4 (3.2)  82.3 (3.9)  83.4 (3.3)  83.6 (3.8)  84.0 (4.0)  83.8 (4.4)  84.4 (4.0)  83.7 (4.1)  84.4 (3.7) 
Jaccard (%)  69.9 (5.3)  71.0 (5.1)  70.5 (5.4)  71.6 (4.5)  70.1 (5.4)  71.6 (4.7)  72.0 (5.5)  72.6 (5.7)  72.3 (6.4)  73.2 (5.8)  72.1 (6.0)  73.2 (5.5) 
MSD (mm)  0.30 (0.08)  0.29 (0.08)  0.29 (0.07)  0.27 (0.07)  0.29 (0.08)  0.28 (0.07)  0.30 (0.08)  0.30 (0.10)  0.30 (0.11)  0.28 (0.09)  0.30 (0.11)  0.29 (0.09) 
HD (mm)  12.98 (2.64)  12.72 (3.72)  14.08 (5.20)  13.10 (3.42)  13.93 (4.68)  12.97 (4.86)  12.73 (3.36)  12.66 (3.69)  13.63 (5.20)  13.28 (3.51)  13.44 (4.10)  12.81 (3.23) 
Mass (Corr)  0.909  0.918  0.901  0.946  0.883  0.927  0.923  0.919  0.914  0.946  0.896  0.923 
Mass(gram) (Bias+LOA)  3.13 (35.60)  2.67 (33.55)  2.67 (36.60)  4.30 (28.07)  5.57 (39.56)  2.12 (31.70)  5.81 (32.45)  4.49 (33.13)  4.61 (33.84)  5.23 (27.39)  5.66 (36.96)  4.33 (32.09) 
Iiia Segmentation and Clinical Indices Evaluation
The proposed Distance Map Regularized (DMR) SegNet, UNet, and USegNet models along with the baseline models were trained for the joint segmentation of RV bloodpool, LV myocardium, and LV bloodpool from the ACDC challenge dataset. The provided reference segmentation and the corresponding automatic segmentation obtained from the DMRUSegNet model for a test patient is shown in Fig. 3. Automatic segmentation obtained from all networks, for ED and ES phases, are evaluated against the reference segmentation and summarized in Table II; also shown is the evaluation of subsequently computed clinical indices.
We can observe consistent improvement in the segmentation performance of the models across all heartchambers and phases after the DMRegularization. Specifically, there is statistically significant improvement^{4}^{4}4Wilcoxon signedrank test is performed for statistical significance testing on several segmentation metrics for SegNet and USegNet models. Same results manifest onto the clinical indices with better correlation and LoA on both EF and myocardium mass. Furthermore, the DMRUSegNet model outperforms all other evaluated networks.
To further analyze the improvement in segmentation performance, we performed a regional analysis by subdividing the slices into apical (25% slices in the apical region and beyond), basal (25% slices in the basal region and beyond) and midregion (remaining 50% mid slices), based on the reference segmentation. From Fig. 3(a), we can observe highest improvement in segmentation performance at the problematic apical slices [39]; however, due to small size of these regions, the improvement does not have a large effect on the overall performance, though it is of significance when constructing patient specific models of the heart for simulation purposes [43]. We postulate that the additional constraint imposed by a very high negative distance assigned to empty apical/basal slices prevents the network from oversegmenting these regions, hence, improving the regional dice overlap and effectively reducing the overall Hausdorff distance.
To study the effect of the distance map regularization across the five patient subgroups, we plot the average Dice coefficient for each subgroup computed for all six models in Fig. 5. As expected, we can observe the segmentation performance is better for the normal patients in comparison to the pathological cases. Furthermore, we can observe consistent improvement in segmentation performance after the distance map regularization for all patient subgroups (with exception of MINF patients for SegNet and UNet models).
Table III shows the segmentation performance evaluated on the LVSC dataset, demonstrating superior performance of the DM regularized models over their baseline. Specifically, there is statistically significant improvement on the Dice and Jaccard metric for the ED phase. Furthermore, the correlation and LoA for the myocardial mass improves after network regularization. The improvement in performance is consistent across different heart regions as shown in Fig. 3(b), excluding slight performance reduction in basal slices for regularized USegNet model, which is compensated by large performance improvement in the apical slices. Interestingly, the simple encoderdecoder SegNet architecture performs equivalent to the UNet architecture with skip connections, likely due to the large training dataset; also reflected in marginal improvement of segmentation performance on the ES phase after DM regularization. Lastly, the segmentation performance on the LVSC dataset is significantly lower than ACDC dataset due to large variability and noise exhibited by the LVSC data as compared to the ACDC dataset.


IiiB Cross Dataset Evaluation (Transfer Learning)
To analyze the generalization ability of our proposed distance map regularized networks, we performed a crossdataset segmentation evaluation. The networks trained on ACDC dataset for fivefold crossvalidation were tested on the LVSC dataset, and vice versa; such that, the majority voting scheme produced the final perpixel segmentation. Quantitative evaluation of the automatic myocardium segmentation against the provided reference segmentation is summarized in Table IV. We can observe a significant boost in Dice coefficient of 8% to 13% for distance map regularized networks over their baseline models when trained on ACDC and tested on LVSC dataset (194 ED and ES volumes). Similarly, the distance map regularized models significantly outperform the baseline models by 17% to 41% improvement in Dice coefficient, when trained on LVSC and tested on ACDC dataset (200 ED and ES volumes).
We further analyzed the feature maps across different layers of the baseline and distance map regularized networks (supplementary material Fig. S4). We can observe the baseline models preserve the intensity information and propagate it throughout the network, hence, they are more sensitive to the datasetspecific intensity distribution. On the other hand, the multitask regularized networks focus more on the edges and other discriminative features, producing sparse feature maps, while ignoring datasetspecific intensity distribution. Moreover, from the feature maps at the decoding layers, we can observe a clear delineation of several cardiac structures in the regularized network, while those for the baseline models are less discriminative, and contain information about all structures present in the image. Hence, we verify that multitask learning based distance map regularization helps the network learn generalizable features important for the segmentation task, demonstrated by their excellent transfer learning capabilities (see Supplementary Materials for details on feature visualization (Fig. S4) and network learning curves showing the robustness of distance map regularized models against overfitting (Fig. S2)).
Iv Discussion and Conclusion
We performed an extensive study on the effects of hyperparameters on the performance of the proposed regularization framework. Here we summarize the effects of the learned vs fixed task weighting, and various choices of the distance map threshold. Furthermore, we analyzed the distribution of network weights before and after regularization.
Task Weighting: During network training, we observed that the automatic weighting scheme learns to weigh the crossentropy and MAD loss, such that, they are brought to the same scale. Hence, to accelerate network training, we initialize the weights to 100 and 0.1 for the crossentropy and MAD loss, respectively. The learned weights for (crossentropy, MAD) are around (1.5, 0.5) and (1.0, 1.0) for ACDC and LVSC dataset, respectively. Next, to determine the effect of learned task weighting scheme presented in section IIC, we analyzed the average Dice coefficient of the test set segmentation results for both ACDC (100 volumes) and LVSC (1050 volumes) datasets with fixed versus learned weighting. From Fig. 6, we can observe a slight improvement in average Dice coefficient with learned weights compared to fixed weighting.
Effect of Distance Map Threshold: We selected three extreme values for the distance map threshold: , , and , with weights initialized to , , and for (crossentropy, MAD) loss, respectively. The networks were then trained with uncertainty based task weighting for a fixed number of epochs. Average Dice coefficient on the testset obtained from the best performing models on the validationset across fivefold crossvalidation is summarized in Fig. 7. We can observe similar performance for different threshold values, demonstrating low sensitivity to the hyperparameter. Hence, we decided to use a very high threshold of pixels, which is almost equivalent to regressing the full distance map and neglecting this hyperparameter.
Network Weight Distribution: We also analyzed the weight distribution of the network before and after distance map regularization, as shown in the Supplementary Materials (Fig. S3). We can observe the number of nonzero weights increase after the distance map regularization, hence, better utilizing the network capacity. However, this effect is less prominent in the UNet architecture, as the deconvolution filters are learned rather than using the pooling indices for upsampling. This gives the UNet architecture more flexibility to pass information through the skip connections, such that, the regularization imposed at the bottleneck layer has reduced effect.
Summary: In this work we proposed, implemented, and demonstrated the benefit of multitask learning based regularization of fully convolutional networks for semantic image segmentation. We append a decoder network at the bottleneck layer of existing FCN architectures to perform an auxiliary task of distance map prediction, which can be removed after training to reduce inference time. We automatically learn the weighting of the tasks based on their uncertainty. As the distance map contains robust information regarding the shape, location, and boundary of the object to be segmented, it facilitates the FCN encoder to learn robust global features important for the segmentation task. Our experiments verify that introducing the distance map regularization improves the segmentation performance of three popular FCN architectures for both binary and multiclass segmentation across two publicly available cardiac cine MRI datasets. Specifically, we observed significant improvement in segmentation performance in the problematic apical slices in response to the softconstraints imposed by the distance map regularization. We also found consistent segmentation improvement on all five patient subgroups in the ACDC dataset. These improvements were also reflected on the computed clinical indices important for the diagnosis of various heart conditions. Furthermore, we demonstrated the proposed regularization significantly improved the generalization ability of the networks on crossdataset segmentation (transfer learning), with 8% to 41% improvement in Dice coefficient over the baseline FCN architectures.
Acknowledgment
Research reported in this publication was supported by the National Institute of General Medical Sciences of the National Institutes of Health under Award No. R35GM128877 and by the Office of Advanced Cyber infrastructure of the National Science Foundation under Award No. 1808530. Ziv Yaniv’s work was supported by the Intramural Research Program of the U.S. National Institutes of Health, National Library of Medicine.
References
 [1] P. Peng, K. Lekadir, A. Gooya, L. Shao, S. E. Petersen, and A. F. Frangi, “A review of heart chamber segmentation for structural and functional analysis using cardiac magnetic resonance imaging,” Magnetic Resonance Materials in Physics, Biology and Medicine, vol. 29, no. 2, pp. 155–195, Apr 2016.
 [2] C. Petitjean and J.N. Dacher, “A review of segmentation methods in short axis cardiac MR images,” Medical Image Analysis, vol. 15, no. 2, pp. 169 – 184, 2011.

[3]
J. Long, E. Shelhamer, and T. Darrell, “Fully convolutional networks for
semantic segmentation,” in
The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)
, June 2015.  [4] Y. Lecun, L. Bottou, Y. Bengio, and P. Haffner, “Gradientbased learning applied to document recognition,” Proceedings of the IEEE, vol. 86, no. 11, pp. 2278–2324, Nov 1998.
 [5] I. Goodfellow, Y. Bengio, and A. Courville, Deep Learning. MIT Press, 2016.
 [6] Y. LeCun, Y. Bengio, and G. Hinton, “Deep learning,” Nature, vol. 521, no. 7553, pp. 436–444, 05 2015.
 [7] D. Shen, G. Wu, and H.I. Suk, “Deep learning in medical image analysis,” Annual review of biomedical engineering, vol. 19, pp. 221–248, 06 2017.
 [8] G. Litjens et al., “A survey on deep learning in medical image analysis,” Medical Image Analysis, vol. 42, pp. 60 – 88, 2017.
 [9] P. V. Tran, “A fully convolutional neural network for cardiac segmentation in shortaxis MRI,” CoRR, vol. abs/1604.00494, 2016.
 [10] R. P. K. Poudel, P. Lamata, and G. Montana, “Recurrent fully convolutional neural networks for multislice MRI cardiac segmentation,” in Reconstruction, Segmentation, and Analysis of Medical Images. Cham: Springer International Publishing, 2017, pp. 83–94.
 [11] M. Avendi, A. Kheradvar, and H. Jafarkhani, “A combined deeplearning and deformablemodel approach to fully automatic segmentation of the left ventricle in cardiac MRI,” Medical Image Analysis, vol. 30, pp. 108 – 119, 2016.
 [12] O. Oktay et al., “Anatomically constrained neural networks (ACNNs): Application to cardiac image enhancement and segmentation,” IEEE Transactions on Medical Imaging, vol. 37, no. 2, pp. 384–395, Feb 2018.
 [13] A. GarciaGarcia, S. OrtsEscolano, S. Oprea, V. VillenaMartinez, and J. G. Rodríguez, “A review on deep learning techniques applied to semantic segmentation,” CoRR, vol. abs/1704.06857, 2017.
 [14] A. Krogh and J. A. Hertz, “A simple weight decay can improve generalization,” ser. NIPS’91. San Francisco, CA, USA: Morgan Kaufmann Publishers Inc., 1991, pp. 950–957.

[15]
P. Vincent, H. Larochelle, Y. Bengio, and P.A. Manzagol, “Extracting and composing robust features with denoising autoencoders,” ser. ICML ’08. New York, NY, USA: ACM, 2008, pp. 1096–1103.
 [16] N. Srivastava, G. Hinton, A. Krizhevsky, I. Sutskever, and R. Salakhutdinov, “Dropout: A simple way to prevent neural networks from overfitting,” Journal of Machine Learning Research, vol. 15, pp. 1929–1958, 2014.
 [17] S. Ioffe and C. Szegedy, “Batch normalization: Accelerating deep network training by reducing internal covariate shift,” in Proceedings of the 32Nd International Conference on International Conference on Machine Learning  Volume 37, ser. ICML’15. JMLR.org, 2015, pp. 448–456.
 [18] I. Goodfellow, J. Shlens, and C. Szegedy, “Explaining and harnessing adversarial examples,” in International Conference on Learning Representations, 2015.
 [19] R. Caruana, “Multitask learning,” Machine Learning, vol. 28, no. 1, pp. 41–75, Jul 1997.
 [20] P. L. Bartlett and S. Mendelson, “Rademacher and gaussian complexities: Risk bounds and structural results,” J. Mach. Learn. Res., vol. 3, pp. 463–482, Mar. 2003.
 [21] S. Ruder, “An overview of multitask learning in deep neural networks,” arXiv preprint arXiv:1706.05098, 2017.
 [22] M. Teichmann, M. Weber, M. Zöllner, R. Cipolla, and R. Urtasun, “Multinet: Realtime joint semantic reasoning for autonomous driving,” in 2018 IEEE Intelligent Vehicles Symposium (IV), June 2018, pp. 1013–1020.
 [23] J. Uhrig, M. Cordts, U. Franke, and T. Brox, “Pixellevel encoding and depth layering for instancelevel semantic labeling,” in GCPR, 2016.
 [24] A. Kendall, Y. Gal, and R. Cipolla, “Multitask learning using uncertainty to weigh losses for scene geometry and semantics,” CoRR, vol. abs/1705.07115, 2017.
 [25] P. Moeskops, J. M. Wolterink, B. H. M. van der Velden, K. G. A. Gilhuijs, T. Leiner, M. A. Viergever, and I. Isgum, “Deep learning for multitask medical image segmentation in multiple modalities,” in MICCAI, 2016.
 [26] V. V. Valindria et al., “Multimodal learning from unpaired images: Application to multiorgan segmentation in CT and MRI,” in 2018 IEEE Winter Conference on Applications of Computer Vision (WACV), March 2018, pp. 547–556.
 [27] W. Xue, G. Brahm, S. Pandey, S. Leung, and S. Li, “Full left ventricle quantification via deep multitask relationships learning,” Medical Image Analysis, vol. 43, pp. 54 – 65, 2018.
 [28] S. Dangi, Z. Yaniv, and C. A. Linte, “Left Ventricle Segmentation and Quantification from Cardiac Cine MR Images via Multitask Learning,” ArXiv eprints, Sep. 2018.
 [29] M. Bai and R. Urtasun, “Deep watershed transform for instance segmentation,” in 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), July 2017, pp. 2858–2866.
 [30] Z. Hayder, X. He, and M. Salzmann, “Boundaryaware instance segmentation,” in 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), vol. 00, July 2017, pp. 587–595.
 [31] B. Bischke, P. Helber, J. Folz, D. Borth, and A. Dengel, “Multitask learning for segmentation of building footprints with deep neural networks,” CoRR, vol. abs/1709.05932, 2017.

[32]
A. Krizhevsky, I. Sutskever, and G. E. Hinton, “Imagenet classification with deep convolutional neural networks,” in
Advances in Neural Information Processing Systems 25, F. Pereira, C. J. C. Burges, L. Bottou, and K. Q. Weinberger, Eds. Curran Associates, Inc., 2012, pp. 1097–1105.  [33] V. Badrinarayanan, A. Kendall, and R. Cipolla, “Segnet: A deep convolutional encoderdecoder architecture for image segmentation,” CoRR, vol. abs/1511.00561, 2015.
 [34] O. Ronneberger, P. Fischer, and T. Brox, “Unet: Convolutional networks for biomedical image segmentation,” CoRR, vol. abs/1505.04597, 2015.
 [35] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” in The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2016.
 [36] G. Borgefors, “Distance transformations in digital images,” Computer Vision, Graphics, and Image Processing, vol. 34, no. 3, pp. 344 – 371, 1986.
 [37] C. G. Fonseca et al., “The cardiac atlas project  an imaging database for computational modeling and statistical atlases of the heart,” Bioinformatics, vol. 27, no. 16, pp. 2288–2295, 2011.
 [38] A. Suinesiaputra et al., “A collaborative resource to build consensus for automated left ventricular segmentation of cardiac MR images,” Medical Image Analysis, vol. 18, no. 1, pp. 50 – 62, 2014.
 [39] O. Bernard et al., “Deep learning techniques for automatic MRI cardiac multistructures segmentation and diagnosis: Is the problem solved?” IEEE Transactions on Medical Imaging, vol. 37, no. 11, pp. 2514–2525, Nov 2018.
 [40] Z. Yaniv, B. C. Lowekamp, H. J. Johnson, and R. Beare, “SimpleITK imageanalysis notebooks: A collaborative environment for education and reproducible research,” Journal of Digital Imaging, vol. 31, no. 3, pp. 290–303, 2018.
 [41] K. He, X. Zhang, S. Ren, and J. Sun, “Delving deep into rectifiers: Surpassing humanlevel performance on imagenet classification,” in Proceedings of the IEEE international conference on computer vision, 2015, pp. 1026–1034.
 [42] G. Hinton, N. Srivastava, and K. Swersky, “Neural networks for machine learning lecture 6a overview of minibatch gradient descent.”
 [43] T. E. Peters, C. E. Linte, Z. E. Yaniv, and J. E. Williams, “Mixed and augmented reality in medicine.” CRC Press, 2018, ch. Chapter 16. Augmented and Virtual Visualization for ImageGuided Cardiac Therapeutics, pp. 231–250.