Cardiac cine magnetic resonance imaging (MRI) is identified as the gold standard for the diagnosis and treatment of cardiovascular diseases. The contours of the left ventricle (LV) extracted from card iac MRI can be used to calculate clinical parameters such as ventricular volume, myocardial mass, cardiac output, ejection fraction, and wall thickness. Due to the complexity of heart structure and th e large amount of cardiac MRI data, the traditional manual delineation is inefficient and labor-intensive. Therefore, methods for LV endocardium (Endo) and epicardium (Epi) segmentation are always the research hotspots in the field of cardiac MR image analysis.
A considerable number of researches on fully automated or semi-automated segmentation methods for LV MR images have been proposed. The traditional methods can be divided roughly into: 1) segmentation methods based on curve evolution, such as active contours ; 2) segmentation methods based on statistical shape model, such as active appearance model (AAM) , active shape model (ASM) ; 3) segmentation methods based on graph theory, such as graph cut method ; 4) segmentation methods based on the combination of neural network s and the former methods [2, 16].
In recent years, driven by powerful deep neural networks, convolutional neural networks (CNNs) have achieved remarkable results in the field of semantic segmentation [14, 6], represented by Fully Convolutional Network(FCN) . FCN discards all fully connected layers in classification architecture and applies a fully convolutional structure to realize pixel-wise prediction. Later, researchers propose symmetrical encoder-decoder network structure, in which the encoder uses pooling layers to gradually reduce the spatial resolution of the input image, while the decoder gradually recovers the details of the target and correspondingly spatial resolution. U-net , SegNet  and FC-DenseNet  are representatives of this network structure.
CNNs-based segmentation algorithms are also applied to the left ventricular segmentation and achieve great segmentation effects. Tran  is the first person who apply FCN to cardiac MRI images segmentation and achieves a start-of-the-art result beyond previous automated methods. Romaguera  trains FCN model by using whole cardiac images as input without cropping and obtains a good performance for LV segmentation. Khened  imports modified FC-DenseNet  architecture for cardiac segmentation, which is parameter and memory efficient compared with FCN. They add parallel pathways in the initial convolution layer, and introduce short-cut connections in the upsampling path. Tan 
designs three separate neural networks, which are used to locate the left ventricle, estimate the LV centerpoint, and compute the radial distances of the endocardium and epicardium, respectively.
In this paper, we propose a new CNN architecture for cardiac LV segmentation which adopts a conventional encoder-decoder structure and we call it as Multi-Scale Fully Convolutional Network (MS-FCN). A multi-scale pooling module is employed in the encoding stage while a dense connection structure is used in the decoding stage, and the effects of them are all explained in detail in the following sections. Moreover, we conduct a series of ablation experiments on the network structure. Finally, we demonstrate state-of-the-art results on the public Sunnybrook dataset, which comes from the MICCAI 2009 Cardiac MR Left Ventricle Segmentation Challenge .
2 Network architecture
In this section, we briefly introduce our neural network architecture and describe the concept of network design.
Figure 1 illustrates our proposed MS-FCN architecture which adopts an encoder-decoder structure. In the encoding stage, MS-FCN comprises 15
convolution layers and three downsampling operations. Each convolution layer is followed by a Batch Normalization (BN) operation and a Rectified Linear Unit (ReLU) activation function. Unlike the first downsampling layer of FCN, MS-FCN employs a multi-scale pooling module instead of a conventional pooling layer to obtain more contextual information. In the decoding stage, in order to restore high-resolution detail, we use a cascade structure called dense connection structure. In addition, MS-FCN also uses skip connections similar to U-net , which concatenate the feature maps of the encoding stage with the corresponding feature maps of the decoding stage. This is beneficial for the forward flow of feature maps and the backward propagation of gradient information.
To thoroughly understand the network structure, the multi-scale pooling module and the dense connection structure are introduced in detail.
2.1 Multi-scale pooling module
Inspired by PSPNet’s pyramid pooling module 
, we propose a hierarchical structure called multi-scale pooling module, which is used to maximises the feature extraction ability. The main difference from PSPNet is that the upsampling process inside the multi-scale pooling module is implemented by deconvolution but not bilinear interpolation. Compared to using fixed bilinear interpolation for upsampling, using a deconvolution layer with learnable parameters is better to restore rich image details, as demonstrated in Section3.
The multi-scale pooling module consists of multiple subpaths in parallel, as illustrated in Figure 2. The first subpath uses a pooling layer with ratio as a baseline. The following subpaths employ pooling layers of different ratios which are all greater than . Then, we employ a convolutional layer in each path as the compression layer to reduce the number of feature maps. After that, we use deconvolution layers to upsample the low-dimension feature maps to the same size as the baseline. Finally, features of all subpaths are concatenated as the multi-scale output features.
It is noteworthy that the number of subpaths and the scaling ratio of the pooling layer in each subpath are all adjustable in the multi-scale pooling module, and they mainly depend on the input image resolution. Thus the downsampling ratios should maintain reasonable intervals to abstract additional contextual information. The multi-scale pooling module in MS-FCN includes four subpaths in parallel, where the downsampling ratios in the four subpaths are , , , and , respectively.
2.2 Dense connection structure
In the decoding stage, most semantic segmentation networks [3, 5, 7] usually employ upsampling layers and convolution layers alternately for decoding, as shown in Figure 3(a). Because the upsampling multiple in each step is fixed, the deconvolution layers can only extract limited features information from the decoded image.
In view of this phenomenon, we propose a dense connection structure which is illustrated in Figure 3(b). In this structure, low resolution feature maps are upsampled in different multiples, and then the upsampled feature maps with the same resolution are concatenated, later the subsequent convolution operations are performed. In this way, the information of low-resolution feature maps can flow through different paths in the network, thereby increasing the possibility of refining the LV boundaries.
It is worth mentioning that the number of upsampling paths and the upsampling multiples in the dense connection structure are also variable. The setting of these parameters depends not only on the resolution of encoded feature maps but also on the size of original image. In MS-FCN, since we have downsampled the input image 12 times during the encoding stage, the first two upsampling layers in our model have scale ratios of 3 and 6 respectively as shown in Figure 4.
3 Experiments and results
3.1 Data augmentation
As part of MICCAI 2009 challenge on automated LV segmentation, the Sunnybrook cardiac dataset contains cine-MRI images with a variety of pathologies and ventricular shapes. This dataset consists of 45 DICOM studies with a resolution of and associated ground truth contours, and it is divided into three parts of 15 cases each: training, validation and online. The training set is used to train our model for LV segmentation, while the validation and online sets are combined as test set to evaluate model performance.
Since the amount of Sunnybrook training samples is small, we perform data augmentation, including displacement, rotation, and flipping, on the training set. First, we use a displacement operation on the training images and each image is translated five pixels in four diagonal directions. Then, we notice that the left ventricle is mostly in the center of image and perform a center cropping. To ensure that all clipped images contain a complete region of interest, we uniformly crop the image to as shown in Figure 5. Meanwhile, the proportion of the left ventricle in the clipped image is relatively large and the class imbalance problem  is relieved to some extent.
Subsequently, we performed rotation and flipping on the cropped images (as shown in Figure 6), including 90 degrees, 180 degrees, and 270 degrees, as well as horizontal flipping and vertical flipping. In conjunction with the displacement operation, the total number of images we get is equal to 40 times that of original images in the training set.
This model is trained and executed with the Caffedeep learning framework, using a NVIDIA GeForce GTX 1080Ti GPU on Ubuntu 16.04 Linux OS. We employ Nesterov solver with momentum of 0.9 as the optimization algorithm to minimize a cross-entropy cost function calculated from the predicted and the ground truth. We choose the “Xavier”  scheme to randomly initialize parameter weights. We also use regularization with a decay coefficient of 0.0005 and a dropout ratio of to prevent overfitting of the network. In the training process, we use polynomial decay strategy to reduce the learning rate as follows,
where is the initial learning rate, is the decay rate, is the current iteration number and
is the total iteration number which approximately equal to 10 epochs in our experiments.
3.3 Ablation experiments
Multi-scale pooling module
For evaluating the effect of multi-scale pooling module on segmentation performance, we conduct a ablation experiment that whether include this module or not. Figure 7 shows a simplified diagram of the network structure used in the contrast experiment, ignored the convolutional layers. Figure 7(b) replaces the first pooling layer in Figure 7(a) with multi-scale pooling module marked in red.
Table 1 shows the comparison results. From this table,We can conclude that using the multi-scale pooling module in the encoding stage has better segmentation performance.
Ablation experiment on using a ordinary pooling layer and the multi-scale pooling module as the first downsampling process. The values are provided as mean (standard deviation).
Upsampling methods in the multi-scale module
Convolutional neural networks generally employ bilinear interpolation or deconvolution as upsampling method. The deconvolution upsampling method uses a deconvolution kernel that can be trained, and theoretically has better learning ability.
In order to evaluate the effects of different upsampling methods in the multi-scale pooling module, we test bilinear interpolation, deconvolution, and group deconvolution, separately, with results listed in Table 2.
From the table, we can see that, whether it is for endocardium or epicardium, the use of deconvolution outperform that of bilinear interpolation in the network. For the deconvolution form, using a group deconvolution not only reduces the number of parameters but also has better segmentation performance. Therefore, we use deconvolution upsampling in the multi-scale pooling module, and in order not to increase too many network weights, we set the group parameter to 32.
Dense connection structure
For examining the effect of dense connection structure on the segmentation performance, we trained two separate models with and without dense connection structure. Table 3 illustrates the comparison results. Compared with the network structure without dense connection structure, the network with dense connection structure improves percentage of good contours and reduces APD significantly, while its dice value of endocardium is only slightly less than the former.
|Dense connection structure||Without||With|
3.4 Comparison and analysis
In this section, we compare our model with two previous state-of-the-art segment proposal methods: FCN and U-net. It is worth noting that U-net structure performs only three downsampling operations due to the input image resolution is only
, and it also implements zero pad convolutions to keep image size unchanged. The comparison results are reported in Table4. As shown in the table, our model has less APD and higher percentage of good contours than FCN and U-net (except the percentage of good endocardial contours). We also achieve a Dice score of and for endocardium and epicardium respectively, which is the best reported fully automated result on this challenging dataset to date.
Number of evaluated cases. 30 - validation and online cases, 45 - all cases.
Only provides dice score of endocardial contours.
Good results: Selected prediction examples from the Sunnybrook test set are shown in Figure 8. As shown in the figure, our model is able to produce accurate results on most slices of left ventricular MRI.
Failure results: As shown in Figure 9, our model has difficulty in segmenting (a) left ventricle with ambiguous or imperceptible boundaries, and (b) the apex and basal regions.
Our proposed model MS-FCN employs the encoder-decoder structure where the multi-scale pooling module is adopted to encode the rich contextual information and the dense connection decoder is used to refine the segmentation results along object boundaries. In order to verify the efficacy of our network structure, we also conduct a series of ablation experiments and compare it with the existing methods. Finally, the experimental results indicate that the proposed model sets a state-of-the-art performance on the Sunnybrook cardiac dataset.
-  A. Andreopoulos and J. K. Tsotsos. Efficient and generalizable statistical models of shape and appearance for analysis of cardiac MRI. Medical Image Analysis, 12(3):335–357, jun 2008. doi:10.1016/j.media.2007.12.003.
-  M. Avendi, A. Kheradvar, and H. Jafarkhani. A combined deep-learning and deformable-model approach to fully automatic segmentation of the left ventricle in cardiac MRI. Medical Image Analysis, 30:108–119, may 2016. doi:10.1016/j.media.2016.01.005.
-  V. Badrinarayanan, A. Kendall, and R. Cipolla. Segnet: A deep convolutional encoder-decoder architecture for image segmentation. 1511.00561v3.
-  M. Buda, A. Maki, and M. A. Mazurowski. A systematic study of the class imbalance problem in convolutional neural networks. 1710.05381v1.
-  L.-C. Chen, G. Papandreou, I. Kokkinos, K. Murphy, and A. L. Yuille. DeepLab: Semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected CRFs. IEEE Transactions on Pattern Analysis and Machine Intelligence, 40(4):834–848, apr 2017. doi:10.1109/tpami.2017.2699184.
-  L.-C. Chen, Y. Zhu, G. Papandreou, F. Schroff, and H. Adam. Encoder-decoder with atrous separable convolution for semantic image segmentation. 1802.02611v2.
-  J. Fu, J. Liu, Y. Wang, and H. Lu. Stacked deconvolutional network for semantic segmentation. 1708.04943v1.
X. Glorot and Y. Bengio.
Understanding the difficulty of training deep feedforward neural
Proceedings of the thirteenth international conference on artificial intelligence and statistics, pages 249–256, 2010.
-  S. Jégou, M. Drozdzal, D. Vazquez, A. Romero, and Y. Bengio. The one hundred layers tiramisu: Fully convolutional densenets for semantic segmentation. 1611.09326v3.
-  Y. Jia, E. Shelhamer, J. Donahue, S. Karayev, J. Long, R. Girshick, S. Guadarrama, and T. Darrell. Caffe: Convolutional architecture for fast feature embedding. 1408.5093v1.
-  M. Khened, V. A. Kollerathu, and G. Krishnamurthi. Fully convolutional multi-scale residual densenets for cardiac segmentation and automated cardiac diagnosis using ensemble of classifiers. 1801.05173v1.
-  X. Lin, B. Cowan, and A. Young. Model-based graph cut method for segmentation of the left ventricle. In 2005 IEEE Engineering in Medicine and Biology 27th Annual Conference. IEEE, 2005. doi:10.1109/iembs.2005.1617120.
-  J. Long, E. Shelhamer, and T. Darrell. Fully convolutional networks for semantic segmentation. In . IEEE, jun 2015. doi:10.1109/cvpr.2015.7298965.
-  P. Luo, G. Wang, L. Lin, and X. Wang. Deep dual learning for semantic image segmentation. In 2017 IEEE International Conference on Computer Vision (ICCV). IEEE, oct 2017. doi:10.1109/iccv.2017.296.
-  P. Makowski, T. S. Sørensen, S. V. Therkildsen, A. Materka, H. Stødkilde-Jørgensen, and E. M. Pedersen. Two-phase active contour method for semiautomatic segmentation of the heart and blood vessels from MRI images for 3d visualization. Computerized Medical Imaging and Graphics, 26(1):9–17, jan 2002. doi:10.1016/s0895-6111(01)00026-x.
-  T. A. Ngo, Z. Lu, and G. Carneiro. Combining deep learning and level set for the automated segmentation of the left ventricle of the heart from cardiac cine magnetic resonance. Medical Image Analysis, 35:159–171, jan 2017. doi:10.1016/j.media.2016.05.009.
-  P. Radau, Y. Lu, K. Connelly, G. Paul, A. Dick, and G. Wright. Evaluation framework for algorithms segmenting short axis cardiac MRI. The MIDAS Journal -Cardiac MR Left Ventricle Segmentation Challenge, 2009. URL http://hdl.handle.net/10380/3070.
-  L. V. Romaguera, M. G. F. Costa, F. P. Romero, and C. F. F. C. Filho. Left ventricle segmentation in cardiac MRI images using fully convolutional neural networks. In S. G. Armato and N. A. Petrick, editors, Medical Imaging 2017: Computer-Aided Diagnosis. SPIE, mar 2017. doi:10.1117/12.2253901.
-  O. Ronneberger, P. Fischer, and T. Brox. U-net: Convolutional networks for biomedical image segmentation. 1505.04597v1.
-  C. Santiago, J. C. Nascimento, and J. S. Marques. A new ASM framework for left ventricle segmentation exploring slice variability in cardiac MRI volumes. Neural Computing and Applications, 28(9):2489–2500, may 2016. doi:10.1007/s00521-016-2337-1.
-  L. K. Tan, R. A. McLaughlin, E. Lim, Y. F. A. Aziz, and Y. M. Liew. Fully automated segmentation of the left ventricle in cine cardiac MRI using neural network regression. Journal of Magnetic Resonance Imaging, jan 2018. doi:10.1002/jmri.25932.
-  P. V. Tran. A fully convolutional neural network for cardiac segmentation in short-axis mri. 1604.00494v3.
-  H. Zhao, J. Shi, X. Qi, X. Wang, and J. Jia. Pyramid scene parsing network. 1612.01105v2.