4D CNN for semantic segmentation of cardiac volumetric sequences

by   Andriy Myronenko, et al.
Harvard University

We propose a 4D convolutional neural network (CNN) for the segmentation of retrospective ECG-gated cardiac CT, a series of single-channel volumetric data over time. While only a small subset of volumes in the temporal sequence are annotated, we define a sparse loss function on available labels to allow the network to leverage unlabeled images during training and generate a fully segmented sequence. We investigate the accuracy of the proposed 4D network to predict temporally consistent segmentations and compare with traditional 3D segmentation approaches. We demonstrate the feasibility of the 4D CNN and establish its performance on cardiac 4D CCTA.


Echocardiography Segmentation with Enforced Temporal Consistency

Convolutional neural networks (CNN) have demonstrated their ability to s...

An automatic deep learning approach for coronary artery calcium segmentation

Coronary artery calcium (CAC) is a significant marker of atherosclerosis...

V-FCNN: Volumetric Fully Convolution Neural Network For Automatic Atrial Segmentation

Atrial Fibrillation (AF) is a common electro-physiological cardiac disor...

Adversarial Convolutional Networks with Weak Domain-Transfer for Multi-Sequence Cardiac MR Images Segmentation

Analysis and modeling of the ventricles and myocardium are important in ...

End-to-End Deep Learning Model for Cardiac Cycle Synchronization from Multi-View Angiographic Sequences

Dynamic reconstructions (3D+T) of coronary arteries could give important...

3D U-Net: Learning Dense Volumetric Segmentation from Sparse Annotation

This paper introduces a network for volumetric segmentation that learns ...

Coronary Artery Centerline Extraction in Cardiac CT Angiography Using a CNN-Based Orientation Classifier

Coronary artery centerline extraction in cardiac CT angiography (CCTA) i...

1 Introduction

Cardiovascular disease is responsible for 18 million deaths annually, making it one of the leading causes of mortality globally [14]

. Coronary computed tomography angiography (CCTA) uses contrast-enhanced CT to evaluate cardiac muscle morphology, function, and vascular patency. Two measurements derived from CCTA with significant diagnostic and prognostic importance are the Left Ventricular Ejection Fraction (LVEF) and Left Ventricular Wall Thickness. Both measurements require the segmentation of the left ventricular muscle, with the former requiring temporal segmentation over the cardiac cycle. The American College of Radiology (ACR) has highlighted the importance of these measurements by listing them among the most important initial ‘use cases’ of artificial intelligence as applied to radiology

[2]. A segmentation model of the left ventricular muscle and cavity over the cardiac cycle, especially the end-systole and end-diastole time points, would allow for automated determination of both measurements from 4D CCTA studies. The clinical utility of such a model is highly relevant as it reduces study reading time and improves the consistency of measurements, thereby potentially preventing missed pathology in cases where the measurements may not have otherwise been performed.

Modern 4D CCTA images are acquired over the entire cardiac cycle, including end-systole and end-diastole. A typical 4D scan includes 20 3D volumes reflecting the cardiac anatomy at equally-spaced time points within a 240 ms time interval. This allows for enough temporal resolution to study the heart’s function. In order to limit the amount of effort required to annotate these images, we restrict the annotation to only certain frames, an example of which is shown in Fig. 2.

While convolutional neural networks (CNNs) have demonstrated state-of-the-art performance across a variety of segmentation tasks [9], the adoption of 4D CNNs for 4D medical imaging (3D + time – e.g., CT or ultrasound) has been limited due to the high computational complexity and lack of manually segmented data. The cost of annotating volumetric imaging is significant, making 4D labeling prohibitively expensive. Nevertheless, the temporal dimension offers valuable information that is otherwise lost when treating each volume independently.

In this work, we propose a 4D CNN for the segmentation of the left ventricle (LV) and left ventricular myocardium (LVM) from 4D CCTA images, enabling the computation of the aforementioned cardiac measurements. To reduce annotation costs, our 4D dataset is sparsely labeled across the temporal dimension – only a fraction of volumes in the sequence are labeled. This enables us to leverage a 4D CNN with a sparse loss function, allowing our algorithm to take advantage of unlabeled images which would otherwise be discarded in a 3D model. The network jointly segments the sequence of volumes, implicitly learning temporal correlations and imposing a soft temporal smoothness constraint. We describe the 4D convolution layer generalization in Section 3.1 and introduce a sparse Dice loss function as well as a temporal consistency regularization in Section 3.2. We demonstrate the feasibility of a 4D CNN and compare its performance to a traditional 3D CNN in Section 4.

2 Related work

Deep learning has achieved state-of-the-art segmentation performance in 2D natural images [3] and 3D medical images [7, 9, 8]

. To leverage the temporal dependency and account for segmentation continuity, recurrent neural networks (RNNs) have been adopted for videos 

[12, 16] and 2D+T cardiac MRI datasets [18]. Convolutional models can also represent temporal relationships and offer competitive performance for language translation [5]. 3D CNNs have also been applied spatio-temporally and proven effective in segmentation of videos [10, 11] and 2D+T cardiac MRIs [17].

For sequences of volumetric imaging, such as 3D+T CT or ultrasound, 4D CNNs are a natural extension. Wang et al. [13] proposed a CNN for 4D light-field material recognition incorporating separable 4D convolutions to reduce computational complexity. Clark et al. [4] adopted a 4D CNN for the de-noising of low-dose CT, where three independent 3D convolutions (with fixed cyclic time delay) were used to simulate 4D convolutions.

To date, 4D CNNs for semantic segmentation have not been explored in similar depth to 2D and 3D CNNs, in part due to their high computational requirements and lack of available annotations. In this work, we demonstrate the feasibility and advantageousness of a true 4D CNN.

3 Methods

Figure 1: 4D network architecture: input is a single channel (grayscale) 4D CT crop, followed by initial 3x3x3x3 4D convolution with 8 filters. Each green building block is a ResNet-like block with GroupNorm normalization. The output has three channels followed by a softmax: background, left ventricle, and myocardium. For a detailed description of the building blocks see Table 1.

Our 4D segmentation network architecture follows an encoder-decoder semantic segmentation strategy, typical for 2D and 3D images. Throughout the network, we use 4D convolutions with kernel size of 3x3x3x3, where the last dimension corresponds to time. The network architecture follows the one proposed in [8], where only the main decoder branch is used and modified to fit 4D images within GPU memory limits.

The input size of the network is 1x1x96x96x64x16 (corresponding to a batch size of 1, input channel 1, and spatial crop of 96x96x64 with 16 frames). We randomly crop this 4D array from the input data during training.

Each building block of the network consists of two convolutions with group normalization [15]

and ReLU, followed by identity skip-connections similar to ResNet 


blocks. A sequence of the building blocks is applied sequentially at different spatial levels. In the encoder part of the network we downscale the spatial dimension after each level and double the feature dimension. We use strided convolutions (stride of 2) for downsizing, and all convolutions are 3x3x3x3. We use one block at level 0 (initial size), two blocks at level 1, and four blocks at level 2. At the smallest scale the input image crop is downsized by a factor of 4 (to 24x24x16x4), which provides a balance between network depth and GPU memory limits. For the encoder branch, we leverage a similar structure with a single block per each spatial level. To upsample, we use 4D nearest neighbor interpolation after 1x1x1x1 convolution. Finally, we use additive skip-connections between the corresponding levels. The details of network structure are shown in Table 

1 and in Figure 1.

Name Ops Repeat Output size
Input 1x96x96x64x16
InitConv Conv 3x3x3x3 8x96x96x64x16
EncoderBlock0 GN,ReLU,Conv,GN,ReLU,Conv, AddId 8x96x96x64x16
EncoderDown1 Conv 3x3x3x3 stride 2 16x48x48x32x8
EncoderBlock1 GN,ReLU,Conv,GN,ReLU,Conv, AddId x2 16x48x48x32x8
EncoderDown2 Conv 3x3x3x3 stride 2 32x24x24x16x4
EncoderBlock2 GN,ReLU,Conv,GN,ReLU,Conv, AddId x4 32x24x24x16x4
DecoderUp1 Conv1, UpNearest, +EncoderBlock1 16x48x48x32x8
DecoderBlock1 GN,ReLU,Conv,GN,ReLU,Conv, AddId 16x48x48x32x8
DecoderUp0 Conv1, UpNearest, +EncoderBlock0 8x96x96x64x16
DecoderBlock0 GN,ReLU,Conv,GN,ReLU,Conv, AddId 8x96x96x64x16
DecoderEnd Conv 1x1x1x1, Softmax 3x96x96x64x16
Table 1: Network structure, where GN stands for group normalization (with group size of 8), Conv - 3x3x3x3 convolution, AddId - addition of identity/skip connection. Repeat column shows the number of repetitions of the block. The output, after softmax, has 3 channels (background and 2 foreground classes)

3.1 4D convolutions

While 4D convolutional layers are not available in common deep-learning frameworks (such as TensorFlow


of PyTorch

222https://pytorch.org), they can be represented as a sum over a sequence of 3D convolutions along the fourth (temporal) dimension. For efficiency, we rearranged the loop to avoid repeated 3D convolutions by implementing 4D convolution as a custom TensorFlow layer. This strategy allows for a true (non-separable) 4D convolution.

3.2 Loss

Our training dataset is sparsely labeled along the temporal dimension since labeling medical images in 4D (and even in 3D) is complex and time-consuming. Therefore, we have defined a sparse loss function that is applied only to the labelled time-frames and include a regularization term to ensure temporal consistency between frames.

The proposed loss function is therefore composed of two terms,


where is a soft dice loss [7] applied only to labeled time points (3D images) to match the corresponding outputs :


is the number of frames (K=16 in our case, since we use the 96x96x64x16 crop size). The second term in (1) is a first-order derivative over time to enforce similarity between frames.

3.3 Optimization

We apply the Adam optimizer with an initial learning rate of and progressively decrease it according to the following schedule:



is an epoch counter, and

is the total number of training epochs.

We use batch size of 1 and sample input sequences randomly (ensuring that each training sequence is drawn once per epoch). From each 4D sequence, we apply a random crop of size 96x96x64x16 centered on a foreground (with a probability of 0.6), otherwise centered on a background voxel. Thus, at each iteration a different number of ground truth labels is available, depending on the location of the crop window (16) of the time dimension.

4 Results




3D Rendering
Figure 2: A typical segmentation example of our 4D network in axial, sagittal and coronal views of a single 3D frame. Notice that the predicted results look better and much smoother than manual annotations in sagittal and coronal cross sections. Manual labeling was done slice-by-slice, which results in noisy out-of-plane ground-truth labels. The 4D segmentation network is able to average out these errors when learning such noisy data examples.

4.1 Dataset

Our dataset consists of 61 4D CCTA sequences, each of 512x512x(40-108)x20 size (512x512 axial size, with 40-108 slices of variable thickness and 20 time points). The spatial image resolution is (0.24-0.46)x(0.24-0.46)x2mm. All images were acquired at [BLINDED] using a 128-slice dual-source multi-detector CT with retrospective ECG gating and tube current modulation. Sequences were reconstructed from multiple R-R333R corresponds to the peak of the QRS complex in the ECG wave. intervals, measured via electrocardiogram.

All images were resampled to an isotropic spatial resolution of 1x1x1mm , retaining the temporal resolution. After re-sampling, the 4D image sizes vary between 112x122x80x20 and 238x238x158x20 voxels. We apply a random data split, with 49 4D images used for training and 12 4D images for validation.

The number of annotated frames in each sequences varies widely, ranging from only 2 out 20 (e.g. end-systole and end-diastole), to 9 (every second time point). Overall, 247 time-points have been annotated throughout the dataset, which represents approximately 20% of all frames. We include studies with differing numbers of frames in both training and validation splits to maximize temporal coverage during both training and validation.

4.2 Experiments

We implemented our 4D network in Tensorflow [1] and trained it on an NVIDIA Tesla P100 SXM2 GPU with 16GB memory. During training we used a random crop of size 96x96x64x16, batch size 1. Data is normalized to [-1,1] using a fixed scaling from input CT range [-1024,1024]. We train for 500 epochs, and use the model at the end of training for evaluations.

For comparison, we also implemented a 3D network largely following the same architecture as in Figure 1, except that all convolutions are 3D and include a greater number of layers with one additional down-sampling level (the end of the encoder being of size 12x12x8) as GPU memory requirements permit deeper architecture in the 3D case. For 3D network we use a crop of size 96x96x64, and train it only on labeled 3D frames. The 3D network learns to predict segmentation without any temporal constraint considerations.

We evaluate both networks on the validation set, using only the labeled frames, in terms of average dice score. In addition, we assess the temporal continuity of the produced results. A a temporal smoothness metric, we compute the L2 norm of first order time derivative of segmentation labels, as well as average surface distance between the consecutive frames. Intuitively, accurate segmentation results must respect the temporal continuity of the heart motion, and are expected to be smoother in the time domain.

Dice Smoothness
Arch LVM LV L2 Surf
3D network 0.85 0.91 1.28 0.74
4D network 0.85 0.90 1.05 0.59
Table 2: Performance evaluation of the 4D semantic segmentation network. LVM - left ventricular myocardium and LV - left ventricle. We also measure temporal smoothness in the result using the L2 norm of temporal derivative of the predictions and average surface distance between the consecutive frames. The proposed 4D network produces temporally smoother results with comparable dice scores.

The evaluation results are shown in Table 2. Visually, the 4D CNN segmentation results have superior temporal consistency, where the label changes more “fluidly” between time-frames. Our smoothness metric confirms this observation, with the proposed 4D network achieving lower smoothness loss than its 3D counterpart (see Table 2).

In terms of dice score, the proposed 4D network demostrated comparable results, with the exception of one of the structures (LV cavity) where the 3D network was 1% better. One reason for this might be that 4D network is not as deep as its 3D counterpart and the dice score is estimated frame by frame; frame-by-frame dice score may not be the most representative accuracy measure of temporal sequence segmentations as it does not account for jittering between frames.

We also observe that in many cases, 4D CNN results look better than the ground truth (See Figure 2). The manual annotations are done slice-by-slice, which results in jittery out-of-plane annotation profiles; this especially visible in sagittal and coronal views. The proposed 4D segmentation network is able to average out these errors while learning from the overall dataset, and produce coherent results both spatially and temporally.

5 Conclusion

We proposed a 4D convolutional neural network for semantic segmentation of the left ventricle (LV) and left ventricular myocardium (LVM) from 4D CCTA studies. The network is fully convolutional and jointly segments a temporal sequence of volumetric images from CCTA.

We utilize a sparse Dice loss function and a temporal consistency regularization to handle the problem of sparse temporal annotation. We have demonstrated the feasibility and advantagousness of 4D CNN compared to 3D CNN, where the first shows improvement both in segmentation performance and temporal consistency.


  • [1]

    Abadi, M., Agarwal, A., Barham, P., Brevdo, E., Chen, Z., Citro, C., Corrado, G.S., Davis, A., Dean, J., Devin, M., Ghemawat, S., Goodfellow, I., Harp, A., Irving, G., Isard, M., Jia, Y., Jozefowicz, R., Kaiser, L., Kudlur, M., Levenberg, J., Mané, D., Monga, R., Moore, S., Murray, D., Olah, C., Schuster, M., Shlens, J., Steiner, B., Sutskever, I., Talwar, K., Tucker, P., Vanhoucke, V., Vasudevan, V., Viégas, F., Vinyals, O., Warden, P., Wattenberg, M., Wicke, M., Yu, Y., Zheng, X.: TensorFlow: Large-scale machine learning on heterogeneous systems (2015),

    https://www.tensorflow.org/, software available from tensorflow.org
  • [2] American College of Radiology: Touch-ai directory (2019), https://www.acrdsi.org/DSI-Services/TOUCH-AI
  • [3] Chen, L.C., Zhu, Y., Papandreou, G., Schroff, F., Adam, H.: Encoder-decoder with atrous separable convolution for semantic image segmentation. arXiv:1802.02611 (2018)
  • [4] Clark, D., Badea, C.: Convolutional regularization methods for 4d, x-ray ct reconstruction. In: Medical Imaging 2019: Physics of Medical Imaging. vol. 10948, p. 109482A. International Society for Optics and Photonics (2019)
  • [5] Gehring, J., Auli, M., Grangier, D., Yarats, D., Dauphin, Y.N.: Convolutional sequence to sequence learning. In: Proceedings of the 34th International Conference on Machine Learning-Volume 70. pp. 1243–1252. JMLR. org (2017)
  • [6]

    He, K., Zhang, X., Ren, S., Sun, J.: Identity mappings in deep residual networks. In: European Conference on Computer Vision (ECCV) (2016)

  • [7] Milletari, F., Navab, N., Ahmadi, S.A.: V-net: Fully convolutional neural networks for volumetric medical image segmentation. In: Fourth International Conference on 3D Vision (3DV) (2016)
  • [8] Myronenko, A.: 3D MRI brain tumor segmentation using autoencoder regularization. In: BrainLes, Medical Image Computing and Computer Assisted Intervention (MICCAI). pp. 311–320. LNCS, Springer (2018), https://arxiv.org/abs/1810.11654
  • [9] Ronneberger, O., P.Fischer, Brox, T.: U-net: Convolutional networks for biomedical image segmentation. In: Medical Image Computing and Computer-Assisted Intervention (MICCAI). LNCS, vol. 9351, pp. 234–241. Springer (2015)
  • [10] Tran, D., Bourdev, L., Fergus, R., Torresani, L., Paluri, M.: Learning spatiotemporal features with 3D convolutional networks. In: Proceedings of the IEEE international conference on computer vision. pp. 4489–4497 (2015)
  • [11]

    Tran, D., Bourdev, L., Fergus, R., Torresani, L., Paluri, M.: Deep end2end voxel2voxel prediction. In: Proceedings of the IEEE conference on computer vision and pattern recognition workshops. pp. 17–24 (2016)

  • [12] Valipour, S., Siam, M., Jagersand, M., Ray, N.: Recurrent fully convolutional networks for video segmentation. In: 2017 IEEE Winter Conference on Applications of Computer Vision (WACV). pp. 29–36. IEEE (2017)
  • [13] Wang, T.C., Zhu, J.Y., Hiroaki, E., Chandraker, M., Efros, A., Ramamoorthi, R.: A 4D light-field dataset and CNN architectures for material recognition. In: Proceedings of European Conference on Computer Vision (ECCV) (2016)
  • [14] World Health Organization: Cardiovascular diseases (CVDs) (May 2017), https://www.who.int/en/news-room/fact-sheets/detail/cardiovascular-diseases-(cvds)
  • [15] Wu, Y., He, K.: Group normalization. In: European Conference on Computer Vision (ECCV) (2018)
  • [16] Xu, N., Yang, L., Fan, Y., Yang, J., Yue, D., Liang, Y., Price, B., Cohen, S., Huang, T.: Youtube-vos: Sequence-to-sequence video object segmentation. In: Proceedings of the European Conference on Computer Vision (ECCV). pp. 585–601 (2018)
  • [17] Yang, D., Huang, Q., Axel, L., Metaxas, D.: Multi-component deformable models coupled with 2d-3d u-net for automated probabilistic segmentation of cardiac walls and blood. In: 2018 IEEE 15th International Symposium on Biomedical Imaging (ISBI 2018). pp. 479–483. IEEE (2018)
  • [18] Zhang, D., Icke, I., Dogdas, B., Parimal, S., Sampath, S., Forbes, J., Bagchi, A., Chin, C.L., Chen, A.: Segmentation of left ventricle myocardium in porcine cardiac cine mr images using a hybrid of fully convolutional neural networks and convolutional lstm. In: Medical Imaging 2018: Image Processing. vol. 10574, p. 105740A. International Society for Optics and Photonics (2018)