Left ventricle quantification through spatio-temporal CNNs

08/23/2018 ∙ by Alejandro Debus, et al. ∙ 0

Cardiovascular diseases are among the leading causes of death globally. Cardiac left ventricle (LV) quantification is known to be one of the most important tasks for the identification and diagnosis of such pathologies. In this paper, we propose a deep learning method that incorporates 3D spatio-temporal convolutions to perform direct left ventricle quantification from cardiac MR sequences. Instead of analysing slices independently, we process stacks of temporally adjacent slices by means of 3D convolutional kernels which fuse the spatio-temporal information, incorporating the temporal dynamics of the heart to the learned model. We show that incorporating such information by means of spatio-temporal convolutions into standard LV quantification architectures improves the accuracy of the predictions when compared with single-slice models, achieving competitive results for all cardiac indices and significantly breaking the state of the art (Xue et al., 2018, MedIA) for cardiac phase estimation.



There are no comments yet.


page 1

page 2

page 3

page 4

Code Repositories


Left ventricle quantification through spatio-temporal CNNs

view repo
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

In 2015, around 17.7 million people died worldwide due to heart diseases. Left ventricle (LV) quantification is a key factor for the identification and diagnosis of such pathologies [2]. However, the estimation of cardiac indices remains a very complex task due to its intricated temporal dynamics and the inter-subject variability of the cardiac structures. Indices such as cavity and myocardium area, regional wall thickness, cavity dimensions, among others, provide useful information to diagnose various types of cardiac pathologies. Cardiovascular magnetic resonance (CMR) is one of the preferred modalities for LV related studies since it is non invasive, presents high spatio-temporal resolution, has a good signal-to-noise ratio and allows to clearly identify the tissues and muscles of interest [6].

Figure 1: Illustration of indices of the left cardiac ventricle (based on Fig. 1 from [10]). (a) Cavity area (brown) and myocardial area (orange). (b) Directional dimensions of cavity (white arrows). (c) Regional wall thicknesses. A: anterior; AS: anterospetal; IS: inferoseptal; I: inferior; IL: inferolateral; AL: anterolateral. (d) Cardiac phase (systole or diastole)

The classical approach to LV quantification consists in estimating such indices by means of automatic segmentation [3, 4, 5, 6, 7, 9]

. Segmentation is usually performed following supervised learning approaches, which require expert manual annotations contouring the edges of the myocardium for training. Once the segmentation is performed, the indices are computed from the resulting mask. Therefore, the accuracy of the predicted indices is conditioned on the quality of the segmentation. In this work, we follow an alternative strategy that directly estimates the indices of interest from the input image sequence. Inspired by the work of

[11, 10, 12], our model is based on a convolutional neural network directly operating on images and regressing the target indices. Different from previous approaches like [10]

where the temporal dynamics of cardiac sequences is incorporated using recurrent neural networks (RNNs), we propose a simple but effective strategy based on the use of spatio-temporal convolutions

[8]. In the context of video analysis, spatio-temporal convolutions are standard 3D convolutions that operate on spatio-temporal video volumes [7]. Here we employ them to process subsets of temporally contiguous CMR slices, leveraging temporal information towards improving prediction accuracy.

We investigate the use of spatio-temporal convolutions for estimating cardiac phase, directional dimensions of the cavity, regional wall thicknesses and area of cavity and myocardium under the hypothesis that such indices may be better explained when taking into account the temporal dynamics of the heart. We benchmark the proposed architecture using the LVQuan Challenge 2018111LVQuan Challenge website: https://lvquan18.github.io/ dataset, which provides CMR sequences with annotations for the aforementioned indices, and provide empirical evidence that incorporating the temporal dynamics of the heart through 3D spatio-temporal convolutions improves prediction accuracy when compared with single-slice models.

2 Materials and methods

2.1 Architecture

Figure 2: Overview of proposed architecture.

An overview of the proposed CNN architecture is presented in Figure 2. The network takes sequences of slices and outputs the corresponding indices for the central slice. In such way, we incorporate information from the surrounding slices, easing the prediction task. In what follows, we describe in detail the main components of the proposed architecture.

Encoder-CNN. We use a first CNN (referred as encoder-CNN in Figures 2 and 3) to extract informative features from individual slices. Inspired by [11], we designed the per-slice encoding phase using a two-layers CNN where the convolutional and pooling kernels are of size 5x5, instead of the frequently used 3x3, to introduce more shift invariance (see Figure 3

for more details). We use ReLU activation function and batch normalization to alleviate the training process.

Spatio-Temporal CNN. After the encoding phase, the 40 filters generated for every individual encoder-CNN are used to construct a spatio-temporal volume with 40 channels per temporal slice. This volume is then processed using 3D convolutions that operate on the temporal and spatial dimensions (see Figure 3), producing compound feature maps that incorporate information from both of them. This module is composed of two 3D convolutional layers with kernels of size 3x5x5 and 2x5x5 when considering slices. When considering

slices, the proposed architecture is modified by using padding in the temporal dimension (

) and adding an extra convolution (

) so that the shape of the output tensor matches 1x6x6, the size required by the CNN Regression and Fully Connected modules. ReLU activations and batch normalization are also used in this module.

Final parallel branches.

After fusing the spatio-temporal features, two parallel branches are derived: (i) the first branch corresponds to a shallow CNN coupled after the spatio-temporal module, acting as a regressor of the directional dimensions, wall thickness and areas; (ii) in the second branch, a third convolutional layer is coupled to the spatio-temporal module, followed by a fully connected multi layer perceptron (MLP) with 640 neurons in the hidden layer and 2 output neurons encoding the probability for the cardiac phase (systole or diastole).

Figure 3: (a) Deatiled overview of the spatio-temporal CNN based on 3D convolutions. (b) Zoomed version of the individual encoder-CNNs: for a single input slice of size 80x80 it outputs 40 filters of size 16x16 which are then fed to the spatio-temporal CNN.

Training procedure and loss function.

We train the proposed network by minimizing a loss function over sets of

slices where annotations are provided only for the central slice. Given a set of slices , ground-truth annotations for the central slice and corresponding predictions from the proposed neural network and the loss function is defined as:


where is the mean squared error between predictions and ground truth, is the cross-entropy loss, is the regularizer (L2 norm of the network weights) and

is a weighting factor. We minimize this loss using stochastic gradient descent with momentum, with mini-batches of size


Circular hypothesis. Since we require sets of temporally contiguous slices as input for our spatio-temporal architecture, given a sequence of slices, we adopt a circular hypothesis meaning that slice number is temporally followed by slice 0. This hypothesis was corroborated by visual inspection of the training dataset. Following this strategy, we generate sets of slices for every sequence and use them as independent data samples. At prediction time, we employ the same hypothesis to generate the sets of test slices.

2.2 Dataset and experimental setting

Our method is experimentally validated using the training data provided by the LVQuan challenge 2018, composed of short axis cardiac MR images of 145 subjects. For each subject, it contains 20 frames corresponding to a complete cardiac cycle (giving a total of 2900 images in the dataset with pixel spacing ranging from 0.6836 mm/pixel to 2.0833 mm/pixel, with a mean of 1.5625 mm/pixel). The images have been collected from 3 different hospitals and subjects are between 16 and 97 years of age, with an average of 58.9 years. All cardiac images undergo several preprocessing steps (including historical tagging, rotation, ROI clipping, and resizing). The resulting images are roughly aligned with a dimension of 80x80. Epicardium and endocardium borders were manually annotated by radiologists, and used to extract the ground truth LV indices and cardiac phase. The values of regional wall thickness and the dimensions of the cavity are normalized by the dimension of the image, while the areas are normalized by the pixel number (6400).

In our experiments, we used cross validation with 3, 5 and 7 folds as suggested by the LVQuan organizers, resulting in partitions of size (49, 48, 48), (29, 29, 29, 29, 29) and (21, 21, 21, 21, 21, 20, 20) respectively. We used learning rate = 1e-4, momentum = 0.5 and (these parameters were obtained by grid-search).

The model was implemented in Python222The source code for the proposed architecture is publicly available at https://github.com/alejandrodebus/SpatioTemporalCNN_lvquan

, using PyTorch and trained in GPU.

Evaluation criteria. Pearson correlation coefficient (PCC) and Mean Absolute Error (MAE) were used to assess the performance of the algorithms for estimation of areas, dimensions and regional wall thicknesses. Error Rate (ER) was used to assess the performance for cardiac phase classification.


where , is the ground-truth value and is the estimated value. and are their mean values, respectively.


where is the indication function, and are the estimated and ground truth value of the cardiac phase, respectively.

3 Results and discussion

The effectiveness of the proposed method was validated under the experimental setting discussed in Section 2.2. We measured the influence of the parameter (number of contiguous slices fed to the network) for 1 (single slice), 3, 5 and 7 for the proposed spatio-temporal model based on 3D convolutions, and compare with the state of the art method recently proposed in [10]. Results are presented in Table 1 for a 5-fold cross validation setting (the same experimental setting and dataset was used in [10]). Note that using sets of slices significantly outperforms the configurations for all the indices, highlighting the importance of the temporal dynamics. However, considering and slices achieves a similar performance. Therefore, we consider as enough temporal context for the remaining experiments.

In quantitative terms, we reduce the error rate from 28.45.06% to 3.85% for cardiac phase estimation and the MAE from 270 to 190, 3.18 to 2.29 and 2.62 to 1.42 in average for the areas, directional dimensions of the cavity and regional wall thickness when comparing the performance for and slices respectively. Moreover, considering the baseline [10] we observe similar results for most indices, except for the phase, where our model improves over the state of the art by a significant margin (reducing the error rate from 8.2% to 3.2%)

Finally, table 2 presents these results for 3 different cross-validation configurations (3, 5 and 7 folds) as required by the LVQuan challenge organizers, together with the results for phase, directional dimensions, regional wall thicknesses and area of cavity and myocardium obtained with the best performing spatio-temporal model (). Note that performance is consistent across folds.

=1 =3 =7 DMTRL [10]
Dimensions ()
Regional wall Thickness
wt1 (IS)
wt2 (I)
wt3 (IL)
wt4 (AL)
wt5 (A)
wt6 (AS)
Phase (ER%)
Table 1: Sensitivity analysis for the parameter (number of neighbouring slices) when using the spatio-temporal model based on 3D convolutions with 5-folds cross validation, compared with the state of the art DMTRL proposed in [10]. Note that incorporating the temporal dynamics by considering multiple slices () makes a significant different with respect the single slice case (). However, considering and slices present a similar performance. Therefore, we consider as enough temporal context for the remaining experiments. When comparing with [10] we observe similar results for most indices, expept for the phase, where the proposed model breaks the state of the art significantly (from 8.2% to 3.2%).

4 Conclusions

In this work, we proposed a new CNN architecture for LV quantification that incorporates the dynamics of the heart by means of spatio-temporal convolutions. Differently from other methods that rely on more complex mechanisms (like recurrent neural networks [10]) we employ simple 3D convolutions to fuse information coming from temporally contiguous CMR slices. We generated training samples following a circular hypothesis, meaning that first and last slices of the sequences are considered as temporally contiguous. Validation was performed using CRM sequences provided by the LVQuan challenge organizers. Results show that incorporating temporal information through spatio-temporal convolutions significantly boosts prediction performance for all the indices. Moreover, when compared with the RNN based model presented in [10], we observe a significant reduction in error rate for phase estimation (from 8.2% to 3.85%) while keeping equivalent results for the other indices. More importantly, our method achieves state of the art results employing simple 3D convolutions instead of the more complex parallel RNN and Bayesian based multitask relationship learning module proposed in [10].

In this work we incorporated the spatio-temporal dynamics by means of 3D convolutions. However, if we consider the slices as multiple channels of a standard 2D architecture, conventional 2D convolutions could also be used, reducing the complexity of the model. Moreover, temporal information encoded by inter-slice deformation fields (obtained trough deep learning based image registration methods [1]) could also be considered to improve model performance. In the future, we plan to explore the performance of these models when compared with the proposed architecture.

N-fold cross validation as required by LVQuan Challenge
N=3 N=5 N=7 N=3 N=5 N=7
a-cav 0.932 0.940 0.939
a-myo 0.915 0.923 0.930
average 0.924 0.932 0.935
dim1 0.938 0.961 0.959
dim2 0.926 0.957 0.954
dim3 0.933 0.963 0.958
average 0.932 0.960 0.957
Regional wall Thickness
wt1 (IS) 0.831 0.854 0.857
wt2 (I) 0.768 0.797 0.802
wt3 (IL) 0.743 0.765 0.755
wt4 (AL) 0.776 0.785 0.797
wt5 (A) 0.829 0.842 0.861
wt6 (AS) 0.857 0.870 0.873
average 0.801 0.819 0.824
Phase (ER %)
N=3 N=5 N=7
Table 2: Results obtained for the LVQuan challenge dataset using the proposed spatio-temporal model (areas of LV cavity and myocardium , directional dimensions , wall thicknesses and cardiac phase) for -folds cross validation with and 7.


The present work used computational resources of the Pirayu Cluster, acquired with funds from the Santa Fe Science, Technology and Innovation Agency (ASACTEI), Government of the Province of Santa Fe, through Project AC-00010-18, Resolution Nº 117/14. This equipment is part of the National System of High Performance Computing of the Ministry of Science, Technology and Productive Innovation of the Republic of Argentina. We also thank NVidia for the donation of a GPU used for this project. Enzo Ferrante is a beneficiary of an AXA Research Fund grant.


  • [1]

    Ferrante, E., Oktay, O., Glocker, B., Milone, D.: On the adaptability of unsupervised cnn-based deformable image registration to unseen image domains. In: 9th International Conference on Machine Learning in Medical Imaging (MLMI 2018 - MICCAI) (2018 - In Press)

  • [2] Karamitsos, T.D., Francis, J.M., Myerson, S., Selvanayagam, J.B., Neubauer, S.: The role of cardiovascular magnetic resonance imaging in heart failure. Journal of the American College of Cardiology 54(15), 1407–1424 (2009)
  • [3] Peng, P., Lekadir, K., Gooya, A., Shao, L., Petersen, S.E., Frangi, A.F.: A review of heart chamber segmentation for structural and functional analysis using cardiac magnetic resonance imaging. Magnetic Resonance Materials in Physics, Biology and Medicine 29(2), 155–195 (2016)
  • [4] Petitjean, C., Dacher, J.N.: A review of segmentation methods in short axis cardiac mr images. Medical image analysis 15(2), 169–184 (2011)
  • [5] Poudel, R.P., Lamata, P., Montana, G.: Recurrent fully convolutional neural networks for multi-slice mri cardiac segmentation. In: Reconstruction, Segmentation, and Analysis of Medical Images, pp. 83–94. Springer (2016)
  • [6] Suinesiaputra, A., Bluemke, D.A., Cowan, B.R., Friedrich, M.G., Kramer, C.M., Kwong, R., Plein, S., Schulz-Menger, J., Westenberg, J.J., Young, A.A., et al.: Quantification of lv function and mass by cardiovascular magnetic resonance: multi-center variability and consensus contours. Journal of Cardiovascular Magnetic Resonance 17(1),  63 (2015)
  • [7] Tan, L.K., Liew, Y.M., Lim, E., McLaughlin, R.A.: Convolutional neural network regression for short-axis left ventricle segmentation in cardiac cine mr sequences. Medical image analysis 39, 78–86 (2017)
  • [8]

    Tran, D., Bourdev, L., Fergus, R., Torresani, L., Paluri, M.: Learning spatiotemporal features with 3d convolutional networks. In: Computer Vision (ICCV), 2015 IEEE International Conference on. pp. 4489–4497. IEEE (2015)

  • [9] Tran, P.V.: A fully convolutional neural network for cardiac segmentation in short-axis mri. arXiv preprint arXiv:1604.00494 (2016)
  • [10] Xue, W., Brahm, G., Pandey, S., Leung, S., Li, S.: Full left ventricle quantification via deep multitask relationships learning. Medical image analysis 43, 54–65 (2018)
  • [11] Xue, W., Lum, A., Mercado, A., Landis, M., Warrington, J., Li, S.: Full quantification of left ventricle via deep multitask learning network respecting intra-and inter-task relatedness. In: International Conference on Medical Image Computing and Computer-Assisted Intervention. pp. 276–284. Springer (2017)
  • [12] Xue, W., Nachum, I.B., Pandey, S., Warrington, J., Leung, S., Li, S.: Direct estimation of regional wall thicknesses via residual recurrent neural network. In: International Conference on Information Processing in Medical Imaging. pp. 505–516. Springer (2017)