Left ventricle quantification through spatio-temporal CNNs
Cardiovascular diseases are among the leading causes of death globally. Cardiac left ventricle (LV) quantification is known to be one of the most important tasks for the identification and diagnosis of such pathologies. In this paper, we propose a deep learning method that incorporates 3D spatio-temporal convolutions to perform direct left ventricle quantification from cardiac MR sequences. Instead of analysing slices independently, we process stacks of temporally adjacent slices by means of 3D convolutional kernels which fuse the spatio-temporal information, incorporating the temporal dynamics of the heart to the learned model. We show that incorporating such information by means of spatio-temporal convolutions into standard LV quantification architectures improves the accuracy of the predictions when compared with single-slice models, achieving competitive results for all cardiac indices and significantly breaking the state of the art (Xue et al., 2018, MedIA) for cardiac phase estimation.READ FULL TEXT VIEW PDF
Cardiac left ventricle (LV) quantification provides a tool for diagnosin...
Cardiac left ventricle (LV) quantification is among the most clinically
Vehicle re-identification is an important problem and has many applicati...
We consider the problem of integrating non-imaging information into
In this work we reduce undersampling artefacts in two-dimensional (2D)
Barten's model of spatio-temporal contrast sensitivity function of human...
We review some of the latest approaches to analysing cardiac
Left ventricle quantification through spatio-temporal CNNs
In 2015, around 17.7 million people died worldwide due to heart diseases. Left ventricle (LV) quantification is a key factor for the identification and diagnosis of such pathologies . However, the estimation of cardiac indices remains a very complex task due to its intricated temporal dynamics and the inter-subject variability of the cardiac structures. Indices such as cavity and myocardium area, regional wall thickness, cavity dimensions, among others, provide useful information to diagnose various types of cardiac pathologies. Cardiovascular magnetic resonance (CMR) is one of the preferred modalities for LV related studies since it is non invasive, presents high spatio-temporal resolution, has a good signal-to-noise ratio and allows to clearly identify the tissues and muscles of interest .
. Segmentation is usually performed following supervised learning approaches, which require expert manual annotations contouring the edges of the myocardium for training. Once the segmentation is performed, the indices are computed from the resulting mask. Therefore, the accuracy of the predicted indices is conditioned on the quality of the segmentation. In this work, we follow an alternative strategy that directly estimates the indices of interest from the input image sequence. Inspired by the work of[11, 10, 12], our model is based on a convolutional neural network directly operating on images and regressing the target indices. Different from previous approaches like 
where the temporal dynamics of cardiac sequences is incorporated using recurrent neural networks (RNNs), we propose a simple but effective strategy based on the use of spatio-temporal convolutions. In the context of video analysis, spatio-temporal convolutions are standard 3D convolutions that operate on spatio-temporal video volumes . Here we employ them to process subsets of temporally contiguous CMR slices, leveraging temporal information towards improving prediction accuracy.
We investigate the use of spatio-temporal convolutions for estimating cardiac phase, directional dimensions of the cavity, regional wall thicknesses and area of cavity and myocardium under the hypothesis that such indices may be better explained when taking into account the temporal dynamics of the heart. We benchmark the proposed architecture using the LVQuan Challenge 2018111LVQuan Challenge website: https://lvquan18.github.io/ dataset, which provides CMR sequences with annotations for the aforementioned indices, and provide empirical evidence that incorporating the temporal dynamics of the heart through 3D spatio-temporal convolutions improves prediction accuracy when compared with single-slice models.
An overview of the proposed CNN architecture is presented in Figure 2. The network takes sequences of slices and outputs the corresponding indices for the central slice. In such way, we incorporate information from the surrounding slices, easing the prediction task.
In what follows, we describe in detail the main components of the proposed architecture.
Encoder-CNN. We use a first CNN (referred as encoder-CNN in Figures 2 and 3) to extract informative features from individual slices. Inspired by , we designed the per-slice encoding phase using a two-layers CNN where the convolutional and pooling kernels are of size 5x5, instead of the frequently used 3x3, to introduce more shift invariance (see Figure 3
Spatio-Temporal CNN. After the encoding phase, the 40 filters generated for every individual encoder-CNN are used to construct a spatio-temporal volume with 40 channels per temporal slice. This volume is then processed using 3D convolutions that operate on the temporal and spatial dimensions (see Figure 3), producing compound feature maps that incorporate information from both of them. This module is composed of two 3D convolutional layers with kernels of size 3x5x5 and 2x5x5 when considering slices. When considering
slices, the proposed architecture is modified by using padding in the temporal dimension () and adding an extra convolution (
) so that the shape of the output tensor matches 1x6x6, the size required by the CNN Regression and Fully Connected modules. ReLU activations and batch normalization are also used in this module.
Final parallel branches.
After fusing the spatio-temporal features, two parallel branches are derived: (i) the first branch corresponds to a shallow CNN coupled after the spatio-temporal module, acting as a regressor of the directional dimensions, wall thickness and areas; (ii) in the second branch, a third convolutional layer is coupled to the spatio-temporal module, followed by a fully connected multi layer perceptron (MLP) with 640 neurons in the hidden layer and 2 output neurons encoding the probability for the cardiac phase (systole or diastole).
Training procedure and loss function.
We train the proposed network by minimizing a loss function over sets ofslices where annotations are provided only for the central slice. Given a set of slices , ground-truth annotations for the central slice and corresponding predictions from the proposed neural network and the loss function is defined as:
where is the mean squared error between predictions and ground truth, is the cross-entropy loss, is the regularizer (L2 norm of the network weights) and
is a weighting factor. We minimize this loss using stochastic gradient descent with momentum, with mini-batches of size.
Circular hypothesis. Since we require sets of temporally contiguous slices as input for our spatio-temporal architecture, given a sequence of slices, we adopt a circular hypothesis meaning that slice number is temporally followed by slice 0. This hypothesis was corroborated by visual inspection of the training dataset. Following this strategy, we generate sets of slices for every sequence and use them as independent data samples. At prediction time, we employ the same hypothesis to generate the sets of test slices.
Our method is experimentally validated using the training data provided by the LVQuan challenge 2018, composed of short axis cardiac MR images of 145 subjects. For each subject, it contains 20 frames corresponding to a complete cardiac cycle (giving a total of 2900 images in the dataset with pixel spacing ranging from 0.6836 mm/pixel to 2.0833 mm/pixel, with a mean of 1.5625 mm/pixel). The images have been collected from 3 different hospitals and subjects are between 16 and 97 years of age, with an average of 58.9 years. All cardiac images undergo several preprocessing steps (including historical tagging, rotation, ROI clipping, and resizing). The resulting images are roughly aligned with a dimension of 80x80. Epicardium and endocardium borders were manually annotated by radiologists, and used to extract the ground truth LV indices and cardiac phase. The values of regional wall thickness and the dimensions of the cavity are normalized by the dimension of the image, while the areas are normalized by the pixel number (6400).
In our experiments, we used cross validation with 3, 5 and 7 folds as suggested by the LVQuan organizers, resulting in partitions of size (49, 48, 48), (29, 29, 29, 29, 29) and (21, 21, 21, 21, 21, 20, 20) respectively. We used learning rate = 1e-4, momentum = 0.5 and (these parameters were obtained by grid-search).
The model was implemented in Python222The source code for the proposed architecture is publicly available at https://github.com/alejandrodebus/SpatioTemporalCNN_lvquan
, using PyTorch and trained in GPU.
Evaluation criteria. Pearson correlation coefficient (PCC) and Mean Absolute Error (MAE) were used to assess the performance of the algorithms for estimation of areas, dimensions and regional wall thicknesses. Error Rate (ER) was used to assess the performance for cardiac phase classification.
where , is the ground-truth value and is the estimated value. and are their mean values, respectively.
where is the indication function, and are the estimated and ground truth value of the cardiac phase, respectively.
The effectiveness of the proposed method was validated under the experimental setting discussed in Section 2.2. We measured the influence of the parameter (number of contiguous slices fed to the network) for 1 (single slice), 3, 5 and 7 for the proposed spatio-temporal model based on 3D convolutions, and compare with the state of the art method recently proposed in . Results are presented in Table 1 for a 5-fold cross validation setting (the same experimental setting and dataset was used in ). Note that using sets of slices significantly outperforms the configurations for all the indices, highlighting the importance of the temporal dynamics. However, considering and slices achieves a similar performance. Therefore, we consider as enough temporal context for the remaining experiments.
In quantitative terms, we reduce the error rate from 28.45.06% to 3.85% for cardiac phase estimation and the MAE from 270 to 190, 3.18 to 2.29 and 2.62 to 1.42 in average for the areas, directional dimensions of the cavity and regional wall thickness when comparing the performance for and slices respectively. Moreover, considering the baseline  we observe similar results for most indices, except for the phase, where our model improves over the state of the art by a significant margin (reducing the error rate from 8.2% to 3.2%)
Finally, table 2 presents these results for 3 different cross-validation configurations (3, 5 and 7 folds) as required by the LVQuan challenge organizers, together with the results for phase, directional dimensions, regional wall thicknesses and area of cavity and myocardium obtained with the best performing spatio-temporal model (). Note that performance is consistent across folds.
|Regional wall Thickness|
In this work, we proposed a new CNN architecture for LV quantification that incorporates the dynamics of the heart by means of spatio-temporal convolutions. Differently from other methods that rely on more complex mechanisms (like recurrent neural networks ) we employ simple 3D convolutions to fuse information coming from temporally contiguous CMR slices. We generated training samples following a circular hypothesis, meaning that first and last slices of the sequences are considered as temporally contiguous. Validation was performed using CRM sequences provided by the LVQuan challenge organizers. Results show that incorporating temporal information through spatio-temporal convolutions significantly boosts prediction performance for all the indices. Moreover, when compared with the RNN based model presented in , we observe a significant reduction in error rate for phase estimation (from 8.2% to 3.85%) while keeping equivalent results for the other indices. More importantly, our method achieves state of the art results employing simple 3D convolutions instead of the more complex parallel RNN and Bayesian based multitask relationship learning module proposed in .
In this work we incorporated the spatio-temporal dynamics by means of 3D convolutions. However, if we consider the slices as multiple channels of a standard 2D architecture, conventional 2D convolutions could also be used, reducing the complexity of the model. Moreover, temporal information encoded by inter-slice deformation fields (obtained trough deep learning based image registration methods ) could also be considered to improve model performance. In the future, we plan to explore the performance of these models when compared with the proposed architecture.
|N-fold cross validation as required by LVQuan Challenge|
|Regional wall Thickness|
|Phase (ER %)|
The present work used computational resources of the Pirayu Cluster, acquired with funds from the Santa Fe Science, Technology and Innovation Agency (ASACTEI), Government of the Province of Santa Fe, through Project AC-00010-18, Resolution Nº 117/14. This equipment is part of the National System of High Performance Computing of the Ministry of Science, Technology and Productive Innovation of the Republic of Argentina. We also thank NVidia for the donation of a GPU used for this project. Enzo Ferrante is a beneficiary of an AXA Research Fund grant.
Ferrante, E., Oktay, O., Glocker, B., Milone, D.: On the adaptability of unsupervised cnn-based deformable image registration to unseen image domains. In: 9th International Conference on Machine Learning in Medical Imaging (MLMI 2018 - MICCAI) (2018 - In Press)
Tran, D., Bourdev, L., Fergus, R., Torresani, L., Paluri, M.: Learning spatiotemporal features with 3d convolutional networks. In: Computer Vision (ICCV), 2015 IEEE International Conference on. pp. 4489–4497. IEEE (2015)