Temporal Convolution Networks for Real-Time Abdominal Fetal Aorta Analysis with Ultrasound

07/11/2018 ∙ by Nicolo' Savioli, et al. ∙ Università di Padova University of Warwick 0

The automatic analysis of ultrasound sequences can substantially improve the efficiency of clinical diagnosis. In this work we present our attempt to automate the challenging task of measuring the vascular diameter of the fetal abdominal aorta from ultrasound images. We propose a neural network architecture consisting of three blocks: a convolutional layer for the extraction of imaging features, a Convolution Gated Recurrent Unit (C-GRU) for enforcing the temporal coherence across video frames and exploiting the temporal redundancy of a signal, and a regularized loss function, called CyclicLoss, to impose our prior knowledge about the periodicity of the observed signal. We present experimental evidence suggesting that the proposed architecture can reach an accuracy substantially superior to previously proposed methods, providing an average reduction of the mean squared error from 0.31 mm^2 (state-of-art) to 0.09 mm^2, and a relative error reduction from 8.1% to 5.3%. The mean execution speed of the proposed approach of 289 frames per second makes it suitable for real time clinical use.



There are no comments yet.


page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Fetal ultrasound (US) imaging plays a fundamental role in the monitoring of fetal growth during pregnancy and in the measurement of the fetus well-being. Growth monitoring is becoming increasingly important since there is an epidemiological evidence that abnormal birth weight is associated with an increased predisposition to diseases related to cardiovascular risk (such as diabetes, obesity, hypertension) in young and adults [1].

Among the possible biomarkers of adverse cardiovascular remodelling in fetuses and newborns, the most promising ones are the Intima-Media Thickness (IMT) and the stiffness of the abdominal aorta by means of ultrasound examination. Obtaining reliable measurements is critically based on the accurate estimation of the diameter of the aorta over time. However, the poor signal to noise ratio of US data and the fetal movement makes the acquisition of a clear and stable US video challenging. Moreover, the measurements rely either on visual assessment at bed-side during patient examination, or on tedious, error-prone and operator-dependent review of the data and manual tracing at later time. Very few attempts towards automated assessment have been presented

[2, 3], all of which have computational requirements that prevent them to be used in real-time. As such, they have reduced appeal for the clinical use. In this paper we describe a method for automated measurement of the abdominal aortic diameter directly from fetal US videos. We propose neural network architecture that is able to process US videos in real-time and leverage both the temporal redundancy of US videos and the quasi-periodicity of the aorta diameter.

The main contributions of the proposed method are as follows. First we show that a shallow CNN is able to learn imaging features and outperforms classical methods as level-set for fetal abdominal aorta diameter prediction. Second we add to the CNN a Convolution Gated Recurrent Unit (C-GRU) [15]

for exploiting the temporal redundancy of the features extracted by CNN from the US video sequence. Finally, we add a new penalty term to the loss function used to train the CNN to exploit periodic variations.

2 Related work

The interest for measuring the diameter and intima-media thickness (IMT) of major vessels has stemmed from its importance as biomarker of hypertension damage and atherosclerosis in adults. Typically, the IMT is assessed on the carotid artery by identifying its lumen and the different layers of its wall on high resolution US images. The improvements provided by the design of semi-automatic and automatic methods based mainly on the image intensity profile, distribution and gradients analysis, and more recently on active contours. For a comprehensive review of these classical methods we refer the reader to [4] and [5]. In the prenatal setting, the lower image quality, due to the need of imaging deeper in the mother’s womb and by the movement of the fetus, makes the measurement of a IMT biomarker, although measured on the abdominal aorta, challenging.

Methods that proved successful for adult carotid image analysis do not perform well on such data, for which only a handful of methods (semi-automatic or automatic) have been proposed, making use of classical tracing methods and mixture of Gaussian modelling of blood-lumen and media-adventitia interfaces [2], or on level sets segmentation with additional regularizing terms linked to the specific task [3]. However, their sensitivity to the image quality and lengthy computation prevented an easy use in the clinical routine.

Deep learning approaches have outperformed classical methods in many medical tasks [8]. The first attempt in using a CNN, for the measurement of carotid IMT has been made only recently [9]. In this work, two separate CNNs are used to localize a region of interest and then segment it to obtain the lumen-intima and media-adventitia regions. Further classical post-processing steps are then used to extract the boundaries from the CNN based segmentation. The method assumes the presence of strong and stable gradients across the vessel walls, and extract from the US sequence only the frames related to the same cardiac phase, obtained by a concomitant ECG signal.

However, the exploitation of temporal redundancy on US sequences was shown to be a solution for improving overall detection results of the fetal heart [11]

, where the use of a CNN coupled with a recurrent neural network (RNN) is strategic. Other works, propose similar approach in order to detect the presence of standard planes from prenatal US data using CNN with Long-Short Term Memory (LSTM)


3 Datasets

This study makes use of a dataset consisting of 25 ultrasound video sequences acquired during routine third-trimester pregnancy check-up at the Department of Woman and Child Health of the University Hospital of Padova (Italy). The local ethical committee approved the study and all patients gave written informed consent.

Fetal US data were acquired using a US machine (Voluson E8, GE) equipped with a 5 MHz linear array transducer, according to the guidelines in [6, 7], using a FOV, image dimension 720x960 pixels, a variable resolution between 0.03 and 0.1 and a mean frame rate of 47 fps. Gain settings were tuned to enhance the visual quality and contrast during the examination. The length of the video is between 2s and 15s, ensuring that at least one full cardiac cycle is imaged.

After the examination, the video of each patient was reviewed and a relevant video segment was selected for semi-automatic annotation considering its visual quality and length: all frames of the segment were processed with the algorithm described in [2] and then the diameters of all frames in the segments were manually reviewed and corrected. The length of the selected segments varied between 21 frames 0.5s and 126 frames 2.5s. The 25 annotated segments in the dataset were then randomly divided into training ( of the segments), validation () and testing () sets. In order to keep the computational and memory requirements low, each frame was cropped to have a square aspect ratio and then resized to pixels. We also make this dataset public to allow the results reproducibility.

4 Network architecture

Our output is the predicted value of the diameter of the abdominal aorta at each time point. Our proposed deep learning solution consists of three main components (see Figure [2

]): a Convolutional Neural Network (CNN) that captures the salient characteristics from ultrasound input images; a Convolution Gated Recurrent Unit (C-GRU)

[15] exploits the temporal coherence through the sequence; and a regularized loss function, called CyclicLoss, that exploits the redundancy between adjacent cardiac cycles.

Our input consists of a set of sequences whereby each sequence has dimension pixels at time , with . At each time point t, the CNN extracts the feature maps of dimensions , where is the number of maps, and and are their in-plane pixel dimensions, that depend on the extent of dimensionality reduction obtained by the CNN through its pooling operators.

The feature maps are then processed by a C-GRU layer [15]. The C-GRU combines the current feature maps with an encoded representation of the feature maps extracted at previous time points of the sequence to obtain an updated encoded representation , the current state, at time : this allows to exploit the temporal coherence in the data. The of the C-GRU layer is obtained by two specific gates designed to control the information inside the unit: a reset gate, , and an update gate, , defined as follow:



is the sigmoid function,

are recurrent weights matrices whose first subscript letter refers to the input of the convolution operator (either the feature maps or the state ), and whose second subscript letter refers to the gate (reset or update ). All this matrices, have a dimension of and

is a bias vector. In this notation,

defines the convolution operation. The current state is then obtained as:


Where denotes the dot product and and are recurrent weight matrices for and , used to balance the new information represented by the feature maps derived by the current input data with the information obtained observing previous data . On the one hand, is then passed on for updating the state at the next time point, and on the other is flatten and fed into the last part of the network, built by Fully Connected (FC) layers progressively reducing the input vector to a scalar output that represent the current diameter estimate .

Figure 1: The deep-learning architecture proposed for abdominal diameter aorta prediction. The blue blocks represent the features extraction through a CNN (AlexNet) which takes in input a US sequence , and provides for each frame a features map that is passed to Convolution Gated Recurrent Units (C-GRU) (yellow circle) that encodes and combines the information from different time points to exploit the temporal coherence. The fully connected block (FC, in green), takes as input the current encoded state as features to estimate the aorta diameter .

4.1 CyclicLoss

Under the assumption that the pulsatility of the aorta follows a periodic pattern with the cardiac cycle, the diameter of the vessel at corresponding instants of the cardiac cycle should ideally be equal. Assuming a known cardiac period , we propose to add a regularization term to the loss function used to train the network as to penalize large differences of the diameter values that are estimated at time points that are one cardiac period apart.

We call this regularization term CyclicLoss (), computed as norm between pairs of predictions at the same point of the heart cycle and from adjacent cycles:


The is the period of the cardiac cycle, while is the number of integer cycles present in the sequence and is the estimated diameter at time . Notably, the is determined through a peak detection algorithm on , and the average of all peak-to-peak detection distances define its value. While the is the number of cycles present, calculated as the total length of the signal divided by .

The loss to be minimized is therefore a combination of the classical mean squared error (MSE) with the , and the balance between the two is controlled by a constant :


where is the target diameter at time point . It is worth noting that the knowledge of the period of the cardiac cycle is needed only during training phase. Whereas, during the test phase, on unknown image sequence, the trained network provide its estimate blind of the periodicity of the specific sequence under analysis.

Figure 2: Each panel (a-c) shows the estimation of the aortic diameter at each frame of fetal ultrasound videos in the test set, using the level set method (dashed purple line), the naive architecture using AlexNet (dashed orange line), the AlexNet+C-GRU (dashed red line), and AlexNet+C-GRU trained with the CyclicLoss (dashed blue line). The ground truth (solid black line) is reported for comparison. Panels (a,c) show the results on long sequences where more than 3 cardiac cycles are imaged, whereas panels (b,d) show the results on short sequences where only 1 or two cycles are available.

4.2 Implementation details

For our experiments, we chose AlexNet [12] as a feature extractor for its simplicity. It has five hidden layers with kernels size in the first layer, in the second and in the last three layers; it is well suited to the low image contrast and diffuse edges characteristic of US sequences. Each network input for the training is a sequence of ultrasound frames with pixels, AlexNet provides feature maps of dimension , and the final output is the estimate abdominal aorta diameter value at each frame.

The loss function is optimised with the Adam algorithm [16] that is a first-order gradient-based technique. The learning rate used is with iterations (calculated as number of patients number of ultrasound sequences) for epochs. In order to improve generalization, data augmentation of the input with a vertical and horizontal random flip is used at each iteration. The constant used during training with CyclicLoss takes the value of .

5 Experiments

The proposed architecture is compared with the currently adopted approach in section 4. This method provides fully-automated measurements in lumen identification on prenatal US images of the abdominal aorta [3] based on edge-based level set. In order to understand the behaviour of different features extraction methods, we have also explored the performance of new deeper network architectures whereby AlexNet was replaced it by InceptionV4 [13] and DenseNets 121 [14].

Methods MSE [] RE [%] p-value
AlexNet 0.29(0.09) 8.67(10) 1.01e-12
AlexNet+C-GRU 0.093(0.191) 6.11(5.22) 1.21e-05
AlexNet+C-GRU+CL 0.085(0.17) 5.23(4.91) “-”
DenseNet121 0.31(0.56) 9.55(8.52) 6.00e-13
DenseNet121+C-GRU 0.13(0.21) 7.72(5.46) 7.78e-12
InceptionV4 6.81(14) 50.4(39.5) 6.81e-12
InceptionV4+C-GRU 0.76(1.08) 16.3(9.83) 2.89e-48
Level-set 0.31(0.80) 8.13(9.39) 1.9e-04
Table 1:

The table show the mean (standard deviation) of MSE and RE error for all the comparison models. The combination of C-GRU and the

CyclicLoss with AlexNet yields the best performance. Adding recurrent units to any CNN architecture improves its performance; however deeper networks as InceptionV4 and DenseNets do not show any particular benefits with respect to the simpler AlexNet. Notably, we also consider the p-value for multiple models comparison with the propose network AlexNet+C-GRU+CL, in this case the significant level should be 0.05/7 using the Bonferroni correction [17].

The performance of each method was evaluated both with respect to the mean squared error (MSE) and to the mean absolute relative error (RE); all values are reported in Tab.1 in terms of average and standard deviation across the test set.

In order to provide a visual assessment of the performance, representative estimations on four sequences of the test set are shown in Fig.2. The naive architecture relying on a standard loss and its C-GRU version are incapable to capture the periodicity of the diameter estimation. The problem is mitigated by adding the CyclicLoss regularization on MSE. This is quantitatively shown in Tab.1, where the use of this loss further decreases the MSE from to , and the relative error of from to .

Strikingly, we observed that deeper networks are not able to outperform AlexNet on this dataset. Their limitation may be due to over-fitting. Nevertheless, the use of C-GRU greatly improve the performance of both networks both in terms of MSE and of RE. Further, we also performed a non-parametric test (Kolmogorov-Smirnov test) to check if the best model was statistically different compared to the others.

The results obtained with the complete model AlexNet+C-GRU+CL are indeed significantly different from all others (p 0.05) also, when the significant level is adjusted for multiple comparison applying the Bonferroni correction [17, 18].

6 Discussion and conclusion

The deep learning (DL) architecture proposed shows excellent performance compared to traditional image analysis methods, both in accuracy and efficiency. This improvement is achieved through a combination of a shallow CNN and the exploitation of the temporal and cyclic coherence. Our results seem to indicate that a shallow CNNs perform better than deeper CNNs such as DenseNet 121 and InceptionV4; this might be due to the small dimension of the data set, a common issue in the medical settings when requiring manual annotations of the data.

6.1 The CyclicLoss benefits

The exploitation of temporal coherence is what pushes the performance of the DL solution beyond current image analysis methods, reducing the MSE from (naive architecture) to with the addition of the C-GRU. The CyclicLoss is an efficient way to guide the training of the DL solution in case of data showing some periodicity, as in cardiovascular imaging. Please note that the knowledge of the signal period is only required by the network during training, and as such it does not bring additional requirements on the input data for real clinical application. We argue that the CyclicLoss is making the network learn to expect a periodic input and provide some periodicity in the output sequence.

6.2 Limitations and future works

A drawback of this work is that it assumes the presence of the vessel in the current field of view. Further research is thus required to evaluate how well the solution adapts to the scenario of lack of cyclic consistency, when the vessel of interest can move in and out of the field of view during the acquisition, and to investigate the possibility of a concurrent estimation of the cardiac cycle and vessel diameter. Finally, the C-GRU used in our architecture, has two particular advantages compared to previous approaches [10, 11]

: first, it is not subject to the vanishing gradient problem as the RNN, allowing to train from long sequences of data. Second, it has less computational cost compared to the LSTM, and that makes it suitable for real time video application.


  • [1] Visentin S., Grumolato F., Nardelli G.B., Di Camillo B. , Grisan E., Cosmi E. Early origins of adult disease: Low birth weight and vascular remodeling, Atherosclerosis, 237(2), pp. 391-399, 2014.
  • [2] Veronese E., Tarroni G., Visentin S., Cosmi E., Linguraru M.G., Grisan E. Estimation of prenatal aorta intima-media thickness from ultrasound examination. Phys Med Biol, 59(21), pp. 6355-71, 2014.
  • [3] Tarroni G., Visentin S., Cosmi E., Grisan E. Fully-Automated Identification and Segmentation of Aortic Lumen from Fetal Ultrasound Images. In: IEEE EMBC pp. 153-6, 2015.
  • [4] Molinari F., Zeng G., Suri J.S.: A state of the art review on intima–media thickness (IMT) measurement and wall segmentation techniques for carotid ultrasound. Comp Meth Prog Biomed, 100(3), 201–221, 2010.
  • [5] Loizou C.P.: A review of ultrasound common carotid artery image and video segmentation techniques, Med & Biol Eng & Comp, 52(12), pp. 1073-1093, 2014.
  • [6] Cosmi E., Visentin S., Fanelli T., Mautone A.J., Zanardo V. Aortic intima media thickness in fetuses and children with intrauterine growth restriction. Obs Gyn,114, pp. 1109–1114, 2009.
  • [7] Skilton M.R., Evans N. Griffiths K.A., Harmer J.A., Celermajer D.S.: Aortic wall thickness in newborns with intrauterine growth restriction. Lancet, 365, pp. 1484–6, 2005.
  • [8] Litjens G., Kooi T., Bejnordi B.E., Setio A.A.A., Ciompi F., Ghafoorian M., van der Laak J.A.W.M., van Ginneken B., Sánchez C.I.: A Survey on Deep Learning in Medical Image Analysis., Med Image Anal, 42, pp 60-88, 2017
  • [9] Shin J.Y., Tajbakhsh N., Hurst R.T., Kendall C.B., Liang J.: Automating Carotid Intima-Media Thickness Video Interpretation with Convolutional Neural Networks, In: IEEE CVPR Conference, pp. 2526-2535, 2016
  • [10] Chen H., Dou Q., Ni D., Cheng J.-Z., Qin J., Li S., Heng P.-A.: Automatic fetal ultrasound standard plane detection using knowledge transferred recurrent neural networks. In: MICCAI 2015, LNCS, Vol. 9349, pp. 507–514, 2015
  • [11] Huang W., Bridge C.P., Noble J.A, Zisserman A.: Temporal HeartNet: Towards Human-Level Automatic Analysis of Fetal Cardiac Screening Video. In: MICCAI 2017, LNCS, vol 10434, pp. 341-349, 2017.
  • [12]

    Krizhevsky A., Sutskever I., Hinton G.E.: ImageNet Classification with Deep Convolutional Neural Networks. In: NIPS-2012, pp. 1097–1105, 2012.

  • [13]

    Szegedy C., Ioffe S., Vanhoucke V.: Inception-v4, Inception-ResNet and the Impact of Residual Connections on Learning. AAAI-17, pp. 4278-4284, 2017

  • [14] Huang G., Liu Z., van der Maaten L., Weinberger K.Q.: Densely Connected Convolutional Networks, In: IEEE CVPR Conference, pp. 2261-2269, 2017
  • [15] Siam M., Valipour A., Jägersand M., Ray N.: Convolutional Gated Recurrent Networks for Video Segmentation, In: IEEE ICIP Conference, pp. 3090-3094, 2017.
  • [16] Kingma D.P., Ba L.J.: Adam: A Method for Stochastic Optimization, 3rd International Conference for Learning Representations 2015
  • [17] Bonferroni, C. E., Teoria statistica delle classi e calcolo delle probabilità, Pubblicazioni del Regio Istituto Superiore di Scienze Economiche e Commerciali di Firenze 1936
  • [18] Dunn, Olive Jean (1961). ”Multiple Comparisons Among Means” Journal of the American Statistical Association. 56 (293): 52–64