Fetal Ultrasound (US) scanning is a vital part of ensuring good health of mothers and fetuses during and after pregnancy. Accurate anomaly detection and assessment of fetal development from US scans are required to ensure that the best care is given at the earliest identifiable stage. In many countries a mid-trimester US scan is carried out between 18-22 weeks gestation as a part of standard prenatal care. ‘Standardized plane’ views are used to acquire images in which distinct anatomical features can be extracted. From some of these standard plane views, measurements of the head, abdomen and femur are most commonly used to predict fetal age and weight, and are the key biometrics identified from US. Biometrics acquired longitudinally can be used to predict the fetal development trajectory. Unfortunately, rates for early detection of fetal abnormalities are low, largely due to the high level of skill required by the sonographer to perform such scans and extract the relevant biometrics . Recently, automatic US scanning approaches have been developed using deep learning , which mitigate the problems of manual US measurement through automatic detection of diagnostically relevant anatomical planes. Such systems have allowed development of robust automated methods for estimation of anatomical biometrics [14, 16] in diverse acquisition conditions with various imaging artefacts, outperforming non-deep learning approaches [3, 8, 11]
. Critically, such methods only provide point estimates of HC without confidence or uncertainty measures, and do not provide any means to assess the quality of individual measurements during real-time scans. This can lead to many, potentially contradicting, measurements without any means to control the trustworthiness of the predictions during examination or retrospectively. To this end, several approaches have been proposed for estimation of uncertainty in Deep Networks. These include Monte-Carlo Dropout (MC Dropout), the most common dropout method which has been shown to model a posterior mixture of Gaussians well. Weights in a deep neural network are ‘dropped’ randomly during inference with a given probability
which has been shown to approximate Bayesian inference in deep Gaussian processes. In addition, ensemble approaches produce prediction samples per input image by training a set of separate networks for the same task. The results are then combined to produce a final segmentation which seems to offer a good trade-off between robustness and accuracy 
. Finally, the Probabilistic U-Net represents a generative segmentation model based on a combination of a U-Net with a conditional variational autoencoder. This is capable of producing an unlimited number of plausible hypotheses, reproducing the possible segmentation variants as well as the frequencies with which they occur.
Contribution: In this paper, we extend upon a state-of-the-art convolutional Deep Learning approach for automatic fetal HC measurement 
to develop a new approach for automated probabilistic fetal HC with real-time feedback on measurement robustness. Two probabilistic deep learning methods are evaluated: MC Dropout during inference and Probabilistic U-Net. These are used to return an ensemble of segmentations, from which upper and lower bounds on the measurement are generated. In addition, we propose the derivation of a ‘variance score’, used to reject acquired images that produce sub-optimal HC measurements. In this way, the system will guide operators towards acquiring optimal US views, resulting in more consistent and accurate measurements.
Our HC estimation builds on the approach developed in  which achieves human level performance. First, a U-Net  segmentation network masks out the head from an US image. Then, an ellipse is fitted to the segmented contours  from which the ellipse parameters can be obtained in mm. We extract ellipse centroid co-ordinates ( and ), major and minor axis radii ( and ) each in pixels, and the angle of rotation () and estimate HC using the Ramanujan approximation II  as
The error of this approximation is which for more circular ellipses is negligible. This ellipse fitting process mimics the sonographer’s manual actions when extracting a HC measurement during fetal US screening.
Probabilistic segmentation: Given the inherent variability between sonographers’ annotations in the training data, we generate a set of plausible segmentations from a single input using the following methods: i) MC Dropout: We randomly drop weights of the network with probability to predict segmentation samples. Here, single-sample experiments () were used to optimise the configuration of the network. This led to implementation of a single dropout layer () before the bottleneck layer of the U-Net during inference. ii) Probabilistic U-Net: We sample a set of plausible segmentations using this method  where we follow the same training scheme as .
Variance Estimation: With a probabilistic mapping function , in our case a deep probabilistic neural network, we can map a continuous input image to a possible segmentation mask . We assume a deterministic function , with semi-major axis length , semi-minor axis length , angle of orientation and center , which provides a least square solution to the ellipse fitting problem to the set of points as proposed by . Based on we can evaluate hypotheses for their suitability to act as a metric to measure robustness during inference given prediction samples from . These proposed metrics are h1) Ellipse parameter variance: ; h2) Total ring area: , where scales to world space in ; h3) Mask classification entropy: , where is the number of pixels in after class assignment and ; and h4) Softmax confidence entropy: given
before class assignment, after conversion of the network’s final layer’s logits with, the resulting can be interpreted as two-element prediction confidence for foreground and background . Thus we can estimate class-agnostic prediction entropy by where .
3 Experiments and Results
Our base dataset, named subsequently as Dataset A, consists of 2,724 two-dimensional US examinations from volunteers
at 18-22 weeks gestation, acquired and labelled during routine screening by 45 expert sonographers. Several images were taken during each session, including the standard transverse brain view at the posterior horn of the ventricle (TV) plane used for HC measurement. This data was combined with the HC18 Challenge  dataset which consists of 1334 two-dimensional US images of the standard plane that is used to measure HC, each image is 800x540 pixels with a pixel size ranging from 0.052mm to 0.326mm. Each image in the training set has an accompanying manual annotation of the HC (ellipse outline) performed by a single trained sonographer . We resample all images to pixels, and produce a head mask from the expert ground truth delineation. Training data is randomly flipped both horizontally and vertically, and a random rotation ()is performed.
Single-Sampling Experiments: In the first instance, single-sample experiments, generating a single segmentation and HC measurement () per subject, were used to verify the performance of the proposed model against the state-of-the-art . Table 1 reports performance measures for all single-sampling experiments. These show comparable performance relative to  for our U-Net implementation, trained on Dataset A. This result improves further when the same model is trained on Dataset A and HC18 data. MC dropout during training further improves the result. For subsequent analysis, all experiments for MC Dropout (during inference) use the combined data and are trained using MC dropout.
|Mean abs difference|
|std (mm)||Mean DICE|
|std (%)||Mean Hausdorff distance|
|Baseline||2.09 1.97||0.982 0.011||1.289 0.880|
|Dataset A + HC18||1.90 1.90||0.982 0.010||1.292 0.791|
|Dropout||1.808 1.65||0.982 0.008||1.295 0.664|
MC Dropout during inference has been compared against a Probabilistic U-Net. Here, multiple () segmentation predictions are made for each US image. From these, the mean and median of the set of fitted ellipse parameters are used to obtain a single HC value for each test case, and the set of segmentations are used to obtain an upper and lower bound. Table 2 shows the performance measures for our multi-sampling experiments. Results show that we lose performance through aggregating multiple results using the mean or median, although this is likely due to dropout not being applied during inference for single sample experiments. However, the multi-sampling methods do allow us to produce an upper and lower bound on the HC value, with an average difference of between upper-lower bounds and ground truth HC measurement ( samples), for cases where the ground truth is not within the upper-lower bounds (MC(inf.)).
|Mean abs difference|
|std (mm)||Mean DICE|
|std (%)||Mean Hausdorff distance|
|MC||1.81 1.65||0.982 0.008||1.295 0.664||N/A|
|Mean||2.22 2.15||0.980 0.011||1.413 0.751||20.4|
|Median||2.21 2.15||0.980 0.011||1.410 0.748||20.4|
|Mean||2.15 2.09||0.981 0.010||1.313 0.613||27.8|
|Median||2.15 2.07||0.981 0.010||1.307 0.604||27.8|
Variance Measure Thresholding: Finally, we experiment with each of the variance scores produced over the test set as a means to accept/reject images at test time. We assess their performance by counting the number of accepted/rejected cases for a range of thresholds between zero and one, and how this threshold affects the resulting average performance scores after rejected images are removed from the test set. In this experiment we use only MC dropout during inference () which performs best in our previous experiments. Figure 2 shows graphs depicting how each variance measure can be used to reject test cases, and how rejecting high variance cases can lead to improved performance. In each case we normalise the variance score to lie between 0 and 1, and for each threshold between 0 and 1 we ‘reject’ cases whose variance score is above the threshold. Plots show the performance for remaining ‘accepted’ cases, plotted against the number of ‘rejected’ cases. For most variance scores we obtain an initial performance boost from ‘rejecting’ the worst cases, but after an initial improvement, the variance scores do not delineate ‘good’ from ‘bad’ cases very well. Results suggest that higher measurement variance may indicate sub-optimal imaging plane acquisition.
Qualitative assessment: Figure 3 shows examples for successful and less model-compliant images using Dropout during inference to produce the samples, where model-compliance captures the proximity of the image to the training data. Note that the best performing examples produce very narrow upper and lower bounds (in this figure where the upper and lower bounds occupy the same pixels the margin is not visible). The worst performing examples show a wider upper and lower bound range but the ground truth ellipse is often not contained within the predicted range. These images often show a lack of clear white presentation of the skull. However, ambiguous segmentation of the regions with missing signal is often reflected in the confidence margin produced, showing greater variation in those image regions, which can be seen clearly in the second example in the bottom row - a wider upper-lower bound area for image regions with low signal from fetal skull. The example on the bottom far right shows missing signal on both sides, which results in a large uncertainty in the ellipses globally due to the compounded effect of missing signal on both sides of the skull.
While we cannot claim our proposed ’variance scores’ represent model uncertainty directly, they show some capability to ‘reject’ particularly low performing test cases. In this way, the ‘variance scores’ can be described as a measurement for the proximity to the variance of the training data of an unseen test sample, which is also desirable, showing the confidence of the network with respect to its capacity and seen training examples. Scenarios in which an operator is present stand to benefit practically using methods introduced in this work, prompting operators to reject sub-optimal measurements by providing real-time feedback during acquisition, thus improving inter-operator consistency. This work lays the foundations for methods by which this can be achieved.
We demonstrate the effectiveness of probabilistic CNNs to automatically generate HC measurements from US scans, and produce upper-lower bound confidence intervals in real-time. Using multi-sampling probabilistic networks we derive ‘variance scores’, which indicate how confident our network is in generating measurements for a given image. This approach could be used to derive a system which rejects images collected from sub-optimal views, forcing sonographers to take measurements from a view for which the network performs optimally. This could lead to techniques for automated fetal HC measurement, which outperform manual approaches in terms of accuracy and consistency. Future directions of this work include exploring alternative methods for multi-sampling networks, alternative segmentation fusion strategies and alternative ’variance scores’. Analysis of new datasets to investigate network bias towards particular datasets is valuable, as well as analysis of cases with anomalous anatomy to verify high performance in the presence of pathologies, clinically the most important cases to identify.Acknowledgements: This work is supported by the Wellcome Trust IEH 102431, EPSRC (EP/S022104/1, EP/S013687/1), and Nvidia GPU donations.
-  Barnard, R.W., Pearce, K., Schovanec, L.: Inequalities for the Perimeter of an Ellipse. Journal of Mathematical Analysis and Applications 260(2), 295–306 (8 2001). https://doi.org/10.1006/JMAA.2000.7128
-  Baumgartner, C.F., et al.: SonoNet: Real-Time Detection and Localisation of Fetal Standard Scan Planes in Freehand Ultrasound. IEEE Trans Med Imag 36(11), 2204–2215 (11 2017). https://doi.org/10.1109/TMI.2017.2712367
-  Carneiro, G., Georgescu, B., Good, S., Comaniciu, D.: Detection and Measurement of Fetal Anatomies from Ultrasound Images using a Constrained Probabilistic Boosting Tree. IEEE Trans on Med Imag 27(9), 1342–1355 (9 2008). https://doi.org/10.1109/TMI.2008.928917
-  Fitzgibbon, A., Pilu, M., Fisher, R.: Direct least squares fitting of ellipses. In: 13th ICPR’96. pp. 253–257. IEEE (1996). https://doi.org/10.1109/ICPR.1996.546029
-  Gal, Y., Ghahramani, Z.: Dropout as a bayesian approximation: Representing model uncertainty in deep learning. In: ICLR’16. pp. 1050–1059 (2016)
-  Kamnitsas, K., et al.: Ensembles of Multiple Models and Architectures for Robust Brain Tumour Segmentation. In: Brainlesion: Glioma, Multiple Sclerosis, Stroke and Traumatic Brain Injuries. pp. 450–462 (09 2018). https://doi.org/10.1007/978-3-319-75238-9_38
-  Kohl, S., et al.: A probabilistic U-Net for segmentation of ambiguous images. In: Advances in Neural Information Processing Systems. pp. 6965–6975 (2018)
Li, J., et al.: Automatic Fetal Head Circumference Measurement in Ultrasound Using Random Forest and Fast Ellipse Fitting. IEEE J Biomed Health Inform22(1), 215–223 (1 2018). https://doi.org/10.1109/JBHI.2017.2703890
Prasad, D., Leung, M., Quek, C.: Ellifit: An unconstrained, non-iterative, least squares based geometric ellipse fitting method. Pattern Recognition46(5), 1449 – 1465 (2013). https://doi.org/10.1016/j.patcog.2012.11.007
-  Ronneberger, O., Fischer, P., Brox, T.: U-net: Convolutional networks for biomedical image segmentation. In: MICCAI’15. pp. 234–241. Springer (2015). https://doi.org/10.1007/978-3-319-24574-4_28
-  Rueda, S., et al.: Evaluation and Comparison of Current Fetal Ultrasound Image Segmentation Methods for Biometric Measurements: A Grand Challenge. IEEE Trans Med Imag 33(4), 797–813 (4 2014). https://doi.org/10.1109/TMI.2013.2276943
-  Sarris, I., et al.: Intra- and interobserver variability in fetal ultrasound measurements. Ultrasound Obstet Gynecol 39(3), 266–273 (mar 2012). https://doi.org/10.1002/uog.10082
-  National Health Service, U.: NHS Fetal Anomaly Screening Programme (FASP) Handbook Valid from August 2018. Tech. rep. (2018)
Sinclair, M., et al.: Human-level Performance On Automatic Head Biometrics In Fetal Ultrasound Using Fully Convolutional Neural Networks. In: 40th EMBC’18. pp. 714–717. IEEE (7 2018).https://doi.org/10.1109/EMBC.2018.8512278
-  van den Heuvel, T.L.A., de Bruijn, D., de Korte, C.L., Ginneken, B.v.: Automated measurement of fetal head circumference using 2D ultrasound images. PLOS ONE 13(8), e0200412 (8 2018). https://doi.org/10.1371/journal.pone.0200412
-  Wu, L., et al.: Cascaded Fully Convolutional Networks for automatic prenatal ultrasound image segmentation. In: IEEE 14th ISBI’17. pp. 663–666. IEEE (4 2017). https://doi.org/10.1109/ISBI.2017.7950607