Hybrid Attention for Automatic Segmentation of Whole Fetal Head in Prenatal Ultrasound Volumes

04/28/2020 ∙ by Xin Yang, et al. ∙ 5

Background and Objective: Biometric measurements of fetal head are important indicators for maternal and fetal health monitoring during pregnancy. 3D ultrasound (US) has unique advantages over 2D scan in covering the whole fetal head and may promote the diagnoses. However, automatically segmenting the whole fetal head in US volumes still pends as an emerging and unsolved problem. The challenges that automated solutions need to tackle include the poor image quality, boundary ambiguity, long-span occlusion, and the appearance variability across different fetal poses and gestational ages. In this paper, we propose the first fully-automated solution to segment the whole fetal head in US volumes. Methods: The segmentation task is firstly formulated as an end-to-end volumetric mapping under an encoder-decoder deep architecture. We then combine the segmentor with a proposed hybrid attention scheme (HAS) to select discriminative features and suppress the non-informative volumetric features in a composite and hierarchical way. With little computation overhead, HAS proves to be effective in addressing boundary ambiguity and deficiency. To enhance the spatial consistency in segmentation, we further organize multiple segmentors in a cascaded fashion to refine the results by revisiting context in the prediction of predecessors. Results: Validated on a large dataset collected from 100 healthy volunteers, our method presents superior segmentation performance (DSC (Dice Similarity Coefficient), 96.05 volumes collected from 52 volunteers, we ahieve high reproducibilities (mean standard deviation 11.524 mL) against scan variations. Conclusion: This is the first investigation about whole fetal head segmentation in 3D US. Our method is promising to be a feasible solution in assisting the volumetric US-based prenatal studies.



There are no comments yet.


page 2

page 3

page 6

page 7

page 8

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Prenatal examinations during different trimesters depend heavily on ultrasound (US) scanning, which is well-recognized as real-time, non-invasive and radiation-free. Biometric measurements interpreted from US images are foundations for the evaluation of fetal and maternal health across different gestational ages (hadlock1985estimation, ).

Figure 1: Volumetric ultrasound of the whole fetal head. (a) Illustration of the 3D ultrasound imaging process around the fetal head. Varying fetal poses are allowed during our acquisition. (b) Anatomical definition of the whole fetal head region (area above the blue line). Skull (includes P, F, E, S, T and O) is only a sub-region of the whole head. (c)-(e) Coronal, traverse and sagittal planes from a fetal head ultrasound volume. Arrows denote the various occluded sites (blue arrows), deficient and ambiguous boundaries (green arrows) around the fetal head.

Among all the biometrics, measurements focusing on fetal head are major indicators accepted by sonographers, which are explicit in reflecting the growth stage of fetus. By combining the measurements of fetal head with that of other anatomical structures, like fetal abdomen and femur, sonographers can further estimate the fetal weight and hence gain better insights for diagnosis. However, limited by 2D US scanning, current clinical measurement pipeline often exposes the diagnosis of fetal head to potential risks. First, 2D measurements obtained from approximated geometry primitives, like line and ellipse, are obviously rough in describing the complex 3D geometry of fetal head. Sonographers often need multiple 2D measurements, like Head Circumference (HC) and Biparietal Diameter (BPD), to justify their diagnoses

rueda2014evaluation . Second, selecting standard planes which contain key anatomical substructures is a prerequisite for measurement. Bias of sonographers in this selection step often enlarges the discrepancy in diagnoses Ni_umb_plane . Previous automatic solutions yu2008fetal ; wu2017cascaded ; li2018automatic for fetal head measurement partially address the problems, but they are still limited by the 2D US imaging.

As shown in Fig. 1, 3D US is emerging as a promising alternative in circumventing the aforementioned problems Tutschek . It provides a broad volumetric field of view and therefore enables sonographers to inspect the fetal head anatomy in an ever straightforward way (Fig. 1(a)). Volumetric scanning is less expert-dependent than 2D scanning and thus alleviates the risk of discrepancy in image acquisition. Biometric measurements extracted from 3D US, such as volume, are more representative and comprehensive than 2D ones for diagnosis. Volumetric metrics may also provide earlier indicators than planar ones for prognosis Tutschek .

Although 3D US is attractive for fetal head imaging, efficient and effective tools for whole fetal head segmentation and quantitative analysis running on the massive volume are still absent. As illustrated in Fig. 1(c)-(e), automated segmentation solutions need to tackle the following challenges possessed by volumetric US of whole fetal head: i) the poor image quality resulting from the speckle noise and low resolution, ii) inevitable boundary ambiguity caused by low tissue contrast and long-span shadow occlusion caused by the severe acoustic attenuation on skull, particularly in the far field, iii) the large varieties of fetal head in appearance (particularly the inner structures), scale, orientation and shape across different fetal poses and gestational ages.

Much work have been proposed for prenatal volumetric US segmentation. Semi-automatic solutions, like VOCAL (GE Healthcare), were investigated in clinic to segment fetal anatomical structures Luewan_vocal_placentaV . These semi-automatic solutions often simplify the segmentation and discard many important details. In Dahdouh , Dahdouh et al. explored both intensity distributions and shape priors to segment fetus. Feng et al. exploited boundary traces to extract fetal limb volume in Feng_limb . Recently, Namburete et al. proposed a 3D deformable parametric surface to represent and fit fetal skull for fetal brain maturity evaluation Namburete_head

. Although shape models provide proper constraints for robust fitting, they are initialization-dependent and rough in capturing the case-specific boundary details. Traditional machine learning methods, like random forests, were leveraged to segment fetal brain structures

Yaqub_brain . Structured Random Forest was further used to segment fetal skull cerrolaza2017fetal .

Deep neural networks (CNN) have drastically taken over the traditional methods in US image segmentation

Shen_review ; torrents2019segmentation . Characterized with the end-to-end dense mapping, fully convolutional network (FCN) Long_fcn was adopted by wu2017cascaded

for 2D prenatal US image segmentation with a high performance. A 3D FCN with Recurrent Neural Network was further developed by

yang2017towards to segment the whole fetus, placenta and gestational sac in early gestational ages. Namburete et al. Namburete2018fully used the shape model to generate skull masks and trained deep neural networks to segment fetal skull in 3D US volumes. They reported the average segmentation DSC as 83. Recently, Cerrolaza et al. cerrolaza2018deep proposed to combine the acoustic shadow casting map with a deep network to segment 3D fetal skull, also achieving an average DSC of 83. A deep conditional generative network was further developed for the 3D fetal skull reconstruction from few 2D slices by cerrolaza20183d , reporting the average DSC of 90.

Although the tasks in cerrolaza2017fetal ; Namburete2018fully ; cerrolaza2018deep ; cerrolaza20183d are similar to the work in this paper, their focuses are the fetal skull segmentation, while our task is segmenting the whole fetal head. As shown in Fig. 1(b), whole fetal head is defined as the region of fetus on top of the plane determined by the fetal lower jaw and cervical spine C2 (denoted as the blue line in Fig. 1(b)). It not only includes the skull (P, F, E, S, T, O), but also the maxillo-facial structures of fetus (M, Z, Y, C). Segmentation of whole fetal head is more informative than skull segmentation for fetal growth evaluation. However, since the maxillo-facial areas of fetus often present larger non-rigid deformations, occlusions and boundary deficiencies than the skull (Fig. 1(c)-(e)), our task is much more difficult than the skull segmentation.

Figure 2:

Schematic view of our segmentation solution. (a) Architecture of our proposed segmentor. Encoder-decoder design with 3D operators digests the whole US volume. Hybrid attention scheme is combined to enhance the feature maps in key sites. Deep supervision with side branch boosts the network training. (b) Cascaded segmentors in an Auto-Context framework for segmentation refinement. Summation of US volume and probability map of whole fetal head is the input of each context level.

In this paper, we propose the first fully-automated solution to segment the whole fetal head in US volumes. In order to fully explore the context in the whole US volume, we firstly formulate the segmentation task as an end-to-end volumetric mapping under an encoder-decoder deep architecture. Making every feature descriptor to be representative is crucial for deep networks, especially for our volumetric network which suffers severe limitations from computation resources. Therefore, we propose a hybrid attention scheme (HAS) and combine it with our segmentor to promote discriminative features and suppress the non-informative volumetric features in a composite and hierarchical way. HAS brings minimal computation overhead and proves to be effective in helping segmentor combating boundary ambiguity and deficiency. To enhance the spatial consistency in the segmentation, we further organize multiple segmentors in a cascaded fashion to refine the results by explicitly revisiting the global shape context encoded in the predictions from predecessors. With experiments on a large dataset, our method presents the ability to tackle a wide range of gestational ages with superior segmentation performance (average DSC as 96), high agreements with experts and decent reproducibilities (mean standard deviation 11.524 mL). The automated segmentation will not only benefit the extraction of representative biometrics in fetal head, but also have potentials in boosting many advanced prenatal studies, like brain alignment Namburete2018fully , volume stitching Gomez_3Dus_regis and longitudinal analysis Namburete_head . Code is publicly available https://github.com/wxde/USegNet-DS-HAS.

2 Materials and Methods

2.1 Datasets

To cover the most important gestational ages where fetal head is intensively examined, we built a dataset consisting of 100 US volumes of fetal head acquired from 100 healthy pregnant women volunteers, with a gestational age (GA) ranging from 20 to 31 weeks. All the data acquisition have been approved by the local Institutional Review Board. All the volunteers that participated in this study have reviewed and signed the consent forms. All the volumes were anonymized and acquired by an experienced sonographer using an US machine (S50, SonoScape Medical Corp., Shenzhen, Guangdong Province, (China)) with an integrated 3D probe. The probe has a scan angle to ensure a complete scanning of the whole fetal head. Fetus is in static state during scanning and free fetal poses are considered during acquisition. Varying scanning orientation of US probe is accepted to ensure the acquisition quality. The original size of volume is 388258440 with a spacing of 0.380.380.38 . An expert with 10-year experience manually delineated all volumes as ground truth. Being skilled in using the annotation software ITK-SNAP itk-snap , the expert needs about 2 hours to finish the annotation of one volume. All the annotation results are double-checked by a senior expert with 20-year experience. We then randomly split the dataset into 50, 50 volumes for training and testing. Regarding the varying fetal head pose, the training dataset is firstly augmented to 600 with flipping and -step rotation around three axes. We then augment the training data by applying the Random Erasing zhong2017random to randomly erase a sub-region with 0 at a random position around the whole fetal head to mimic the ubiquitous acoustic shadow. Finally, the training dataset is augmented to 2112 volumes.

2.2 Methodology

Our proposed framework is elucidated in Fig. 2. The system directly takes the whole US volume as an input. Our deep encoder-decoder architecture then densely labels each voxel in the volume, and generates intermediate probability maps for the foreground and background. Deep supervision is attached to boost training efficiency. Attention modules are injected in both upsample path and skip connections to form the hybrid filtering on volumetric features (Fig. 2(a)). Multiple segmentors then follow a cascaded fashion to refine the volume predictions level by level. Final output of the system is the segmentation of whole fetal head (Fig. 2(b)).

2.3 3D Encoder-Decoder Architecture for Dense Labeling

With the interleaved convolution, pooling and non-linearity layers to extract features and organize the semantic hierarchy, convolutional neural networks have become the dominant workhorse for medical image segmentation

Shen_review . Among all the architectures, FCN Long_fcn with encoder-decoder design is a popular choice for dense pixel-wise end-to-end mapping.

Because 3D FCN has orders of magnitude of parameters than 2D ones, processing the whole US volume with 3D FCN is challenging under limited GPU memory. Previous researches often resort to slice Namburete2018fully ; cerrolaza2018deep and tri-plane based 2D FCN. However, they discard the spatial dependency and thus sacrifice accuracy dou2016automatic . 3D patch-based FCN with an overlap-tiling stitching strategy is another attempt for volumetric segmentation yang2017hybrid . Although it explores spatial cues in 3D patches, it is time-consuming and loses global context in the volume to regularize its segmentation. To directly process the whole US volume with 3D FCN, the architecture of 3D FCN should be carefully tailored. Also, all features should learn to be highly task-relevant. We will elaborate the details of our 3D FCN designs in this section. Improving the discriminative power of features with attention mechanism is introduced in the section follows.

As parameterized in Fig. 2(a), we customize a 3D U-net Ronneberger_unet

with long skip connections bridging encoder and decoder paths as our backbone. Concatenation operator is taken to merge feature volumes between encoder and decoder paths. Each convolutional layer (Conv) is followed by a batch normalization layer (BN) and a rectified linear unit (Relu). With the whole US volume as input, we focus on tuning the number of pooling layers, Conv layers and kernel sizes in our architecture to balance the input volume resolution and GPU memory constraints. All these factors affect the receptive field size and feature hierarchy of deep networks in perceiving global and local contexts. We finally opt for 2 successive Conv+BN+Relu layers as a block between each two max-pooling layers. There are totally 3 max-pooling layers in our encoder path. We select 3

33 kernel for shallow Conv layers, 555 kernel for deep Conv layers to further enlarge the receptive field.

Suffering from the gradient vanishing problem along the long backpropagation path in deep networks

Glorot_vanish , deep layers are often over-tuned while the shallow layers are under-tuned during training. To maintain the training efficacy for all layers of our 3D deep network, we adopt the deep supervision mechanism to replenish the gradient flow with auxiliary losses and shorten the backpropagation path for shallow layers Lee-ds ; dou2017_ds .

As shown in Fig. 2

(a), besides the main loss function at the end of network, deep supervision mechanism exposes shallow layers to the extra supervision of

auxiliary loss signals via side branches. These branches share the same ground truth with the main loss function but have shortened backpropagation path length. These paths build a composite loss signal and therefore enhance the gradient flow in updating the parameters of shallow layers. The basic formulation of deep supervision is as follows:


where , are training pairs, is the weight of main network, , are the weights of side branches.

Several consecutive deconvolution operations are often used in the side branches to upsample the outputs as the same size as ground truth label dou2017_ds . However, the deconvolution is very computation intensive and the number of it in these side branches is often times as that in the main network. Thus, this setting consumes a lot of GPU footprint. Recently, Lin et al. lin2017refinenet proposed to directly downscale the ground truth label to fit different branches and hence remove the heavy deconvolutions. As shown in Fig. 2(a), there is no deconvolution based upsampling in our side branches. We also downscale the ground truth label to 1/2 and 1/4 to fit the size of two branches, respectively. To increase non-linearity while keep computation cost, we use a Conv layer with 111 kernel to output the probability maps in each branch. Our modification preserves the effectiveness of deep supervision and makes more GPU memory available for the main network to explore. The final composite loss function for our deeply supervised network is accordingly modified as Eq. 2,


where cross entropy is the metric for main loss function and auxiliary . indicates the downscale operation on ground truth label .

2.4 Hybrid Attention Scheme to Promote Features

Decoder path and skip connections assign the U-net variants advantages in distilling global context and preserving local details for fetal head segmentation. However, as the deconvolution successively upsamples the feature maps, not only the features representing the fetal head, but also the features of non-head region are learned at different scales. As these feature maps propagate along the decoder path, the final segmentation will be adversely affected. At the same time, feature maps at the shallow layers of encoder path not only contain the detailed boundary cues of fetal head, but also include abundant background noise. Skip connections convey these features to the decoder path to enhance the boundary details at the risks of bringing about false positives. Therefore, filtering the features to be task-relevant and focus on fetal head region becomes very important under the limited feature capacities and computation resource.

Figure 3: Schematic illustration of our proposed attention module.

Attention mechanism becomes an attractive solution for the problem. It borrows the idea from human beings in applying the limited computation resources to the focus of view. Irrelevant features are suppressed under the mechanism to improve the learning efficiency and efficacy. Recently, attention mechanism becomes popular in image analysis wang2017residual

and natural language processing

bahdanau2014neural . In wang2018deep , Wang et al. proposed a framework with attention setting to improve multi-scale features for prostate segmentation in 2D US images. Schlemper et al. further investigated the attention mechanism as a general module for both medical image classification and segmentation schlemper2019attention . However, they only explored the attention module to filter the feature maps on skip connections and ignored the upsample path, and also the composite filtering effect. In this work, we propose a hybrid attention scheme (HAS) to progressively refine the features maps in both skip connection and upsample path at different scales.

As the building block of HAS, we firstly introduce the design of our attention module (AM) (Fig. 3). Given the feature maps as input, we apply a Conv layer with 111 kernel to shrink the feature map channels to 32 as . This operation reduces the computation cost of our attention module and enhances the interactions among channels. We then attach two consecutive Conv layers with 333 kernel to produce the un-normalized attention weights :


where represents the parameters learned by which contains the mentioned 3 Conv layers. After that, we approach the attention map by normalizing across the channel dimension with a Softmax function:


where denotes the value at spatial location and k-th channel on , while denotes the normalized attention weight at spatial location and k-th channel on . After getting the normalized attention map, we then multiply it with the in an element-by-element manner to generate the refined feature map . With the attention mechanism, will learn to indicate and enhance the semantic features of fetal head boundary in with higher , while suppress the non-head regions with lower attention values (see our Results section). To fully make use of the and , we concatenate them together and output the final 64-channel attention features after a Conv layer.

As shown in Fig. 2, to thoroughly exploit the feature filtering effect of attention mechanism, different from wang2018deep ; schlemper2019attention , our HAS implants the AM in both skip connections (denoted as SAM) and upsample path (denoted as UAM

) at different scales in out network. HAS forms a composite and hierarchical feature selection in the segmentor. Specifically, we allocate a SAM in each skip connection and an UAM after each feature concatenation in the upsample path. UAM enforces the feature interaction among the concatenated feature maps and then selects the most discriminative features among them for further decoding. The auxiliary loss branches are attached behind the UAMs to ease the learning with filtered and discriminative features.

2.5 Refinement with Auto-Context Scheme

Neighboring predictions are beneficial to support the decision on current location and address boundary ambiguity. Thanks to the whole US volume input and large receptive field size in our network design, our network can get arbitrary access to the context dependencies in long or short ranges. We hence combine our HAS based network with a classic iterative refinement framework, Auto-Context Tu_atoctxt , to explore varying context and better recover the boundary of fetal head in ambiguous sites.

Auto-Context is designed to learn the context encoded in images and probability maps. It proves to be an elegant scheme for successive labeling refinement. As shown in Fig. 2(b), it stacks a series of models in a way that, the model at level can simultaneously revisit the appearance context in the intensity image and the shape context in the probability map which is generated by the model at level . Eq. 5 illustrates the general iterative process of a typical Auto-Context scheme,


where is the mapping function of the segmentor at level , and are the US volume and the probability map of fetal head from level , respectively. is a join operator to combine and , which is set as an element-wise summation in this work. Summation saves the computation cost and performs better than concatenation wu2017cascaded . Limited by GPU memory, we train the network in level after finishing the training of level . The probability map in level 0 is initialized with the constant value 0. By compromising between the time efficiency in training/testing and the performance gain, we use 2 context levels in total (). The last context level outputs the final refined segmentation.

3 Experimental Results

3.1 Implementation and Evaluation Criteria

We implemented our framework in Tensorflow

. Training and testing were run in a NVIDIA GeForce GTX TITAN X GPU (12GB). All the Conv layers were initialized from truncated normal distributions. As a trade-off between image quality and segmentation performance, we downscale the original US volume with a factor of 0.4 on each dimension for input. The final segmentation result was resampled back to the original full resolution for evaluation. We updated the weights of all layers with an Adam optimizer (batch size=1, initial learning rate=


, moment term is 0.5). The training epoch in each Auto-Context level was set to 30. For the testing time, our final method only needs about

seconds to segment an US volume.

For segmentation evaluation, we target to assess the region, boundary and voxel-wise similarities with 5 criteria. They include the Dice similarity coefficient (DSC, ), Conformity (Conf, ), Jaccard (Jacc, ), average distance of boundaries (Adb [mm]), Hausdorff distance of boundaries (Hdb [mm]). DSC indicates the mutual overlap between segmentation and ground truth. Conformity () provides wider range and can be more sensitive and rigorous than DSC, as suggested by Chang_Conform

. Adb is used to describe the average distance from segmentation surface to ground truth. Hdb is sensitive to boundary outliers and emphasizes the worst labeling cases

Rueda_Challenge . DSC, Jaccard, Adb and Hdb are defined as following,


where and are segmentation and ground truth. calculates the volume of segmented object. and are the surfaces of segmentation and ground truth. is the vertex on the surface, is the Euclidean distance between vertex and .

3.2 Quantitative and Qualitative Analysis

Hence forth, we will denote our basic 3D segmentor network without deep supervision and HAS as USegNet. As shown in Table 1, we firstly compare the USegNet with several competitors, like the 3D deconvolution network (3D-DeconvNet) noh2015learning , 2D USegNet (2D-USegNet) and 3D patch-based USegNet (p-USegNet) to prove the effectiveness of our backbone architecture. The USegNet shares the same encoder-decoder layout with 3D-DeconvNet, 2D-USegNet and p-USegNet. Whereas, 3D-DeconvNet lacks the skip connections between encoder and decoder, 2D-USegNet takes slices with original resolution as input and outputs slices with the same sizes. p-USegNet digests 646464 3D patches and generates the prediction of whole US volume with an overlap-tiling stitching strategy yang2017hybrid . We keep proper settings for all the compared methods for fair comparisons.

Method Metrics
DSC [%] Conf [%] Jacc [%] Adb [mm] Hdb [mm]
p-USegNet 93.31 85.53 87.53 0.8908 7.839
2D-USegNet 94.31 87.86 89.27 0.8023 6.861
3D-DeconvNet 94.51 88.31 89.62 0.6804 6.334
USegNet 94.83 89.06 90.20 0.6186 5.400
USegNet-DS 94.95 89.33 90.42 0.6225 5.409
USegNet-DS-UAM 95.57 90.70 91.54 0.5242 4.785
USegNet-DS-SAM 95.63 90.83 91.64 0.5362 4.702
USegNet-DS-HAS 95.85 91.30 92.05 0.5070 4.876
USegNet-DS-HAS-Ctx 96.05 91.74 92.42 0.4793 4.609
Table 1: Quantitative comparison of different segmentation methods

Lacking of global contextual information for guidance, p-USegNet presents the worst results in terms of both shape and boundary similarities among the compared methods. With original whole slice as high resolution input and slice based large training dataset, 2D-USegNet gets better results than p-USegNet but still has high boundary distance errors suffering from the lack of spatial regularization. 3D-DeconvNet achieves 1.2 percent DSC improvements over p-USegNet and reduces the Adb error for about 15 over 2D-USegNet. This proves the importance of adopting 3D operators and taking whole volume as input to exploit global context to benefit whole fetal head segmentation. By establishing skip connections to revisit detailed boundary cues in multi-scale feature maps, our USegNet further refines the results over 3D-DeconvNet with 0.3 percent improvement in DSC. Finally, by adding the deep supervision (DS) to boost the training process, USegNet-DS gets another 0.1 percent improvement in DSC.

Figure 4: Two cases (first and second row) to show the comparison of Hausdorff distance [mm] among different methods. From left to right: 2D-USegNet, p-USegNet, USegNet-DS-UAM and USegNet-DS-HAS-Ctx. The color bar is annotated with mean in the center, min and max on the ends.

Based on USegNet-DS, we then move to conduct ablation study on our introduced modules, including the SAM, UAM, HAS and Auto-Context (Table 1). Locating on the feature flow of main network, UAM (USegNet-DS-UAM) presents significant improvement over USegNet-DS, about 0.6 percent in DSC. Selectively enhancing the detailed boundary features of fetal head and discarding the noise in background as the upsampling progresses, UAM proves its importance in our architecture. Benefiting from the feature filtering effect on skip connections, SAM (USegNet-DS-SAM) brings 0.7 percent improvement in DSC and reduces the Adb error about 12 over USegNet-DS. This result reflects that, the feature maps from shallow layers of encoder path indeed contain redundant and irrelevant features. SAM suppresses these kind of features and hence improves the segmentation. Also, because UAM charges the main stream of feature flow in the network, UAM can bring higher improvements than SAM.

Figure 5: Segmentation result of six cases in the testing set. These cases have different shapes, sizes and gestational ages. Hausdorff distances [mm] from the segmentation surface to ground truth are illustrated to provided a more detailed illustration. The color bar is annotated with mean in the center, min and max on the ends.

Combining the SAM and UAM to form the composite and hierarchical feature filtering, HAS (USegNet-DS-HAS) only adds little computation overhead but contributes more segmentation improvements than the SAM-only and UAM-only based models. It brings the highest refinement over USegNet-DS (about 0.9 percent in DSC, 19 in Adb). At this point, the whole fetal head segmentation performance of USegNet-DS-HAS is already very promising. The averaged absolute relative error in voxel number between our segmentation and ground truth gets as low as . Considering the computation cost in training/testing and performance gain, our Auto-Context scheme only stacks two USegNet-DS-HAS with same configurations. We denote the stacked models as USegNet-DS-HAS-Ctx. As Table 1 shows, USegNet-DS-HAS-Ctx presents betterment on all metrics compared to USegNet-DS-HAS (about 0.2 percent in DSC, 0.4 percent in Jacc). Compared to the results reported by Namburete2018fully ; cerrolaza2018deep ; cerrolaza20183d for fetal skull segmentation (as highest as 90 in DSC), our task is more challenging and our method achieves better results.

With two cases as shown in Fig. 4, we visualize the Hausdorff distance from different whole fetal head segmentation surfaces to the ground truth. Lacking of proper spatial context to guide and regularize the segmentation, the results of 2D-USegNet and p-USegNet are rough and visually implausible. The use of UAM (USegNet-DS-UAM) obviously reduces the surface distances of most boundary points. Benefiting from the hybrid feature filtering effect of HAS and Auto-Context, USegNet-DS-HAS-Ctx further narrows the surface distances. More visualization results of USegNet-DS-HAS-Ctx are shown in Fig. 5. It can be observed that, our proposed method conquers the poor image quality, scale and shape variations, occlusion and boundary ambiguities of whole fetal head in US volumes, and finally presents promising segmentation results.

Figure 6: Two cases (first and second row) to explicitly compare the segmentation details of different methods. From left to right, sagittal slice of a whole fetal head in right, middle and left region. Red, green and yellow curves denote the contour from ground truth, USegNet-DS and our USegNet-DS-HAS.

In Fig. 6, we illustrate an explicit comparison between USegNet-DS and our USegNet-DS-HAS to show the effectiveness of hybrid attention. As we can observe, both USegNet-DS and USegNet-DS-HAS can properly fit the ground truth around fetal skull regions. However, suffering from the irrelevant features in the background, USegNet-DS tends to under-segment the whole fetal head around the fetal facial and neck areas (blue arrows). The boundaries in these areas are hard to recognize due to the lack of hard bone structures. With Fig. 7, we further show a case (USegNet-DS vs. USegNet-DS-HAS) to reveal the impact of our proposed hybrid attention. Through the point-to-point comparisons, we can see that, the probability maps produced by USegNet-DS are still fuzzy and low around fetal head boundaries. Whereas, the maps produced by USegNet-DS-HAS are compact and high around the whole fetal head, even in the severely occluded spots. This phenomenon demonstrates that our hybrid attention scheme not only suppresses the false positive prediction in non-head regions but also enlarges the gap between foreground object and background noise.

Figure 7: Advantages of HAS in enhancing the prediction maps. From top to bottom: coronal, traverse and sagittal slice of a whole fetal head; segmentation ground truth (green) overlaid on slices; probability map of whole fetal for three slices produced by USegNet-DS; probability map of whole fetal for three slices produced by USegNet-DS-HAS. Color map denotes the probability range. White crosshair is used to facilitate the point-to-point comparisons.

In Table 2, we conduct the ablation study on the number of SAM and UAM. The base model for the experiments is USegNet-DS. Its results are listed for reference. As shown in Fig. 2, we define our experimental setting as follows. SAM-1 denotes the USegNet-DS only with the module SAM 1. SAM-12 denotes the USegNet-DS with both SAM 1 and SAM 2 modules in the skip connections. USegNet-DS-SAM is our model with all the 3 SAMs. UAM-1 denotes the USegNet-DS with only the UAM 1 module before the second deconvolution layer. UAM-12 denotes the USegNet-DS with both UAM 1 and UAM 2 modules. USegNet-DS-UAM is our model with all the 3 UAMs. For the experiments on UAM, if it is removed, then the auxiliary loss branch is attached to the concatenation layer before it to keep the training process to be fair. As we can see, using only one SAM and UAM, SAM-1 and UAM-1 can already bring obvious refinement on the segmentation (about 0.5 percent in DSC). Increasing the number of SAM and UAM also increases the performance. However, the increment is decreasing and the performance comes to a saturation in USegNet-DS-SAM and USegNet-DS-UAM. When compared with the computation cost of USegNet-DS, the SAM and UAM modules only add slight computation overhead. SAM-12 and UAM-12 consistently improve all the segmentation metrics over SAM-1 and UAM-1, except the Hdb. We interpret this phenomenon as that, SAM-2 and UAM-2 locate at the middle semantic levels and may miss some very detailed features that are only conveyed by SAM-3. They therefore present slight degradation in the strict metric Hdb which emphasizes worst boundary outliers.

Network Layout Metrics
DSC [%] Conf [%] Jacc [%] Adb [mm] Hdb [mm] Time(s)
USegNet-DS 94.95 89.33 90.42 0.6225 5.409 0.78
SAM-1 95.44 90.42 91.30 0.5556 4.939 0.76
SAM-12 95.60 90.75 91.59 0.5380 5.149 0.78
USegNet-DS-SAM 95.63 90.83 91.64 0.5362 4.702 1.11
UAM-1 95.49 90.51 91.39 0.5374 4.832 0.77
UAM-12 95.54 90.62 91.48 0.5350 5.024 0.81
USegNet-DS-UAM 95.57 90.70 91.54 0.5242 4.785 0.88
Table 2: Comparison about different numbers of SAM and UAM
Figure 8: Correlation and Bland-Altman agreement on measuring the fetal head volume.

After getting the fetal head segmentation, we can then obtain some useful biometrics, like the volume. We adopt the correlation coefficient and Bland-Altman agreement Rueda_Challenge to comprehensively evaluate the discrepancy among the volume size derived from expert annotations and our USegNet-DS-HAS-Ctx segmentations. As shown in Fig. 8, tested on the 50 varying volumes, our solution achieves high correlation (0.990) and agreement (-1.619.5 mL with 95% of the measurements locate in the 1.96 standard deviation in Bland-Altman plot) in measuring the fetal head volume when compared to the expert. This high correlation and agreement indicate that our solution may serve as a promising alternative in assisting experts to analyze whole fetal head volumes.

Figure 9: Fetal head volume (mL) measurement reproducibility (right) against three pre-defined scanning directions (left). Blue mesh represents the whole fetal head. Three volumes of a fetus in the same group.

As shown in Fig. 1, subject to the strong acoustic reflection on fetal skull, different fetal head orientations or scanning directions can arouse various shadows and occlusions. The appearance of whole fetal head in US volumes can hence drastically change. In this regard, keeping high reproducibility and being robust against scanning direction variation become crucial requirements before our methods can be applied in real clinical scenarios. Accordingly, we newly collected 156 volumes from 52 volunteers to validate the reproducibility of our solution (3 volumes per volunteer. Free fetal pose and varying GA from 21 to 31 weeks). The same US machine, S50 from SonoScape Medical Corp., Shenzhen, Guangdong Province, (China) was used for image acquisition. Each volunteer was scanned along three predefined directions, as shown in Fig. 9, where the Anterior(A), Posterior (P), Left (L), Right (R), Superior (S), Inferior (I) axes are sketched to denote fetal head orientation. For each volunteer, a volume is acquired along each direction. All the 3 volumes from the same volunteer are collected as a group. All the data are anonymized and the acquisition is approved by the local Institutional Review Board. All the volunteers have reviewed and signed the consent forms. Fig. 9 shows the box-plot of volume measurements generated by our USegNet-DS-HAS-Ctx for each group. We can observe that, our method suffers little from the pose or scanning variations, and attains remarkable reproducibilities (the mean of standard deviation of all groups is 11.524 mL, minimum is 1.960 mL, maximum is 49.869 mL) over all groups in measuring the whole fetal head volumes.

4 Discussion

Automated analyses of US volume have appealing potentials in promoting the prenatal examinations and bringing about changes to the traditional clinical work-flow. Accurately segmenting the whole fetal head in the volume may provide ever precise biometrics in describing the fetal growth. However, automated segmentation of whole fetal head in US volume is non-trivial due to the poor image quality, varying fetal poses and massive volume data. In this paper, we approach the task by proposing a fully-automated solution with high performances and good reproducibilities.

Whereas, there still exist several key points for future study. First, since our method enables the automated extraction of fetal head volume, conducting the population study about the precise fetal head volume against GA becomes more tractable than ever. Previously, due to the lack of efficient tools in analyzing US volumes, there is no widely accepted reference chart of fetal head volume. This then limits the use of 3D US in supporting prenatal diagnoses. Only with the population study and the associated reference charts, volumetric measurement of fetal head can really benefit the fetal health monitoring. To achieve this goal, we need to collect more volume data and enhance our solution to cover a broad GA range. Second, in the population study, ultrasound images will be acquired across different subjects, sites, devices, sonographers, GAs and etc. Unpredictable appearance shift in US images often happens during the acquisition due to different imaging conditions. Deep neural networks tend to suffer from this kind of appearance shift and be severely degraded yang2018generalizing . Improving the generalization ability and robustness of deep neural networks to handle varying imaging conditions is critical for automated ultrasound image analysis, especially for the population study. Leveraging the shared shape prior yang2018generalizing or fine-tuning the deep model for each acquisition site with few samples gibson2018inter will be considered for our task. Third, to accelerate the collection and annotation of large dataset for population study, we should greatly reduce the time and cost in manually annotating the volumes. Currently, the volume annotation is very expensive and time-consuming (more than 2 hours for one volume). Assisting the experts during annotation with machine learning powered algorithms, like the interactive segmentation wang2018interactive , is highly demanded in our scenario. Finally

, based on our proposed automated segmentation, we should try to conduct longitudinal study to analyze the development pattern of fetal head volume. This kind of study may provide earlier and better indicators than 2D measurements for the prognosis of rare diseases, like the intrauterine growth restriction (IUGR) syndrome.

Considering the computation burden of the volumetric segmentation, we need to further reduce the computation cost of 3D deep networks to enable larger volume input with higher resolution for better segmentation results. We can consider the checkpointed backpropagation techniques chen2016training to save more GPU memory for training with high resolution input. In yang2019fetusmap , we explored the checkpointed backpropagation and provided clear evidences that higher resolution volumetric input can promote the localization of multiple fetal landmarks than low-resolution ones in 3D ultrasound. Real-time feature is not strongly required in current 3D ultrasound applications, however, network architectures like the pseudo-3D networks Qiu_pseudo3d and Mobilenets howard2017mobilenets to reduce the computation cost of convolution kernel should be seriously investigated in the near future. Also, as the fetal pose and scale vary greatly across subjects and timepoints, efficient detection strategies, like Faster R-CNN ren2015faster in 2D or 3D form xu2019efficient , to locate the fetal head in volume can greatly narrow the search space and hence reduces the computation burden in segmentation. Finally, the segmentation result should be fully explored to facilitate more advanced applications or be complementary to each other, like the landmark detection, standard plane localization and longitudinal analysis of fetal brain.

5 Conclusions

In this work, we propose the first fully-automated solution for the precise segmentation of whole fetal head in US volumes. The task is pending and lacking satisfying solution before this work. We highlight our work with a hybrid attention scheme. It imposes a composite and hierarchical feature filtering effect on our 3D encoder-decoder backbone for better feature learning under the limited GPU resources and deep network layers. With experiments and demonstrations, our proposed modules are proved to be effective. Promising segmentation accuracy, remarkable correlations and agreements with experts, and high reproducibilities against scanning variations indicate that, our work may have potentials to assist sonographers in reviewing fetal growth from a new perspective.

6 Acknowledgements

This work was supported in part by the National Key RD Program of China (No. 2019YFC0118300), Shenzhen Peacock Plan (No. KQTD2016053112051497, KQJSCX20180328095606003), Medical Scientific Research Foundation of Guangdong Province, China (No. B2018031) and National Natural Science Foundation of China with Project No. U1813204.


  • (1) F. P. Hadlock, R. Harrist, R. S. Sharman, R. L. Deter, S. K. Park, Estimation of fetal weight with the use of head, body, and femur measurements—a prospective study, American Journal of Obstetrics & Gynecology 151 (3) (1985) 333–337.
  • (2) S. Rueda, S. Fathima, C. L. Knight, M. Yaqub, A. T. Papageorghiou, B. Rahmatullah, A. Foi, M. Maggioni, A. Pepe, J. Tohka, et al., Evaluation and comparison of current fetal ultrasound image segmentation methods for biometric measurements: a grand challenge, IEEE Transactions on medical imaging 33 (4) (2014) 797–813.
  • (3) D. Ni, X. Yang, X. Chen, C.-T. Chin, S. Chen, P. A. Heng, S. Li, J. Qin, T. Wang, Standard plane localization in ultrasound by radial component model and selective search, Ultrasound in medicine & biology 40 (11) (2014) 2728–2742.
  • (4) J. Yu, Y. Wang, P. Chen, Y. Shen, Fetal abdominal contour extraction and measurement in ultrasound images, Ultrasound in medicine and biology 34 (2) (2008) 169–182.
  • (5) L. Wu, Y. Xin, S. Li, T. Wang, P.-A. Heng, D. Ni, Cascaded fully convolutional networks for automatic prenatal ultrasound image segmentation, in: Biomedical Imaging (ISBI 2017), 2017 IEEE 14th International Symposium on, IEEE, 2017, pp. 663–666.
  • (6) J. Li, Y. Wang, B. Lei, J.-Z. Cheng, J. Qin, T. Wang, S. Li, D. Ni, Automatic fetal head circumference measurement in ultrasound using random forest and fast ellipse fitting, IEEE journal of biomedical and health informatics 22 (1) (2018) 215–223.
  • (7) B. Tutschek, H. Blaas, J. Abramowicz, K. Baba, J. Deng, W. Lee, E. Merz, L. Platt, D. Pretorius, I. Timor-Tritsch, et al., Three-dimensional ultrasound imaging of the fetal skull and face., Ultrasound in obstetrics & gynecology 50 (1) (2017) 7.
  • (8) D. Meengeonthong, S. Luewan, S. Sirichotiyakul, T. Tongsong, Reference ranges of placental volume measured by virtual organ computer-aided analysis between 10 and 14 weeks of gestation, Journal of Clinical Ultrasound 45 (4) (2017) 185–191.
  • (9) S. Dahdouh, E. D. Angelini, G. Grangé, I. Bloch, Segmentation of embryonic and fetal 3d ultrasound images based on pixel intensity distributions and shape priors, Medical Image Analysis 24 (1) (2015) 255–268.
  • (10) S. Feng, K. S. Zhou, W. Lee, Automatic fetal weight estimation using 3d ultrasonography, in: Proc. of SPIE Vol, Vol. 8315, 2012, pp. 83150I–1.
  • (11) A. I. Namburete, R. V. Stebbing, B. Kemp, M. Yaqub, A. T. Papageorghiou, J. A. Noble, Learning-based prediction of gestational age from ultrasound images of the fetal brain, Medical Image Analysis 21 (1) (2015) 72–86.
  • (12) M. Yaqub, R. Cuingnet, R. Napolitano, D. Roundhill, A. Papageorghiou, R. Ardon, J. A. Noble, Volumetric segmentation of key fetal brain structures in 3d ultrasound, in: International Workshop on Machine Learning in Medical Imaging, Springer, 2013, pp. 25–32.
  • (13) J. J. Cerrolaza, O. Oktay, A. Gomez, J. Matthew, C. Knight, B. Kainz, D. Rueckert, Fetal skull segmentation in 3d ultrasound via structured geodesic random forest, in: Fetal, Infant and Ophthalmic Medical Image Analysis, Springer, 2017, pp. 25–32.
  • (14)

    D. Shen, G. Wu, H.-I. Suk, Deep learning in medical image analysis, Annual review of biomedical engineering 19 (2017) 221–248.

  • (15) J. Torrents-Barrena, G. Piella, N. Masoller, E. Gratacós, E. Eixarch, M. Ceresa, M. Á. G. Ballester, Segmentation and classification in mri and us fetal imaging: Recent trends and future prospects, Medical Image Analysis 51 (2019) 61–88.
  • (16)

    J. Long, E. Shelhamer, T. Darrell, Fully convolutional networks for semantic segmentation, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2015, pp. 3431–3440.

  • (17) X. Yang, L. Yu, S. Li, X. Wang, N. Wang, J. Qin, D. Ni, P.-A. Heng, Towards automatic semantic segmentation in volumetric ultrasound, in: International Conference on Medical Image Computing and Computer-Assisted Intervention, Springer, 2017, pp. 711–719.
  • (18) A. I. Namburete, W. Xie, M. Yaqub, A. Zisserman, J. A. Noble, Fully-automated alignment of 3d fetal brain ultrasound to a canonical reference space using multi-task learning, Medical Image Analysis 46 (2018) 1–14.
  • (19) J. J. Cerrolaza, M. Sinclair, Y. Li, A. Gomez, E. Ferrante, J. Matthew, C. Gupta, C. L. Knight, D. Rueckert, Deep learning with ultrasound physics for fetal skull segmentation, in: 2018 IEEE 15th International Symposium on Biomedical Imaging (ISBI 2018), IEEE, 2018, pp. 564–567.
  • (20) J. J. Cerrolaza, Y. Li, C. Biffi, A. Gomez, M. Sinclair, J. Matthew, C. Knight, B. Kainz, D. Rueckert, 3d fetal skull reconstruction from 2dus via deep conditional generative networks, in: International Conference on Medical Image Computing and Computer-Assisted Intervention, Springer, 2018, pp. 383–391.
  • (21) A. Gomez, K. Bhatia, S. Tharin, J. Housden, N. Toussaint, J. A. Schnabel, Fast registration of 3d fetal ultrasound images using learned corresponding salient points, in: Fetal, Infant and Ophthalmic Medical Image Analysis, Springer, 2017, pp. 33–41.
  • (22) P. A. Yushkevich, J. Piven, H. C. Hazlett, R. G. Smith, S. Ho, J. C. Gee, G. Gerig, User-guided 3d active contour segmentation of anatomical structures: significantly improved efficiency and reliability, Neuroimage 31 (3) (2006) 1116–1128.
  • (23) Z. Zhong, L. Zheng, G. Kang, S. Li, Y. Yang, Random erasing data augmentation, arXiv preprint arXiv:1708.04896.
  • (24) Q. Dou, H. Chen, L. Yu, L. Zhao, J. Qin, D. Wang, V. C. Mok, L. Shi, P.-A. Heng, Automatic detection of cerebral microbleeds from mr images via 3d convolutional neural networks, IEEE transactions on medical imaging 35 (5) (2016) 1182–1195.
  • (25) X. Yang, C. Bian, L. Yu, D. Ni, P.-A. Heng, Hybrid loss guided convolutional networks for whole heart parsing, in: International Workshop on Statistical Atlases and Computational Models of the Heart, Springer, 2017, pp. 215–223.
  • (26) O. Ronneberger, P. Fischer, T. Brox, U-net: Convolutional networks for biomedical image segmentation, in: International Conference on Medical Image Computing and Computer-Assisted Intervention, Springer, 2015, pp. 234–241.
  • (27)

    X. Glorot, Y. Bengio, Understanding the difficulty of training deep feedforward neural networks, in: Proceedings of the Thirteenth International Conference on Artificial Intelligence and Statistics, 2010, pp. 249–256.

  • (28) C.-Y. Lee, S. Xie, P. Gallagher, Z. Zhang, Z. Tu, Deeply-supervised nets, in: Artificial Intelligence and Statistics, 2015, pp. 562–570.
  • (29) Q. Dou, L. Yu, H. Chen, Y. Jin, X. Yang, J. Qin, P.-A. Heng, 3d deeply supervised network for automated segmentation of volumetric medical images, Medical Image Analysis 41 (2017) 40–54.
  • (30) G. Lin, A. Milan, C. Shen, I. Reid, Refinenet: Multi-path refinement networks for high-resolution semantic segmentation, in: IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2017.
  • (31) F. Wang, M. Jiang, C. Qian, S. Yang, C. Li, H. Zhang, X. Wang, X. Tang, Residual attention network for image classification, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2017, pp. 3156–3164.
  • (32) D. Bahdanau, K. Cho, Y. Bengio, Neural machine translation by jointly learning to align and translate, arXiv preprint arXiv:1409.0473.
  • (33) Y. Wang, Z. Deng, X. Hu, L. Zhu, X. Yang, X. Xu, P.-A. Heng, D. Ni, Deep attentional features for prostate segmentation in ultrasound, in: International Conference on Medical Image Computing and Computer-Assisted Intervention, Springer, 2018, pp. 523–530.
  • (34) J. Schlemper, O. Oktay, M. Schaap, M. Heinrich, B. Kainz, B. Glocker, D. Rueckert, Attention gated networks: Learning to leverage salient regions in medical images, Medical Image Analysis 53 (2019) 197–207.
  • (35) Z. Tu, Auto-context and its application to high-level vision tasks, in: Computer Vision and Pattern Recognition, 2008. CVPR 2008. IEEE Conference on, IEEE, 2008, pp. 1–8.
  • (36) H.-H. Chang, A. H. Zhuang, D. J. Valentino, W.-C. Chu, Performance measure characterization for evaluating neuroimage segmentation algorithms, Neuroimage 47 (1) (2009) 122–135.
  • (37) S. Rueda, S. Fathima, C. L. Knight, M. Yaqub, A. T. Papageorghiou, B. Rahmatullah, A. Foi, M. Maggioni, A. Pepe, J. Tohka, et al., Evaluation and comparison of current fetal ultrasound image segmentation methods for biometric measurements: a grand challenge, IEEE Transactions on medical imaging 33 (4) (2014) 797–813.
  • (38) H. Noh, S. Hong, B. Han, Learning deconvolution network for semantic segmentation, in: Proceedings of the IEEE International Conference on Computer Vision, 2015, pp. 1520–1528.
  • (39) X. Yang, H. Dou, R. Li, X. Wang, C. Bian, S. Li, D. Ni, P.-A. Heng, Generalizing deep models for ultrasound image segmentation, in: International Conference on Medical Image Computing and Computer-Assisted Intervention, Springer, 2018, pp. 497–505.
  • (40) E. Gibson, Y. Hu, N. Ghavami, H. U. Ahmed, C. Moore, M. Emberton, H. J. Huisman, D. C. Barratt, Inter-site variability in prostate segmentation accuracy using deep learning, in: International Conference on Medical Image Computing and Computer-Assisted Intervention, Springer, 2018, pp. 506–514.
  • (41) G. Wang, W. Li, M. A. Zuluaga, R. Pratt, P. A. Patel, M. Aertsen, T. Doel, A. L. David, J. Deprest, S. Ourselin, et al., Interactive medical image segmentation using deep learning with image-specific fine tuning, IEEE transactions on medical imaging 37 (7) (2018) 1562–1573.
  • (42) T. Chen, B. Xu, C. Zhang, C. Guestrin, Training deep nets with sublinear memory cost, arXiv preprint arXiv:1604.06174.
  • (43)

    X. Yang, W. Shi, H. Dou, J. Qian, Y. Wang, W. Xue, S. Li, D. Ni, P.-A. Heng, Fetusmap: Fetal pose estimation in 3d ultrasound, in: International Conference on Medical Image Computing and Computer-Assisted Intervention, Springer, 2019, pp. 281–289.

  • (44) Z. Qiu, T. Yao, T. Mei, Learning spatio-temporal representation with pseudo-3d residual networks, in: 2017 IEEE International Conference on Computer Vision (ICCV), IEEE, 2017, pp. 5534–5542.
  • (45) A. G. Howard, M. Zhu, B. Chen, D. Kalenichenko, W. Wang, T. Weyand, M. Andreetto, H. Adam, Mobilenets: Efficient convolutional neural networks for mobile vision applications, arXiv preprint arXiv:1704.04861.
  • (46) S. Ren, K. He, R. Girshick, J. Sun, Faster r-cnn: Towards real-time object detection with region proposal networks, in: Advances in neural information processing systems, 2015, pp. 91–99.
  • (47) X. Xu, F. Zhou, B. Liu, D. Fu, X. Bai, Efficient multiple organ localization in ct image using 3d region proposal network, IEEE transactions on medical imaging.