FetusMap: Fetal Pose Estimation in 3D Ultrasound

by   Xin Yang, et al.
Shenzhen University

The 3D ultrasound (US) entrance inspires a multitude of automated prenatal examinations. However, studies about the structuralized description of the whole fetus in 3D US are still rare. In this paper, we propose to estimate the 3D pose of fetus in US volumes to facilitate its quantitative analyses in global and local scales. Given the great challenges in 3D US, including the high volume dimension, poor image quality, symmetric ambiguity in anatomical structures and large variations of fetal pose, our contribution is three-fold. (i) This is the first work about 3D pose estimation of fetus in the literature. We aim to extract the skeleton of whole fetus and assign different segments/joints with correct torso/limb labels. (ii) We propose a self-supervised learning (SSL) framework to finetune the deep network to form visually plausible pose predictions. Specifically, we leverage the landmark-based registration to effectively encode case-adaptive anatomical priors and generate evolving label proxy for supervision. (iii) To enable our 3D network perceive better contextual cues with higher resolution input under limited computing resource, we further adopt the gradient check-pointing (GCP) strategy to save GPU memory and improve the prediction. Extensively validated on a large 3D US dataset, our method tackles varying fetal poses and achieves promising results. 3D pose estimation of fetus has potentials in serving as a map to provide navigation for many advanced studies.


page 1

page 2

page 3

page 4


Joint Segmentation and Landmark Localization of Fetal Femur in Ultrasound Volumes

Volumetric ultrasound has great potentials in promoting prenatal examina...

Weakly Supervised Localisation for Fetal Ultrasound Images

This paper addresses the task of detecting and localising fetal anatomic...

Self-supervised Learning of 3D Object Understanding by Data Association and Landmark Estimation for Image Sequence

In this paper, we propose a self-supervised learningmethod for multi-obj...

PoseTriplet: Co-evolving 3D Human Pose Estimation, Imitation, and Hallucination under Self-supervision

Existing self-supervised 3D human pose estimation schemes have largely r...

Can You Trust Your Pose? Confidence Estimation in Visual Localization

Camera pose estimation in large-scale environments is still an open ques...

Learning Skeletal Graph Neural Networks for Hard 3D Pose Estimation

Various deep learning techniques have been proposed to solve the single-...

Cross-Domain Adaptation for Animal Pose Estimation

In this paper, we are interested in pose estimation of animals. Animals ...

1 Introduction

Featured as real-time and radiation-free, ultrasound (US) is widely accepted in clinic for fetal health monitoring. Plenty of diagnostic biometrics can be automatically interpreted from the US images by recent researches. With broad field-of-view and low user dependency, the advent of 3D US further brings opportunities for automated solutions to attain precise descriptions of fetus [13].

Figure 1: 3D pose estimation of fetus in US volumes. (a) A sectional view of a fetus in US volume. (b) An instance of 3D fetal pose with 16 landmark indexes and 15 colored segments. (c) All the pose annotations of 152 fetuses in our dataset. Large variations exist when referring to (b). Better view in color version.

However, currently, there still lacks a solution to provide structuralized descriptions for the whole fetus in 3D US. This description should facilitate not only the traditional tasks in local scale, like standard plane detection [2] and biometric measurements [11], but also the advanced analyses in global scale, like fetal movement pattern and longitudinal comparison. Therefore, we propose to approach this goal by exploring a new task dedicated to 3D pose estimation of fetus in US volumes. Specifically, as illustrated in Fig. 1(b), by localizing 16 landmarks of fetus in fully body, we aim to extract the skeleton of whole fetus and assign different segments/joints with correct torso/limb labels.

As shown in Fig. 1, estimating fetal pose in 3D US needs to tackle several challenges. First, the image quality of 3D US is relatively low due to the speckle noise, low resolution and acoustic shadows (Fig. 1(a)). Second, large variations in fetal pose, scale and orientation cause high image appearance variations, which not only generate the ambiguity in localizing symmetric landmarks but also degrade the generalization ability of automated methods. (Fig. 1(b)&(c)). Third, accurate landmark localization heavily depends on the global context in the whole volume to suppress false positives. However, digesting the whole US volume with size about 200200200 is very tough under limited computing resources.

Deep neural network is nowadays the dominant method for landmark detection in 3D US. A multi-task deep network was proposed in


for fetal eye localization in US volumes.. Huang et al. exploited a semi-supervised learning method to localize 6 fetal head landmarks

[5]. To use the geometric or class constraints, generative adversarial scheme was explored in [12]

to regularize the landmark predictions. Although these methods are promising, the networks often suffer from their limited generalization ability, especially in our task, where fetuses have free poses with varying appearances. For 2D pose estimation, it has been well studied in computer vision field. Chen et al. proposed to learn the joint inter-connectivity prior in an adversarial scheme to refine the human pose prediction


. Liu et al. further distilled the articulated relationship between joints with recurrent neural network for feature boosting

[6]. However, facing with the large volume and varying poses, these methods tend to be degraded as our experiments show.

In this paper, we try to tackle the challenges in 3D US for whole-body fetal pose estimation and generalize the landmark detection for large volumes. Our contribution is three-fold. (i) To the best of our knowledge, this is the first work about 3D pose estimation of fetus in the literature. We believe that taking the fetal pose estimation as a map, navigation can be generated to assist a series of advanced studies on automated prenatal examinations. (ii) We propose a self-supervised learning (SSL) framework to force the deep network to produce visually plausible pose predictions. Specifically, we leverage the landmark-based registration to effectively encode case-adaptive anatomical priors and generate evolving label proxy for supervision. The proxy is a suboptimal supervision but proves to be explicit in conveying prior knowledge for successive refinement. (iii) To enable our 3D deep network generate better features with higher resolution input under limited computing resource, we further adopt the gradient check-pointing (GCP) strategy to save GPU memory. With little computation overhead, GCP facilitates the training and inference of larger volumes and hence contributes to better localization performance. With extensive experiments on a large 3D US dataset, our proposed method deals with varying fetal poses and presents to be general with promising results.

Figure 2: Schematic view of our proposed framework for on-line refinement.

2 Methodology

Fig. 2 is the overview of our proposed framework. System input is a whole US volume. A pre-trained deep network based landmark detector firstly digests the input and predicts the heatmap of 16 landmarks with an intermediate fetal pose estimation. By retrieving a support set of atlases in the pose library via rigid registration, label proxies are produced to form the self-supervision. The landmark detector is then tuned iteratively for on-line refinement. The system outputs the final pose estimation after a few number of iterations. Landmark detector is updated under the gradient checkpointing strategy in necessary.

Figure 3: Our proposed U-net like architecture for landmark detection.

2.1 Backbone of Landmark Detector

Since localizing fetal landmarks needs to consider both the global context and local details, as shown in Fig. 3, we build a 3D U-net [8]

like network to simultaneously localize 16 landmarks of fetus in full body. Specifically, we deepen the network with consecutive convolutional (Conv) layers in a block and 4 pooling layers to encode high-level semantic features of the whole volume. Each Conv and deconvolutional (Deconv) is followed by a batch normalization (BN) layer and a rectified linear unit (ReLU). L2 regression loss is minimized as loss function.

2.2 Self-supervised Learning for On-line Refinement

Due to the large variations of fetal pose, scale and orientation, deep networks for 3D fetal pose estimation in US often suffer from the low generalization ability when facing with varying and unseen fetal appearances. Anatomical prior is helpful for the problem [6, 4]. However, these priors are often modeled in an indirect way and hard to take effect in our task (see Section 3). In this paper, as shown in Fig. 2, we propose to address the problem by producing direct shape prior for on-line refinement under a SSL scheme.

Supervising a model with the label proxy generated by the model itself and thus being annotation-free is the core idea of SSL. SSL changes classic testing fashion from simple inference to on-line learning. It fine-tunes the trained deep model with a label proxy. Strong guidance from the label proxy helps deep model update itself and generalize well to unseen cases. Recently, conditional random field [1] and interactive annotation [10] have been proposed to learn pixel-wise dependency to synthesize the label proxy in SSL. Whereas, these methods are intractable for our discrete landmark detection. Therefore, we propose to synthesize the landmark label proxy by combining the model prediction with the shape knowledge of a pose library.


As shown in Fig. 2, after being pre-trained on the training dataset, the landmark detector enters our SSL scheme for testing. Following the Eq. 1, for an unseen testing US volume of fetus, detector predicts its 16-channel landmark heatmaps and an intermediate 3D pose estimation . Each atlas in the pose library is then aligned to via a rigid transformation . Since there often exist flaws in the landmarks of pose , we only select a subset of landmarks to calculate the . Specifically, referring to Fig. 1, we choose the landmarks

which can be robustly detected across the dataset and also have relatively small variances to fulfil the rigid registration conditions. By retrieving the top-

K candidates with lowest registration errors, a support set of aligned atlas is formed. in this paper. Then, a 16-channel landmark label proxy is produced by averaging the landmark gaussian maps of the aligned atlases in . The label proxy will serve as the pseudo ground truth in iteration to trigger the loss function. Landmark detector needs to update itself to refine its predictions and also the label proxy to minimize the loss.

Although the label proxy is initially rough, it encodes case-adaptive and strong shape prior which helps the detector to generalize to unseen US cases. The label proxy will evolve towards a suboptimal and case-specific state as the SSL iterates. Effectiveness of SSL will be elaborated in Section 3.

2.3 Enable Better Performance with Larger Input

Limited by GPU memory, 3D deep models often sacrifice the input size to enlarge network capacity. Plenty of content details are destroyed during the downscaling. Reducing the GPU memory consumption to break the bottleneck of input size is crucial for our task. In this work, we opt for the gradient checkpointing (GCP) strategy [3, 9] to trade off the GPU memory usage with re-computation and make US volume with high resolution available for deep model.

Figure 4: Illustration of the forward pass and gradient re-computation in backward pass of the GCP. Dotted circle denotes the node in the computation graph to be emptied.

As shown in Fig. 4

, the core idea of GCP is discarding the data in some computation graph nodes after approaching a milestone node to make more GPU memory available for subsequent inference. The data of the discarded nodes will then be recovered by re-computation during backpropagation. Given an input

, data in node is computed by the parameters of the function. Based on and , node

is then approached. At this moment, the data in

will be discarded to release the occupied GPU memory. During the backward pass, to get the gradients of and , node will be recovered as by the re-computation from and . The gradient for the parameters of is calculated from and the gradient of (). With as a transition node, the gradient for the parameters of () can be further obtained. Thus, without losing model accuracy, both the forward and backward passes can fit in the GPU. For our task, restricted by the skip connections, we manually select all the Conv layers except the Conv layers directly connected with the concatenation layer into the layer set for GCP. By using GCP to reduce GPU memory consumption, we can enable the network process US volumes with high resolution (enlarged as 1.25 times on each dimension).

3 Experimental Results

3.0.1 Materials and Implementation

We validate our method on a dataset of 152 fetal US volumes acquired from 152 pregnant volunteers with gestational age ranges from 1014 weeks. Average size of volume is 220205260. Voxel size is mm. Approved by local IRB, all volumes were anonymized and obtained by experts using a Mindray DC-8 system. Free fetal poses are allowed. An expert with 10-year experience manually annotated 16 landmarks. These 16 landmarks cover the fetal head, neck, shoulder, elbow, wrist, spine, sacra, hip joint, knee and ankle. We randomly split the dataset into 100/52 volumes for training/testing. Training set is augmented to 800 with flipping and rotation.

We implement our method in Tensorflow, using a standard PC with only one NVIDIA TITAN Xp GPU (12GB). Codes will be online available. The original US volume is downscaled as 0.4 times before input into our basic landmark detector. 0.4 is the highest ratio allowed by the GPU for our network. With the GCP, we can enlarge the ratio to 0.5. During the training of landmark detector on the training dataset, we update the weights with an Adam optimizer (batch size=1, initial learning rate is 1e-3

, moment term is 0.5, epoch=20). During the testing with SSL, initial learning rate is decreased to

5e-4. Landmark detector runs on each testing case with SSL for 6 iterations (about 12 seconds in total). GCP is used for all the methods compared in this paper when it is needed. Training with GCP needs about 1.5 times of extra running time.

3.0.2 Quantitative and Qualitative Analysis

Two metrics are used to evaluate accuracy of pose estimation: the Euclidean distance (mm) between landmark prediction and ground truth, and the area under PCK curve (AUC, %), where PCK is the Percentage of Correct Key points, i.e., the percentage of detections with Euclidean distance below a threshold. With the basic landmark detector (Land) as backbone, we compared our SSL method with two typical refinement methods that explore the landmark dependency: (a) generative adversarial learning (GAN) [4, 12] and (b) recurrent neural network (RNN) [6]

. We implemented GAN by learning to classify the pair of US volume and 16-channel heatmaps, and RNN by adding a convolutional RNN layer to the last Conv layer of our landmark detector. GCP is applied to input when the downscale ratio is


Method Euclidean Distance [mm] ↓
L1 L2 L3 L4 L5 L6 L7 L8 L9 L10 L11 L12 L13 L14 L15 L16 mean
Land-R4 1.75 6.69 2.54 7.18 8.85 11.1 2.57 3.31 9.87 13.7 4.34 9.23 7.22 3.84 6.69 6.46 6.59
LandGCP 1.74 8.78 2.54 5.59 6.39 6.40 2.15 2.39 11.5 11.1 2.97 6.27 5.05 2.36 5.91 4.49 5.35
RNN-R4 1.85 12.1 7.62 14.6 22.1 22.6 2.70 10.4 20.2 18.9 10.7 12.3 6.86 2.87 5.49 7.11 11.14
RNNGCP 1.76 9.47 5.93 13.5 18.1 14.9 4.45 10.4 17.1 15.0 4.41 7.27 5.56 5.33 7.11 5.35 9.11
GAN-R4 1.85 7.16 2.18 6.57 8.54 11.7 2.44 2.50 10.3 11.4 3.35 8.30 5.00 3.37 5.35 4.21 5.89
GANGCP 1.68 8.61 2.42 5.18 7.79 9.48 2.34 2.32 11.2 12.2 2.99 6.84 6.29 2.02 5.43 4.18 5.69
SSL-R4 1.72 5.00 2.36 4.37 6.81 13.3 2.56 3.19 8.40 10.8 3.32 6.45 7.40 2.93 4.63 4.26 5.47
SSLGCP 1.76 6.39 2.44 4.57 6.66 6.00 2.23 2.30 9.27 9.40 2.65 6.51 5.68 1.98 5.60 5.31 4.92
Table 1: Comparison of Euclidean Distance in Landmark Localization
Method AUC Ratio [%] ↑
L1 L2 L3 L4 L5 L6 L7 L8 L9 L10 L11 L12 L13 L14 L15 L16 mean
Land-R4 81.4 56.5 75.5 50.1 47.5 32.3 72.8 74.0 48.0 28.4 63.6 27.3 44.9 61.8 48.1 49.4 53.8
LandGCP 82.5 48.4 74.8 61.0 65.6 51.8 78.4 75.9 45.7 35.8 70.4 46.5 49.7 76.6 51.7 55.7 60.7
RNN-R4 80.6 34.2 77.1 29.6 34.3 21.9 71.3 72.7 35.5 29.1 55.5 37.3 54.7 69.7 59.3 55.6 51.2
RNNGCP 82.8 40.3 77.3 36.5 35.5 36.2 75.5 76.0 26.6 27.5 58.1 43.9 51.3 62.4 54.0 60.0 52.8
GAN-R4 80.6 55.5 77.2 55.1 50.3 33.0 74.2 74.4 46.6 38.1 66.6 32.9 51.9 66.4 55.0 56.4 57.1
GANGCP 83.4 49.8 75.6 64.2 57.7 36.1 76.4 77.0 46.6 35.8 71.1 44.6 51.6 80.0 55.8 58.8 60.3
SSL-R4 81.8 61.9 75.9 65.6 56.7 22.0 73.0 75.3 56.8 37.8 65.1 46.8 42.4 70.1 63.0 57.0 59.5
SSLGCP 82.6 57.7 75.6 66.1 63.4 55.0 77.5 76.9 56.0 43.6 73.6 47.4 45.1 80.5 55.3 50.5 62.9
Table 2: Comparison of AUC in Landmark Localization
Figure 5: PCK curves for 3 fetal landmarks. x axis is the distance threshold. SSLGCP (dotted green curve) gets the best results among all the competitors.

Table 1 presents the Euclidean distance of different methods for all the 16 landmarks. We use R4 to denote the model handling input with downscale ratio of 0.4, and GCP the method with GCP to handle input with larger downscale ratio of 0.5. As demonstrated in the table, almost all methods achieved lower prediction distance for all the landmarks with GCP, benefiting from its better features perceiving from higher resolution input. With this work, we are the first to prove that, GCP can improve landmark localization by enabling larger ultrasound volume input. Besides, although RNN and GAN based refinement methods bring improvements over the , they still perform obviously worse for some landmarks. With the case-adaptive label proxy as a strong prior, SSL based methods surpass the GAN/RNN and get almost the best results by achieving the top rank on 10 landmarks. The advantage of SSL can also be drawn from the average prediction distance, according to which the proposed SSL achieves an average distance of 4.92mm, and significantly outperforms the two competitors.

Figure 6: Visualization of two 3D fetal pose estimations. From left to right: ground truth, Land-R4, LandGCP and SSLGCP. Blue digit for landmark index, green digit for length.

PCK evaluates the distribution of predicted landmarks around ground truth. Table 2 further compares the AUC of methods. Similar trends for GCP and SSL can be observed. SSL equipped with GCP (SSLGCP) tops the task of most landmark detections. It also achieves the highest mean AUC among all competitors. The highest improvement over the baseline Land-R4, about , occurs on the detection of landmarks L4, L6, L10, L12 and . Referring to Fig. 1, we can find that these are the symmetric landmarks on the limb which are hard to be differentiated by Land, RNN and GAN methods. We believe that both the strong shape prior from the evolving label proxy and the better feature input enabled by the GCP contribute to this significant improvement. We further provide the PCK curves of these landmarks from different methods in Fig. 5 for readers to check details.

In Fig. 6, we visualize two cases of fetal pose estimations to show the advantages of our method SSLGCP. Land-R4 and LandGCP tend to be trapped by symmetric landmarks (green arrows), while our method can rectify these flaws and presents visually plausible estimations. As a byproduct of the pose estimation, the lengths of key segments of fetus are also produced in the 3D pose.

4 Conclusion

In this paper, we propose the first work about 3D fetal pose estimation in US volumes. We mainly tackle the challenges from the generalization ability with self-supervised learning and computation burden of large volumes with gradient checkpointing strategy. Extensive experiments prove the feasibility and effectiveness of our proposed method. We believe the pose estimation of fetus can serve as map and inspire the automated prenatal US image analyses.

4.0.1 Acknowledgments:

The work in this paper was supported by the grant from Research Grants Council of Hong Kong SAR (Project No. CUHK14225616), National Natural Science Foundation of China(Project No. U1813204) and Shenzhen Peacock Plan (No. KQTD2016053112051497, KQJSCX20180328095606003).


  • [1] W. Bai, O. Oktay, et al. (2017) Semi-supervised learning for network-based cardiac mr image segmentation. In MICCAI, pp. 253–260. Cited by: §2.2.
  • [2] C. F. Baumgartner et al. (2017) SonoNet: real-time detection and localisation of fetal standard scan planes in freehand ultrasound. IEEE TMI 36 (11), pp. 2204–2215. Cited by: §1.
  • [3] T. Chen, B. Xu, C. Zhang, and C. Guestrin (2016) Training deep nets with sublinear memory cost. arXiv preprint arXiv:1604.06174. Cited by: §2.3.
  • [4] Y. Chen, C. Shen, et al. (2017) Adversarial posenet: a structure-aware convolutional network for human pose estimation. In ICCV, pp. 1212–1221. Cited by: §1, §2.2, §3.0.2.
  • [5] R. Huang, J. A. Noble, and A. I. Namburete (2018) Omni-supervised learning: scaling up to large unlabelled medical datasets. In MICCAI, pp. 572–580. Cited by: §1.
  • [6] J. Liu, H. Ding, A. Shahroudy, L. Duan, X. Jiang, G. Wang, and A. K. Chichung (2019) Feature boosting network for 3d pose estimation. IEEE TPAMI. Cited by: §1, §2.2, §3.0.2.
  • [7] A. I. Namburete et al. (2018) Fully-automated alignment of 3d fetal brain ultrasound to a canonical reference space using multi-task learning. MedIA 46, pp. 1–14. Cited by: §1.
  • [8] O. Ronneberger, P. Fischer, and T. Brox (2015) U-net: convolutional networks for biomedical image segmentation. In MICCAI, pp. 234–241. Cited by: §2.1.
  • [9] T. Salimans and Y. Bulatov Saving memory using gradient-checkpointing. Note: https://github.com/openai/gradient-checkpointing/ Cited by: §2.3.
  • [10] G. Wang, W. Li, et al. (2018)

    Interactive medical image segmentation using deep learning with image-specific fine tuning

    IEEE TMI 37 (7), pp. 1562–1573. Cited by: §2.2.
  • [11] L. Wu et al. (2017) Cascaded fully convolutional networks for automatic prenatal ultrasound image segmentation. In ISBI, pp. 663–666. Cited by: §1.
  • [12] Z. Xu et al. (2018) Less is more: simultaneous view classification and landmark detection for abdominal ultrasound images. In MICCAI, pp. 711–719. Cited by: §1, §3.0.2.
  • [13] X. Yang, L. Yu, et al. (2019) Towards automated semantic segmentation in prenatal volumetric ultrasound. IEEE TMI 38 (1), pp. 180–193. Cited by: §1.