Featured as real-time and radiation-free, ultrasound (US) is widely accepted in clinic for fetal health monitoring. Plenty of diagnostic biometrics can be automatically interpreted from the US images by recent researches. With broad field-of-view and low user dependency, the advent of 3D US further brings opportunities for automated solutions to attain precise descriptions of fetus .
However, currently, there still lacks a solution to provide structuralized descriptions for the whole fetus in 3D US. This description should facilitate not only the traditional tasks in local scale, like standard plane detection  and biometric measurements , but also the advanced analyses in global scale, like fetal movement pattern and longitudinal comparison. Therefore, we propose to approach this goal by exploring a new task dedicated to 3D pose estimation of fetus in US volumes. Specifically, as illustrated in Fig. 1(b), by localizing 16 landmarks of fetus in fully body, we aim to extract the skeleton of whole fetus and assign different segments/joints with correct torso/limb labels.
As shown in Fig. 1, estimating fetal pose in 3D US needs to tackle several challenges. First, the image quality of 3D US is relatively low due to the speckle noise, low resolution and acoustic shadows (Fig. 1(a)). Second, large variations in fetal pose, scale and orientation cause high image appearance variations, which not only generate the ambiguity in localizing symmetric landmarks but also degrade the generalization ability of automated methods. (Fig. 1(b)&(c)). Third, accurate landmark localization heavily depends on the global context in the whole volume to suppress false positives. However, digesting the whole US volume with size about 200200200 is very tough under limited computing resources.
Deep neural network is nowadays the dominant method for landmark detection in 3D US. A multi-task deep network was proposed in
for fetal eye localization in US volumes.. Huang et al. exploited a semi-supervised learning method to localize 6 fetal head landmarks. To use the geometric or class constraints, generative adversarial scheme was explored in 
to regularize the landmark predictions. Although these methods are promising, the networks often suffer from their limited generalization ability, especially in our task, where fetuses have free poses with varying appearances. For 2D pose estimation, it has been well studied in computer vision field. Chen et al. proposed to learn the joint inter-connectivity prior in an adversarial scheme to refine the human pose prediction
. Liu et al. further distilled the articulated relationship between joints with recurrent neural network for feature boosting. However, facing with the large volume and varying poses, these methods tend to be degraded as our experiments show.
In this paper, we try to tackle the challenges in 3D US for whole-body fetal pose estimation and generalize the landmark detection for large volumes. Our contribution is three-fold. (i) To the best of our knowledge, this is the first work about 3D pose estimation of fetus in the literature. We believe that taking the fetal pose estimation as a map, navigation can be generated to assist a series of advanced studies on automated prenatal examinations. (ii) We propose a self-supervised learning (SSL) framework to force the deep network to produce visually plausible pose predictions. Specifically, we leverage the landmark-based registration to effectively encode case-adaptive anatomical priors and generate evolving label proxy for supervision. The proxy is a suboptimal supervision but proves to be explicit in conveying prior knowledge for successive refinement. (iii) To enable our 3D deep network generate better features with higher resolution input under limited computing resource, we further adopt the gradient check-pointing (GCP) strategy to save GPU memory. With little computation overhead, GCP facilitates the training and inference of larger volumes and hence contributes to better localization performance. With extensive experiments on a large 3D US dataset, our proposed method deals with varying fetal poses and presents to be general with promising results.
Fig. 2 is the overview of our proposed framework. System input is a whole US volume. A pre-trained deep network based landmark detector firstly digests the input and predicts the heatmap of 16 landmarks with an intermediate fetal pose estimation. By retrieving a support set of atlases in the pose library via rigid registration, label proxies are produced to form the self-supervision. The landmark detector is then tuned iteratively for on-line refinement. The system outputs the final pose estimation after a few number of iterations. Landmark detector is updated under the gradient checkpointing strategy in necessary.
2.1 Backbone of Landmark Detector
like network to simultaneously localize 16 landmarks of fetus in full body. Specifically, we deepen the network with consecutive convolutional (Conv) layers in a block and 4 pooling layers to encode high-level semantic features of the whole volume. Each Conv and deconvolutional (Deconv) is followed by a batch normalization (BN) layer and a rectified linear unit (ReLU). L2 regression loss is minimized as loss function.
2.2 Self-supervised Learning for On-line Refinement
Due to the large variations of fetal pose, scale and orientation, deep networks for 3D fetal pose estimation in US often suffer from the low generalization ability when facing with varying and unseen fetal appearances. Anatomical prior is helpful for the problem [6, 4]. However, these priors are often modeled in an indirect way and hard to take effect in our task (see Section 3). In this paper, as shown in Fig. 2, we propose to address the problem by producing direct shape prior for on-line refinement under a SSL scheme.
Supervising a model with the label proxy generated by the model itself and thus being annotation-free is the core idea of SSL. SSL changes classic testing fashion from simple inference to on-line learning. It fine-tunes the trained deep model with a label proxy. Strong guidance from the label proxy helps deep model update itself and generalize well to unseen cases. Recently, conditional random field  and interactive annotation  have been proposed to learn pixel-wise dependency to synthesize the label proxy in SSL. Whereas, these methods are intractable for our discrete landmark detection. Therefore, we propose to synthesize the landmark label proxy by combining the model prediction with the shape knowledge of a pose library.
As shown in Fig. 2, after being pre-trained on the training dataset, the landmark detector enters our SSL scheme for testing. Following the Eq. 1, for an unseen testing US volume of fetus, detector predicts its 16-channel landmark heatmaps and an intermediate 3D pose estimation . Each atlas in the pose library is then aligned to via a rigid transformation . Since there often exist flaws in the landmarks of pose , we only select a subset of landmarks to calculate the . Specifically, referring to Fig. 1, we choose the landmarks
which can be robustly detected across the dataset and also have relatively small variances to fulfil the rigid registration conditions. By retrieving the top-K candidates with lowest registration errors, a support set of aligned atlas is formed. in this paper. Then, a 16-channel landmark label proxy is produced by averaging the landmark gaussian maps of the aligned atlases in . The label proxy will serve as the pseudo ground truth in iteration to trigger the loss function. Landmark detector needs to update itself to refine its predictions and also the label proxy to minimize the loss.
Although the label proxy is initially rough, it encodes case-adaptive and strong shape prior which helps the detector to generalize to unseen US cases. The label proxy will evolve towards a suboptimal and case-specific state as the SSL iterates. Effectiveness of SSL will be elaborated in Section 3.
2.3 Enable Better Performance with Larger Input
Limited by GPU memory, 3D deep models often sacrifice the input size to enlarge network capacity. Plenty of content details are destroyed during the downscaling. Reducing the GPU memory consumption to break the bottleneck of input size is crucial for our task. In this work, we opt for the gradient checkpointing (GCP) strategy [3, 9] to trade off the GPU memory usage with re-computation and make US volume with high resolution available for deep model.
As shown in Fig. 4
, the core idea of GCP is discarding the data in some computation graph nodes after approaching a milestone node to make more GPU memory available for subsequent inference. The data of the discarded nodes will then be recovered by re-computation during backpropagation. Given an input, data in node is computed by the parameters of the function. Based on and , node
is then approached. At this moment, the data inwill be discarded to release the occupied GPU memory. During the backward pass, to get the gradients of and , node will be recovered as by the re-computation from and . The gradient for the parameters of is calculated from and the gradient of (). With as a transition node, the gradient for the parameters of () can be further obtained. Thus, without losing model accuracy, both the forward and backward passes can fit in the GPU. For our task, restricted by the skip connections, we manually select all the Conv layers except the Conv layers directly connected with the concatenation layer into the layer set for GCP. By using GCP to reduce GPU memory consumption, we can enable the network process US volumes with high resolution (enlarged as 1.25 times on each dimension).
3 Experimental Results
3.0.1 Materials and Implementation
We validate our method on a dataset of 152 fetal US volumes acquired from 152 pregnant volunteers with gestational age ranges from 1014 weeks. Average size of volume is 220205260. Voxel size is 0.50.50.5 mm. Approved by local IRB, all volumes were anonymized and obtained by experts using a Mindray DC-8 system. Free fetal poses are allowed. An expert with 10-year experience manually annotated 16 landmarks. These 16 landmarks cover the fetal head, neck, shoulder, elbow, wrist, spine, sacra, hip joint, knee and ankle. We randomly split the dataset into 100/52 volumes for training/testing. Training set is augmented to 800 with flipping and rotation.
We implement our method in Tensorflow, using a standard PC with only one NVIDIA TITAN Xp GPU (12GB). Codes will be online available. The original US volume is downscaled as 0.4 times before input into our basic landmark detector. 0.4 is the highest ratio allowed by the GPU for our network. With the GCP, we can enlarge the ratio to 0.5. During the training of landmark detector on the training dataset, we update the weights with an Adam optimizer (batch size=1, initial learning rate is 1e-3
, moment term is 0.5, epoch=20). During the testing with SSL, initial learning rate is decreased to5e-4. Landmark detector runs on each testing case with SSL for 6 iterations (about 12 seconds in total). GCP is used for all the methods compared in this paper when it is needed. Training with GCP needs about 1.5 times of extra running time.
3.0.2 Quantitative and Qualitative Analysis
Two metrics are used to evaluate accuracy of pose estimation: the Euclidean distance (mm) between landmark prediction and ground truth, and the area under PCK curve (AUC, %), where PCK is the Percentage of Correct Key points, i.e., the percentage of detections with Euclidean distance below a threshold. With the basic landmark detector (Land) as backbone, we compared our SSL method with two typical refinement methods that explore the landmark dependency: (a) generative adversarial learning (GAN) [4, 12] and (b) recurrent neural network (RNN) 
. We implemented GAN by learning to classify the pair of US volume and 16-channel heatmaps, and RNN by adding a convolutional RNN layer to the last Conv layer of our landmark detector. GCP is applied to input when the downscale ratio is0.5.
|Method||Euclidean Distance [mm] ↓|
|Method||AUC Ratio [%] ↑|
Table 1 presents the Euclidean distance of different methods for all the 16 landmarks. We use R4 to denote the model handling input with downscale ratio of 0.4, and GCP the method with GCP to handle input with larger downscale ratio of 0.5. As demonstrated in the table, almost all methods achieved lower prediction distance for all the landmarks with GCP, benefiting from its better features perceiving from higher resolution input. With this work, we are the first to prove that, GCP can improve landmark localization by enabling larger ultrasound volume input. Besides, although RNN and GAN based refinement methods bring improvements over the , they still perform obviously worse for some landmarks. With the case-adaptive label proxy as a strong prior, SSL based methods surpass the GAN/RNN and get almost the best results by achieving the top rank on 10 landmarks. The advantage of SSL can also be drawn from the average prediction distance, according to which the proposed SSL achieves an average distance of 4.92mm, and significantly outperforms the two competitors.
PCK evaluates the distribution of predicted landmarks around ground truth. Table 2 further compares the AUC of methods. Similar trends for GCP and SSL can be observed. SSL equipped with GCP (SSLGCP) tops the task of most landmark detections. It also achieves the highest mean AUC among all competitors. The highest improvement over the baseline Land-R4, about , occurs on the detection of landmarks L4, L6, L10, L12 and . Referring to Fig. 1, we can find that these are the symmetric landmarks on the limb which are hard to be differentiated by Land, RNN and GAN methods. We believe that both the strong shape prior from the evolving label proxy and the better feature input enabled by the GCP contribute to this significant improvement. We further provide the PCK curves of these landmarks from different methods in Fig. 5 for readers to check details.
In Fig. 6, we visualize two cases of fetal pose estimations to show the advantages of our method SSLGCP. Land-R4 and LandGCP tend to be trapped by symmetric landmarks (green arrows), while our method can rectify these flaws and presents visually plausible estimations. As a byproduct of the pose estimation, the lengths of key segments of fetus are also produced in the 3D pose.
In this paper, we propose the first work about 3D fetal pose estimation in US volumes. We mainly tackle the challenges from the generalization ability with self-supervised learning and computation burden of large volumes with gradient checkpointing strategy. Extensive experiments prove the feasibility and effectiveness of our proposed method. We believe the pose estimation of fetus can serve as map and inspire the automated prenatal US image analyses.
The work in this paper was supported by the grant from Research Grants Council of Hong Kong SAR (Project No. CUHK14225616), National Natural Science Foundation of China(Project No. U1813204) and Shenzhen Peacock Plan (No. KQTD2016053112051497, KQJSCX20180328095606003).
-  (2017) Semi-supervised learning for network-based cardiac mr image segmentation. In MICCAI, pp. 253–260. Cited by: §2.2.
-  (2017) SonoNet: real-time detection and localisation of fetal standard scan planes in freehand ultrasound. IEEE TMI 36 (11), pp. 2204–2215. Cited by: §1.
-  (2016) Training deep nets with sublinear memory cost. arXiv preprint arXiv:1604.06174. Cited by: §2.3.
-  (2017) Adversarial posenet: a structure-aware convolutional network for human pose estimation. In ICCV, pp. 1212–1221. Cited by: §1, §2.2, §3.0.2.
-  (2018) Omni-supervised learning: scaling up to large unlabelled medical datasets. In MICCAI, pp. 572–580. Cited by: §1.
-  (2019) Feature boosting network for 3d pose estimation. IEEE TPAMI. Cited by: §1, §2.2, §3.0.2.
-  (2018) Fully-automated alignment of 3d fetal brain ultrasound to a canonical reference space using multi-task learning. MedIA 46, pp. 1–14. Cited by: §1.
-  (2015) U-net: convolutional networks for biomedical image segmentation. In MICCAI, pp. 234–241. Cited by: §2.1.
-  Saving memory using gradient-checkpointing. Note: https://github.com/openai/gradient-checkpointing/ Cited by: §2.3.
Interactive medical image segmentation using deep learning with image-specific fine tuning. IEEE TMI 37 (7), pp. 1562–1573. Cited by: §2.2.
-  (2017) Cascaded fully convolutional networks for automatic prenatal ultrasound image segmentation. In ISBI, pp. 663–666. Cited by: §1.
-  (2018) Less is more: simultaneous view classification and landmark detection for abdominal ultrasound images. In MICCAI, pp. 711–719. Cited by: §1, §3.0.2.
-  (2019) Towards automated semantic segmentation in prenatal volumetric ultrasound. IEEE TMI 38 (1), pp. 180–193. Cited by: §1.