## 1 Introduction

Estimation of fetal pose from volumetric MRI in pregnancy has applications that include motion tracking and prospective artifact mitigation during diagnostic imaging, retrospective analysis and evaluation of movement by the fetus, as well as the establishment of kinematic models of fetal movement during MRI. Prior work in fetal motion includes methods that rely on simple indices for fetal motion analysis and quantification, such as the angle of the fetal body axes with respect to the maternal body [1] and maternal perception of fetal movements [2].

Although pose estimation for the human (adult) body is an established domain in computer vision

[3], to the best of our knowledge, no work has demonstrated fetal pose estimation over time in medical images by MRI. In contrast to human pose estimation from 2D photography, in fetal pose estimation we need to predict 3D pose from dense volumetric data, which increases the computational burden. Further complicating the task is the variable orientation of the fetus within the mother, rapid growth and change in fetal features over gestational age, and poor-quality observations of ground truth pose.In pose estimation, handcrafted features such as graphical models and tree-based methods typically suffer from low accuracy and low processing speed while recent developments in deep learning have demonstrated great success in computer vision with acceleration by GPUs and the capability to learn high-level features from data. Consequently, deep convolution neural networks have also found their way into human pose estimation and achieved state-of-the-art results.

In an ongoing study of placental function by EPI BOLD imaging time series (see Figure 1 (a)), we have built an archive of over 70 subjects, each with 200-500 time frames of EPI volumes, imaged continuously over 10-30 minute observation intervals and resulting in over 18,000 EPI volumes. By visual inspection, the fetal pose can be inferred from these data but manual labeling of keypoints for pose estimation (see Figure 1 (b)) across these volumes is prohibitive and here we propose a method based on deep neural networks to identify fetal key points.

We propose, demonstrate, and characterize the performance of a two-stage framework for fetal pose estimation in 3D MRI using deep learning, where we first generate heatmaps for each fetal keypoint using a convolution network and then infer fetal pose from heat maps using a Markov Random Field (MRF) that exploit anatomically rational information about connections between keypoints. Evaluation of performance shows that the proposed method achieves a mean error of 4.47 mm and a percentage of correct detection of 96.4%. Further, computation time of our pipeline is less than 1 s/volume, which potentially enables low-latency tracking of fetal pose during diagnostic MRI in pregnancy.

(a) | (b) |

## 2 Methods

### 2.1 Pose Estimation Framework

Exploring the idea of heatmap prediction in human pose estimation [3], here we propose a two-stage framework for fetal pose estimation in 3D MRI using deep learning (see Fig. 2). In the first stage, a CNN is used to generate heatmaps from input MR volume, which produce per-pixel likelihoods for keypoints on the fetal skeleton. However, the generated heatmap may have multiple local maxima and simply using max activating location as prediction may lead to low accuracy.

To address this problem, a second stage is proposed to infer location from estimated heatmaps, exploiting the constraints of fetal pose to refine the results. We model the fetal pose as a MRF, where each keypoint of fetus is represented by a node in the graph and the states are the plausible locations of the keypoint. The final prediction is generated by performing inference on this MRF.

The following subsections describe the proposed framework in detail.

### 2.2 Heatmap Prediction using CNN:

Inspired by the successful application of hourglass networks in human pose estimation [3], we propose a 3D hourglass network for heatmap prediction of fetal keypoints. The overall architecture of the proposed network is shown in Fig. 3. The network is based on the encoder-decoder structure which is motivated by the idea of capturing multi-scale information. In pose estimation, while local evidence, e.g., local contrast, is important for identification of keypoint, global information can help resolve ambiguity, such as fetus’ orientation and relative position of other joints or body parts. In each scale of the network, resblocks with 3D convolution layers are used to extract features. To recover loss of high resolution information in downscale-upscale structure, skipped connections with element-wise addition are adopted to connect symmetric scales.

The CNN tries to learn a mapping from MR images to target heatmaps, which is generated by placing a Gaussian distribution with

on the ground-truth position and stacking together. So the output heatmaps will be of the same spatial dimensions but have channels, whereis the number of keypoints need to predict. The loss function used for training is the mean-squared error (MSE) between the predicted heatmap and target heatmap. Instead of using the whole volume, 3D patches with size of

are used as input for training. This strategy can reduce GPU memory usage, enabling mini-batch training. Since the network is fully convolutional, in inference, the whole 3D MR volumes are fed into the network to generate heatmap of full scale.### 2.3 Location Estimation from Heatmap:

Given the output heatmap from CNN, the second stage of the pose estimation framework is to estimate location of each keypoint. Let and be the location and heatmap of the th keypoint, . Let . Then one simple idea to infer keypoint positions from heatmaps is taking the max activating location of each heatmapHowever, this method handles each keypoint independently and does not make use of the connection between keypoints, e.g., the distance between two joints should be a constant if they are connected by bones. To exploiting these connections, we model the fetal pose as a MRF, where each keypoint correspond to a node in the graph and connections of keypoints are represented as edges in the graph. The states for node is the top- local maxima in heatmap . Our prediction of fetal pose would be a particular configuration of the MRF, i.e., . Each configuration is assigned an energy, defined as

(1) |

where

is the set of connections. A low energy of a configuration implies high probability. Therefore, the inference is equivalent to finding the configuration with lowest energy

Since the heatmap can be considered as a surrogate for the probability distribution of the corresponding keypoint, the unary term in energy function

can be modeled as(2) |

As for the pairwise term, we define as a quadratic function of , the distance between keypoint and .

(3) |

where is the mean bone length at gestational age , so that can be regarded as the distance of two keypoints normalized by gestational age. and

are the mean and variance of the normalized distance, which are estimated from training data.

is the regularization weight. The optimization problem is solved by a belief propagation algorithm [4].## 3 Experiments and Results

### 3.1 Dataset

The data for this study consist of volumetric MRI time series from imaging of 70 mothers pregnant with singletons at a gestational age ranging from 25 to 35 weeks. MRIs were acquired on a 3T Skyra scanner (Siemens Healthcare, Erlangen, Germany). Multislice, single-shot, gradient echo EPI sequence was used for acquisitions with in-plane resolution of mm, slice thickness of 3 mm, mean matrix size = ; TR=s, TE=ms, FA=90. Each subject was scanned for 10 to 30 min.

Similar to the task of adult human pose estimation, we model the pose of a fetus with a set of keypoints. We chose fifteen keypoints (ankles, knees, hips, bladder, shoulders, elbows, wrists and eyes) to capture pose and labeled manually, with a representative example shown in Fig. 1(b). These fifteen landmarks were selected as keypoints as they capture gross fetal anatomy that is critical in subsequent motion analysis, and they presented with adequate image contrast to be relatively robustly observed in the MR volumes, thus mitigating the error and noise in labelling. In total, 1705 MR volumes were labelled, 1028() for training, 240() for validation and 437() for testing, where the testing set consists of subjects different from training and validation sets.

In order to improve the generalization capacity and avoid overfitting, several data augmentation techniques were used, including intensity scaling, 3D rotation and flipping.

### 3.2 Experiments Setup

All experiments were performed on a server with an Intel Xeon E5-1650 CPU, 128GB RAM and a NVIDIA TITAN X GPU. Neural networks were implemented with TensorFlow and for optimization we use Adam with an initial learning rate of

, weight decay of and the restart strategy [5]. The networks are trained for 200 epochs. For the second stage, we set

and .### 3.3 Results

In this section, we evaluate the proposed pipeline for fetal pose estimation. First, we evaluate the proposed 3D hourglass network (HG) with max activating location of the heatmap as final prediction. For comparison, 3D UNet[6] is used in our experiment, which has been used for heatmap regression[7]. Finally, we examine the whole pipeline by combine the CNN-based heatmap regression and MRF. These models are denoted as UNet-M and HG-M respectively.

Several metrics are used for evaluation: a) Percentage of Correct Keypoint (PCK), where a detected keypoint is considered correct if the distance between the predicted and the true keypoint is within a certain threshold, b) mean error (in mm),i.e., the mean distance between the predicted and the ground-truth keypoint, and c) median of error.

metric | method | wrist | elbow | shoulder | eye | bladder | hip | knee | ankle | all |
---|---|---|---|---|---|---|---|---|---|---|

median (mm) | UNet | 3.84 | 3.43 | 2.87 | 2.74 | 3.20 | 3.12 | 4.00 | 4.42 | 3.47 |

UNet-M | 3.84 | 3.43 | 2.87 | 2.73 | 3.19 | 3.12 | 3.99 | 4.36 | 3.46 | |

HG | 3.82 | 3.42 | 2.83 | 2.72 | 3.37 | 3.16 | 3.87 | 4.15 | 3.42 | |

HG-M | 3.82 | 3.41 | 2.83 | 2.72 | 3.36 | 3.16 | 3.86 | 4.15 | 3.42 | |

mean (mm) | UNet | 7.34 | 4.06 | 4.27 | 3.96 | 4.48 | 3.33 | 5.19 | 10.2 | 5.41 |

UNet-M | 5.64 | 3.81 | 3.75 | 3.29 | 3.52 | 3.23 | 4.84 | 8.18 | 4.60 | |

HG | 7.48 | 4.81 | 3.24 | 3.35 | 4.69 | 3.58 | 4.39 | 7.49 | 4.89 | |

HG-M | 6.37 | 4.11 | 3.10 | 3.28 | 4.12 | 3.33 | 4.19 | 7.07 | 4.47 |

network | computation time (ms/volume) | number of parameters |

UNet | 271 | 22M |

HG | 225 | 3.5M |

Fig 4 shows PCK with two threshold, 5mm (1.67 pixel) and 10mm (3.33 pixel) while the mean and median of error of different models are illustared in table 1. Applying the proposed pipeline, 96.4% of the keypoints are located correctly (with error 10mm) and the mean distance between predicted and ground-truth keypoints is 4.47mm (1.5 pixel). Besides, we see that, in average, the proposed 3D hourglass network has similar performance compared to 3D UNet. However, as illustrated in table 2, the number of parameters of UNet is 6 times as large as that of hourglass network, indicating that the proposed network is more compact and efficient. The main reason is that the hourglass network use elementwise sum instead of concatenate in skip connection and fix the number of channels across different scales. We also notice that the second stage Markov network refinement improves the performance upon CNN heatmap regression, in terms of PCK as well as mean error. As illustrated in Fig. 5(b), fetal pose estimation based on max activating location of heatmap may result in irrational prediction. Such error is corrected in the MRF refinement by making a trade-off between prior information of keypoint connections and heatmaps generated by the CNN. As for computation time, the proposed 3D hourglass network runs at a speed of 225 ms/volume on a GPU and solving the optimization problem for inferring keypoint locations from heatmaps takes 290 ms/volume on CPU. Therefore, the end-to-end processing time of the whole pipeline is less than 1 s/volume and therefore shorter than the temporal resolution in the current fetal MR protocol, which potentially enables low latency tracking of fetal pose in fetal MR imaging.

(a) | (b) |

## 4 Conclusions

In this work, we proposed a two-stage deep learning framework for fetal pose estimation in 3D MRI. The proposed method achieves mean error of 4.47 mm ( 1.5 pixels) and percentage of correct detection of 96.4%, which indicates that deep neural networks are able to identify key features for fetal pose estimation from time frames in low-resolution, volumetric EPI data from pregnant mothers. Further, the total processing time of the proposed framework is less than 1 s, potentially enabling low latency tracking of fetal pose in fetal MR imaging. Limitations of the current method include a pipeline that was only trained on singleton pregnancies. Also, the current pose detection was performed on each time frame in isolation without utilizing any form of temporal correlations in the MR series. In future work the proposed framework could be extended to work with multiplet pregnancies as well as exploit temporal correlations across volumes in a time sequence.

Overall, the proposed pipeline could be deployed for fetal motion estimation during MR scanning of pregnant mothers with applications to fetal health and disease, establishment of fetal kinetic motion models, and prospective motion correction with slice-prescription updates for more robust diagnostic fetal and maternal MRI.

## Acknowledgements

This research was supported by NIH U01HD087211, NIH R01EB01733, NIH NIBIB NAC P41EB015902 and NIH NICHD U01HD087211.

## References

- [1] Biglari, H., Sameni, R.: Fetal motion estimation from noninvasive cardiac signal recordings. Physiological measurement 37(11), 2003 (2016)
- [2] Heazell, A.P., Frøen, J.: Methods of fetal movement counting and the detection of fetal compromise. Journal of Obstetrics and Gynaecology 28(2), 147–154 (2008)
- [3] Newell, A., Yang, K., Deng, J.: Stacked hourglass networks for human pose estimation. In: European Conference on Computer Vision. pp. 483–499. Springer (2016)
- [4] Schmidt, M.: Ugm: Matlab code for undirected graphical models. URL http://www. di. ens. fr/mschmidt/Software/UGM. html (2012)
- [5] Loshchilov, I., Hutter, F.: Fixing weight decay regularization in adam. arXiv preprint arXiv:1711.05101 (2017)
- [6] Çiçek, Ö., Abdulkadir, A., Lienkamp, S.S., Brox, T., Ronneberger, O.: 3d u-net: learning dense volumetric segmentation from sparse annotation. In: International conference on medical image computing and computer-assisted intervention. pp. 424–432. Springer (2016)
- [7] Payer, C., Štern, D., Bischof, H., Urschler, M.: Regressing heatmaps for multiple landmark localization using cnns. In: International Conference on Medical Image Computing and Computer-Assisted Intervention. pp. 230–238. Springer (2016)

Comments

There are no comments yet.