Over the past decade, there has been significant advances in realistic human body shape modeling and simulation in the graphics domain [2, 14, 25, 16], where different statistical models have been applied to learn compact parametric representations of the human body shape. However, their impact on the healthcare domain is relatively limited [21, 29]. One major reason is that existing shape modeling approaches focus primarily on the skin surface while the healthcare domain pays more attention to the internal organs. This work attempts to tackle this limitation by addressing the challenging task of estimating the internal anatomy of human body from the surface data. More specifically, our approach generates a synthetic X-ray image of a person only from the surface geometry. However the synthetic X-ray would only serve as an approximation of the true internal anatomy, we thus simultaneously predict markers which can be adjusted to update the X-ray image. Since the markers serve as spatial parameters that can be used to perturb the image, we refer to the predicted image as a parametrized image. Figure 1 shows examples of parametrized X-ray images generated from surface data.
Learning to predict parametrized images is a very challenging task due to strict constraints in the output space. As the training framework learns to predict images and the corresponding spatial parameters (i.e. markers), it also needs to ensure that the perturbations of these parameters lies on a manifold of ’realistic deformations’ (e.g. realistic facial expressions when generating face images or realistic body anatomy when generating synthetic X-ray). Since learning such output spaces (which are implicitly highly correlated) is difficult, we propose to learn a pair of networks, one trained to predict the parameters from image contents, and the other trained to predict the image contents from the parameters. When the parameters are updated, the networks are applied iteratively in a loop until convergence. To facilitate such convergent behavior during test phase, we present a novel learning algorithm that jointly learns both the networks. While some recent work have utilized predicting markers as a supplementary task within the context of multi-task learning [10, 30], to the best of our knowledge, this work is first in explicitly learning a bijection between the predicted markers and the generated images.
We use the proposed framework to generate synthetic X-ray image from 3D body surface meshes. Such a technology can be used in conjunction with existing approaches to estimate 3D body surface models from depth 
and potentially benefit medical procedures such as patient positioning for scanning or interventional procedures. We report several impactful applications of this technology such as anomaly detection as well as completion of full X-ray from partial X-ray. We also demonstrate via experiments that the parametrized X-ray images can be used to generate training data thus helping to overcome the significant big data barrier faced during the application of the deep learning approaches in medical imaging tasks.
The key contributions of this paper can be summarized as follows:
A novel framework to generate parametrized images using a convergent training pipeline.
A novel technology to predict parametrized X-ray images from body surface data, which as we demonstrate can significantly impact the healthcare domain.
Use of Wasserstein GANs (WGAN)  with gradient penalty method for a conditional regression task.
2 Related Work
Learning parametrized image representations can be formulated as multi-task learning, where one task is to generate the synthetic image from input data, and other is to ensure that the synthetic image correlates strongly with the set of control parameters.  presented such a multi-task architecture to regress both facial markers and a corresponding 3D face represented as a volumetric image. They present a network architecture with two stacked hourglass networks , one predicts the markers, and the other predicts the volumetric images taking output of the marker network as input. While they show the use of landmarks improve face prediction, there is no guarantee that the predicted markers and generated face converge to have a strong correlation, as the markers are pre-trained independently from the face network and never get any feedback from it.  employs an architecture where tasks are incorporated with a single network but these auxiliary tasks are categorical such as gender recognition. While such networks would have the ability to implicitly learn the correlation between the two tasks (i.e. parameter space and image space in our setting), they are difficult to train because of the large variation in losses during the training period .
With the advent of generative adversarial networks , several attempts have been made to generate synthetic training data or augment data in domain where it’s either difficult to acquire or annotations are difficult to obtain. Although  focuses on generating data from noise, it bears similarities with our approach as it incorporates semantic parametrization during the image generation and shows that generated data can help with training deep networks. However, the sampled parameter space is limited to global scene parameters such as lighting variation, background variation etc., while in our work, we focus on the spatial parametrization which allow changing the scene structure.  employs GANs to learn the natural image manifold and allow user to manipulate the generated image by making structural edits, which are provided as sketches and image is updated by finding the nearest instance on the manifold. In contrast, our approach learns an explicit mapping between the image and several markers, and provides editability using these markers.  presents an approach for image domain transfer by learning a pair of networks with an inverse relationship, using cyclic consistency as a regularizer. However our approach learns mapping from the source domain (surface data) to target domain (X-ray image) together with control parameters (markers), and consistency is being enforced between the target image space and parameter space.
3 Parametrized Images
We refer parametrized image to an image that is parametrized by a set of spatially distributed markers, which can be perturbed to manipulate the contents of the image to generate realistic image variations. Such manipulation of an image via markers (spatial parameters) requires learning a bijection mapping. In this work, we achieve this by learning a pair of networks, Marker Prediction network (MPN), trained to predict the parameters from the image contents, and Conditional Image Generation network (CIGN) trained to predict the image contents given the parametrization. We use the networks to predict an initial estimate and then iteratively refine until convergence. Figure 2 shows an overview of the networks involved in the parametrized image generation. While parametrized images can be generated from noise (similar to image generation task ), we focus on the task of conditional image generation  as it naturally applies to task of generating the X-ray images from 3D body surface data.
We represent the 3D human surface mesh data with a 2-channel 2D image; the first channel stores the depth of the body surface as observed from front, and second channel stores the thickness computed by measuring the distance between the closest and furthest point as observed from front; in the rest of the document we refer to this 2 channel image as surface image.
3.1 Marker Prediction Network
The marker prediction network takes the surface image as well as the predicted X-ray image as input and predicts the locations for all the markers. We employ a U-Net like  architecture to train a network to regress from a 3-channel input image (2 surface data channels, 1 X-ray image channel) to a 17-channel heatmap image by minimizing L2-loss; these heatmaps correspond to anatomically meaningful landmarks (such as lung top, liver top, kidney center etc). Each output channel compares with the given ground truth that includes a Gaussian mask (kernel , ) centered at the given target location ( similar to other landmark detection approaches [28, 17, 27]).
3.2 Conditional Image Generation Network
The proposed conditional image generation network is derived from the conditional GAN architecture . The generator with the U-Net architecture takes the surface image and marker heatmaps as input and outputs the synthetic X-ray image. To stabilize the CIGN training, we adopt the Wasserstein loss with gradient penalty introduced in , which is known to outperform other adversarial losses. The critic takes the surface image and corresponding X-ray image as input, though theoretically, a better critic model would have taken the surface images, marker maps (parameter space) and X-ray image as input to implicitly force a strong correlation between them; however fusing all the data together as a channel image did not help training a useful critic in any of the GAN variants [9, 3, 6]; we noticed that the critic was not able to utilize the marker maps to determine the real vs fake, and thus fails to provide gradients that would eventually help generating the X-ray image. The final choice of the network architecture was determined after thorough experimentation with several architectures as well as GAN variants.
4 Learning Parametrized Image Representation
In this section, we describe the procedure to train the networks (marker prediction as well as conditional image generation network). We first pre-train the networks using the available ground truth data, and subsequently refine them end-to-end to minimize the combined loss, defined as,
where, is the mean squared error between the predicted and the ground truth heat maps for the markers, is loss between the predicted and ground truth X-ray image.
4.1 Pre-training Marker Prediction Network
We pre-train the marker network using the Adam optimizer  to minimize the MSE loss, and set the initial learning rate to . During pre-training, we use the ground truth X-ray images with body surface images as input.
During the convergent training process, the input is replaced by the predicted X-ray image. This initially worsens the performance on the marker prediction network but it quickly recovers after a few epochs of convergent training, as demonstrated in the experiments.
4.2 Pre-training Conditional Image Generation Network
We pre-train the image generation network with surface images and ground truth landmark maps as input, using the Adam optimizer with initial learning rate of
. After pre-training, we use the RMSProp with a low learning rate of . In our experiments, we found the gradient penalty variant of WGAN  to outperform the original WGAN with weight clipping . The architecture of the critic was similar to the encoder section of the generator network. We observe that in case of WGAN, using a more complex critic helps converging faster.
During the convergent training, the network is iteratively updated using the predicted landmarks as input.
4.3 Convergent Training via Iterative Feedback
During the test phase, we apply both networks iteratively in succession until both of them reach the steady state. This implicitly requires the networks to have a high likelihood of convergence during the training stage. A stable solution sits where both the markers/parameters and synthetic image are in complete agreement with each other, suggesting a bijection. We achieve the goal by freezing one network and updating the weights of the other network using its own loss as well as the loss backpropagated from the other network. Thus, not only the networks get feedback from the ground truth, they also get feedback on how they helped each other (good markers give good X-ray image, and vice versa). The gradient flow during backpropagation is also shown in Figure2.
The losses optimized by conditional image generation (CIGN) and marker prediction network (MPN) at each iteration are given by,
where, and are deep networks depicted in functional form; and are ground truth image and markers heat maps respectively; and are predicted images and markers heat maps at iteration .
The iterative approach to train the networks to facilitate convergence is motivated by the iterative adversarial training procedure explained in . However, in this case, the networks are learning to cooperate instead of compete. Similar to GAN training, there is a possibility that the training may become unstable and diverge. We address this issue by weighting the losses with appropriate scale.
While the number of epochs required to reach convergence depends on how tightly the output of the two networks correlate, in our experiments, we found epochs to be sufficient. After these epochs, no significant change in X-ray or in landmarks positions were observed, suggesting convergence. Algorithm 1 details the pseudo-code for convergent training.
To validate the convergent training, we selected a random data sample from the testing set and monitored the marker displacement across iterations. Without the convergent training, the markers kept changing across iterations. Figure 3 shows the variation of the y position of body markers over 50 iterations. Notice that with the convergent training, markers become stable after 15 iterations.
To evaluate our approach, we collected full body Computed Tomography (CT) images from patients at several different hospital sites in North America and Europe. We randomly split the entire dataset into a testing set of images, a validation set of
images and a training set with the rest. The 3D body surface meshes were obtained from the CT scans using thresholding and morphological image operations. The X-ray images were generated by orthographically projecting the CT images. All the data were normalized to a single scale using the neck and pubic symphysis body markers (since these can be easily approximated from the body surface data). All the experiments are conducted in PyTorch environment
. Our U-Net like networks are composed of four levels of convolution blocks (each consisting a 3 repetitions of Convolution, Batch Normalization
and ReLU). Each network has 27 convolutional layers, with 32 filters in each layer.
5.1 Landmark Estimation
Although the purpose of the convergent training is to ensure a tight correlation between X-ray and markers, we also computed the error statistics w.r.t. the ground truth marker annotations provided by medical experts. Interestingly enough, the convergent training helped improve the accuracy of the marker prediction network, though the improvement is not quantitatively significant (the mean euclidean distance dropped from 2.50 cm to 2.44 cm). An interesting observation is that the accuracy improved for some of the particularly difficult to predict markers, such as the tip of the sternum; we attribute this to the fact that convergent training facilitates more stable prediction. For example, in case of the tip of the sternum, the error dropped from 2.46 to 2.00 cm.
5.2 Synthetic X-ray Prediction
We use the widely used pix2pix approach as our baseline for conditional image regression. Following , we use L1-loss for training generators. Using receptive field of achieved the lowest validation error. During training, we found it difficult to optimize the network with batch size . We alleviated this issue by gradually reducing the batch size from to ; this enabled faster training in beginning although with blurred results but as the training continued and the batch size was reduced to 1, more details were recovered.
Table 1 shows a quantitative comparison between different approaches. Since L1 error is known to be insufficient in capturing the perceived image quality, we also report the Multi-Scale Structural Similarity index (MS-SSIM)  averaged over the entire testing set. Notice that the MS-SSIM score for the WGAN-GP is significantly higher than other methods. Furthermore, the convergent training is able to retain a high perceived image quality, while ensuring that the markers are strongly correlated with the X-ray image. Figure 7
shows predicted X-ray images for several different surface images using these methods. Compared to ground truth, synthetic X-ray images does surprising well in certain regions such as upper thorax while does poorly in other regions such as lower abdomen where the variance is known to be significantly higher. More importantly, notice the images generated using the proposed method is sharper around organ contours as well as spine and pelvic bone structures are much clearer.
6.1 Training Data Generation
Due to privacy and health safety issues, medical imaging data is difficult to obtain, which creates a significant barrier for data driven analytics such as deep learning. Recently, generative deep learning model such as GANs and other variants [12, 20, 23, 4] have been effectively employed to generate realistic training data, however they offer limited control (categorical) over the nature of the perturbations.
Parametrized X-ray images offers an approach to generate medical image training data; furthermore, the spatial parametrization offers controlled perturbations such as generating data variations with lungs of certain size. For tasks such as marker detection, we further observe that since the image manifold is smooth, it’s possible to generate training data (for augmentation) together with annotations, by annotating the marker in one image and tracking it in the image domain as it is perturbed along the image manifold.
We demonstrate the effectiveness of this method on the task of detecting left lung bottom landmark (NOTE: this marker is not in the original set of landmarks). We manually annotated this marker in 50 synthetic X-ray images to be used for training data. For evaluation, we manually annotated 50 ground truth X-ray images. To generate the augmented training dataset, we generate random perturbations from the annotated parametrized images (by allowing the marker to move within a certain range). Since the image manifold is smooth, as the position of the marker changed in the perturbed image, we were able to propagate the annotation using coefficient normed template matching. Figure 4 shows sample synthetic training images.
We train a Fully Convolutional Network  to regress the marker location, depicted as a Gaussian mask, from X-ray image. We used the Adam optimizer with an initial learning rate of .
To measure the usefulness of data generated using parametrized images, we created a baseline training dataset by augmenting the training images using random translations. Table 2 lists the error metrics as the networks are trained using the datasets for 25 epochs (after which they both overfitted). Notice that after only 5 epochs, the model trained with the parametrized training had a cm mean error on the testing set, compared to cm for the baseline. After 25 epochs, the baseline has a mean error of 1.20 cm, while the network trained on data with parametrized perturbations has a much lower a 0.68 cm error.
|Epoch||MSE (p)||MSE (t)|
6.2 X-ray Image Completion
As the radiation exposure is considered harmful, X-ray images are often acquired with a limited field of view, only covering a certain body region (say thorax or abdomen). Using parametric images, we can reconstruct the X-ray image of the entire body such that it is still consistent with the partial yet real image. While the reconstructed X-ray image would be of limited diagnostic use, it would be beneficial for acquisition planning in subsequent or future medical scans. Using the reconstructed X-ray, the scan region can be specified more precisely, thus potentially reducing the radiation exposure.
To reconstruct the complete X-ray, we first generate a parametrized X-ray image of the patient from the surface data. As previously mentioned, the predicted X-ray may not always correspond to the true internal anatomy. This, however, can be addressed using the markers on the parametrized image by adjusting them such that synthetic X-ray matches the real one where they overlap. Once the markers are adjusted, we regenerate the complete X-ray together with all the markers (see Figure 5). Table 3 shows the quantitative comparison between the predicted synthetic X-ray and markers, before and after being refined using the real X-ray image.
6.3 Anomaly Detection
Another potential use case of the proposed method is anatomical anomaly detection. As the proposed method generates a representation of healthy anatomy learned from healthy patients, it can be applied for anomaly detection by quantifying the difference between the real X-ray image and the predicted one. Figure 6 illustrates such examples, where one patient has a missing lung, and the other has implant, which is sometimes overlooked by technicians. While the anatomical anomaly is easier to identify, the proposed approach with higher resolution imaging can potentially be used to suggest candidates for lung nodules (in a chest X-ray) or other pathological conditions.
We presented a novel framework that learns to predict parametrized images from partial image data, which enables natural perturbations of the predicted images. We apply the presented method on the challenging task of predicting a synthetic X-ray image from the patient surface data, together with corresponding markers distributed over the X-ray; the predicted image can be further manipulated by adjusting the body markers while ensuring physically consistent X-ray image. The proposed technology has been demonstrated to address the significant barrier of training data scarcity in the medical domain, in addition to enabling novel use cases with benefits to the medical community such as device positioning. One of our future works is to directly predict 3D CT from the surface. This work is currently limited by the difficulties in getting sufficiently large dataset to learn the variation in 3D anatomical structure. This is also due to the fact that most of the CT scans are with limited field of view and sometimes may even have contrast depending on the scan protocol.
-  https://github.com/pytorch/pytorch.
-  D. Anguelov, P. Srinivasan, D. Koller, S. Thrun, J. Rodgers, and J. Davis. SCAPE: shape completion and animation of people. ACM Trans. Graph, 24:408–416, 2005.
M. Arjovsky, S. Chintala, and L. Bottou.
Wasserstein generative adversarial networks.
In D. Precup and Y. W. Teh, editors,
Proceedings of the 34th International Conference on Machine Learning, volume 70 of Proceedings of Machine Learning Research, pages 214–223, International Convention Centre, Sydney, Australia, 06–11 Aug 2017. PMLR.
-  J. Bao, D. Chen, F. Wen, H. Li, and G. Hua. CVAE-GAN: fine-grained image generation through asymmetric training. CoRR, abs/1703.10155, 2017.
-  I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and Y. Bengio. Generative adversarial nets. In Z. Ghahramani, M. Welling, C. Cortes, N. D. Lawrence, and K. Q. Weinberger, editors, Advances in Neural Information Processing Systems 27, pages 2672–2680. Curran Associates, Inc., 2014.
-  I. Gulrajani, F. Ahmed, M. Arjovsky, V. Dumoulin, and A. C. Courville. Improved training of wasserstein gans. CoRR, abs/1704.00028, 2017.
-  G. E. Hinton. Rectified linear units improve restricted boltzmann machines vinod nair, 2010.
-  S. Ioffe and C. Szegedy. Batch normalization: Accelerating deep network training by reducing internal covariate shift. CoRR, abs/1502.03167, 2015.
-  P. Isola, J. Zhu, T. Zhou, and A. A. Efros. Image-to-image translation with conditional adversarial networks. CoRR, abs/1611.07004, 2016.
-  A. S. Jackson, A. Bulat, V. Argyriou, and G. Tzimiropoulos. Large pose 3d face reconstruction from a single image via direct volumetric CNN regression. CoRR, abs/1703.07834, 2017.
-  D. P. Kingma and J. Ba. Adam: A method for stochastic optimization. In ICLR, 2015.
-  D. P. Kingma and M. Welling. Auto-encoding variational bayes. CoRR, abs/1312.6114, 2013.
-  J. Long, E. Shelhamer, and T. Darrell. Fully convolutional networks for semantic segmentation. CoRR, abs/1411.4038, 2014.
-  M. Loper, N. Mahmood, J. Romero, G. Pons-Moll, and M. J. Black. SMPL: A skinned multi-person linear model. ACM Trans. Graphics (Proc. SIGGRAPH Asia), 34(6):248:1–248:16, Oct. 2015.
-  A. Newell, K. Yang, and J. Deng. Stacked hourglass networks for human pose estimation. CoRR, abs/1603.06937, 2016.
-  G. Pons-Moll, J. Romero, N. Mahmood, and M. J. Black. Dyna: A model of dynamic human shape in motion. ACM Transactions on Graphics, (Proc. SIGGRAPH), 34(4):120:1–120:14, Aug. 2015.
-  T. Probst, K. Maninis, A. Chhatkuli, M. Ourak, E. V. Poorten, and L. V. Gool. Automatic tool landmark detection for stereo vision in robot-assisted retinal surgery. arXiv, 2017.
-  R. Ranjan, V. M. Patel, and R. Chellappa. Hyperface: A deep multi-task learning framework for face detection, landmark localization, pose estimation, and gender recognition. CoRR, abs/1603.01249, 2016.
-  O. Ronneberger, P. Fischer, and T. Brox. U-net: Convolutional networks for biomedical image segmentation. CoRR, abs/1505.04597, 2015.
-  M. Rosca, B. Lakshminarayanan, D. Warde-Farley, and S. Mohamed. Variational approaches for auto-encoding generative adversarial networks. CoRR, abs/1706.04987, 2017.
-  V. Singh, K. Ma, B. Tamersoy, Y.-J. Chang, A. Wimmer, T. O’Donnell, and T. Chen. Darwin: Deformable patient avatar representation with deep image network. In MICCAI, pages 497–504, 09 2017.
-  L. Sixt, B. Wild, and T. Landgraf. Rendergan: Generating realistic labeled data. CoRR, abs/1611.01331, 2016.
-  K. Sohn, H. Lee, and X. Yan. Learning structured output representation using deep conditional generative models. In C. Cortes, N. D. Lawrence, D. D. Lee, M. Sugiyama, and R. Garnett, editors, Advances in Neural Information Processing Systems 28, pages 3483–3491. Curran Associates, Inc., 2015.
-  T. Tieleman and G. Hinton. Lecture 6.5—RmsProp: Divide the gradient by a running average of its recent magnitude. COURSERA: Neural Networks for Machine Learning, 2012.
-  A. Tsoli, N. Mahmood, and M. J. Black. Breathing life into shape: Capturing, modeling and animating 3d human breathing. ACM Trans. Graph., 33(4):52:1–52:11, July 2014.
-  Z. Wang, E. P. Simoncelli, and A. C. Bovik. Multi-scale structural similarity for image quality assessment. 2003.
-  Y. Wu and Q. Ji. Robust facial landmark detection under significant head poses and occlusion. CoRR, abs/1709.08127, 2017.
-  S. Yan, Z. Liu, P. Luo, S. Qiu, X. Wang, and X. Tang. Unconstrained fashion landmark detection via hierarchical recurrent transformer networks. CoRR, abs/1708.02044, 2017.
-  S. Y. Yeo, J. Romero, M. Loper, J. Machann, and M. Black. Shape estimation of subcutaneous adipose tissue using an articulated statistical shape model. Computer Methods in Biomechanics and Biomedical Engineering: Imaging and Visualization, 0(0):1–8, 2016.
-  Z. Zhang, P. Luo, C. C. Loy, and X. Tang. Facial landmark detection by deep multi-task learning. eccv, 2014.
-  J. Zhu, T. Park, P. Isola, and A. A. Efros. Unpaired image-to-image translation using cycle-consistent adversarial networks. CoRR, abs/1703.10593, 2017.
J.-Y. Zhu, P. Krähenbühl, E. Shechtman, and A. A. Efros.
Generative visual manipulation on the natural image manifold.
Proceedings of European Conference on Computer Vision (ECCV), 2016.